<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/Notebook7_2022_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 1

The course Github repo has the tweets syntax analyzed in the data folder. The file is called simply `covidtweets.conllu.gz `. Let's work on this a bit.

### Basics
* How many tweets?
* How many tokens? What if you exclude punctuation and numbers? 
* How many sentences? Note that `\t` does not work with egrep, can you google for how to do this?

### Lexical characteristics
* The most frequent lemmas? What if you exclude function words? 
* The definition of function words can vary a bit. What do you think could be the most useful POS classes to keep to get a general view to the contents of the tweets?
* Note: it might be hard to figure out the POS tags associated with the words, you can also analyze this by combining the lemma and POS columns

### More
* Now that we have POS classes, we can also focus on specific kinds of words. So let's count the most frequent lemmas for
  * Nouns (NOUN)
  * Verbs (VERB)
  * Proper nouns (PROPR)
* Which POS class words provide in your opinion the most interesting results?

In [None]:
! git clone https://github.com/TurkuNLP/ATP_kurssi.git

Cloning into 'ATP_kurssi'...
remote: Enumerating objects: 596, done.[K
remote: Counting objects: 100% (2/2), done.[K
remote: Compressing objects: 100% (2/2), done.[K
remote: Total 596 (delta 0), reused 0 (delta 0), pack-reused 594[K
Receiving objects: 100% (596/596), 79.58 MiB | 25.94 MiB/s, done.
Resolving deltas: 100% (336/336), done.


In [None]:
%cd ATP_kurssi/data

In [None]:
! zcat covidtweets.conllu.gz| head # we can see that each tweet starts with the mention ###C: NEWDOC


In [1]:
! zcat covidtweets.conllu.gz| egrep "^###C: NEWDOC" | wc -l # we can just grep for those and count the lines

gzip: covidtweets.conllu.gz: No such file or directory
0


In [None]:
# for tokens, we need to focus on the lines starting with a number and count those
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | head #these seem to match the correct lines

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | wc -l

In [None]:
# to exclude tokens tagged as numbers or punctuation, we should exclude lines w those tags
# first we need to figure out those tags. 
# this can be searched for too! 
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "!" | head 

In [None]:
# to find how numbers are tagged, I'll keep the columns 2 (for the running words) and 4 (POS)
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | cut -f 2,4 | egrep "[0-9]" | head 

In [None]:
# then the tokens wo punctuation and numbers
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "NUM|PUNCT" | head # looks about right!

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "NUM|PUNCT" | wc -l

In [None]:
# to count sentences, we can search for sentences starting with 1.
# be sure to match just 1, not 10!
# [[:space:]] works with egrep and matches \t
# another option is grep -P, which accepts \t
# surely there can be others as well!

! zcat covidtweets.conllu.gz| egrep "^1[[:space:]]" | head -5
! zcat covidtweets.conllu.gz| grep -P "^1\t" | head -5

In [None]:
! zcat covidtweets.conllu.gz| grep -P "^1\t" | wc -l # then the sentence count!

In [None]:
### Then the lexical characteristics, let's start with the lemmas

In [None]:
# lemmas are column number 3
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | cut -f 3 | head

In [None]:
# we can just do the frequency list from column 3
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | cut -f 3 | sort | uniq -c | sort -rn | head

In [None]:
# I want to exclude at least numbers and punctuation for starters
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM" | cut -f 3 | sort | uniq -c | sort -rn | head

In [None]:
# there's a lot I dont want to have in the above list. But it's kinda hard to know what POS categories they are, so I'll just do a 
# frequency list with the lemmas + POS tags

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM" | cut -f 3,4 | sort | uniq -c | sort -rn | head

In [None]:
# so at least SCONJ, AUX, PRON away...

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM|PRON|SCONJ|AUX"  | cut -f 3,4 | sort | uniq -c | sort -rn | head -20

In [None]:
# looks so much better! but yet at least CCONJ away...
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM|PRON|SCONJ|CCONJ|AUX"  | cut -f 3,4 | sort | uniq -c | sort -rn | head -20

In [2]:
# but I think that one does it!

In [None]:
# Then the specific POS classes

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "NOUN" | head # so these match nouns
# then just column 3 and the frequencies

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "NOUN" | cut -f 3 | sort | uniq -c | sort -rn| head -20

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "ADJ" | cut -f 3 | sort | uniq -c | sort -rn| head -20 # quite sick adjectives right?

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "VERB" | cut -f 3 | sort | uniq -c | sort -rn| head -20


## Exercise 2

With a python wrapper we can print only the tweets that match the content and time of our query. 

Here's an example:

`! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "Covid" --time "2020" `

Both options match any string in the respective metadatalines, so `--time "2021"` and `"2021-02" `are both valid options.

Also, the --text option normalises everything to lower case, `so --text "TWEET"` matches any upper / lower case versions.



---

Let's focus on tweets that mention specific persons. At least the prime minister @MarinSanna is mentioned frequently, so let's try her. How could you first fetch just the tweets that mention her? Direct those tweets to a file.

How many tweets does this file have? How are they distributed over time? What would be a reasonable way to analyze the distribution - the time stamps are quite detailed and not equally distributed? 

Let's try to compare the contents of tweets mentioning @MarinSanna at different periods of time. Ideally, we would have a sufficient number of tweets to compare, representing different time spots of the crisis.

Gather the tweets to be compared, and analyze possible differences in terms of frequent words. Which POS classes provide the most interpretable results? If any? Can you see some differences that could be related to the different periods in the crisis and events that took place?

When you find interesting words that could reflect some events, remember to analyze tweets with those words to check if the words are used like you anticipated.

If you have time left after this, you can try with different politicians or other handles. For instance, @THL_org seems quite frequent - what is it and what do twitters tweet about it?


In [None]:
! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "Covid" --time "2020" | head -40

In [None]:
! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2019-12" --text "pää" | head -40

In [None]:
# then the tweets with @MarinSanna to a file
! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "@MarinSanna" | gzip > sannamarin.conllu.gz

In [None]:
! zcat sannamarin.conllu.gz | head -50 # these look ok!

In [None]:
# by grepping and counting the lines starting with ###C:TEXT we can count the number of tweets
! zcat sannamarin.conllu.gz | egrep "^###C:TEXT" | wc -l 

In [None]:
# Then by grepping and counting the time stamps we can get their distribution over time

! zcat sannamarin.conllu.gz | egrep "^###C:TI" | sort | uniq -c | sort -rn | head 
# but clearly these stamps are too detailed, no trends can be seen

In [None]:
# I'll take months (and delete the days and times)
! zcat sannamarin.conllu.gz | egrep "^###C:TI" |  cut -f 2 -d ' '| head  # for the dates

In [None]:
# for the months
! zcat sannamarin.conllu.gz | egrep "^###C:TI" |  cut -f 2 -d ' '| cut -f 1,2 -d '-' | head # this looks ok!

In [None]:
# to make sure, let's see how the distribution is
! zcat sannamarin.conllu.gz | egrep "^###C:TI" |  cut -f 2 -d ' '| cut -f 1,2 -d '-' | sort | uniq -c | sort -rn | head -30

# or, actually, I think the distribution is kinda wonky, mostly just from the first months of the crisis. 

In [None]:
# what about the first days of the crisis? What would be frequent enough?
# Based on this frequency list, I'll take March 17 and March 18 
! zcat sannamarin.conllu.gz | egrep "TIME" | egrep "2020-03" | cut -f 2 -d ' ' | cut -f 1,2,3 -d '-'| sort | uniq -c | sort -rn | head -30

In [None]:
# unfortunately my python script doesn't support regexes
! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-03-18" > 03-18.conllu 
! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-03-17" > 03-17.conllu 

In [None]:
# I want to check the numbers match - looks like ok!
! cat 03-18.conllu 03-17.conllu | egrep "###C: NEWDOC" | wc -l
! cat 03-18.conllu 03-17.conllu | egrep "###C:TIME" | cut -f 2 -d ' ' | cut -f 1,2,3 -d '-' | sort | uniq -c | sort -rn

In [None]:
# the most frequent adjectives, nouns, verbs
! cat 03-18.conllu 03-17.conllu  | egrep "ADJ|NOUN|VERB" | cut -f 3 | sort | uniq -c | sort -rn | head -20
# I guess these are ok, but I'm not sure about the verbs. would the list be better without verbs?

In [None]:
! cat 03-18.conllu 03-17.conllu  | egrep "ADJ|NOUN" | cut -f 3 | sort | uniq -c | sort -rn | head -20


In [None]:
# well, I guess these do reflect some moments in the crisis! 
# there's valmiuslaki "emergency powers legislation", tärkeä "important", raja "border"
# I'll yet check how these are used in the tweets. Unf. my script doesn't search for lemmas, but maybe the string will get some matches anyway!

! cat 03-18.conllu 03-17.conllu  | python3 ../scripts/read_conllu_docs.py --text "raja" | egrep "###C:TEXT" | head
# well, not horrible! The tweets mostly seem to discuss borders and border control

Then the comparison to April. Again, we need enough tweets, preferably from late April so there's some time gap to the March tweets we have. Let's see how many tweets we have from late April

In [None]:
# these are now just the days of April
! zcat sannamarin.conllu.gz | egrep "###C:TIME" | cut -f 2 -d ' ' | egrep "2020-04" | cut -f 3 -d '-' | sort | uniq -c | sort -rn 

# not many tweets from late April, but let's take April 15 and 14

In [None]:
! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-04-15" >  04-15.conllu 
! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-04-14" >  04-14.conllu

In [None]:
! cat 04*  | egrep "ADJ|NOUN" | cut -f 3 | sort | uniq -c | sort -rn | head -20 

I guess there's some differences, at least mökki "summer house" and maski "mask". Let's see how these tweets with these words look like. 

In [None]:
! cat 04* | python3 ../scripts/read_conllu_docs.py --text "mökki" | egrep "###C:TEXT" | head

# well, looks like I should I'd think!