<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/ATP_2025_Notebook_7_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise I

In the Github repo https://github.com/TurkuNLP/ATP_kurssi.git there are syntax analyzed tweets in a folder called 'data'. The file is called simply `covidtweets.conllu.gz `. Let's work on this a bit.

### 1. Preparations


*   a. clone the repo
*   b. go to the folder


### 2. Basics
* a. How many tweets?
* b. How many tokens? What if you exclude punctuation and numbers?
* c. How many sentences? Note that `\t` (tab which separates the columns in the file) does not work with egrep, can you google for how to do this?

### 3. Lexical characteristics
* a. The most frequent lemmas?
* b. What if you exclude function words? The definition of function words can vary a bit. What do you think could be the most useful POS classes to keep to get a general view to the contents of the tweets?
* Note: it might be hard to figure out the POS tags associated with the words, you can also analyze this by combining the lemma and POS columns

### 4. More
* Now that we have POS classes, we can also focus on specific kinds of words. So let's count the most frequent lemmas for
  * a. nouns (NOUN)
  * b. adjectives (ADJ)
  * c. verbs (VERB)
* Which POS class words provide the most interesting results in your opinion?

In [None]:
#1a. Let's start by cloning the repo

! git clone https://github.com/TurkuNLP/ATP_kurssi.git

In [None]:
#1b. Then go to the folder where the covidtweets file is

%cd ATP_kurssi/data

In [None]:
! ls

In [None]:
#As always, let's start by checking the file

#remember to print the gzipped file with zcat

! zcat covidtweets.conllu.gz| head -20 # we can see that each tweet starts with the mention ###C: NEWDOC


In [None]:
#2a. Count the tweets
# we can just grep for ###C: NEWDOC indicating the start of a new document and count the lines

! zcat covidtweets.conllu.gz| egrep "^###C: NEWDOC" | wc -l

In [None]:
#2b. Count the tokens
# for tokens, we need to focus on the lines starting with a number and count those

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | head #these seem to match the correct lines

In [None]:
#Count the lines for token count

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | wc -l

In [None]:
# to exclude tokens tagged as numbers or punctuation, we should exclude lines with those tags
# first we need to figure out those tags.
# this can be searched for too!

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "!" | head

**N.B.** The fourth column gives the POS tag for a token, and here we see that punctuation marks are tagged with PUNCT. We can then `egrep -v "PUNCT"` to exclude punctuation.

In [None]:
#similarly, we can find the POS tag for numbers
#to make it a bit simpler (and easier to read), I'll keep the columns 2 (for the running words) and 4 (POS)

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | cut -f 2,4 | egrep "[0-9]" | head

Now we can see that the POS tag for numbers is NUM. We can `egrep -v "NUM|PUNCT"` to remove both numbers and punctuation simultaneously.

Remember that `|`is alternative, and thus `egrep -v "NUM|PUNCT"`matches lines that don't include NUM or PUNCT.

In [None]:
# Now, get the tokens without punctuation and numbers

#grep the lines that start with a number (get the tokens)
#remove the lines with numbers or punctuation

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "NUM|PUNCT" | head -50 # looks about right!

In [None]:
#Let's check if the numbers were removed by comparing with the original file
#and we can see that the number 18.5 is removed indeed
#however, 7/x # remains since it is not tagged as NUM, but SYM

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | head -50

In [None]:
#Count the tokens without numbers and punctuation

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "NUM|PUNCT" | wc -l

In [None]:
#2c.Count the sentences
# to count sentences, we can search for lines with 1 (the first token in each sentence has line number 1)
# be sure to match just 1, not 10!
# [[:space:]] works with egrep and matches \t
# another option is grep -P, which accepts \t
# surely there can be others as well!

! zcat covidtweets.conllu.gz| egrep "^1[[:space:]]" | head -5
! echo '---'
! zcat covidtweets.conllu.gz| grep -P "^1\t" | head -5

In [None]:
# then the sentence count!

! zcat covidtweets.conllu.gz| grep -P "^1\t" | wc -l

In [None]:
### Then the lexical characteristics, let's start with the lemmas

In [None]:
# 3a. The most frequent lemma
#lemmas are in column number 3

#grep the token lines
#choose the correct column
#and check that you got the lemma column

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | cut -f 3 | head

In [None]:
# we can just do the frequency list from column 3

#after cut -f 3 add the frequency list
#and check what the frequency list looks like

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | cut -f 3 | sort | uniq -c | sort -rn | head

In [None]:
#3b. Remove function words
# let's start with removing numbers and punctuation

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM" | cut -f 3 | sort | uniq -c | sort -rn | head

In [None]:
# there's a lot I don't want to have in the above list.
#But it's kinda hard to know what POS categories they are, so first,
# I'll just do a frequency list with the lemmas + POS tags

In [None]:
#grep the token lines
#remove punctuation and numbers
#take the lemma and pos columns
#(N.B. if you use cut -f on several columns, there's no space between the comma and the following number)
#and check that you got the correct columns

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM" | cut -f 3,4 | head

In [None]:
#then do the frequency list of the lemmas and pos tags

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM" | cut -f 3,4 | sort | uniq -c | sort -rn | head

In [None]:
#3b.
# so at least SCONJ, AUX, PRON away...

In [None]:
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM|PRON|SCONJ|AUX"  | cut -f 3,4 | sort | uniq -c | sort -rn | head -20

In [None]:
# looks so much better! but yet at least CCONJ away...
! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep -v "PUNCT|NUM|PRON|SCONJ|CCONJ|AUX"  | cut -f 3,4 | sort | uniq -c | sort -rn | head -20

In [None]:
# but I think that one does it!

In [None]:
# Then the specific POS classes

In [None]:
#4a. Most frequent lemmas for NOUN

#grep the NOUNS
#check that you got the NOUNS

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "NOUN" | head # so these match nouns

In [None]:
# then just column 3 and the frequencies

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "NOUN" | cut -f 3 | sort | uniq -c | sort -rn| head -20

In [None]:
#4b. Most frequent lemmas for ADJ

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "ADJ" | cut -f 3 | sort | uniq -c | sort -rn| head -20

In [None]:
#4c. Most frequent lemmas for VERB

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "VERB" | cut -f 3 | sort | uniq -c | sort -rn| head -20

**EXTRA TIME LEFT?**

Choose a word and count the frequencies of the different forms of the word.

In [None]:
#I decided to have a look at the different adjective compounds with the word 'korona'

! zcat covidtweets.conllu.gz| egrep "^[0-9]" | egrep "korona#" | cut -f 3,4 | egrep "ADJ" | sort | uniq -c | sort -nr | head

## Exercise II

Let's do this exercise together.

In this exercise we'll use a **Python wrapper**. Let's not go into detail here, since this is not a course in Python, instead it suffices to understand this on the surface.

The point is to show how we can use Python scripts in Bash to get more out of our data. In this Python wrapper we use a script that takes **arguments**, and in these arguments we can **specify details** of the Tweets we want to find.

With this Python wrapper we can print only the tweets that match the content and time of our query.

Here's an example:

`! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "Covid" --time "2020" `

Let's break this down.



*   First we print the file `! zcat covidtweets.conllu.gz`
*   Then on this file we run the Python script. We need to give the command `python3` and the path to the script `../scripts/read_conllu_docs.py`
*   Finally we have two options marked with `--`, in this case `--time` and `--text`


  *   These refer to the metadatalines in the data file

      `###C:TIME`
      
      `###C:TEXT`



  *   By assigning a specific time or word(s) to these options respectively, we can get Tweets from a specific date/time frame and on a specific topic



  *   Both options match any string in the respective metadatalines, so all of the following are valid options:
  
      `--time "2021"`  
      `--time "2021-02" `

      `--text "Covid"`  
      `--text "korona on" `


  *   Also, the `--text` option normalises everything to lower case, `so --text "TWEET"` matches any upper / lower case versions.

Let's see how the Python wrapper works. In the example, we look for tweets with a mention of "Covid" from the year 2020.

In [None]:
#Tweets with "Covid" from 2020.

! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "Covid" --time "2020" | head -40

In [None]:
#Here we can see that the options match any string
#"aurinko paistaa" means the sun is shining

! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2019-12" --text "aurinko paistaa" | head -40

In [None]:
! zcat covidtweets.conllu.gz | egrep "###C:TIME" | cut -f 2 -d ' ' | cut -f 1 -d '-' | sort | uniq -c | sort -nr | head

##Exercise III


Let's focus on tweets that mention specific persons. At least the prime minister @MarinSanna is mentioned frequently, so let's try her.

1. How could you first fetch **just the tweets** that mention her? **Direct** those tweets **to a file**.

2. a) **How many tweets** does this file have? How are they **distributed over time**?

    b) What would be **a reasonable way to analyze the distribution** - the time stamps are quite detailed and not equally distributed?

3. Let's try to **compare** the contents of **tweets** mentioning @MarinSanna **at different periods of time**. Ideally, we would have a sufficient number of tweets to compare, representing different time spots of the crisis.

    a) Gather the tweets to be compared, and analyze possible differences in terms of **frequent words**. Which POS classes provide the most interpretable results? If any?
  
    b) Can you see some **differences** that could be **related to the different periods** in the crisis and **events** that took place?

    When you find interesting words that could reflect some events, remember to analyze tweets with those words to check if the words are used like you anticipated.

4. **If you have time left** after this, you can try with different politicians or **other handles**. For instance, @THL_org seems quite frequent - what is it and what do twitters tweet about it?

In [None]:
# 1. Tweets with @MarinSanna to a file
#First check what you get with your command
! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "@MarinSanna" | head

In [None]:
#Then direct to a file

! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "@MarinSanna" | gzip > sannamarin.conllu.gz

In [None]:
#And check that the file looks as intended

! zcat sannamarin.conllu.gz | head -50 # these look ok!

In [None]:
#2.a) How many tweets does this file have?
# by grepping and counting the lines starting with ###C:TEXT we can count the number of tweets

! zcat sannamarin.conllu.gz | egrep "^###C:TEXT" | wc -l

In [None]:
#How are they distributed over time?
# By grepping and counting the time stamps we can get their distribution over time

! zcat sannamarin.conllu.gz | egrep "^###C:TI" | sort | uniq -c | sort -rn | head

# but clearly these stamps are too detailed, no trends can be seen

In [None]:
#2.b) What would be a reasonable way to analyze the distribution?
# I'll take months (and delete the days and times)

# for the dates
#cut -f 2 -d ' ' (second column, delimiter white space)

! zcat sannamarin.conllu.gz | egrep "^###C:TI" |  cut -f 2 -d ' '| head

In [None]:
# for the months

#cut -f 2 -d ' ' | cut -f 1,2 -d '-'
# first take the dates, then from the dates, take columns 1 and 2 (year and month) delimited by a hyphen

! zcat sannamarin.conllu.gz | egrep "^###C:TI" |  cut -f 2 -d ' '| cut -f 1,2 -d '-' | head # this looks ok!

In [None]:
# to make sure, let's see how the distribution is
#i.e. make a frequency list

! zcat sannamarin.conllu.gz | egrep "^###C:TI" |  cut -f 2 -d ' '| cut -f 1,2 -d '-' | sort | uniq -c | sort -rn | head -30

# or, actually, I think the distribution is kinda wonky, mostly just from the first months of the crisis.

In [None]:
#Let's try to compare the contents of tweets mentioning @MarinSanna at different periods of time.

# what about the first days of the crisis? What would be frequent enough?
# Based on this frequency list, I'll take March 17 and March 18

! zcat sannamarin.conllu.gz | egrep "^###C:TI" | egrep "2020-03" | cut -f 2 -d ' ' | cut -f 1,2,3 -d '-'| sort | uniq -c | sort -rn | head -30

In [None]:
# unfortunately my python script doesn't support regexes
#let's gather the tweets from the chosen dates to two files

! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-03-18" > 03-18.conllu
! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-03-17" > 03-17.conllu

In [None]:
# I want to check the numbers match - looks like ok!

! cat 03-18.conllu 03-17.conllu | egrep "###C: NEWDOC" | wc -l
! cat 03-18.conllu 03-17.conllu | egrep "###C:TIME" | cut -f 2 -d ' ' | cut -f 1,2,3 -d '-' | sort | uniq -c | sort -rn

In [None]:
! ls

In [None]:
! head -20 03-18.conllu
! echo '---'
! head -20 03-17.conllu

In [None]:
! head *conllu

In [None]:
#3.a) Analyze possible differences in terms of frequent words.
#Which POS classes provide the most interpretable results?

#Let's first check the frequencies of POS tags
#N.B. Unless you remove the metadata lines before cutting the column you are interested in,
#the metadata lines will be part of your frequency list

! cat 03-18.conllu 03-17.conllu  | egrep -v "^###|^$" | cut -f 4 | sort | uniq -c | sort -rn | head -20

In [None]:
#3.a) Analyze possible differences in terms of frequent words.
#Which POS classes provide the most interpretable results?

# let's check the most frequent adjectives, nouns, verbs

#get the lemmas of the selected POS tags:
#print the file, remove metadata and empty lines, grep the lines with the POS tags, cut the lemma column, frequency list

! cat 03-18.conllu 03-17.conllu  | egrep -v "^###|^$" | egrep "ADJ|NOUN|VERB" | cut -f 3 | sort | uniq -c | sort -rn | head -20

# I guess these are ok, but I'm not sure about the verbs. would the list be better without verbs?

In [None]:
#let's leave the verbs out

! cat 03-18.conllu 03-17.conllu  | egrep -v "^###|^$" | egrep "ADJ|NOUN" | cut -f 3 | sort | uniq -c | sort -rn | head -20


In [None]:
# well, I guess these do reflect some moments in the crisis!
# there's valmiuslaki "emergency powers legislation", tärkeä "important", raja "border"
# I'll yet check how these are used in the tweets. Unf. my script doesn't search for lemmas,
#but maybe the string will get some matches anyway!

#to get the texts only, egrep "###C:TEXT" after the option

#N.B. two files can be read simultaneously by listing them after cat

! cat 03-18.conllu 03-17.conllu  | python3 ../scripts/read_conllu_docs.py --text "raja" | egrep "###C:TEXT" | head

# well, not horrible! The tweets mostly seem to discuss borders and border control

Then the comparison to April. Again, we need enough tweets, preferably from late April so there's some time gap to the March tweets we have. Let's see how many tweets we have from late April

In [None]:
#3.b) Can you see some differences that could be related to the different periods in the crisis and events that took place?
# these are now just the days of April

#cut -f 2 -d ' ' | egrep "2020-04" | cut -f 3 -d '-'
#take the year-month-day column, then grep April in 2020, and finally take the day column separated by a hyphen

! zcat sannamarin.conllu.gz | egrep "###C:TIME" | cut -f 2 -d ' ' | egrep "2020-04" | cut -f 3 -d '-' | sort | uniq -c | sort -rn

# not many tweets from late April, but let's take April 15 and 14

In [None]:
#direct the tweets from the chosen dates to files

! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-04-15" >  04-15.conllu
! zcat sannamarin.conllu.gz | python3 ../scripts/read_conllu_docs.py --time "2020-04-14" >  04-14.conllu

In [None]:
! ls

In [None]:
! head 04-14.conllu
! echo '---'
! head 04-15.conllu

In [None]:
#get the same POS for April as for March

#N.B. cat 04* matches all files that start with 04, so in this case both April files

! cat 04*  | egrep -v "^###|^$" | egrep "ADJ|NOUN" | cut -f 3 | sort | uniq -c | sort -rn | head -20

In [None]:
#to chech which files you match with the *
#you can use ls which will list you all the files that match the given pattern

! ls 04*

I guess there's some differences, at least mökki "summer house" and maski "mask". Let's see how these tweets with these words look like.

In [None]:
#checking the words in the tweets

! cat 04* | python3 ../scripts/read_conllu_docs.py --text "mökki" | egrep "###C:TEXT" | head

# well, looks like I should I'd think!

In [None]:
#4. Extra.
#THL_org seems quite frequent - what is it and what do twitters tweet about it?

#fetch the tweets that have the handle

! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "@THLorg" | head

In [None]:
#direct these tweets to a new zipped file

! zcat covidtweets.conllu.gz | python3 ../scripts/read_conllu_docs.py --text "@THLorg" | gzip > thl_tweets.gz.conllu

In [None]:
#check that succeeded

! ls

In [None]:
#check that the new file looks ok

! zcat thl_tweets.gz.conllu | head

In [None]:
#count the texts in the file

! zcat thl_tweets.gz.conllu | egrep "^###C: NEWDOC" | wc -l

In [None]:
#What are these tweets about?

#remove metadata and empty lines
#fetch the lemma and POS columns
#remove punctuation marks
#fetch the lemma column; N.B. now the lemmas are in the first column since you cut from the output of the previous command
#in which you selected two columns
#normalize to lower case
#remove any handles
#make a frequency list
#function words are most common, so checking a bit further down in the frequency list is perhaps more interesting

! zcat thl_tweets.gz.conllu | egrep -v "^###C:|^$" | cut -f 3,4 | egrep -v "PUNCT" | cut -f 1 | tr '[:upper:]' '[:lower:]' | egrep -v "^@" | sort | uniq -c | sort -nr | head -150 | tail -50

In [None]:
# How could you find other frequent handles in the texts?

! zcat covidtweets.conllu.gz | egrep -v "^###" | cut -f 2 | egrep "^@" | sort | uniq -c | sort -nr | head

##**Summary of useful commands in these exercises**

*   match **tab** in egrep

    `egrep "^1[[:space:]]"`

*   grep lines with ADJ, NOUN, **or** VERB

    `egrep "ADJ|NOUN|VERB"`


*   get **several columns**; N.B. no white space between the column numbers

    `cut -f 3,4`

*   get a **column separated by** a hyphen; `d` in `-d ' '` stands for delimiter, while the delimiter comes in single quotation marks; the default delimiter is tab


    `cut -f 3 -d '-'`


*   print **several files simulatenously**

    `cat 03-18.conllu 03-17.conllu`


*   print all files **starting** with 04

    `cat 04*`


*   print all files **ending** with `.txt`

    `cat *.txt`



*   **list** all files ending in `.txt`

    `ls *.txt`


