<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/2024_Notebook_9_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Please use the server to do these exercises. If you are unfamiliar with scripts, you can also start with the time-out exercises listed on Notebook8**

## Exercise 1

The Github repo https://github.com/MarkHershey/CompleteTrumpTweetsArchive has all the tweets published by Donald Trump when he was at office. Clone the repo to your home folder on the server.

### 1.1
Count the most frequent hashtags and / or handles of the dataset covering Tweets when Trump was in office. Make sure to ignore possible punctuations, such as
```
@realDonaldTrump:
@realDonaldTrump
```

most frequent hashtags:

```
#!/bin/bash

cat realDonaldTrump_in_office.csv | #print the file
    cut -f 2 -d '"'  | #extract the column with the tweets (N.B. to get the entire tweets, use " as separator; the comma will give only parts of the tweets including commas)
    tr ' ' '\n' | #token per line
    egrep "^#" | #grep the lines starting with a hashtag
    perl -pe 's/[[:punct:]]$//g' | #remove punctuation in the end of line
    egrep -v "^$" | #remove empty lines
    sort | #frequency list
    uniq -c |
    sort -nr |
    less
```

run as follows:

```
./your_script_hashtags.sh | less
```

N.B. `less` in the end of a pipe is useful as it gives the output in an easy-to-read and searchable format.

most frequent handles (essentially the same as for hashtags):



```
#!/bin/bash

cat realDonaldTrump_in_office.csv | #print the file
    cut -f 2 -d '"'  | #extract the tweets
    #tr ' ' '\n' | #token per line; you can use either tr or perl
    perl -pe 's/ /\n/g' | #token per line
    egrep "^@" | #grep the lines that start with @
    perl -pe 's/[[:punct:]]$//g'| #remove punctuation in the end of line
    egrep -v "^$" | #remove empty lines
    sort | #frequency list
    uniq -c |
    sort
```
run as:



```
./your_script_handles.sh | less
```







### 1.2
Make a script that takes as argument a handle and prints out its distribution over time month by month. Run the script on a couple of interesting handles / hashtags. Do you see any trends?

**Extra:** sort the output so that you have the tweets ordered from older to newer, followed by the number of tweets for that time stamp, like

```
YEAR-MONTH1 NUM-OF-TWEETS
YEAR-MONTH2 NUM-OF-TWEETS
```
If you get a permission denied error, you have forgotten to add execution rights to your script. This can be done with `chmod a+rwx file.txt`



```
#!/bin/bash

#run: cat file.csv | ./your_script.sh handle

#frequency list of timestamps

egrep -i $1 | # grep for lines with the handle / hashtag
    cut -f 2 -d ',' | # take the 2nd column (the timestamps)
    cut -f 2 -d ' ' | # take the date; N.B. the date is in the 2nd column although there is no visible 1st column
    cut -f 1,2 -d '-'| # take just the years and months
    sort |
    uniq -c |
    sort -rn #count frequencies

```
for the extra part of the task, add the following in a pipe:



```
 ###extra
    perl -pe 's/^ *([0-9]+) ([0-9-]+)/$2 $1/g' | #switch the columns the other way around
    sort -n # sort by the 2nd column (timestamp) in ascending order
```




### 1.3
Make another script that takes as argument a handle and prints out a cleaned and normalized frequency list of the words that occur in the tweets with the handle.

You can try out different ways of cleaning the data. Does it make sense to include tokens with numbers and / or punctuation at all? Or is it better to just, e.g., delete tokens and numbers and otherwise otherwise keep the strings there?



```
#!/bin/bash

#run: cat your_file.txt | ./your_script.sh handle | less

#frequency list of tweets with a specific handle/hashtag

cut -f 2 -d '"' | #set the separator to " in order to get the entire text in the tweets
    egrep -i $1 | #grep lines with the handle given as an argument when you run the script
    #tr ' ' '\n' | #word per line; either tr or perl
    perl -pe 's/ /\n/g' |
    egrep -v '^[[:punct:]]|[0-9]' | #remove puncts and numbers
    tr '[[:upper:]]' '[[:lower:]]' | #normalize text
    egrep -v "^$" | #remove empty lines
    sort | #frequency list
    uniq -c |
    sort -nr
```



### 1.4 Advanced

The output of 1.3 would be much better if it excluded function words, don't you think?

Trankit is super slow and cannot be run on the server. But UdPipe can - the performance is lower, but maybe it would be enough. Let's try!

UdPipe documentation can be found at https://lindat.mff.cuni.cz/services/udpipe/api-reference.php

With the following call, UdPipe prints to standard output the conllu analysis of the input file, here called delme.csv

`curl -F data=@delme.csv -F model=english -F tokenizer= -F tagger= -F parser= http://lindat.mff.cuni.cz/services/udpipe/api/process | PYTHONIOENCODING=utf-8 python -c "import sys,json; sys.stdout.write(json.load(sys.stdin)['result'])" `

If you have time, try to implement a script that is similar to 1.3 but outputs only content POS classes, such as nouns, verbs, etc, instead of all most frequent words, and lemmas instead of running words.

You can try with different POS classes to figure out the most useful output in your opinion. What happens if you include only nouns? or proper nouns? or adjetives? or all these classes?



```
#!/bin/bash

#run cat your_file.csv | ./your_script.sh handle

egrep -i $1 | cut -f 4 -d ',' > toparse.csv #extract the tweets, direct to a new file

curl -F data=@toparse.csv -F model=english -F tokenizer= -F tagger= -F parser= http://lindat.mff.cuni.cz/services/udpipe/api/process | PYTHONIOENCODING=utf-\
8 python -c "import sys,json; sys.stdout.write(json.load(sys.stdin)['result'])" | #parse

    egrep -w "NOUN|PROPN|ADJ" | #grep the lines with these words as individual words, i.e., not part of a word
    cut -f 3 | #lemmas
    tr '[:upper:]' '[:lower:]' | #normalize
    egrep -v "rt|&am" | #remove
    sort | #frequency list
    uniq -c |
    sort -nr |
    less
```

### Tips
A new perl script can be useful for modifying strings. This example would replace all instances of REPLACED with ORIGINAL.

```
cat inputfile.txt | perl -pe 's/ORIGINAL/REPLACED/g'
```

This script supports also back reference, so you can refer to expressions in `ORIGINAL` and refer to those with `$` in `REPLACED`. Like here any punctuation is replaced by a newline and that punctuation

```
 perl -pe 's/([[:punct:]])/\n$1/g'
```
----------------

Note that in a script, you can have various commands, and all the commands will be executed one by one (with the exception of pipes, where the commands are executed all at once).

For instance if a script has the two commands below, it will first print "kukkuu" to a file called delme.txt, and then, undepending of the first one, print the file and egrep lines correspomding to "kuu".

```
echo "kukkuu" > delme.txt

cat delme.txt | egrep "kuu"
```