<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/ATP_2025_Notebook_9_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hands-on 9-1

### 9-1.1
The Github repo https://github.com/MarkHershey/CompleteTrumpTweetsArchive has all the tweets published by Donald Trump when he was at office. **Clone** the repository to your home folder on the server.

```
git clone https://github.com/MarkHershey/CompleteTrumpTweetsArchive
```


### 9-1.2
Count the most frequent hashtags (#) and / or handles (@) of the dataset covering Tweets when Trump was in office. Make sure to ignore possible punctuations to avoid losing data, e.g.:
```
@realDonaldTrump:
@realDonaldTrump
```

Most frequent hashtags:
* (N.B. to get the entire tweets, use `"` as separator; since the tweets can include commas, using the comma will give only parts of some tweets)
```
# !/bin/bash

cat realDonaldTrump_in_office.csv | #print the file
    cut -f 2 -d '"'  | #extract the column with the tweets
    tr ' ' '\n' | #token per line
    egrep "^#" | #grep the lines starting with a hashtag
    perl -pe 's/[[:punct:]]$//g' | #remove punctuation in the end of line
    egrep -v "^$" | #remove empty lines
    sort | #frequency list
    uniq -c |
    sort -nr
```
Run like this:
* N.B. `less` in the end of a pipe is useful as it gives the output in an easy-to-read and searchable format.
```
./your_script_hashtags.sh | less`
```


### 9-1.3
Make a script that takes a tweet handle as an argument and prints out its distribution over time month by month. Run the script on a couple of interesting handles / hashtags. Do you see any trends?

```
# !/bin/bash

# run: cat file.csv | ./your_script.sh handle

# frequency list of timestamps

egrep -i $1 | # grep for lines with the handle / hashtag
    cut -f 2 -d ',' | # take the 2nd column (the timestamps)
    cut -f 2 -d ' ' | # take the date; N.B. the date is in the 2nd column although there is no visible 1st column
    cut -f 1,2 -d '-'| # take just the years and months
    sort |
    uniq -c |
    sort -rn #count frequencies
```

**Extra:** sort the output so that you have the tweets ordered from older to newer, followed by the number of tweets for that time stamp, like this:

```
YEAR-MONTH1 NUM-OF-TWEETS
YEAR-MONTH2 NUM-OF-TWEETS
```
If you get a permission denied error, you have forgotten to add execution rights to your script. This can be done with `chmod a+rwx file.txt`

So what to do:
* switch columns the other way around with Perl, regex capture groups, and back referencing
* sort by the 2nd column (timestamp) in ascending order

```
perl -pe 's/^ *([0-9]+) ([0-9-]+)/$2 $1/g' |
  sort -n
```







### 9-1.4
Make another script that takes a handle as an argument and prints out a cleaned and normalized frequency list of the words that occur in the tweets with the handle.

You can try out different ways of cleaning the data. Does it make sense to include tokens with numbers and / or punctuation at all? Or is it better to just, e.g., delete tokens and numbers and otherwise keep the strings there?

```
# !/bin/bash

# run: cat your_file.txt | ./your_script.sh handle | less

# frequency list of tweets with a specific handle/hashtag

cut -f 2 -d '"' | #set the separator to " in order to get the entire text in the tweets
    egrep -i $1 | #grep lines with the handle given as an argument when you run the script
    #tr ' ' '\n' | #word per line; either tr or perl
    perl -pe 's/ /\n/g' |
    egrep -v '^[[:punct:]]|[0-9]' | #remove puncts and numbers
    tr '[[:upper:]]' '[[:lower:]]' | #normalize text
    egrep -v "^$" | #remove empty lines
    sort | #frequency list
    uniq -c |
    sort -nr
```

# Hands-on 9-2

Finnish parliamentary speeches have been published in parlamenttisampo.fi but it is difficult to get large amounts of data from there. Luckily the full datasets are available as yearly CSV files here:
https://a3s.fi/parliamentsampo/speeches/csv/index.html

Unlucky for us, the format of the file is not optimal for our tools, but let's see what we can get out of it anyway.

### 9-2.1

Get the speech file for year 2020. Browse the file to see what it contains. Notice how it is comma separated and how the actual speech ("content") is separated by double quotes. More advanced tools could handle that, but out cut tool cannot.

(N.B. A good way to deal with this would be to use e.g. Python and the Pandas library, but that is advanced stuff. We will make do with more simiple tools.)

Make a pipe that only uses the first line of text (containing the headers) and prints a list of headers, one per line.

```
wget https://a3s.fi/parliamentsampo/speeches/csv/speeches_2020.csv
less speeches_2020.csv  # exit with q

head -1 speeches_2020.csv | tr ',' '\n'
```



### 9-2.2

Find out the 10 last names (column "family") that have given the most speeches in 2020.

Find out the 10 full names (column "name_in_source") that have given the most speeches in 2020.

```
cat speeches_2020.csv | cut -f7 -d ',' | sort | uniq -c | sort -rn | head

cat speeches_2020.csv | cut -f23 -d ',' | sort | uniq -c | sort -rn | head

```



### 9-2.3


Find out which months were the most active (most speeches given) and which months were the least active (least speeches given) for debate.

Which months are missing? Why do you think this is?

```
cat speeches_2020.csv |
  cut -f3 -d ',' |
  cut -f1,2 -d '-' |
  egrep '^2020' | # to get rid of noise
  sort |
  uniq -c |
  sort -rn |
  head -20  # to make sure all 12 months are shown and to see if something else was caught
```


### 9-2.4

Try to find out the most dicussed topic.

```
cat speeches_2020.csv | cut -f10 -d ',' | sort | uniq -c | sort -rn | head -20

# this does not really give us anything useful :/
```



### 9-2.5

Try to find out the party that spoke the most often.


```
cat speeches_2020.csv | cut -f9 -d ',' | sort | uniq -c | sort -rn | head -20

```





---



---



---



# Hands-on 9-3



### 9-3.1

a) Create a file that contains this string of text on one line:
```
These are the rules:BREAKJava is trashBREAKPython is coolBREAKAnd so is Bash
```

* either use nano and copypaste...
* ...or use `echo` and save to file from stdout:
```
echo "These are the rules:BREAKJava is trashBREAKPython is coolBREAKAnd so is Bash" > poetry.txt
```

b) Formulate a simple pipeline that reads the file you just created, uses the `perl` substitution command to change all `BREAK` words into newlines `\n`, then egreps all lines that have the word "is"

```
cat poetry.txt | perl -pe 's/BREAK/\n/g' | egrep "is"
```



### 9-3.2

a) Make a script file called `print_rules.sh` that contains the previous pipeline but divided on multiple lines for readability (remember to indent!).

```
#!/bin/bash
cat poetry.txt |
  perl -pe 's/BREAK/\n/g' |
  egrep "is"
```

b) Then add one command to the beginning of the script that prints the text string "RULES:", then add another command to the end of the script that prints (`Anna`).

```
#!/bin/bash
echo "RULES:"
cat poetry.txt |
  perl -pe 's/BREAK/\n/g' |
  egrep "is"
echo "Anna"
```

c) Modify the last line so that you can give any name as an argument (replace "Anna" with something). Test the script with your name.

```
#!/bin/bash
echo "RULES:"
cat poetry.txt |
  perl -pe 's/BREAK/\n/g' |
  egrep "is"
echo $1
```

Run with:
```
./print_rules.sh somenamehere
```



### 9-3.3

Modify the Perl script so that it finds all instances of "is " that are followed by a lowercase character and substitutes them with "is super ", like this:
```
Python is cool >>>> Python is super cool
And so is Bash >>>> And so is Bash
```

Use regex capture groups and back reference!!

```
# !/bin/bash
echo "RULES:"
cat poetry.txt |
  perl -pe 's/BREAK/\n/g' |
  perl -pe 's/(is) ([[:lower:]])/$1 super $2/g' |
  egrep "is"
echo $1
```



## Hands-on 9-4

Get a copy of a recipe dataset from here:

`/home/ankrris/data/recipes_modified.csv`

a) Find out how many recipes do not use tomatoes in the "ingredients" column.

```
cat recipes_modified.csv | cut -f6 -d '|' | egrep -v "tomato" | wc -l
```

b) Replace every "potato" with "tomato". Now how many recipes do not use any tomatoes in the "ingredients" column?

```
cat recipes_modified.csv | cut -f6 -d '|' | perl -pe 's/potato/tomato/g' | egrep -v "tomato" | wc -l
```

d) Replace every "tomato" and "potato" with "pomato" in a single Perl substitution. Count lines with "pomato".

```
cat recipes_modified.csv | cut -f6 -d '|' | perl -pe 's/(potato|tomato)/pomato/g' | egrep -v "tomato" | wc -l
```

e) Make a script that substitutes every word that begins with an uppercase letter with that same word plus "-ho". E.g. "Place" >>>>> "Place-ho". Provide the file to the script as an argument.

```
#!/bin/bash
cat $1 | perl -pe 's/\b([A-Z][a-z]*)\b/$1-ho/g'
```

* \b — word boundary (ensures we match whole words).

* [A-Z] — first character must be uppercase.

* [a-z]* — zero or more lowercase letters after that.

* $1-ho — puts back the captured word and adds -ho.