<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/Notebook10_2022_anwers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercises

Today's data consists of speeches held in the Finnish parliament, provided by the Language Bank of Finland. This dataset, in the file `eduskunta.vrt.gz` can be found at /home/mavela/data-2022 on the course server. The file is quite big, so please don't copy it to your own working directory. Rather you can access it directly with less or zcat, e.g.

`zcat /home/mavela/data-2022/eduskunta.vrt.gz `

The data format mimics conllu, but is slightly different (please don't ask why they do not respect standards). You'll notice that the metadata is provided in lines starting with `<`. For the speeches, the column order is

`word word_id lemma lemmacomp POS morpho dephead deprel nertag `

### Basic stats and metadata

- How many tokens does the dataset have?

The metadata lines include a lot of interesting information about the data.
- Timestamps for the speeches are given on the lines starting with `<text`. Make a list of the timestamps so you get to know from when the speeches are.

- The names of the speakers are given on the lines starting with `<paragraph`.Who are the most active speakers? 

 Note that the data includes binary characters (noise), instead of `egrep` you should use `grep -a ` 

 Also, the perl script `perl -pe 's/ORIGINAL/REPLACED/g'` can be useful here. Remember that it accepts regular expressions and back reference. 
 So in the following example, the regular expression `[a-zöäå ]+` matching any character (and the white space!) in the [] 1 or more times is replaced by a newline, the string matching the regex and a newline.

` echo "THIS IS MY + SENTENCE!" | perl -pe 's/([a-zöäå ]+)/\n$1\n/g' | less`

And remember to escape special characters, such as `"` with `\"` when using the perl script, to match the expressions literally

- The lines starting with `<paragraph` are include also the party of the speaker. What are the most frequent parties? How many different parties are there?

In [None]:
# the tokens can be counted by ignoring the metadata lines starting with "<" and then just counting the lines

zcat eduskunta.vrt.gz | egrep -v "^<" | wc -l

# the time stamps can be listed by first focusing on the lines starting with "<text", then splitting each token to a new line and then focusing on the lines starting with "date"

zcat eduskunta.vrt.gz | egrep "^<text" | tr ' ' '\n' | egrep "^date=" | less

# The most active speakers can be listed in many ways. My solution makes a regex that matches the name following the speaker (note the whitespace for first+last names), puts those to lines of their own 
# and then sorts those lines

zcat eduskunta.vrt.gz | egrep "^<paragraph" |perl -pe 's/speaker=\"([A-ZÖÄÅa-zöäå ]+)/\nSpeaker$1\n/g'  | grep -a "Speaker" | sort | uniq -ci | sort -rn | less

# The partys are matched with a very similar technique than the speakers names

zcat eduskunta.vrt.gz | egrep "^<paragraph" | perl -pe 's/speaker_parl_group=\"([a-z]+)\"/\nPARTY $1 \n/g' | egrep "PARTY" | sort | uniq -ci | sort -rn 

# to get the (number of) unique parties, it's enough to count the lines of the unique list

 zcat eduskunta.vrt.gz | egrep "^<paragraph" | perl -pe 's/speaker_parl_group=\"([a-z]+)\"/\nPARTY $1 \n/g' | egrep "PARTY" | sort | uniq | wc -l


### Keywords - if you have time

The data can be used for many interesting analyses, on the characteristics of, e.g., individual speakers, parties, and changes in time. 

Let's count keywords for something that interests you. Select two individual politicians, points in time or parties, and count keywords for those using the keyword script we look at already earlier.

The keyword script takes in data, where each text, or speech, is on a line of its own. To get this, you can use the script read_vrt.py in the course scripts folder on Github. The script takes in the vrt file from standard input (pipe). As arguments, you can use:

```
--speaker
--time
--party
```
The `speaker` and `time` arguments accept speakers and parties as they are in the vrt file metadata, the speaker field accepts also partial matches, so e.g. just the last name is sufficient. The `time` argument accepts years.

So e.g., the following prints you speeches with the specified argument:


```
zcat /home/mavela/data-2022/eduskunta.vrt.gz | python3 read_vrt.py --speaker Lipponen

zcat /home/mavela/data-2022/eduskunta.vrt.gz | python3 read_vrt.py --time 2009 

zcat /home/mavela/data-2022/eduskunta.vrt.gz | python3 read_vrt.py --party vas 
```

The keyword script input should be two files where each document is on a line of its own. The lines can be whatever information you want to get from the syntax analyzed files - lemmas, words, or even word ngrams.


So, to have this, you should yet modify the output of the read_vrt.py script to get your two files in this format. This can be done relatively easily with egrep -v, cut -f and the perl script. Just ignore the lines you dont need, take the columns you want and then put back line breaks to match the document boundaries marked as "###C: NEWDOC".

Note that you can also ignore function words at this point if you want to!

Finally, you can get the keywords with 

` python3 text_dispersion_text.py vayrynen lipponen.txt` 


In [None]:
# this extracts just the speeches held by Lipponen
zcat /home/mavela/data-2022/eduskunta.vrt.gz | python3 read_vrt.py --speaker Lipponen > lipponen.conlluish

# this takes just the words from the conllu format

less lipponen.conlluish | egrep -v "SPEAKER|PARTY|TIME" | cut -f 1 | perl -pe 's/\n/ /g' | perl -pe 's/###C: NEWDOC/\n/g' > lipponen.txt

# and then we can just give these to the keyness script

python3 text_dispersion_text.py vayrynen lipponen.txt 