<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/ATP_2025_Notebook_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 2: more commands, flags (options) and pipes

**Today's topics:**
*   finding patterns (e.g., words) in a line (egrep/grep; grepping)
*   making commands more versatile with flags
*   combining commands (pipes)
*   counting frequencies



## egrep (or grep)
* Matches a pattern on a **line** and prints the line with the match;
  
  e.g. `egrep "is" filename.txt`

  prints all lines in the file that match (have) a string 'is'

* see http://people.uta.fi/~jm58660/jutut/unix/grep.html (in Finnish)
  

* N.B. 1: **capitals and whitespaces count**!

  So `egrep " is" filename.txt` gives a different output

* N.B. 2: **lines**

  A line
  1) ends in a line break, so it might be long.
  2) does not necessarily correspond to a sentence.
  3) can consist of several sentences, or even an entire text.


In [None]:
# let's first get the data we had previously
! wget https://www.gutenberg.org/cache/epub/20748/pg20748.txt
! ls

In [None]:
! mv pg20748.txt book.txt # let's change the name to an easier one


In [None]:
! head book.txt # let's have a look to remind us what's inside

In [None]:
! egrep "snow" book.txt

In [None]:
! egrep "Snow" book.txt

In [None]:
! egrep "Snow " book.txt

## More about flags / options

* Options are arguments that can be given to Unix commands to change their behavior. These form one of the core elements of Unix command line work.

* most flags have both a short from (e.g. `-v`) and a long form (e.g. `--verbose`); we use short forms whenever possible since it's faster to write

* Options are difficult to remember but you can read them for instance on man pages or google for instrutions (which are probably easier to understand than man pages)

In [None]:
# an example of the manual page for ls
! man ls

In [None]:
# In Colab the man pages have been 'minimized', so you need to run this first
# before Colab will show them to you:
! unminimize
# Colab will ask you if you really want to do this;
# click on the area behind the question, type 'y' and press enter.
# After running this cell you can try the cell above again.

### Some useful options

* `head -n 15` or `head -15` prints 15 first lines
* `tail -n 15` or `tail -15` prints 15 last lines
* `wc -w` prints wordcount
* `wc -l` prints linecount


In [None]:
# let's try a bit!
! head -5 book.txt
! echo "-----"
! wc -l book.txt
! wc -w book.txt

##More useful options

* `egrep -v` (reverse) prints lines without a match
* `egrep -c` counts matches
* `egrep -i` ignores case
* `egrep -w` matches just words
* `egrep -B N` prints N lines Before
* `egrep -A N` prints N lines After



N.B. The flags can also be **combined**, e.g., `egrep -vi`

In [None]:
! egrep -i "snow" book.txt
! egrep -w "snow" book.txt
! egrep -wi "snow" book.txt # the flags can also be combined
! egrep -v "snow" book.txt

In [None]:
! egrep -c "snow" book.txt
! egrep -c "Snow" book.txt
! egrep -ic "snow" book.txt
! egrep -icw "snow" book.txt

In [None]:
! egrep -B 5 -A 5 "Snow" book.txt

### Time to try things out!

Let's use the same file we already have, now called `book.txt`. Do the following tasks:

1. Rename the file as e.g., `fairytales.txt`
2. Count how many **words** and **lines** the file has.
3. Count **how many times** the pronoun _he_ appears in the file. How can you control the use of **capitals** and **white spaces** so that the query matches **only entire words**?
4. Direct the number of lines that match the query (_he_) to **a new file** called `pron_in_file.txt`.



### Advanced
* You can also combine commands on one line using `;`
* You can assign variables using `$`:

In [None]:
! wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"


In [None]:
! wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"
! echo "Word count is", $wordcount # this won't work anymore on Colab, because each line is interpreted individually

### Pipes (*putket* in Finnish)

* You can combine commands with a pipe `|` (alt gr + the key next to z)
* This way the output of the first command goes as input to the next one (the order of the commands is important!)
* `cat file.txt | wc -w` (is actually the same as just `wc -w`) first prints a file, then counts words
* `cat file.txt | egrep "is"`  first prints a file, then matches the pattern
* `cat file.txt | egrep "is"  | wc -l`
* `cat file.txt | egrep "is" | head`
* `cat file.txt | egrep "is" > output.txt`
* `cat file.txt | head -1000 | tail -100`  prints the lines between 900-1000

In [None]:
! cat book.txt | egrep "is" | wc -l
! cat book.txt | egrep "is" | head

In [None]:

! cat book.txt | egrep -w "is" | wc -l
! cat book.txt | egrep -w "is" | head


In [None]:
! cat book.txt | egrep -wi "is" | head
! cat book.txt | egrep -wi "is" | wc -l

In [None]:
# Again, you can direct these to a file as before
! cat book.txt | egrep -wi "is" | wc -l > file.txt
! cat file.txt

## Frequency counts

* A frequency counter can be done by combining *sort* and *uniq* commands with a pipe
* `sort` sorts the input lines to alphabetical order
* `uniq` filters repetitive lines


Let's first get some data. This time we'll fetch some parliamentary data.

In [None]:
!wget https://a3s.fi/parliamentsampo/speeches/csv/speeches_2024.csv

Next, we'll need some Python to make the data format suitable for our purposes. **You need not know how to do this**, but feel free to have a look at the code snippet.

In [None]:
import pandas as pd
df = pd.read_csv('speeches_2024.csv')
df.to_csv('speeches_2024.tsv', sep='\t')

Let's check what we got.

In [None]:
!ls

Looks good. Let's use the **tsv-file** and just ignore the original csv-file we downloaded.

In [None]:
# What does the data look like?

! cat speeches_2024.tsv | head

#! cat speeches_2024.tsv | head #outputs the same thing as the command above

In [None]:
! cat speeches_2024.tsv | wc -l

There's quite a lot of data, so let's focus on one month only. I select March this time.

Let's grep the lines that matches the time stample 2024-03 and direct those in a new file.

In [None]:
!cat speeches_2024.tsv | egrep '2024-03' > speeches_2024_03.tsv

In [None]:
!ls

In [None]:
!cat speeches_2024_03.tsv | wc -l

It's always possible that data files have empty lines. Let's check if we have any in this data. We can use

`egrep "^$"`

to do that.

In [None]:
! cat speeches_2024_03.tsv | egrep "^$" | wc -l

This data has several columns.
Let's first take just the speaker surnames from the data, it's the column number 8.

`cut` will output just (a) specific column(s).

`cut -f 8` will give us column 8.


In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | head

Let's count how many lines we have.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | wc -l

In case a name is missing in a row in the original data, we will have empty lines here. Let's check.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | egrep '^$' | wc -l

Let's remove this one empty line and check that we succeeded.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | egrep -v '^$' | wc -l

Let's save these in a new file.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | egrep -v '^$' > speakers.txt

In [None]:
! cat speakers.txt | head

Then to the **frequency counter** part

*   `sort` sorts the lines **alphabetically**



In [None]:
! cat speakers.txt | sort | head # the beginning of the sorted document looks like this

In [None]:
! cat speakers.txt | sort | tail

* `uniq` prints only unique lines, i.e., deletes duplicate lines that follow each other
* `uniq -c` only counts the number of duplicate lines

  --> important to sort the lines first

In [None]:
! cat speakers.txt | sort | uniq | head

In [None]:
! cat speakers.txt | sort | uniq -c | head # always use sort before uniq!

Using `sort -n` again sorts the lines **by their number**

In [None]:
! cat speakers.txt | sort | uniq -c | sort | head

In [None]:
! cat speakers.txt | sort | uniq -c | sort -n | head  # This command pipe may take a while to print out! How would you fix that?

`sort -nr` sorts the lines by their number **in reverse order** --> perhaps a bit more practical

In [None]:
! cat speakers.txt | sort | uniq -c | sort -nr | head

Since there are some urls among the names, let's remove these so we'll get a frequency list of names only.

In [None]:
! cat speakers.txt | egrep -v '^http' | sort | uniq -c | sort -nr | head

### ... so this is how we finally have the frequencies!

### What did we learn?

* Flags are used to change or specify commands
* Pipes are used to combine commands
* New flags and commands
  * `egrep` match character strings
  * `egrep` + flags
  * `egrep -v "^$"` delete empty lines
  * `cut -f` select a specific column
  * `sort`
  * `uniq`
  * `sort | uniq -c | sort -rn` frequency list
