<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/ATP_2025_Notebook_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 2: more commands, flags (options) and pipes

**Today's topics:**
*   finding patterns (e.g., words) in a line (egrep/grep; grepping)
*   making commands more versatile with flags
*   combining commands (pipes)
*   counting frequencies



## egrep (or grep)
* Matches a **pattern** on a **line** and **prints the line** with the match;
  
  e.g. `egrep "is" filename.txt`

  prints all lines in the file that match (have) the string 'is'

* see http://people.uta.fi/~jm58660/jutut/unix/grep.html about egrep (in Finnish)
  
---

N.B. 1: **capitals and whitespaces count**!

  All the following commands give a different output:

  `egrep "is" filename.txt` - lines with the string *is*, as a word or part of it (e.g., '**is**', 'th**is**', 'l**is**t', '**is**land')
  
  `egrep " is" filename.txt` - lines with the string *is*, where *is* can be a separate word or in the beginning of a word (e.g., '**is**', '**is**land')

  `egrep "is " filename.txt` - lines with the string *is*, where *is* can be a separate word or in the end of a word (e.g., '**is**', 'th**is**')
  
  `egrep " is " filename.txt` - lines with the whitespace separated string *is*, i.e., a separate word ('**is**')

  `egrep "Is" filename.txt` - lines with the string *Is*, as a word or part of it (e.g., '**Is**', '**Is**land', 'th**Is**' [camel case strings: thisIsCamelCase])
  
  `egrep " Is " filename.txt` - lines with the whitespace separated string *Is*, i.e., a separate word ('**Is**')

---

N.B. 2: **lines**

  A line
  1) ends in a line break, so it might be long.
  2) does not necessarily correspond to a sentence.
  3) can consist of several sentences, or even an entire text.


In [None]:
# let's first get the limerick data we had previously
! wget https://raw.githubusercontent.com/aristila/totally-random-stuff/refs/heads/main/bash_limerick.txt
! ls

In [None]:
! head bash_limerick.txt # let's have a look to remind us what's inside

In [None]:
# find patterns that match the string 'Bash' (capital B!), as a separate word or part of a word
! egrep "Bash" bash_limerick.txt

In [None]:
# lower-cased 'bash', as a separate word or part of a word
! egrep "bash" bash_limerick.txt

In [None]:
# 'rep' as a separate word or part of a word
! egrep "rep" bash_limerick.txt

In [None]:
# 'rep' as a separate word or in the beginning of a word
! egrep " rep" bash_limerick.txt

In [None]:
# 'rep' as a separate word or in the end of a word
! egrep "rep " bash_limerick.txt

# N.B. the comma in 'rep,'

In [None]:
# 'he' as a separate word or part of a word; lower-case!
! egrep "he" bash_limerick.txt

In [None]:
# 'He' (capital H!) as a separate word or part of a word
! egrep "He" bash_limerick.txt

In [None]:
# 'he' as a separate word; lower-case!
! egrep " he " bash_limerick.txt

## More about flags / options

* Options are arguments that can be given to Unix commands to change their behavior. These form one of the core elements of Unix command line work.

* most flags have both a short from (e.g. `-v`) and a long form (e.g. `--verbose`); we use short forms whenever possible since it's faster to write

* Options are difficult to remember but you can read them for instance on `man` pages or google for instructions (which are probably easier to understand than `man` pages)

In [None]:
# an example of the manual page for ls
! man ls

In [None]:
# In Colab the man pages have been 'minimized', so you need to run this first
# before Colab will show them to you:
! unminimize
# Colab will ask you if you really want to do this;
# click on the area behind the question, type 'y' and press enter.
# After running this cell you can try the cell above again.

### Some useful options

* `head -n 15` or `head -15` prints 15 first lines
* `tail -n 15` or `tail -15` prints 15 last lines
* `wc -w` prints **word** count
* `wc -l` prints **line** count


In [None]:
# let's try a bit!
! head -3 bash_limerick.txt
! echo "-----"
! wc -l bash_limerick.txt
! wc -w bash_limerick.txt

In [None]:
# if no flag is given to wc, it will print line, word, and byte & character count
! wc bash_limerick.txt

##More useful options

* `egrep -v` (reverse) prints lines **without a match**
* `egrep -c` **counts** the lines with a match
* `egrep -i` **ignores case**
* `egrep -w` matches **just words** (prints lines with a match where the match is a word and not part of a word)
* `egrep -B N` prints N lines **Before**
* `egrep -A N` prints N lines **After**



N.B. The flags can also be **combined**, e.g.,

`egrep -vi`

ignores case and prints lines without the pattern.

In [None]:
# let's print the text here so it's easier to compare egrep output with the entire text
! cat bash_limerick.txt

In [None]:
# ignores case (so all lines with the string 'he', capitalized or not, and as a separate word or part of a word, are printed)
! egrep -i "he" bash_limerick.txt

In [None]:
# only matches lines with 'a' as a word
! egrep -w "a" bash_limerick.txt

In [None]:
# all lines with 'a', as a separate word or part of a word
! egrep "a" bash_limerick.txt

In [None]:
# the string 'he' as a word, case ignored
! egrep -wi "he" bash_limerick.txt # the flags can also be combined

In [None]:
# prints all lines that do not match the string 'Bash'
! egrep -v "Bash" bash_limerick.txt

In [None]:
# prints all lines that do not match the string 'he' (lower-case!) as a separate word or part of a word
! egrep -v "he" bash_limerick.txt

In [None]:
# counts lines with the string "Bash"
! egrep -c "Bash" bash_limerick.txt

In [None]:
# counts lines with the string 'he', ignoring case
! egrep -ic "he" bash_limerick.txt

In [None]:
# let's check which lines it matched (and counted)
! egrep -i "he" bash_limerick.txt

In [None]:
# counts lines with 'he' (lower-case), as a separate word or as part of word
! egrep -c "he" bash_limerick.txt

In [None]:
# these are the lines that were counted above
! egrep "he" bash_limerick.txt

In [None]:
# ignores case, matches only 'he' as a word and counts the lines
! egrep -icw "he" bash_limerick.txt

In [None]:
# order of the flags does not matter
! egrep -wic "he" bash_limerick.txt

In [None]:
# ignores case, 'he' as a separate word
! egrep -iw "he" bash_limerick.txt

In [None]:
# prints one line before and two lines after the line(s) that match the string 'and'
! egrep -B 1 -A 2 "and" bash_limerick.txt

### Time to try things out!

Let's fetch some data from project Gutenberg at www.gutenberg.org. Do the following tasks:



1.   Download this file: https://www.gutenberg.org/cache/epub/20748/pg20748.txt
1. Rename the file as e.g., `fairytales.txt`
2. Count how many **words** and **lines** the file has.
3. Count on **how many lines** the pronoun _he_ appears in the file. How can you control the use of **capitals** and **white spaces** so that the query matches **only entire words**?
4. Direct the number of lines that match the query (_he_) to **a new file** called `pron_in_file.txt`.


**Advanced**

Direct the output to a file called `pron_count_with_text.txt` with the following text: 'The pronoun he appeared this many times:'

I.e., when you print the file, the output should look something like this:
```
Th pronoun he appeared this many times:
15
```

N.B. The commands etc. in **'Advanced' sections will not be tested in the exam**. They're just there to give you a bit more (advanced) information.

### Advanced
* You can also combine commands on one line using `;`
* You can assign variables using `$`:



In [None]:
! wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"


In [None]:
! wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"
! echo "Word count is", $wordcount # this won't work anymore on Colab, because each line is interpreted individually

### Pipes (*putket* in Finnish)

* You can combine commands with a pipe `|` (alt gr + the key next to z)
* This way the output of the first command goes as input to the next one (**the order of the commands is important!**)
* `cat file.txt | wc -w` first prints a file, then counts words (this is actually the same as just `wc -w file.txt`)
* `cat file.txt | egrep "is"`  first prints a file, then matches the pattern
* `cat file.txt | egrep "is"  | wc -l` prints file, matches a pattern, counts the lines with the pattern
* `cat file.txt | egrep "is" | head` prints file, matches a pattern, prints the head of the lines with the pattern
* `cat file.txt | egrep "is" > output.txt` prints file, matches a pattern, directs the matched lines to a new file
* `cat file.txt | head -1000 | tail -100`  prints the lines between 900-1000; first prints the file, then it prints the first 1000 lines which go to the next pipe, which in turn prints the 100 last lines of the `head -1000` output

In [None]:
# once again, let's use the limerick
! cat bash_limerick.txt

In [None]:
# prints file, matches a pattern, counts the lines with the pattern
! cat bash_limerick.txt | egrep "and" | wc -l

In [None]:
# here I use echo to separate output, but it's not necessary

# prints file, matches a pattern as a word, prints the lines with the pattern
! cat bash_limerick.txt | egrep -w "a"

! echo
! echo --- line count ---

# prints file, matches a pattern as a word, counts the lines with the pattern
! cat bash_limerick.txt | egrep -w "a" | wc -l

! echo
! echo --- the first two lines with a match ---

# prints file, matches a pattern as a word, prints the head of the lines with the pattern
! cat bash_limerick.txt | egrep -w "a" | head -2


In [None]:
# prints file, matches a pattern as a word and ignores case, prints the head of the lines with the pattern
! cat bash_limerick.txt | egrep -wi "he" | head

# prints file, matches a pattern as a word and ignores case, counts the lines with the pattern
! cat bash_limerick.txt | egrep -wi "he" | wc -l

In [None]:
# Again, you can direct these to a file as before
! cat bash_limerick.txt | egrep -wi "he" | wc -l > file.txt
! cat file.txt

## Frequency counts



*   A frequency counter is very useful to count e.g., strings in a text
* A frequency counter can be done by combining *sort* and *uniq* commands with a pipe
* `sort` sorts the input lines in **alphabetical order**
* `uniq` filters **repetitive, consecutive** lines


Let's first get some data. This time we'll fetch some parliamentary data.

In [None]:
!wget https://a3s.fi/parliamentsampo/speeches/csv/speeches_2024.csv

In [None]:
# let's check that the file made it to our directory
! ls

In [None]:
# let's check what we got
! head speeches_2024.csv

This file is a **csv** file.

**csv** stands for comma-separated values.

This file format is a bit impractical for us at the moment, so let's change it to **tsv**.

**tsv** stands for tab-separated values.

Next, we'll need some Python to make the data format suitable for our purposes. **You need not know how to do this**, but feel free to have a look at the code snippet.

In [None]:
# let's save the csv file as a tsv file
import pandas as pd
df = pd.read_csv('speeches_2024.csv')
df.to_csv('speeches_2024.tsv', sep='\t')

Let's check what we got.

In [None]:
!ls

In [None]:
# What does the data look like?

! cat speeches_2024.tsv | head

#! head speeches_2024.tsv #outputs the same thing as the command above

Looks good. Let's use the **tsv-file** and just ignore the original csv-file we downloaded.

In [None]:
# let's count the lines
! cat speeches_2024.tsv | wc -l

There's quite a lot of data, so let's focus on one month only. I select March this time.

Let's grep the lines that match the time stamp 2024-03 and direct those in a new file.

In [None]:
!cat speeches_2024.tsv | egrep '2024-03' > speeches_2024_03.tsv

In [None]:
!ls

In [None]:
# let's count the lines in this new file with speeches from March only
!cat speeches_2024_03.tsv | wc -l

It's always possible that data files have empty lines. Let's check if we have any in this data. We can use

`egrep "^$"`

to do that.

Let's have a quick look at the command:

`^` stand for line **start**

`$` stands for line **end**

so together line start and line end with nothing inbetween make an empty line!


In [None]:
! cat speeches_2024_03.tsv | egrep "^$" | wc -l

This data has several columns.
Let's first take just the speaker surnames from the data, it's the column number 8.

`cut` will output just (a) specific column(s).

`cut -f 8` will give us column 8.

**N.B.** In Bash, counting starts from 1 (unlike in Python!), so the first column is accessed with `cut -f 1`


In [None]:
# always check that you've cut the intended column!
!cat speeches_2024_03.tsv | cut -f 8 | head

Let's count how many lines we have.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | wc -l

Good. This line count corresponds to the line count of the entire file.

In case a name is missing in a row in the original data, we will have an empty line here. Let's check.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | egrep '^$' | wc -l

Let's remove this one empty line and check that we succeeded.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | egrep -v '^$' | wc -l

It seems that we were able to remove the empty line; cf. line count with the previous line count of the speaker column.

Let's save these in a new file.

In [None]:
!cat speeches_2024_03.tsv | cut -f 8 | egrep -v '^$' > speakers.txt

In [None]:
# and let's check the file looks ok
! cat speakers.txt | head

Then to the **frequency counter** part

*   `sort` sorts the lines **alphabetically**



In [None]:
! cat speakers.txt | sort | head # the beginning of the sorted document looks like this

In [None]:
! cat speakers.txt | sort | tail

* `uniq` prints only unique lines, i.e., deletes consecutive duplicate lines (lines that follow each other)
* `uniq -c` only counts the number of duplicate lines

  --> **important to sort the lines first**

In [None]:
! cat speakers.txt | sort | uniq | head

In [None]:
# then let's count the sorted unique lines
! cat speakers.txt | sort | uniq -c | head # always use sort before uniq!

Using `sort -n` again sorts the lines **by their number**

In [None]:
# sort sorts alphabetically
! cat speakers.txt | sort | uniq -c | sort | head

There seems to be some data in the file that does not correspond to a surname, but let's ignore that for now and get back to it later.

In [None]:
# sort -n sorts by number
! cat speakers.txt | sort | uniq -c | sort -n | head  # This command pipe may take a while to print out! How would you fix that?

`sort -nr` sorts the lines by their number **in reverse order** --> perhaps a bit more practical

In [None]:
# sort -nr sorts lines by number in ascending order
! cat speakers.txt | sort | uniq -c | sort -nr | head

Now, let's do something about the lines that do not correspond to surnames. `egrep -v` could help us here. In the pattern we want to match we must not forget to add line start `^`, otherwise it will match patterns also elsewhere in the line. Here the `egrep -v` without line start would work, but in general it is good practice to remember to indicate the line start if you wish to match patterns that begin a line.

In [None]:
# so here is our final frequency list
! cat speakers.txt | egrep -v '^http' | sort | uniq -c | sort -nr | head

### ... so this is how we finally have the frequencies!

##Now it's your turn!


1.   Use the same file with speeches from March, 2024.
2.   Focus on (i.e., cut) the column with the party.
1.   Remove any empty lines.
1.   Save the parties to a new file called `parties.txt`
2.   Make a frequency list of the parties.

N.B. All data is not neatly in the column it belongs to. Thus, make sure to remove any items that do not refer to a party.

Also, remember to check after each command that you have or did what you intended to! I have not added separate code cells for these.







In [None]:
# 1. Check the file to be used

In [None]:
# 2. Cut the column with the parties

In [None]:
# 3. Remove any empty lines

In [None]:
# 4. Save the column to a new file called parties.txt

In [None]:
# 5. Make the frequency list

In [None]:
# 6. Remove any items that do not belong to the frequency list (N.B. you might want to run this command earlier!)

### What did we learn?

* Flags are used to change or specify commands
* Pipes are used to combine commands
* New flags and commands
  * `egrep` match character strings
  * `egrep` + flags
  * `egrep -v "^$"` delete empty lines

      `^` beginning of a line

      `$` line end
  * `cut -f` select a specific column
  * `sort`
  * `uniq`
  * `sort | uniq -c | sort -rn` frequency list
