<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/Notebook2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 2: more commands, flags (options) and pipes

## egrep (or grep)
* Matches a pattern on a line and prints it
* see http://people.uta.fi/~jm58660/jutut/unix/grep.html 
* e.g. egrep "is" file
* Note that capitals and whitespaces count!
* So egrep " is" file gives a different output



In [None]:
# let's first get the data we had previously
! wget https://www.gutenberg.org/cache/epub/20748/pg20748.txt
! ls

In [3]:
! mv pg20748.txt book.txt # let's name this nicer


In [None]:
! head book.txt # let's have a look

In [None]:
! egrep "snow" book.txt

In [None]:
! egrep "Snow" book.txt

In [6]:
! egrep "Snow " book.txt 

## Options / flags

* Options are arguments that can be given to Unix commands. These form one of the core elements of Unix command line work.

* Options are difficult to remember but you can read them for instance on man pages or google for instrutions (which are probably easier to understand than man pages)


In [None]:
! man ls 

### Some useful options

* Head -n 15 prints 15 first lines
* Tail -n 15 prints 15 last lines
* wc -w prints wordcount
* wc -l prints linecount


In [None]:
# let's try a bit!
! head -5 book.txt
! echo "-----"
! wc -l book.txt
! wc -w book.txt

* egrep -v (reverse) prints lines without a match
* egrep -c counts matches
* egrep -i ignores case
* egrep -w matches just words
* egrep -B N prints N lines Before
* egrep -A N prints N lines After

In [None]:
! egrep -i "snow" book.txt 
! egrep -w "snow" book.txt 
! egrep -wi "snow" book.txt # the flags can also be combined
! egrep -v "snow" book.txt

In [None]:
! egrep -c "snow" book.txt
! egrep -c "Snow" book.txt
! egrep -ic "snow" book.txt
! egrep -icw "snow" book.txt

In [None]:
! egrep -B 5 -A 5 "Snow" book.txt

### Time out

* How to rename a file?
* How to count how many words and lines a file has?
* How to count how many times the pronoun _he_ appears in a file? How can you control the use of capitals and white spaces so that the query matches only entire words?
* How to direct all the lines that match the query (_he_) to a file?
For instance, the file could include the following lines:

```
He appread this many times:
15
```
### Advanced
* You can also combine commands on one line using ;
* You can assign variables using $:


In [None]:
!wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"


In [None]:
!wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"
!echo "Word count is", $wordcount # this wont work anymore on Colab, because each line is interpreted individually

### Pipes (putket)

* You can combine commands with a pipe | (alt gr + the key next to z)
* This way the output of the first command goes as input to the next one
* cat file.txt | wc -w (is actually the same as just wc -w) # first prints a file, then counts words
* cat file.txt | egrep "is" # first prints a file, then matches the pattern
*  cat file.txt | egrep "is"  | wc -l
*  cat file.txt | egrep "is" | head
*  cat file.txt | egrep "is" > output.txt
* cat file.txt | head -1000 | tail -100 # prints the lines between 900-1000

In [None]:
! cat book.txt | egrep "is" | wc -l
! cat book.txt | egrep "is" | head

In [None]:

! cat book.txt | egrep -w "is" | wc -l
! cat book.txt | egrep -w "is" | head


In [None]:
! cat book.txt | egrep -wi "is" | head
! cat book.txt | egrep -wi "is" | wc -l

In [None]:
# Again, you can direct these to a file as before
!  cat book.txt | egrep -wi "is" | wc -l > file.txt
! cat file.txt

## Frequency counts

* A frequency counter can be done by combining *sort* and *uniq* commands with a pipe
* `Sort` sorts the input lines to alphabetical order
* `Uniq` filters repetitive lines


In [None]:
! wget http://dl.turkunlp.org/atp/tweets_en_nort.csv  # this gives us a new dataset - what do they look like?

Let's first take just the tweet parts from the data, it's the column number 3 that we can get with `cut -f 3`

Also, there seem to be empty lines. Those we can delete with `egrep -v "^$"`



In [None]:
! cat tweets_en_nort.csv | egrep -v "^$" | cut -f 3 | head -20

In [20]:
! ! cat tweets_en_nort.csv | egrep -v "^$" |  cut -f 3 > tweets_text.txt # let's direct this to a file that we can then use

In [None]:
! cat tweets_text.txt | head 

Then to the frequency counter part

In [None]:
! cat tweets_text.txt | sort | head # the beginning of the sorted document looks like this 

In [None]:
! cat tweets_text.txt | sort | head -3000 | tail

`Uniq` prints only unique lines, i.e., deletes duplicate lines that follow each other. `Uniq -c` counts the number of duplicate lines

In [None]:
! cat tweets_text.txt | sort | uniq -c | head

`Sort -n` sorts the lines by their number

In [None]:
! cat tweets_text.txt | sort | uniq -c | sort -n

Sort -nr sorts the lines by their number in a reverse order --> perhaps a bit practical

In [None]:
! cat tweets_text.txt | sort | uniq -c | sort -nr | head

### ... so this is how we then have the counter!

### Today

* Flags are used to change or specify commands
* Pipes are used to combine commands
* New flags and commands 
  * egrep + flags
  * sort
  * uniq
  * sort | uniq -c | sort -rn


## Exercise 

* It would be useful to have a look at the actual text in the Gutenberg book, not just the beginning or end. How could you print the lines between 210 and 220? How do they look like?

* Can you think of two ways to count the lines that match the word "Gutenberg"?
* How can you direct to a file the first 5 lines that match "Gutenberg"?
* Filter away lines that have the word "gutenberg" in some form and direct the "cleaned" version to a file. Compare this to the original file. How many words or lines did you delete with this filtering?

Advanced
* Advanced: Can you find egrep options you can use to match entire words?
* Make one-liners that print nicely different counts to a file
* Btw you can also assign piped commands as variables




In [None]:
!wordcount=$(cat book.txt|egrep -i "the" | wc -l) ; echo "Line count for the is $wordcount" >> niceoutput.txt
