<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/Notebook2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 2: more commands, flags (options) and pipes

## egrep (or grep)
* Matches a pattern on a line and prints it
* see http://people.uta.fi/~jm58660/jutut/unix/grep.html 
* e.g. egrep "is" file
* Note that capitals and whitespaces count!
* So egrep " is" file gives a different output



In [None]:
# let's first get the data we had previously
! wget https://www.gutenberg.org/cache/epub/20748/pg20748.txt
! ls

In [None]:
! mv pg20748.txt book.txt # let's name this nicer
! head book.txt # let's have a look


In [None]:
! egrep "project" book.txt

In [None]:
! egrep "Duck" book.txt

## Options / flags

* Options are arguments that can be given to Unix commands. These form one of the core elements of Unix command line work.

* Options are difficult to remember but you can read them for instance on man pages


In [None]:
! man ls 

* Head -n 15 prints 15 first lines
* Tail -n 15 prints 15 last lines
* ls -lah prints more information than just ls
* egrep -v (reverse) prints lines without a match
* egrep -c counts matches
* egrep -i ignores case
* egrep -B N prints N lines Before
* egrep -A N prints N lines After
* wc -w prints wordcount
* wc -l prints linecount

In [None]:
# let's try a bit!
! head -5 book.txt
! wc -l book.txt
! wc -w book.txt

In [None]:
! egrep -c "of" book.txt
! egrep -i "of" book.txt


In [None]:
# Flags can also be combined!
! egrep -ci "of" book.txt
! egrep -c "of" book.txt

### Exercise on flags

* Count how many words and lines the Gutenberg book has

* First check how the actual text looks like. You remember that with default head and tail we got only the beginning and ending notes, not the actual stories? How could you get a longer chunk of the file?

* Let's analyze personal pronouns in the book. How many times the pronoun "it" is used? What about the female and male pronouns? How can you use the flags so that you can catch both "He" and "he"? 

* (Later we will also learn how to match optional sequences so you can get both "he" and "himself" to a oneliner, but let's not go there quite yet.)

* Create a file where you direct all the counts and their descriptors. For instance, the file could include the following lines:

```
She appread this many times:
15
```
### Advanced
* (You can also combine commands on one line using ;)
* (You can assign variables using $:


In [None]:
!wordcount=$(wc -w book.txt) ; echo "Word count is $wordcount"


### Pipes (putket)

* You can combine commands with a pipe | (alt gr + the key next to z)
* This way the output of the first command goes as input to the next one
* cat file.txt | wc -w (is actually the same as just wc -w) # first prints a file, then counts words
* cat file.txt | egrep "is" # first prints a file, then matches the pattern
*  cat file.txt | egrep "is"  | wc -w
*  cat file.txt | egrep "is" | head
*  cat file.txt | egrep "is" > output.txt
* cat file.txt | egrep "is" | wc -l
* cat file.txt | head -1000 | tail -100 # prints the lines between 900-1000

## Exercise 

* It would be useful to have a look at the actual text in the Gutenberg book, not just the beginning or end. How could you print the lines between 210 and 220? How do they look like?

* Can you think of two ways to count the lines that match the word "Gutenberg"?
* How can you direct to a file the first 5 lines that match "Gutenberg"?
* Filter away lines that have the word "gutenberg" in some form and direct the "cleaned" version to a file. Compare this to the original file. How many words or lines did you delete with this filtering?

Advanced
* Advanced: Can you find egrep options you can use to match entire words?
* Make one-liners that print nicely different counts to a file
* Btw you can also assign piped commands as variables




In [None]:
!wordcount=$(cat book.txt|egrep -i "the" | wc -l) ; echo "Line count for the is $wordcount" >> niceoutput.txt


## Copying a Github repo

* Github is a common place to save code and data in NLP.
* The repos (directories) can be copied to a local computer programatically
* This is quite handy especially with Google colab
* The command for the copying is *git clone*, and it should be followed the url "Code" link in the green box available at a Git repo
 

In [2]:
! git clone https://github.com/TurkuNLP/ATP_kurssi.git
! ls #to check that we got the repo

Cloning into 'ATP_kurssi'...
remote: Enumerating objects: 320, done.[K
remote: Counting objects: 100% (87/87), done.[K
remote: Compressing objects: 100% (81/81), done.[K
remote: Total 320 (delta 43), reused 17 (delta 5), pack-reused 233[K
Receiving objects: 100% (320/320), 17.09 MiB | 14.24 MiB/s, done.
Resolving deltas: 100% (172/172), done.
ATP_kurssi  sample_data


In [3]:
# cd will take us to that folder
%cd ATP_kurssi/
! ls # check that we are at the correct place

/content/ATP_kurssi
Notebook1.ipynb  Notebook2.ipynb  old_versions	README.md  tweets_en_nort.csv


### Exercise with tweets 

* The file tweets_en_nort.csv includes tweets
* How many are there?
* What does the last tweet look like?
* Can you think what they discuss? What seems the best way to read the data?
* The dataset has a lot of bot-generated tweets that we'd like to filter out. At least one such tweet has the string "Message of Leader". Filter those out. How many such tweets were there? How many tweets do you have in the "cleaned" dataset after the filtering?
* (There can also be other repetitive tweets, if you see any, you can take those away too)

* The tweets mention at least France and Paris. Direct these to a separate file, which includes all spelling variants.
* Can you find tweets where France or Paris are not capitalized?
* There seems to be many news agencies. How many times is FoxNews mentioned? What about other news agencies, can you see any?



