<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/Notebook4_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Today's topics

* Cloning Github repos
* Gzipped files using `gzip` and `zcat`
* Changing characters using `tr`
  * Combining `tr` to a frequency list pipeline
  * Using `tr` to normalize
* Regular expressions

### Copying a Github repo

Github is a common place to save code and data in NLP.
The repos (directories) can be copied to a local computer programatically

This is quite handy especially with Google colab

The command for the copying is `git clone`, and it should be followed the url "Code" link in the green box available at a Git repo

In [None]:
! git clone https://github.com/TurkuNLP/CORE-corpus.git
! ls #to check that we got the repo

In [None]:
# cd will take us to that folder
%cd CORE-corpus/
! ls # check that we are at the correct place

### Basic check-ups from gzipped files

* `zcat` for printing
* `gzip` for producing

---
* You need to print `gz`-files before you can process them

In [None]:
! zcat train.tsv.gz | head # what's in?

In [None]:
! head register_label_abbreviations.txt # check what the abbreviations mean

In [None]:
! zcat train.tsv.gz | wc -l #How many lines?

### Focus on specific columns

* ` cut -f `

In [None]:
! zcat train.tsv.gz | cut -f 3 | head # to focus onthe text

### Filtering away the duplicates

* Unfortunately, the train file has some duplicates and empty documents. Before we move on, make a file that includes only the text parts of the file, and no duplicates or empty documents


In [None]:
! zcat train.tsv.gz | wc -l # this many docs in the original

In [None]:
! zcat train.tsv.gz | cut -f 3 | egrep "^$" | wc -l # how many empty ones?

In [None]:
! zcat train.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq -c | sort -rn | head -10 # frequency list to see if there are duplicates

In [None]:
! zcat train.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq | wc -l # how many unique non-empty documents? 


In [None]:
! zcat train.tsv.gz | cut -f 3 | sort | uniq | gzip > cleaned.txt.gz #Note the gzip command to create a gzipped file!

### Changing characters

Changing characters is often a useful thing to do:
* Splitting tokens to one per line (a useful format for Bash)
* Splitting to sentences
* Normalization --> all to lower case
* Deleting punctuation or numbers
* `tr` (transform) command can be used for this

### Using *tr* to split tokens one per line
* Replace whitespace by linebreak `\n`

In [None]:
! zcat cleaned.txt.gz | tr ' ' '\n' | head # the contents of the first ' ' are transformed to the contents of the second ' '


In [None]:
! zcat cleaned.txt.gz | head 

In [None]:
! zcat cleaned.txt.gz | tr ' ' '\n' > outputfile.txt # You can direct this to a file
! cat outputfile.txt | wc -l # or you can count how many lines (tokens) you have!

In [None]:
! head outputfile.txt # how did this look like again?

### Combining *tr* to a frequency list pipeline

* First split the tokens one per line, then count the frequencies using *sort | uniq -c | sort -n*

In [None]:
! zcat cleaned.txt.gz | tr ' ' '\n' | sort | uniq -c | sort -rn | head -5

### Using `tr` to normalize

* From upper case to lower case: `tr '[:upper:]' '[:lower:]'` # replace any upper case letter with a lower case letter
* Deleting numbers: replace any number `[0-9]` with a whitespace
* Deleting punctuation: replace any punctuation `[:punct:]` with a whitespace


In [None]:
! zcat cleaned.txt.gz  | tr '[:upper:]' '[:lower:]' | head -20

In [None]:
! zcat cleaned.txt.gz  | tr '[0-9]' ' ' | head -20

In [None]:
! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | head -20

### We can combine all these to make a cleaned and normalized frequency list

* Delete punctuation, numbers
* Normalize to lowercase
* Transform to string-per-line format
* Make a frequency list of the lines

In [None]:
! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | head

In [None]:
! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | head

In [None]:
! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | sort | uniq -c | sort -rn | head -10

In [None]:
! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | egrep -v "^$"| sort | uniq -c | sort -rn  | head -10 # yet without empty lines

### Time out!

New  commands
```
* git clone
* gzip, zcat
* tr
```
Wildcards for matching larger groups of characters

`[:punct:], [0-9], [:upper:], [:lower:]`

#### Recap

Let's count the most frequent words of one text class from the CORE corpus, *AV*.

* Before counting the most frequent words, let's normalize to lowercase and clean punctuation and numbers away
* How long in the frequency list do you need to go before you start getting content words? (What do we mean by them?)
* What do you think where these texts come from?

In [None]:
! zcat train.tsv.gz | egrep -w NA | head # first we need to egrep for the correct labels + texts

In [None]:
! zcat train.tsv.gz | egrep -w NA | wc -l # good to check how many we got

In [None]:
! zcat train.tsv.gz | egrep -w NA | cut -f 3 > na.txt # let's then take the third column and direct to a file for simplicity

In [None]:
! cat na.txt  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | egrep -v "^$" | sort | uniq -c | sort -rn | head -100 | tail -50

## Regular expressions

Above, we saw that some expressions - *regular expressions* - can be used to match a larger group of strings
* `[:punct:] [:upper:] [:lower:] [0-9]`

Note: regexes can vary between languages

Some useful *operators*
* `^` beginning of line
* `$ `end of line
* `^$ `empty line (beginning + end without anything between)
* `| `alternative, e.g., `"cat|dog"`
* `[] `group, e.g.` [A-ZÅÄÖa-zåäö] [0-9] [abc]` *any of the characters*
* `() `group to form a whole, e.g. `(abc)|(def)`
* The same thing can be expressed in many ways, e.g. `[abc]` is the same as `"a|b|c"`

NOTE: if you want to search for the literal meaning of a regular expression, you need to *escape* it with `\` 

These (and more) are listed also here: https://www.guru99.com/linux-regular-expressions.html



Let's first make a version of the original file with one token per line

In [None]:
! zcat cleaned.txt.gz  | tr ' ' '\n' | egrep -v "^$" > one-per-line.txt

In [None]:
! cat one-per-line.txt | head -5

In [None]:
! echo "any line with the string"
! cat one-per-line.txt | egrep "is" | head -4 # any line with the string
! echo
! echo "lines strating with the string"
! cat one-per-line.txt | egrep "^is" | head -4 # lines starting with is 
! echo 
! echo "lines ending with the string"
! cat one-per-line.txt | egrep "is$" |  egrep -v "^is$"| head -4 # lines ending with is

In [None]:
! cat one-per-line.txt | egrep "ing$" # any line ending with ing

In [None]:
! cat one-per-line.txt | egrep "^[[:upper:]]" | head -5 # any line starting with a capital letter

In [None]:
! cat one-per-line.txt | egrep "[[:punct:]]" | head -10

In [None]:
! cat one-per-line.txt | egrep "^[[:punct:]]" | head -5 # anything starting with punctuation

In [None]:
! cat one-per-line.txt | egrep "[[:punct:]]$" | head -5 # anything ending with punctuation

In [None]:
! cat one-per-line.txt | egrep "^[[:punct:]][A-Z]" | head -5 # anything starting with punctuation and then a capital letter

In [None]:
! cat  one-per-line.txt | egrep "[a-zA-Z],[a-zA-Z]" | head -5 # tokenization mistakes

In [None]:
! cat  one-per-line.txt | tr '[aeiouy]' ' ' | head -5 # all vowels away

### A couple more useful operators

* `.` any character
* `*` 0 times or more
* `+ ` 1 time or more
* `?`  0 or 1 times

Operators can also be combined
- `.* ` --> any character 0 times or more
- `.?` --> any character 0 or one times
- `.+` --> any character 1 or more times
- `a+ `--> (the letter) a 1 or more times
- `a.*` --> (the letter) a, any character, 1 or more times
- `a?.$` -->(the letter) a 0 or 1 times, any character, line end

In [None]:
! cat  one-per-line.txt | egrep "^[A-Z]$" | head -5 # one-letter lines 

In [None]:
! cat  one-per-line.txt | egrep "^[A-Z]+$" | tail -5 # one or more letters per line

In [None]:
! cat  one-per-line.txt | egrep "^a.*ing$" | head #starting with a, ten any character 0 or more times, ing 

In [None]:
! cat  one-per-line.txt | egrep "^A.*'s$" | uniq | head

In [None]:
! cat  one-per-line.txt  | egrep "^[[:lower:]]+[[:upper:]]+" | head -5