<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/Notebook4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Changing characters

Changing characters is often a useful thing to do:
* Splitting tokens to one per line (a useful format for Bash)
* Splitting to sentences
* Normalization --> all to lower case
* Deleting punctuation
* *tr* (transform) command can be used for this

In [None]:
# First let's get data: A Christmas Carol by Charles Dickens
! wget https://www.gutenberg.org/ebooks/19337.txt.utf-8

In [None]:
! ls
! mv 19337.txt.utf-8 xmascarol.txt
! head -200 xmascarol.txt  | tail -40

### Using *tr* to split tokens one per line
* Replace whitespace by linebreak (\n)

In [None]:
! cat xmascarol.txt | tr ' ' '\n' # the contents of the first ' ' are transformed to the contents of the second ' '
! cat xmascarol.txt | tr ' ' '\n' > outputfile.txt # You can direct this to a file
! cat xmascarol.txt | tr ' ' '\n' | wc -l # or you can count how many lines (tokens) you have!

### Combining *tr* to a frequency list pipeline

* First split the tokens one per line, then count the frequencies using *sort | uniq -c | sort -n*

In [None]:
! cat xmascarol.txt | tr ' ' '\n' | sort | uniq -c | sort -n 

### Exercise
* Download Christmas carol and rename it to something easily conceivable
* How many words does the book have?
* Using *tr* and *sort uniq* combinations:
* How many unique (different) words does the book include?
* What are the 20 most frequent tokens of the book?
* How many tokens occur only once?

## Using *tr* to normalize

* From upper case to lower case: tr '[:upper:]' '[:lower:]' # replace any upper case letter with a lower case letter
* Deleting numbers: replace any number ([0-9]) with a whitespace
* Deleting punctuation: replace any punctuation ([:punct:]) with a whitespace



In [None]:
! cat xmascarol.txt | tr '[:upper:]' '[:lower:]' | head -20

In [None]:
! cat xmascarol.txt | tr '[0-9]' ' ' | head -20

In [None]:
! cat outputfile.txt | sort | uniq -c | sort -rn | head -5 | tr '[0-9]' ' '

In [None]:
! cat outputfile.txt | head -30
! echo
! echo "And now without punctuation:"
! cat outputfile.txt | tr '[:punct:]' ' '  | head -30

### *tr* and *sort uniq* to compare texts

* Combining the commands with pipes lets you turn running texts to cleaned frequency lists of the text words
* Steps: 
** 1) normalize all to lowercase
** 2) delete punctuation and numbers (you can do this with separate commands for now)
** 3) split tokens to own lines (replace whitespace with a line break)
** 4) do a frequency list of the cleaned tokens (lines)

In [None]:
! cat xmascarol.txt | tr '[:upper:]' '[:lower:]'  | tr '[0-9]' ' ' | tr '[:punct:]' ' ' | tr ' ' '\n' | sort | uniq -c | sort -rn | head -30

### Exercise

Compare *A Christmas Carol* to *The Pickwick Papers* also written by Dickens.

The Pickwick Papers is available at https://www.gutenberg.org/files/580/580-0.txt 

* First look at the data you have. When are the books released to Gutenberg?
* Then normalize the data by turning all to lowercase and deleting numbers and punctuation 
* How many words/tokens do the books include?
* How many unique words/tokens do the books include?
* Compare (by reading) the 20 most frequent tokens of the books. Do you find any differences?

#### Tip to help comparing files: *paste file1 file2* prints the files side by side. 