<a href="https://colab.research.google.com/github/TurkuNLP/ATP_kurssi/blob/master/ATP_2025_Notebook_4_answers_part_1_and_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Today's topics

I Cloning Github repos

II Gzipped files using `gzip` and `zcat`

III Changing characters using `tr`
  * Combining `tr` to a frequency list pipeline
  * Using `tr` to normalize

IV Regular expressions

### I Copying a Github repo

Github is a common place to save code and data in NLP. The repos (directories) can be copied to a local computer programatically.

This is quite handy especially with Google colab.

The command for the copying is `git clone`, and it should be followed the url "Code" link in the **green box** available at a Git repo.



###Let's start by fetching some data from Github.

1. Clone the following repo: https://github.com/TurkuNLP/CORE-corpus.git and check that we got it.

2. In the repo, there's a folder called CORE-corpus. Go to the folder.



In [None]:
#1. Clone the repo

! git clone https://github.com/TurkuNLP/CORE-corpus.git

In [None]:
#Check that we got it

! ls

In [None]:
#2. Go to the folder

# cd will take us there

%cd CORE-corpus/
! ls # check that we are at the correct place

### II Gzipped files: basic check-ups

* `zcat` for printing
* `gzip` for producing (writing in a zipped file)

---
* You need to print `gz` files before you can process **zipped** files
---


###Next

3. Print the train file within the folder, i.e., always `zcat file.tsv.gz`
4. The first column in the file indicates the register label for the text. Check what the abbreviations mean. The CORE-corpus folder contains a file register_label_abbreviations.txt that features all the abbreviations for the register labels.
5. Then count the lines (since the data is in the form text per line, you get the number of texts in the file by counting the lines in the file).

In [None]:
#3. Print the file
! zcat train.tsv.gz | head # what's in the file?

In [None]:
# Register labels are in the first column
! zcat train.tsv.gz | cut -f 1 | head

In [None]:
#4. Check what the labels are.

! head register_label_abbreviations.txt

In [None]:
#5. Line (in this case also text) count

! zcat train.tsv.gz | wc -l

###Try the commands yourselves.

1. Check what the `test` file includes.
2. And do the line count for the `test` file.

In [None]:
# 1. Check what the *test* file includes
! zcat test.tsv.gz | head

In [None]:
# 2. And do the line count for the *test* file
! zcat test.tsv.gz | wc -l

### Focus on specific columns

A reminder:

` cut -f `

Columns in a file can be accessed with the ` cut -f ` command. The **column must be specified** in the command with the flag `-f`. E.g.,

`cut -f 5`

prints the 5th column in the file.

**Note** the white spaces in the command! If they are incorrect, you'll get an error.

##Exercise 1

1. Print the text column in the `train` file and check that you got it right.
2. Check what the different columns have.
3. Check the columns in the file `register_label_abbreviations.txt`. What do you get with `cut -f 3`?

In [None]:
# 1. Print the text column (the 3rd column) in the train file and check that you got it right.

#check with head

! zcat train.tsv.gz | cut -f 3 | head

In [None]:
# 2. Check what the different columns have.
! zcat train.tsv.gz | cut -f 1 | head

In [None]:
# 3. Check the columns in the file register_label_abbreviations.txt. What do you get with cut -f 3?
! cat register_label_abbreviations.txt | cut -f 1 | head

In [None]:
#What happens if you try cut -f 3 on the file register_label_abbreviations.txt?
! cat register_label_abbreviations.txt | cut -f 3 | head

##Exercise 2

### Filtering away the duplicates and the empty lines

Unfortunately, the train file has some duplicates and empty documents. Before we move on, make a file that includes only the text parts of the file, and no duplicates or empty documents. It's advisable to use e.g. line count and frequency list to check that everything that was supposed to be removed is removed.

Make a clean version of the file in which you include **only the texts**.
1. Remove **all empty lines** and
2. **duplicates**.
3. Direct the output in a **new file** called `cleaned.txt.gz`


In [None]:
# 1. First, check how many lines there are in the original file.

#N.B. Here, when you count the lines, you get the number of docs/texts

! zcat train.tsv.gz | wc -l

In [None]:
! zcat train.tsv.gz | cut -f 3 | head

In [None]:
#Then, check how many empty lines (in the text column) the file includes.

#cut -f 3: access the text column
#egrep "^$": get the empty lines (^ line start, $ line end)

! zcat train.tsv.gz | cut -f 3 | egrep "^$" | wc -l

In [None]:
#Check that all the empty lines are removed (count the remaining lines after the removal of the empty ones).

! zcat train.tsv.gz | cut -f 3 | egrep -v "^$" | wc -l

In [None]:
# 2. Check the duplicates in the file.

#sort: sort the lines alphabetically
#uniq -c: counts the number of consequtive duplicate lines
#sort -rn: n sorts according to numbers, r sorts in reverse order (descending order)

! zcat train.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq -c | sort -rn | head -10

In [None]:
#Count how many unique non-empty docs there are in the file.

#sort: sorts alphabetically
#uniq: prints only once each (duplicate) line

! zcat train.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq | wc -l


In [None]:
# 3. Direct the output to a new file.

#gzip: create a gzipped file

! zcat train.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq | gzip > cleaned.txt.gz

In [None]:
# Check that the new file exists.
! ls

In [None]:
# Check the file contents.
! zcat cleaned.txt.gz | head

**EXTRA TIME LEFT?**

Clean the test file as above (i.e. do all the same things for the test file as for the train file). Remember to check the output of each pipe so you can be sure that you end up with the intended final output. Finally, direct the output of the cleaned test file to a new file called `cleaned_test_file.txt.gz`

In [None]:
#Clean the test file.

! zcat test.tsv.gz | cut -f 3 | head

In [None]:
! zcat test.tsv.gz | cut -f 3 | wc -l

In [None]:
! zcat test.tsv.gz | cut -f 3 | egrep "^$" | wc -l

In [None]:
! zcat test.tsv.gz | cut -f 3 | egrep -v "^$" | wc -l

In [None]:
! zcat test.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq | head

In [None]:
! zcat test.tsv.gz | cut -f 3 | egrep -v "^$" | sort | uniq | wc -l

### III Changing characters

Changing characters is often a useful thing to do:
* **Splitting** tokens to one per line (a useful format for Bash), i.e. the format is **word per line**
* Splitting **to sentences**
* **Normalization** (i.e. all to lower case)
* **Deleting** punctuation or numbers

### Using `tr` to split tokens one per line
* `tr` refers to transform
* use **single quotation** marks to indicate **what is transformed** (within first single quotation marks) and **to what** (within the second single quotation marks)

  e.g. `tr ' ' '\n'`

  transforms white space (' ') to a line break ('\n')

##Let's try this.
1. **Split the tokens** one per line in the file `cleaned.txt.gz`
2. **Direct the output** (the word-per-line output) to `outputfile.txt`
3. **Count the lines** in the file.

In [None]:
#1. Word per line: the contents of the first ' ' is transformed to the contents of the second ' '
#In the present context this means that each white space is transformed to line break

# ' ' refers to white space
# '\n' refers to new line

! zcat cleaned.txt.gz | tr ' ' '\n' | head


In [None]:
#Compare to the original file

! zcat cleaned.txt.gz | head

In [None]:
#2. Direct the output to a new file

! zcat cleaned.txt.gz | tr ' ' '\n' > outputfile.txt


In [None]:
#Again, check that the file looks as intended.

! head outputfile.txt

In [None]:
# 3. Count the lines in the new file

#N.B. This is actually also a token count!

! cat outputfile.txt | wc -l

### Combining `tr` to a frequency list pipeline

By combining `tr` to a frequency list pipeline you get a **token frequency list**.

In practice, first print the file, then split the tokens one per line and finally, make a frequency list.

##Now, let's try this.

4. Make a token frequency list of the `cleaned.txt.gz` file.

In [None]:
# 4. Token frequency list

#First split the tokens one per line: tr ' ' '\n'
#then count the frequencies: sort | uniq -c | sort -n

! zcat cleaned.txt.gz | tr ' ' '\n' | sort | uniq -c | sort -rn | head -5

In [None]:
! cat outputfile.txt | sort | uniq -c | sort -rn | head -5

### Using `tr` to normalize

`tr` can be used to **normalize data**, i.e. changing or removing tokens so that same words are recognized as the same, i.e. that e.g.,
  
    cat
    Cat
    cat.
    CAT
    cat,

are recognized as the same word (cat).

With `tr` we can

* **normalize letters** from upper case to lower case (i.e. replace any upper case letter with a lower case letter):

  `tr '[:upper:]' '[:lower:]'`
  
* **delete numbers** (i.e. replace any number `[0-9]` with a whitespace):

  `tr '[0-9]' ' '`
* **delete punctuation** (i.e. replace any punctuation `[:punct:]` with a whitespace):

  `tr '[:punct:]' ' '`


##Let's practice these.

In the `cleaned.txt.gz` file

5. **normalize** the text,
6. delete the **numbers** and
7. delete the **punctuation**.


In [None]:
#Start by checking the file contents

! zcat cleaned.txt.gz | head -10

In [None]:
#5. Normalize - replace upper case with lower case.

! zcat cleaned.txt.gz  | tr '[:upper:]' '[:lower:]' | head -20

In [None]:
#6. Delete numbers.

! zcat cleaned.txt.gz  | tr '[0-9]' ' ' | head -20

In [None]:
#7. Delete punctuation.

! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | head -20

**Note** that the `tr` command **replaces with white space**, and if you have e.g. a year with 4 digits, you get 4 extra white spaces. This leads to empty lines in e.g. frequency lists, but the empty lines can be removed by grepping them.

We can **combine** all these to make a cleaned and normalized frequency list.

##Next, continuing with the file `cleaned.txt.gz`

8. a) delete punctuation and numbers, and normalize to lower case
      
    b) transform to string-per-line format (i.e. word per line)
    
    c) make a frequency list of the lines

In [None]:
#8. a) Clean and normalize the data

! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | head

In [None]:
#8. b) Transform to string/token per line

#N.B. The empty lines!

! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | head

In [None]:
#8. c) Make a frequency list

#N.B. The most frequent item is an empty line.

! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | sort | uniq -c | sort -rn | head -10

In [None]:
# To get a frequency list without empty lines we need to remove them

#egrep -v "^$": print lines that do not match ^$ (i.e. print lines that are not empty)

! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | egrep -v "^$"| sort | uniq -c | sort -rn  | head -20

In [None]:
! zcat cleaned.txt.gz  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | egrep -v "^$"| sort | uniq -c | sort -rn | egrep "^the"

### Time out!

**New  commands**

`git clone`

`gzip`

`zcat`

`tr`

**Wildcards** for matching larger groups of characters

`[:punct:]`

`[0-9]`

`[:upper:]`

`[:lower:]`

(**N.B.** [:punct:] matches **most** punctuation marks, so even if you replace [:punct:] with something else, you might still find some (more specific/unusual) punctuation marks in your data)




##Exercise 3

#### **Recap**

Let's count the most frequent words of one text class from the CORE corpus, *NA*.

1. Grab the **NA texts** and direct them to a folder called `na.txt`
2. Before counting the most frequent words, let's **normalize** to lower case and **remove** punctuation and numbers.
3. Make a **frequency list** of the words in the file. How long in the frequency list do you need to go before you start getting content words? (What do we mean by them?)
4. Where do you think these texts come from?

In [None]:
#1. Grab the NA texts and direct them to a folder called na.txt

# first we need to egrep for the correct labels + texts

#N.B. egrep -w : match only whole words (i.e. prints lines with the whole word)
# egrep -w "cat" matches "cat", but not "cats", "catering" or "vacation" (egrep "cat" would match these, too)

#! zcat train.tsv.gz | egrep -w NA | head  # notice that this line would also work without quotes around NA, but it is good practice to use quotes anyway
! zcat train.tsv.gz | egrep -w "NA" | head

In [None]:
# good to check how many we got

! zcat train.tsv.gz | egrep -w NA | wc -l

In [None]:
#Then take the texts only and check that the output is correct

#cut -f 3 takes the third column with the text

! zcat train.tsv.gz | egrep -w NA | cut -f 3 | head

In [None]:
#Finally, direct the NA texts to a new file

! zcat train.tsv.gz | egrep -w NA | cut -f 3 > na.txt

In [None]:
! ls

In [None]:
! cat na.txt | head

In [None]:
#2. Normalize the data and remove punctuation and numbers

#It's a good idea to check that everything looks as it should
#Sometimes it might be useful to check what the command does after each pipe to make sure you get the intended output in the end
#Checking is a way to avoid mistakes (or if you have a mistake in the final output, it's a good idea to check each pipe's output)

! cat na.txt  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | egrep -v "^$" | head -50

In [None]:
#3. Frequency list

#head -100 | tail -50: first print the top 100 words, then the last 50 (the tail with 50 lines) from these top 100

! cat na.txt  | tr '[:punct:]' ' ' | tr '[0-9]' ' ' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | egrep -v "^$" | sort | uniq -c | sort -rn | head -100 | tail -50

## IV Regular expressions

Above, we saw that **regular expressions** can be used to match a larger group of strings
* `[:punct:]`
* `[:upper:] [:lower:]`
* `[0-9]`

**Note1**: regexes can vary between languages

**Note2**: the letters **å**, **ä**, **ö** are not recognized e.g. by `[:upper:] [:lower:]`

Some useful **operators**
* `^` beginning of line
* `$` end of line
* `^$` empty line (beginning + end without anything inbetween)
* `|` alternative, e.g., `"cat|dog"`
* `[]` group, e.g.`[A-ZÅÄÖa-zåäö]`, `[0-9]`, `[abc]` ***any** of the characters*
* `()` group to form a whole, e.g. `(abc)|(def)`
* The same thing can be expressed in many ways, e.g. `[abc]` is the same as `"a|b|c"`

**NOTE**: if you want to search for the literal meaning of a regular expression, you need to **escape** it with `\`

e.g. `egrep '\$|€|£'`

These (and more) are listed also here: https://www.guru99.com/linux-regular-expressions.html




##Practicing regex

1. Let's first make a version of the original file with one token per line. Direct the new version to the file `one-per-line.txt`


2. In this new file, grep the lines that
  
    a) match the string "is"
  
    b) start with "is"
  
    c) end with "is"

3. Grep the lines

    a) ending with "ing"
  
    b) starting with a capital letter

    c) with any punctuation mark

    d) beginning with a punctuation mark

    e) ending with a punctuation mark

    f) begining with a punctuation mark followed by a capital letter

4. Find tokenization mistakes.

5. Remove all vowels.

In [None]:
# Check the cleaned.txt.gz file

! zcat cleaned.txt.gz | head

In [None]:
#1. Make a token per line file; start by checking the output

! zcat cleaned.txt.gz  | tr ' ' '\n' | egrep -v "^$" | head

In [None]:
#Direct the output to a new file

! zcat cleaned.txt.gz  | tr ' ' '\n' | egrep -v "^$" > one-per-line.txt

In [None]:
! ls

In [None]:
#Check the file contents

! cat one-per-line.txt | head -5

In [None]:
#2. a) Lines with "is"

! echo "any line with the string"
! cat one-per-line.txt | egrep "is" | head -4 # any line with the string
! echo

#2. b) Lines starting with "is"

! echo "lines starting with the string"
! cat one-per-line.txt | egrep "^is" | head -4 # lines starting with is
! echo

#2. c) Lines that end with "is"

! echo "lines ending with the string"
! cat one-per-line.txt | egrep "is$" |  egrep -v "^is$"| head -4 # lines ending with is

In [None]:
#3. a) Lines that end with "ing"

! cat one-per-line.txt | egrep "ing$" | head -20 # any line ending with ing

In [None]:
#3. b) Lines that start with a capital letter

! cat one-per-line.txt | egrep "^[[:upper:]]" | head -5 # any line starting with a capital letter

In [None]:
#3. c) Lines with a punctuation mark

! cat one-per-line.txt | egrep "[[:punct:]]" | head -10

In [None]:
#3. d) Lines that begin with a punctuation mark

! cat one-per-line.txt | egrep "^[[:punct:]]" | head -5 # anything starting with punctuation

In [None]:
#3. e) Lines that end with a punctuation mark

! cat one-per-line.txt | egrep "[[:punct:]]$" | head -5 # anything ending with punctuation

In [None]:
#3. f) Lines that begin with a punctuation mark followed by a capital letter

! cat one-per-line.txt | egrep "^[[:punct:]][A-Z]" | head -5 # anything starting with punctuation and then a capital letter

In [None]:
#4. Find tokenization mistakes (i.e. not tokenized words)

#egrep "[a-zA-Z],[a-zA-Z]" matches lines in which any letter is followed by a comma (no white space inbetween)
#followed by any letter (no white space after the comma)

! cat one-per-line.txt | egrep "[a-zA-Z],[a-zA-Z]" | head -5

In [None]:
#5. Remove all vowels

! cat one-per-line.txt | tr '[aeiouy]' ' ' | head -5 # all vowels away

### A couple more useful operators

* `.` any character
* `*` 0 times or more (i.e. present or absent)
* `+ ` 1 time or more
* `?`  0 or 1 time

Operators can also be **combined**
- `.* ` any character 0 time or more
- `.?` any character 0 or one time
- `.+` any character 1 or more times
- `a+ ` (the letter) *a* 1 or more times
- `a.*` (the letter) *a* followed by any character, 1 or more times
- `a?.$` (the letter) *a* 0 or 1 time followed by any character and line end

# Exercise 4

In the file `one-per-line.txt` grep lines

1. with only one capital letter and nothing else
2. with one or more capital letters but nothing else
3. starting with the letter "a" followed by any character 0 or more times, and ending with ing
4. with proper nouns starting with A in the possessive form, i.e. words starting with the capital A followed by any character 0 or more times followed by 's and line end
5. with words in the form wOrd, wORD, woRD, or worD, i.e. lines that start with one or more lower case letter(s) followed by one or more capital letter(s)

Let's do these first exercises **together**.

In [None]:
# 1. Lines with only one capital letter and nothing else

! cat one-per-line.txt | egrep "^[A-Z]$" | head -5

In [None]:
# 2. Lines with one or more capital letters but nothing else

! cat one-per-line.txt | egrep "^[A-Z]+$" | tail -5

In [None]:
# 3. Lines starting with the letter "a" followed by any character 0 or more times, and ending with ing

! cat one-per-line.txt | egrep "^a.*ing$" | head

In [None]:
# 4. Lines with proper nouns on A in the possessive form, i.e. words starting with a capital A
# followed by any character 0 or more times followed by 's and line end)

! cat one-per-line.txt | egrep "^A.*'s$" | uniq | head

In [None]:
# 5. Lines with words in the form wOrd, wORD, woRD, worD
# i.e. lines that start with one or more lower case letter(s) followed by one or more capital letter(s)

! cat one-per-line.txt  | egrep "^[[:lower:]]+[[:upper:]]+" | head -5

##**N.B.**

`.` stands for any **character**; i.e., not only letters but also e.g. numbers

`[a-z]` stands for *any* lower case **letter**

`[A-Z]` stands for *any* upper case **letter**


###Continue with the following exercises

In the file `one-per-line.txt` grep lines

6. with the word "cat" ending with punctuation mark(s)
7. with the word "cat" ending with just one punctuation mark
8. with compound words with a hyphen, e.g. "co-operate"
9. that end in full stop
10. written fully in capitals
11. starting with punctuation
12. with words only (i.e. lines with only letters)

13. Can you tell what the difference is between these two lines:

    `egrep "^[[:punct:]]$`

    `egrep "^[[:punct:]]+$`

14. Can you tell what the difference is between these two lines:

    `egrep "^.[[:punct:]].[[:punct:]].`

    `egrep "^[a-z][[:punct:]][a-z][[:punct:]][a-z]`



In [None]:
# 6. Lines with the word "cat" ending with punctuation mark(s)

! cat one-per-line.txt  | egrep "^cat[[:punct:]]" | head -10

In [None]:
# 7. Lines with the word "cat" ending with just one punctuation mark

! cat one-per-line.txt  | egrep "^cat[[:punct:]]$" | head -10

In [None]:
# 8. Lines with compound words with a hyphen, e.g. "co-operate"

! cat one-per-line.txt  | egrep "^.+-.+$" | head -10  # this will match also e.g. 123-123
! echo '----------------------------------------'
! cat one-per-line.txt  | egrep "^[a-zA-Z]+-[a-zA-Z]+$" | head -10  # this will only match words that consists of letters, but it even omits words ending in commas

In [None]:
# 9. Lines that end in full stop

! cat one-per-line.txt  | egrep "^.+\.$" | head -10

In [None]:
# 10. Lines written fully in capitals
# (The output might make more sense if you remove 'I')

! cat one-per-line.txt  | egrep "^[[:upper:]]+$" | egrep -v "^I$" | head -10

In [None]:
# 11. Lines starting with punctuation

! cat one-per-line.txt | egrep "^[[:punct:]]+" | head -10

In [None]:
# 12. Lines with words only (i.e. lines with only letters)

! cat one-per-line.txt | egrep '^[a-zA-Z]+$' | head -10

In [None]:
# 13. Can you tell what the difference is between these two lines:
# egrep "^[[:punct:]]$
# egrep "^[[:punct:]]+$

! cat one-per-line.txt | egrep "^[[:punct:]]$" | tail -10  # finds lines with just one punctuation
! echo '----------------------------------'
! cat one-per-line.txt | egrep "^[[:punct:]]+$" | tail -10  # finds lines with one or more punctuations

In [None]:
# 14. Can you tell what the difference is between these two lines:
# egrep "^.[[:punct:]].[[:punct:]]."
# egrep "^[a-z][[:punct:]][a-z][[:punct:]][a-z]"

! cat one-per-line.txt | egrep "^.[[:punct:]].[[:punct:]]." | head -10  # one character + one punct + one character + one punct + one character
! echo '-----------------------------------'
! cat one-per-line.txt | egrep "^[a-z][[:punct:]][a-z][[:punct:]][a-z]" | head -10  # one lower case letter + one punct + one lower case letter + one punct + one ower case letter

# **In this Notebook we covered the following**

`git clone`

`gzip`

`zcat`

`tr`

**Regular expressions, a.k.a. regex** such as

`[[:punct:]]`

`[0-9]`

`.`

`*`

`^`

`$`

and the combination of these, e.g.:

`^a?.$`

`^a.*ing$`

`^[[:lower:]]+[[:upper:]]+`



