# Text processing on the Linux command line

### Data-oriented Programming Paradigms,  2021 WS
10/12/2021

Gábor Recski

In this notebook we learn how to perform simple text processing tasks by combining a set of tools available on __UNIX-like systems__ (such as Linux and Mac OS) using __pipes__.

For a very brief introduction to the Linux command line, including links to additional documentation, see this notebook:


[Introduction to the Linux command line](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01a_Intro_to_Linux_command_line.ipynb)

To run the commands inside this notebook, you need to install the bash kernel for jupyter, e.g. like this:
```
pip install bash-kernel
python -m bash_kernel.install
```

In [None]:
!export LC_ALL=C

The following steps are based on two files in the `data` folder, `alice_tok.txt` contains the **tokenized** version of the novel _Alice in Wonderland_ and `data/stopwords.txt` contains a list of English **stopwords**, words that express some grammatical function that we often want to ignore in text processing applications. Both files were created in [this notebook](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01_Text_processing.ipynb) on the basics of text processsing.

Let's have a look at the two files we are going to work with:

In [None]:
!head data/alice_tok.txt

In [None]:
!head data/stopwords.txt

`grep` is the command for matching regular expressions. Let's use it to find capitalized words.

The pipe symbol `|` means that the output of one command becomes the input of the next one:

In [None]:
grep -E '^[A-Z][a-z]+' data/alice_tok.txt | head

_You can ignore the occasional `Broken pipe` errors, it just means that a command in the pipeline was still writing output when the next one was already finished_

The `cat` command is often used at the beginning of a pipe, since all it does by default is send the contents of the file to the standard output

In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | head

Counting can be implemented as a combination of sorting and aggregation:

In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -20

Let's save this for later. The `sed` command can be used for regex-based search-and-replace, here we use it to get a more convenient format. Then we sort the lines alphabetically, later we'll see why.

In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | sort | head

In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | sort > data/ent_freqs.txt

Let's filter stopwords

In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | head

In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | head

Now let's find a way to get rid of the first words of sentences.

In [None]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | head

In [None]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | head -30

Now we can add the frequencies from the saved file

In [None]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | sed 's/^./\u&/' | join - data/ent_freqs.txt | sort -k2 -nr

### Homework

1. (25 points) Redo all steps of extracting a frequency count of entities, but using _Alice in Wonderland_ in another language. The German version is in `data/alice_de.txt`, but you can choose any other language for which you can find a plain text version online (try [Project Gutenberg](https://www.gutenberg.org/))! Start by adapting the preprocessing and segmentation steps in [this notebook](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01_Text_processing.ipynb) to your chosen language and creating the two files used in this notebook (the tokenized text and the list of stopwords). Then check if the remaining steps also need modification.

1. (75 points) Improve the solution in this notebook to also include multi-word entities (we didn't find the Mock Turtle or the March Hare!). There may be many different ways to do this.

### Submission instructions

Submit your solution via TUWEL, by uploading 3 files:
- The tokenized input text for your chosen language (e.g. `alice_de_tok.txt`)
- The list of stopwords for your chosen language (e.g. `stopwords_de.txt`)
- A file with the extension `.sh` containing your command(s), with short explanations as comments (lines preceded by #).

__The cell below shows how the solution in this notebook would have to be submitted.__

In [None]:
# extract capitalized words, count them, reformat, save to file: 
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | sort > data/ent_freqs.txt

# reformat to get one sentence per line, keep all but the first words of sentences, reformat again to one word per line, extract capitalized words, count them, keep the top 50, and then only those that are not in the stopwords file, finally match the lines to the lines of the word frequency file and sort by frequency
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | sed 's/^./\u&/' | join - data/ent_freqs.txt | sort -k2 -nr

## Homework

### Tokenization

In [None]:
# pwd

In [None]:
# cd homework/ex1/

In [None]:
import re
from nltk.tokenize import word_tokenize, sent_tokenize


def clean_text(text):
    cleaned_text = re.sub("_", "", text)
    cleaned_text = re.sub("\n", " ", cleaned_text)
    return cleaned_text


def head(text, n: int = 1000):
    print(text[:n])

In [None]:
text = open("data/alice_de.txt").read()
print(text[:100])

In [None]:
text = clean_text(text)
head(text)

In [None]:
sens = sent_tokenize(text)
head(sens, 10)

In [None]:
toks = [word_tokenize(sen) for sen in sens]

with open("data/alice_tok_de.txt", "w") as f:
    f.write("\n\n".join("\n".join(sen) for sen in toks) + "\n")

### Stopwords

In [None]:
import nltk

nltk.download("punkt")
nltk.download("stopwords")

In [None]:
from nltk.corpus import stopwords

stopwords = set(stopwords.words("german"))

# should all be lower case words
print(list(stopwords)[0:10])



In [None]:
def batch(iterable, n=1):
    l = len(iterable)
    if isinstance(iterable, set):
      iterable = list(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

words_per_line = 7

print("\n".join([", ".join([word for word in w_slice]) for w_slice in batch(stopwords, words_per_line)]))

In [None]:
import re

def check_if_in(wset, pattern, regex=False):
    if regex:
        res = re.search(pattern, '\n'.join(wset))
        print(f"found: {res}")

    else:
        if pattern in wset:
            print(f"found: {pattern}")

# check_if_in(stopwords, 'sie')
check_if_in(stopwords, '[Ss]ie', regex=True)
check_if_in(stopwords, '[Jj]a', regex=True)
check_if_in(stopwords, '[Nn]ein', regex=True)


In [None]:
# I think 'Ja' and 'Nein' would be good additions to 
# the german stopword list.
# I noticed both words were still in the filtered entity/character list.
# I researched other stoplists and realized that the nltk list is rather minimal.
# One can always argue that if a character was named 'Ja', 
# which some authors might actually do,
# it would get removed that way. Then again, you can say the same
# about any word on the stopword list.
stopwords.add('ja')
stopwords.add('nein')

In [None]:
# if the 'data'-folder doesn't exist, it needs to be created
# or the cwd of the jupyterlab kernel instance needs to be adjusted
with open("data/stopwords_de.txt", "w") as f:
    f.write("\n".join(sorted(stopwords)) + "\n")

## 1) Frequency count of entities

In [None]:
#!/bin/bash

# extract capitalized words, count them, reformat, save to file: 
cat data/alice_tok_de.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | tr [:upper:] [:lower:] | sort > out/ent_freqs.txt

# reformat to get one sentence per line, 
# get rid of the bracketed numbers: [ xx ] (e.g. [1] before 'Erstes Kapitel'),
# keep all but the first words of sentences, reformat again to one word per line, 
# extract capitalized words, count them, keep the top 50, 
# and then only those that are not in the stopwords file, 
# finally match the lines to the lines of the word frequency file and sort by frequency
cat data/alice_tok_de.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | \
sed -E "s/\[ ?[0-9]+ ?\] //" | \
cut -d' ' -f2- | tr ' ' '\n' | \
grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | \
sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords_de.txt - | \
join - out/ent_freqs.txt | sort -k2 -nr > out/result_german.txt

## 2) Multi-word Entities

(75 points) Improve the solution in this notebook to also include multi-word entities (we didn't find the Mock Turtle or the March Hare!). There may be many different ways to do this.