# Text processing on the Linux command line

### Data-oriented Programming Paradigms,  2021 WS
10/12/2021

Gábor Recski

In this notebook we learn how to perform simple text processing tasks by combining a set of tools available on __UNIX-like systems__ (such as Linux and Mac OS) using __pipes__.

For a very brief introduction to the Linux command line, including links to additional documentation, see this notebook:


[Introduction to the Linux command line](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01a_Intro_to_Linux_command_line.ipynb)

To run the commands inside this notebook, you need to install the bash kernel for jupyter, e.g. like this:
```
pip install bash-kernel
python -m bash_kernel.install
```

In [2]:
export LC_ALL=C

The following steps are based on two files in the `data` folder, `alice_tok.txt` contains the **tokenized** version of the novel _Alice in Wonderland_ and `data/stopwords.txt` contains a list of English **stopwords**, words that express some grammatical function that we often want to ignore in text processing applications. Both files were created in [this notebook](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01_Text_processing.ipynb) on the basics of text processsing.

Let's have a look at the two files we are going to work with:

In [22]:
head -n 50 data/alice_tok.txt

CHAPTER
I
.

Down
the
Rabbit-Hole
Alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on
the
bank
,
and
of
having
nothing
to
do
:
once
or
twice
she
had
peeped
into
the
book
her
sister
was
reading
,
but
it
had
no
pictures
or


In [4]:
head data/stopwords.txt

a
about
above
after
again
against
ain
all
am
an


`grep` is the command for matching regular expressions. Let's use it to find capitalized words.

The pipe symbol `|` means that the output of one command becomes the input of the next one:

In [6]:
grep -E '^[A-Z][a-z]+' data/alice_tok.txt | head

Down
Rabbit-Hole
Alice
Alice
So
White
Rabbit
There
Alice
Rabbit
grep: write error: Broken pipe


_You can ignore the occasional `Broken pipe` errors, it just means that a command in the pipeline was still writing output when the next one was already finished_

The `cat` command is often used at the beginning of a pipe, since all it does by default is send the contents of the file to the standard output

In [7]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | head

Down
Rabbit-Hole
Alice
Alice
So
White
Rabbit
There
Alice
Rabbit
grep: write error: Broken pipe


Counting can be implemented as a combination of sorting and aggregation:

In [8]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -20

    397 Alice
    108 The
     74 Queen
     67 And
     64 It
     61 King
     57 Turtle
     56 Mock
     55 You
     55 Hatter
     55 Gryphon
     44 Rabbit
     43 What
     42 Duchess
     40 She
     40 Dormouse
     37 But
     33 There
     33 Oh
     32 March


Let's save this for later. The `sed` command can be used for regex-based search-and-replace, here we use it to get a more convenient format. Then we sort the lines alphabetically, later we'll see why.

In [9]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | sort | head

Ada	1
Adventures	2
Advice	1
After	6
Ah	5
Ahem	1
Alas	1
Alice	397
All	6
Allow	1


In [None]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | sort > data/ent_freqs.txt

Let's filter stopwords

In [10]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | head

Alice
The
Queen
And
It
King
Turtle
Mock
You
Hatter


In [14]:
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | head

alice
bill
cat
caterpillar
come
dinah
dodo
dormouse
duchess
gryphon


Now let's find a way to get rid of the first words of sentences.

In [None]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | head

In [None]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | head -30

Now we can add the frequencies from the saved file

In [18]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | sed 's/^./\u&/' | join - data/ent_freqs.txt | sort -k2 -nr

Alice 397
Queen 74
King 61
Turtle 57
Mock 56
Hatter 55
Gryphon 55
Rabbit 44
Duchess 42
Dormouse 40
Oh 33
March 32
Hare 31
Mouse 30
Caterpillar 27
Cat 26
Well 23
White 22
Come 20
Dinah 14
Bill 14
Soup 13
Dodo 13
Yes 12
Majesty 12
Pigeon 11


In [17]:
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - 

alice
bill
cat
caterpillar
come
dinah
dodo
dormouse
duchess
gryphon
hare
hatter
king
majesty
march
mock
mouse
oh
pigeon
queen
rabbit
soup
turtle
well
white
yes


### Homework

1. (25 points) Redo all steps of extracting a frequency count of entities, but using _Alice in Wonderland_ in another language. The German version is in `data/alice_de.txt`, but you can choose any other language for which you can find a plain text version online (try [Project Gutenberg](https://www.gutenberg.org/))! Start by adapting the preprocessing and segmentation steps in [this notebook](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01_Text_processing.ipynb) to your chosen language and creating the two files used in this notebook (the tokenized text and the list of stopwords). Then check if the remaining steps also need modification.

1. (75 points) Improve your solution to also include multi-word entities (above we didn't find the Mock Turtle or the March Hare!). There may be many different ways to do this.

### Submission instructions

Submit your solution via TUWEL, by uploading 3 files:
- The tokenized input text for your chosen language (e.g. `alice_de_tok.txt`)
- The list of stopwords for your chosen language (e.g. `stopwords_de.txt`)
- A file with the extension `.sh` containing your command(s) for exercises 1 and 2, with short explanations as comments (lines preceded by #).

__IMPORTANT: Make sure that running your script (e.g. with the command `bash your_solution.sh`) requires only the two files you uploaded and produces the desired output!__

The cell below shows how the solution in this notebook would have to be submitted:

In [19]:
# extract capitalized words, count them, reformat, save to file: 
cat data/alice_tok.txt | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | sed 's/^ *\([0-9]*\) \(.*\)$/\2\t\1/' | sort > data/ent_freqs.txt

# reformat to get one sentence per line, keep all but the first words of sentences, reformat again to one word per line, extract capitalized words, count them, keep the top 50, and then only those that are not in the stopwords file, finally match the lines to the lines of the word frequency file and sort by frequency
cat data/alice_tok.txt | sed 's/^$/@/' | tr '\n' ' ' | sed 's/ @ /\n/g' | cut -d' ' -f2- | tr ' ' '\n' | grep -E '^[A-Z][a-z]+' | sort | uniq -c | sort -nr | head -50 | sed 's/^[ 0-9]*//' | sort | tr [:upper:] [:lower:] | comm -13 data/stopwords.txt - | sed 's/^./\u&/' | join - data/ent_freqs.txt | sort -k2 -nr

Alice 397
Queen 74
King 61
Turtle 57
Mock 56
Hatter 55
Gryphon 55
Rabbit 44
Duchess 42
Dormouse 40
Oh 33
March 32
Hare 31
Mouse 30
Caterpillar 27
Cat 26
Well 23
White 22
Come 20
Dinah 14
Bill 14
Soup 13
Dodo 13
Yes 12
Majesty 12
Pigeon 11
