# Basics of text processing

### Natural Language Processing and Information Extraction,  2021 WS
10/15/2021

Gábor Recski

## In this lecture
- Regular Expressions

- Text segmentation and normalization:
   - sentence splitting and tokenization
   - lemmatization, stemming, decompounding, morphology

## Import dependencies

In [1]:
import re
from collections import Counter

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize, sent_tokenize
import stanza
stanza.download('en')
stanza.download('de')

[nltk_data] Downloading package punkt to /home/eszter/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/eszter/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2021-10-18 14:22:35 INFO: Downloading default packages for language: en (English)...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2021-10-18 14:25:06 INFO: Finished downloading models and saved to /home/eszter/stanza_resources.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.3.0.json:   0%|   …

2021-10-18 14:25:07 INFO: Downloading default packages for language: de (German)...


Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.3.0/models/default.zip:   0%|          | 0…

2021-10-18 14:28:01 INFO: Finished downloading models and saved to /home/eszter/stanza_resources.


## Regular expressions

### Basics

![re1](media/re1.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [2]:
text = open('data/alice.txt').read()
print(text[:100])


CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on


In [12]:
re.search('Alice', text)

<re.Match object; span=(35, 40), match='Alice'>

In [13]:
text[35:40]

'Alice'

![re2](media/re2.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [14]:
re.search('[Rr]abbit', text)

<re.Match object; span=(21, 27), match='Rabbit'>

In [15]:
re.findall('[Rr]abbit', text[:5000])

['Rabbit', 'Rabbit', 'Rabbit', 'Rabbit', 'rabbit', 'rabbit', 'rabbit']

In [16]:
for match in re.finditer('[Rr]abbit', text[:5000]):
    print(match.group(), match.span())

Rabbit (21, 27)
Rabbit (589, 595)
Rabbit (743, 749)
Rabbit (959, 965)
rabbit (1149, 1155)
rabbit (1341, 1347)
rabbit (1486, 1492)


![re3](media/re3.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [17]:
re.findall(' [A-Za-z][a-z][a-z] ', text[:5000])

[' the ',
 ' was ',
 ' get ',
 ' her ',
 ' and ',
 ' she ',
 ' her ',
 ' was ',
 ' but ',
 ' had ',
 ' the ',
 ' she ',
 ' her ',
 ' she ',
 ' for ',
 ' day ',
 ' her ',
 ' and ',
 ' the ',
 ' the ',
 ' the ',
 ' was ',
 ' nor ',
 ' out ',
 ' the ',
 ' the ',
 ' say ',
 ' she ',
 ' her ',
 ' she ',
 ' but ',
 ' all ',
 ' but ',
 ' the ',
 ' out ',
 ' its ',
 ' and ',
 ' and ',
 ' her ',
 ' for ',
 ' her ',
 ' out ',
 ' and ',
 ' she ',
 ' and ',
 ' was ',
 ' see ',
 ' pop ',
 ' the ',
 ' the ',
 ' she ',
 ' get ',
 ' for ',
 ' and ',
 ' had ',
 ' she ',
 ' the ',
 ' was ',
 ' she ',
 ' for ',
 ' she ',
 ' her ',
 ' she ',
 ' and ',
 ' she ',
 ' but ',
 ' was ',
 ' see ',
 ' the ',
 ' the ',
 ' and ',
 ' and ',
 ' and ',
 ' she ',
 ' and ',
 ' She ',
 ' jar ',
 ' one ',
 ' the ',
 ' was ',
 ' but ',
 ' her ',
 ' was ',
 ' she ',
 ' not ',
 ' the ',
 ' for ',
 ' put ',
 ' one ',
 ' she ',
 ' How ',
 ' all ',
 ' say ',
 ' off ',
 ' the ',
 ' was ',
 ' the ',
 ' she ',
 ' the ',
 ' the ',


In [18]:
Counter(re.findall(' [A-Za-z][a-z][a-z] ', text)).most_common(10)

[(' the ', 1191),
 (' and ', 611),
 (' she ', 348),
 (' was ', 233),
 (' you ', 206),
 (' her ', 154),
 (' all ', 110),
 (' had ', 105),
 (' for ', 105),
 (' but ', 90)]

![re4](media/re4.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

![re5](media/re5.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

![re6](media/re6.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [19]:
re.findall('...', text[:100])

['CHA',
 'PTE',
 'R I',
 'Dow',
 'n t',
 'he ',
 'Rab',
 'bit',
 '-Ho',
 'Ali',
 'ce ',
 'was',
 ' be',
 'gin',
 'nin',
 'g t',
 'o g',
 'et ',
 'ver',
 'y t',
 'ire',
 'd o',
 'f s',
 'itt',
 'ing',
 ' by',
 ' he',
 'r s',
 'ist',
 'er ']

![re7](media/re7.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [21]:
re.findall('\w', text[:10])

['C', 'H', 'A', 'P', 'T', 'E', 'R', 'I']

In [22]:
re.split('\s', text[:100])

['',
 'CHAPTER',
 'I.',
 'Down',
 'the',
 'Rabbit-Hole',
 '',
 '',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on']

![re8](media/re8.png)([SLP Ch.2](https://web.stanford.edu/~jurafsky/slp3/2.pdf))

In [23]:
re.findall('\w+', text[:100])

['CHAPTER',
 'I',
 'Down',
 'the',
 'Rabbit',
 'Hole',
 'Alice',
 'was',
 'beginning',
 'to',
 'get',
 'very',
 'tired',
 'of',
 'sitting',
 'by',
 'her',
 'sister',
 'on']

In [24]:
Counter(re.findall('\w+', text)).most_common(20)

[('the', 1533),
 ('and', 803),
 ('to', 728),
 ('a', 617),
 ('it', 528),
 ('I', 523),
 ('she', 510),
 ('of', 502),
 ('said', 456),
 ('Alice', 396),
 ('in', 356),
 ('was', 351),
 ('you', 345),
 ('that', 274),
 ('as', 246),
 ('her', 244),
 ('t', 216),
 ('at', 202),
 ('s', 196),
 ('on', 189)]

In [25]:
Counter(re.findall('[^\w\s]', text)).most_common(20)

[(',', 2426),
 ('“', 1118),
 ('”', 1114),
 ('.', 987),
 ('’', 702),
 ('!', 451),
 ('—', 263),
 (':', 233),
 ('?', 203),
 (';', 193),
 ('-', 142),
 ('*', 60),
 ('(', 56),
 (')', 56),
 ('‘', 46),
 ('[', 2),
 (']', 2)]

### Substitution and groups

In [26]:
re.sub('\s+', ' ', text[:100])

' CHAPTER I. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on'

In [27]:
print(re.sub('\s+', '\n', text[:100]))


CHAPTER
I.
Down
the
Rabbit-Hole
Alice
was
beginning
to
get
very
tired
of
sitting
by
her
sister
on


In [28]:
re.findall('CHAPTER [^\s]+', text)

['CHAPTER I.',
 'CHAPTER II.',
 'CHAPTER III.',
 'CHAPTER IV.',
 'CHAPTER V.',
 'CHAPTER VI.',
 'CHAPTER VII.',
 'CHAPTER VIII.',
 'CHAPTER IX.',
 'CHAPTER X.',
 'CHAPTER XI.',
 'CHAPTER XII.']

In [29]:
print(re.sub('CHAPTER ([^\s]+)', 'Chapter \\1', text[:100]))


Chapter I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on


In [30]:
print(re.sub('CHAPTER ([^\s.]+).\n([^\n]*)', 'Chapter \\1: \\2', text[:100]))


Chapter I: Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on


In [31]:
re.findall('CHAPTER ([^\s.]+).\n([^\n]*)', text)

[('I', 'Down the Rabbit-Hole'),
 ('II', 'The Pool of Tears'),
 ('III', 'A Caucus-Race and a Long Tale'),
 ('IV', 'The Rabbit Sends in a Little Bill'),
 ('V', 'Advice from a Caterpillar'),
 ('VI', 'Pig and Pepper'),
 ('VII', 'A Mad Tea-Party'),
 ('VIII', 'The Queen’s Croquet-Ground'),
 ('IX', 'The Mock Turtle’s Story'),
 ('X', 'The Lobster Quadrille'),
 ('XI', 'Who Stole the Tarts?'),
 ('XII', 'Alice’s Evidence')]

Regular expressions are surprisingly powerful. Also, with the right implementation, they are literally as fast as you can get. That's because they are equivalent to [finite state automata (FSAs)](https://en.wikipedia.org/wiki/Finite-state_machine). Actually, every regular expression is a [regular grammar](https://en.wikipedia.org/wiki/Regular_grammar) defining a [regular language](https://en.wikipedia.org/wiki/Regular_language).

![re_xkcd](media/re_xkcd.png)([XKCD #208](https://xkcd.com/208/))

## Text segmentation

### Sentence splitting

#### How to split a text into sentences?

In [32]:
text2 = "'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."

Naive: split on `.`, `!`, `?`, etc.

In [33]:
re.split('[.!?]', text2)

["'Of course it's only because Tom isn't home,' said Mrs",
 ' Parsons vaguely',
 '']

Better: use language-specific list of abbreviation words, collocations, etc.

In [34]:
nltk.sent_tokenize(text2)

["'Of course it's only because Tom isn't home,' said Mrs. Parsons vaguely."]

Custom lists of patterns are often necessary for special domains. 

In [None]:
text3 = "An die Stelle der Landesgesetze vom 17. Jänner 1883, n.ö.L.G. u. V.Bl. Nr. 35, vom 26. Dezember 1890, n.ö.L.G. u. V.Bl. Nr. 48, vom 17. Juni 1920 n.ö.L.G. u. V.Bl. Nr. 547, vom 4. November 1920 n.ö.L.G. u. V.Bl. Nr. 808, und vom 9. Dezember 1927, L.G.Bl. für Wien Nr. 1 ex 1928, die, soweit dieses Gesetz nichts anderes bestimmt, zugleich ihre Wirksamkeit verlieren, hat die nachfolgende Bauordnung zu treten."

In [None]:
print(text3)

In [None]:
nltk.sent_tokenize(text3, language='german')

###  Tokenization

#### How to  split text into words?

#### Naive approach: split on whitespace

In [35]:
text2.split()

["'Of",
 'course',
 "it's",
 'only',
 'because',
 'Tom',
 "isn't",
 "home,'",
 'said',
 'Mrs.',
 'Parsons',
 'vaguely.']

#### Better: separate punctuation marks

In [36]:
re.findall('(\w+|[^\w\s]+)', text2)[:30]

["'",
 'Of',
 'course',
 'it',
 "'",
 's',
 'only',
 'because',
 'Tom',
 'isn',
 "'",
 't',
 'home',
 ",'",
 'said',
 'Mrs',
 '.',
 'Parsons',
 'vaguely',
 '.']

#### Best: add some language-specific conventions:

In [37]:
nltk.word_tokenize(text2)

["'Of",
 'course',
 'it',
 "'s",
 'only',
 'because',
 'Tom',
 'is',
 "n't",
 'home',
 ',',
 "'",
 'said',
 'Mrs.',
 'Parsons',
 'vaguely',
 '.']

## Text normalization

In [None]:
words = nltk.word_tokenize(text)

In [None]:
words[:10]

In [None]:
Counter(words).most_common(10)

Let's get rid of punctuation

In [None]:
words = [word for word in words if re.match('\w', word)]

In [None]:
Counter(words).most_common(10)

Filtering common function words is called __stopword removal__

In [None]:
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
print(stopwords)

In [None]:
words = [word for word in words if word.lower() not in stopwords]

In [None]:
Counter(words).most_common(20)

### Lemmatization and stemming

Words like _say_, _says_, and _said_ are all different **word forms** of the same **lemma**. Grouping them together can be useful in many applications. 

**Stemming** is the reduction of words to a common prefix, using simple rules that only work some of the time:

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
for word in ('say', 'says', 'said'):
    print(stemmer.stem(word))

In [None]:
for word in ('he', 'his', 'him'):
    print(stemmer.stem(word))

**Lemmatization** is the mapping of word forms to their lemma, using either a dictionary of word forms, a grammar of how words are formed (a **morphology**), or both.

In [None]:
nlp = stanza.Pipeline('en', processors='tokenize,lemma,pos')

In [None]:
doc = nlp(text)

In [None]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print(word.text + '\t' + word.lemma)
    print()

Now we can count lemmas

In [None]:
Counter(
    word.lemma for sentence in doc.sentences for word in sentence.words
    if word.lemma not in stopwords and re.match('\w', word.lemma)).most_common(20)

The full analysis of how a word form is built from its lemma is known as **morphological analysis**

In [None]:
for sentence in doc.sentences[:5]:
    for word in sentence.words:
        print('\t'.join([word.text, word.lemma, word.upos, word.feats if word.feats else '']))
    print()

A special case of lemmatization is **decompounding**, recognizing multiple lemmas in a word

In [None]:
nlp('roller-coaster')

In [None]:
nlp('wastebasket')

For English you might say that this is good enough... but _some languages_ allow forming compounds on the fly...

In [None]:
nlp_de = stanza.Pipeline('de', processors='tokenize,lemma,pos')

In [None]:
nlp_de('Kassenidentifikationsnummer')

There is no good generic solution and no standard tool. There are some unsupervised approaches like [SECOS](https://github.com/riedlma/SECOS) and [CharSplit](https://github.com/dtuggener/CharSplit), and there are also full-fledged morphological analyzers that might work, like [SMOR](https://www.cis.lmu.de/~schmid/tools/SMOR/) and its extensions [zmorge](https://pub.cl.uzh.ch/users/sennrich/zmorge/) and [SMORLemma](https://github.com/rsennrich/SMORLemma).

## Examples

### Text processing with regular expressions

Load a sample text

In [None]:
text = open('data/alice.txt').read()
print(text[:1000])

In [None]:
def clean_text(text):
    cleaned_text = re.sub('_','',text)
    cleaned_text = re.sub('\n', ' ', cleaned_text)
    return cleaned_text

In [None]:
text = clean_text(text)

In [None]:
print(text[:1000])

Let's split this into sentences, then words.

In [None]:
sens = sent_tokenize(text)

In [None]:
print('\n\n'.join(sens[:5]))

In [None]:
toks = [word_tokenize(sen) for sen in sens]

In [None]:
print('\n\n'.join('\n'.join(sen) for sen in toks[:5]))

Let's also write this to a file

In [None]:
with open('data/alice_tok.txt', 'w') as f:
    f.write('\n\n'.join('\n'.join(sen) for sen in toks) + '\n')

Let's try to find all names using regexes

In [None]:
def find_names(toks):
    curr_name = []
    for sen in toks:
        for tok in sen[1:]:
            if re.match('[A-Z][a-z]+', tok):
                curr_name.append(tok)
            elif curr_name:
                yield ' '.join(curr_name)
                curr_name = []
                
        if curr_name:
            yield curr_name
            
        
def count_names(toks):
    name_counter = Counter()
    
    for name in find_names(toks):
        name_counter[name] += 1
    
    for name, count in name_counter.most_common():
        print(name, count)

In [None]:
count_names(toks)

We can filter our tokens for stopwords:

In [None]:
toks_without_stopwords = [[tok for tok in sen if tok.lower() not in stopwords] for sen in toks]

In [None]:
print('\n\n'.join('\n'.join(sen) for sen in toks_without_stopwords[:5]))

In [None]:
count_names(toks_without_stopwords)

Let's also write the stopwords into a file

In [None]:
with open('data/stopwords.txt', 'w') as f:
    f.write('\n'.join(sorted(stopwords)) + '\n')

Continue to [Text processing on the Linux command line](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/01_Text_processing/01b_Text_processing_Linux_command_line.ipynb)