##  Preprocessing with RegEx and NLTK

Here we compare possible preprocessing methods.  

**Regular Expression (RegEx)** can be used in Python with the standard library `re`. You can define the patterns of the string you want to have represented/found. A nice RegEx playground can be found here: https://regexr.com

The **Natural Language Toolkit (NLTK)** [nltk.org](https://www.nltk.org) is a well known, tried and tested Python package to work with human language data.

In [None]:
# import regex
import re

# import nltk and its tokenizers
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
# Feel free to change the text or add your own
text = '''The representatives of many countries met at the United Nations
conference in New York. UN general secretary António Guterres held a speech
to open the event; he was focusing on the importance of human rights.'''

Look how the text is represented in the output (e.g. the newlines `\n`):

In [None]:
text

## Removing the newlines (\n) with re.sub

`re.sub(pattern, repl, string)` is the re-package tool to substitute the found patterns with the replacement string you provide.

In [None]:
pattern = ___
replacement = ___
re.sub(pattern, replacement, text)

In [None]:
text

## Splitting with re.split()

Let's try sentence-tokenize the text manually with `re.split(pattern, string)`. The delimiters are splitted seperatly and kept in the result-list.

In [None]:
pattern = ___
sents_re = re.split(pattern, text)

In [None]:
# iterate over the sentences, does it make sense?

for s in sents_re:
    print(s)
    print('*'*30)

## Splitting with the NLTK sent_tokenizer

Now let's split the sentences automatically with the `sent_tokenizer` from NLTK.

In [None]:
# use the imported function to tokenize the sentences
sents_nltk = ___

In [None]:
for s in sents_nltk:
    print(s)
    print('*'*30)

## Tokenizing

Since we splitted the sentences and have them in `sents_nltk`, we'll try to tokenize the first sentence with `re.findall(pattern, string)` and `nltk`.

In [None]:
pattern = ___
tokens_re = re.findall(pattern, sents_nltk[0])

In [None]:
# you like it?
for t in tokens_re:
  print(t)

In [None]:
# only of the first sentence
tokens_nltk = ___

In [None]:
for t in tokens_nltk:
  print(t)

## Lemmatization

It's gonna be hard to lemmatize the text with the `re` package - this would be a huge pattern. 😄
We gonna use one of NLTK lemmatizers for the English language, the `WordNetLemmatizer`.  

For German there is for example the **GermaNLTK** from the Hochschule der Medien Stuttgart: https://docs.google.com/document/d/1rdn0hOnJNcOBWEZgipdDfSyjJdnv_sinuAUSDSpiQns  

It integrates the [GermaNet of the Universität Tübingen](https://uni-tuebingen.de/fakultaeten/philosophische-fakultaet/fachbereiche/neuphilologie/seminar-fuer-sprachwissenschaft/arbeitsbereiche/allg-sprachwissenschaft-computerlinguistik/ressourcen/lexica/germanet-1/) in NLTK.

In [None]:
from nltk import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

In [None]:
# compare the lemmatization without and with casing
for t in tokens_nltk:
  t_low = ___
  t_cased = ___

  print(t +':\n')
  print(t_low)
  print(t_cased)
  print('-'*30)

## Stopword removal

The NLTK package has good stopword lists of 29 languages.  
We'll use it to remove the stopwords from our text.

In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords

stopwords.fileids()

In [None]:
print(set(stopwords.words('german')))

In [None]:
# take care: the stopword list is lowercased!

stop_words = set(stopwords.words('english'))
kept = []
removed = []

# check if the token t is a stopword
# if yes, append it to removed - else to kept
for t in tokens_nltk:
  ___

In [None]:
print(removed)
print('*'*30)
print(kept)