## Lab 1. NLP Basics
### Text preprocessing
Text preprocessing is, probably, one of the least pleasant yet one of the most important steps of a natural language processing (NLP) pipelines. This step determines how your NLP algorithms are going to see the data. If your preprocessing breaks, the whole model can break or, what is even worse, keep silent and give incorrect results.

Text preprocessing can be devided into three main parts:
- Tokenization
- Normalization
- Noise reduction

The parts are not necessarily applied in that particular order. Sometimes, before tokenization the noise reduction should be performed. In other cases, the some steps can be repeated several times.

In the next steps, we are going to look into more details for each part.

In [None]:
from string import punctuation
from collections import Counter
import re

import nltk
from nltk import word_tokenize, sent_tokenize, pos_tag, pos_tag_sents
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet, gutenberg, inaugural
from wordcloud import WordCloud

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Run this cell to install all the necessary files for NLTK
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('gutenberg')
nltk.download('inaugural')

### Tokenization
__Tokenization__ is a general term for splitting the text into smaller parts. We can highlight __word segmentation__ and __sentence segmentation__. Depending on the task, you might need to use only word segmentation, for other tasks, you might want to have both sentences and words.

As the names suggest, _word segmentation_ is dividing the raw text sequence into words and _sentence segmentation_ is dividing the text into sentences.

Imagine that we need to parse the first paragraph from the Wikipedia article about Hawaii. We have the following raw text:

In [None]:
raw_text = "Hawaii is a state of the United States of America. " + \
           "It is the only state located in the Pacific Ocean and the only state composed entirely of islands."
print(raw_text)

The simplest and the most logical way to split the text into tokens would be to split it by the whitespace:

In [None]:
tokens = raw_text.split()
print(tokens)

But already here, we can see the problem with the tokens like `'America.'` and `'islands.'`. In our case, the dot is the part of the token that we definetely don't want. One solution is to strip each token from the punctuation.

In [None]:
def tokenize(text):
    return [token.strip(punctuation) for token in text.split()]

print(tokenize(raw_text))

Let's say now, that we want to split the text into sentences and then get tokens for each sentence. The simplest way is to split the text by dot first and then get tokens for each sentence.

In [None]:
def segment_sents(text):
    sents = []    # We are going to store the tokenized sentences here
    for sent in text.split('.'):
        # First check if the sentence is not empty
        if sent: 
            # Then we do the same process for each token in the sentence
            sents.append([token.strip(punctuation) for token in sent.split()])
    return sents

print(segment_sents(raw_text))

For this example, it worked fine so far. But this task hold many surprises for an unprepared person. Let's see another examples that can cause troubles if using our function.

In [None]:
difficult_sents = [
    "Dr. Ford did not ask Col. Mustard the name of Mr. Smith's dog.",
    '"What is all the fuss about?" asked Mr. Peters.',
    "This full-time student isn't living in on-campus housing, and she's not wanting to visit Hawai'i."
]

for sent in difficult_sents:
    print(segment_sents(sent))

Here, we can see that different abbreviations like *Dr.*, *Col.*, *Mr.* were treated as a sentence end. Also, contractions like _isn't_ and _she's_ are in fact two words: _is not_ and _she is_. However, _Smith's_ can be either _Smith is_ or rather, like in our case, one word showing possession. Finally, we have to decide if _full-time_ and _on-campus_ have one word or two.

Luckily, for English, we can use different libraries like __nltk__ or __spacy__ which tackle most of these problems.

In [None]:
print("NLTK tokenization:\n")
for sent in difficult_sents:
    print([word_tokenize(s) for s in sent_tokenize(sent)])

### Normalization

Normalization is another important step in text preprocessing since it removes a lot of input information and makes it easier for the model to choose the most important things. Two main steps in normalization are __stemming__ and __lemmatization__. 
_Stemming_ usually refers to removing endings and prefixes from a word. For example, `playing` and `played` are going to be reduced to `play` after going through the stemmer. It works rather well for English but it can be troublesome for other languages with not so straightforward morphology. Also, the past tense for `run`,  `ran` is not going to be changed with stemming and finally is going to be considered a different word.

NLTK library includes a stemming package as well.

In [None]:
words_to_stem = ['playing', 'played', 'plays', 'play', 'running', 'ran', 'runs', 'run']
stemmer = PorterStemmer()
print('Stemming with NLTK:\n')
for word in words_to_stem:
    print(f'{word}: {stemmer.stem(word)}')

To solve the problem with the words that change their roots in different grammarical forms, we should use more complicated method, called _lemmatization_. Lemmatization usually uses more sophisticated rules to find the normal form of the word. Now, however, most of the lemmatizers are trained using neural networks.

In [None]:
print('Lemmatization with NLTK:\n')
lemmatizer = WordNetLemmatizer()
for word in words_to_stem:
    print(f'{word}: {lemmatizer.lemmatize(word)}')

We can see immediately that NLTK doesn't give correct lemmas for our words. This is because the NLTK lemmarizer expects to have a part-of-speech (POS) tag for each word, i.e. the information if the word is a noun, a verb, an adjective etc. We can, of course, specify the pos tag for each word but if our corpus is big, it will be tiresome to determine the pos tags by hand. In order to do that, we can use already pretrained pos tagger.

In [None]:
print('Lemmatization with NLTK with correct pos tags:\n')
for word in words_to_stem:
    print(f'{word}: {lemmatizer.lemmatize(word, pos=wordnet.VERB)}')

We can use [`pos_tag()`](https://www.nltk.org/api/nltk.tag.html?highlight=pos_tag#nltk.tag.pos_tag) function from the NLTK to automatically assign pos tags to each word in a sentence. This function uses NLTK's currently recommended part of speech tagger, which is the [Perceptron Tagger](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python) now, to tag the given list of tokens. It offers a decent performance at a high speed.

In [None]:
pos_tag(word_tokenize(difficult_sents[0]))

However, the default NLTK tagger uses [the Penn Treebank part-of-speech tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) while the lemmatizer takes WordNet tags. WordNet tags are the following: `'n'` for nouns, `'v'` for verbs, `'a'` for adjectives, `'r'` for adverbs. To make then work together, we will need to write a small function to convert one tagset to another. 

It can be done either with or withour regular expressions. Try both ways and compare!

In [None]:
def treebank_to_wordnet(tag):
    """Converts Penn Treebank part-of-speech tag into WordNet pos tag.
    
    Args:
        tag (str): Penn Treebank part-of-speech tag.
        
    Returns:
        str: WordNet part-of-speech tag. If the input tag is unknown, return the noun pos tag.
        
    """
    # YOUR CODE STARTS HERE
    # YOUR CODE ENDS HERE

In [None]:
assert treebank_to_wordnet('NN') == wordnet.NOUN
assert treebank_to_wordnet('NNPS') == wordnet.NOUN

assert treebank_to_wordnet('VB') == wordnet.VERB
assert treebank_to_wordnet('VBG') == wordnet.VERB

assert treebank_to_wordnet('JJ') == wordnet.ADJ
assert treebank_to_wordnet('JJS') == wordnet.ADJ

assert treebank_to_wordnet('RB') == wordnet.ADV
assert treebank_to_wordnet('RBR') == wordnet.ADV

Now we can process our sentences from before and see how they look lemmatized. Also, here we will use a `pos_tag_sents()` function, which takes a list of lists of tokens, i.e. it can work with multiple sentences.

In [None]:
def nltk_lemmatize(sents):
    """Takes in the tokenized sentences and converts each token into the lemma.
    
    Args:
        sents (list[list[str]]): List of tokenized sentences.
        
    Returns:
        list[list[str]]: List of lemmatized sentences.
        
    """
    nltk_lemmas = []
    
    # YOUR CODE STARTS HERE
    # YOUR CODE ENDS HERE
    
    return nltk_lemmas

In [None]:
sent_raw = [word_tokenize(s) for s in sent_tokenize(difficult_sents[0])]
sent_lem = [['Dr.', 'Ford', 'do', 'not', 'ask', 'Col.', 'Mustard', 'the', 'name', 'of', 'Mr.', 'Smith', "'s", 'dog', '.']]
assert nltk_lemmatize(sent_raw)  == sent_lem

In [None]:
for sent in difficult_sents:
    nltk_sents = [word_tokenize(s) for s in sent_tokenize(sent)]
    print(f'Original sentence:\n{nltk_sents}')
    print(f'Tagged sentence:\n{pos_tag_sents(nltk_sents)}')    
    print(f'Lemmatized sentence:\n{nltk_lemmatize(nltk_sents)}')
    print('\n------\n')

We can also compare it to the stemming.

In [None]:
def nltk_stem(sents):
    """Takes in the tokenized sentences and converts each token into the stem.
    
    Args:
        sents (list[list[str]]): List of tokenized sentences.
        
    Returns:
        list[list[str]]: List of stemmed sentences.
    
    """    
    nltk_stems = []
    
    # YOUR CODE STARTS HERE
    # YOUR CODE ENDS HERE
    
    return nltk_stems

In [None]:
sent_raw = [word_tokenize(s) for s in sent_tokenize(difficult_sents[0])]
sent_stem = [['dr.', 'ford', 'did', 'not', 'ask', 'col.', 'mustard', 'the', 'name', 'of', 'mr.', 'smith', "'s", 'dog', '.']]
assert nltk_stem(sent_raw) == sent_stem

In [None]:
print("NLTK stemming:\n")
for sent in difficult_sents:
    nltk_sents = [word_tokenize(s) for s in sent_tokenize(sent)]
    print(f'Original sentence:\n{nltk_sents}')
    print(f'Stemmed sentence:\n{nltk_stem(nltk_sents)}')
    print('\n------\n')

We can see the NLTK stemmer also __puts all the words to lowercase__ which is another part of normalization. Also, we can also see some artifacts with the stemming like `thi`, `full-tim`, `on-campu`.

We can now try to work with a slightly bigger text. NLTK has different [corpora](https://www.nltk.org/nltk_data/) available which are ready for you to use. You can see a bit more on how to use them here: https://www.nltk.org/howto/corpus.html.

For example, we can use the corpus of inaugural speeches of US presidents. 

In [None]:
inaugural.fileids()

In [None]:
text_fileid = '1789-Washington.txt'

In [None]:
print(inaugural.raw(text_fileid))

We can make a simple overview of the text by counting the number of words and vocabulary size.

In [None]:
words = inaugural.words(text_fileid)
print(len(words))

In [None]:
vocab = set(words)
print(len(vocab))

We probably also do not really care about the punctuation here, so we can filter it out as well.

In [None]:
words_lower = [word.lower() for word in words if word not in punctuation]
vocab_lower = set(words_lower)
print(len(vocab_lower))

Now, we can see that the vocabulary size is almost three times smaller than the number of the words in the text. 

Interestingly, the relation between the number of words and vocabulary size is decribed by [Heaps' law](https://en.wikipedia.org/wiki/Heaps%27_law). In general, the bigger is the text, the slower is the vocabulary growth. Try to use a bigger text to check it!

One more way to analyze the text is to make a frequency list.

In [None]:
freqs = Counter(words_lower)
freqs.most_common(25)

Not very informative, right? That's because very common words like articles (a, the) and prepositions (to, in, by) are overwhelming the other words. These are usually called __stop words__ and can be filtered out as a part of __text normalization__ process.

P.S. Look how the word frequencies roughly follow [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law).

Another parts for the normalization include:
- Removing the punctuation
- Removing whitespace
- Removing numbers or converting them into text
- Converting contractions to their full forms (I've -> I have)

Finally, we can look a bit more into the __stop words__. Stop words are the words that are very common in some language but usually don't carry any useful information about the idea of the text. For English, they can be _is_, _are_, _not_, _she_, _he_, _it_ etc. This also usually includes prepositions and other particles. However, the stop list can be modified to fit a specific task.

Both NLTK and Spacy have built-in lists for stop words, however, you are free to find it anywhere else on the internet or even compose your own list.

In [None]:
print('Stop words for English from NLTK:\n')
nltk_stopwords = set(stopwords.words('English'))
print(sorted(nltk_stopwords))

In [None]:
no_stop_words = [word for word in words_lower if word not in nltk_stopwords.union(punctuation)]
freqs = Counter(no_stop_words)
freqs.most_common(25)

A bit better already! Now, if we show this wordlist to someone who doesn't know which text it was taken from, they would probably guess that it has to do something with politics. 

We can also plot it :)

In [None]:
fd = nltk.FreqDist(no_stop_words)
plt.figure(figsize=(10, 5))
fd.plot(30);

We can also see the trends in the US presidential speeches. For example, we can plot how the words "citizen" and "america" were used over time.

In [None]:
plt.figure(figsize=(15, 5))
cfd = nltk.ConditionalFreqDist(
           (target, fileid[:4])
           for fileid in inaugural.fileids()
           for w in inaugural.words(fileid)
           for target in ['america', 'citizen']
           if w.lower().startswith(target))
cfd.plot();

Last but not least, you can create a cloud of words to represent any text. You have probably seen it before, since it appears rather frequently in NLP blogs. Luckily, it is very easy to create it using the [wordcloud](https://pypi.org/project/wordcloud/) package!

In [None]:
wordcloud = WordCloud().generate_from_frequencies(freqs)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud)
plt.axis("off")
plt.show();

## Noise Removal

In this lab, we are not going to go into details of this step. It includes:
- Removal of headers, footers and other parts of the articles
- Removal of HTML, XML etc. markup
- Extracting the data from various formats, like JSON, CONLL etc.

Most of these steps can be done with the regular expressions. There are also good libraries out there to help you. For example, [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a very powerful tool for the HTML and XML parsing.