## Lab 1. NLP Basics
### Text preprocessing
Text preprocessing is, probably, one of the least pleasant yet one of the most important steps of a natural language processing (NLP) pipelines. This step determines how your NLP algorithms are going to see the data. If your preprocessing breaks, the whole model can break or, what is even worse, keep silent and give incorrect results.

Text preprocessing can be devided into three main parts:
- Tokenization
- Normalization
- Noise reduction

The parts are not necessarily applied in that particular order. Sometimes, before tokenization the noise reduction should be performed. In other cases, the some steps can be repeated several times.

In the next steps, we are going to look into more details for each part.

In [None]:
from string import punctuation

import nltk
from nltk import word_tokenize, sent_tokenize, pos_tag
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

import spacy
nlp = spacy.load("en_core_web_sm")
# If you don't have the model installed run "python -m spacy download en_core_web_sm"
# in the console and restart the python kernel

In [None]:
# Run this cell to install all the necessary files for NLTK
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

### Tokenization
__Tokenization__ is a general term for splitting the text into smaller parts. We can highlight __word segmentation__ and __sentence segmentation__. Depending on the task, you might need to use only word segmentation, for other tasks, you might want to have both sentences and words.

As the names suggest, _word segmentation_ is dividing the raw text sequence into words and _sentence segmentation_ is dividing the text into sentences.

Imagine that we need to parse the first paragraph from the Wikipedia article about Hawaii. We have the following raw text:

In [None]:
raw_text = "Hawaii is a state of the United States of America. " + \
           "It is the only state located in the Pacific Ocean and the only state composed entirely of islands."
print(raw_text)

The simplest and the most logical way to split the text into tokens would be to split it by the whitespace:

But already here, we can see the problem with the tokens like `'America.'` and `'islands.'`. In our case, the dot is the part of the token that we definetely don't want. One solution is to strip each token from the punctuation.

In [None]:
def tokenize(text):
    ...

print(tokenize(raw_text))

Let's say now, that we want to split the text into sentences and then get tokens for each sentence. The simplest way is to split the text by dot first and then get tokens for each sentence.

In [None]:
def segment_sents(text):
    ...

print(segment_sents(raw_text))

For this example, it worked fine so far. But this task hold many surprises for an unprepared person. Let's see another examples that can cause troubles if using our function.

In [None]:
difficult_sents = [
    "Dr. Ford did not ask Col. Mustard the name of Mr. Smith's dog.",
    '"What is all the fuss about?" asked Mr. Peters.',
    "This full-time student isn't living in on-campus housing, and she's not wanting to visit Hawai'i."
]

for sent in difficult_sents:
    print(segment_sents(sent))

Here, we can see that different abbreviations like *Dr.*, *Col.*, *Mr.* were treated as a sentence end. Also, contractions like _isn't_ and _she's_ are in fact two words: _is not_ and _she is_. However, _Smith's_ can be either _Smith is_ or rather, like in our case, one word showing possession. Finally, we have to decide if _full-time_ and _on-campus_ have one word or two.

Luckily, for English, we can use different libraries like __nltk__ or __spacy__ which tackle most of these problems. Let's see, how they manage with our sentences.

In [None]:
print("NLTK tokenization:\n")
...

In [None]:
print("Spacy tokenization:\n")
...

As we can see, Spacy somewhat better for this task. However, this is only that good for English and, probably, most of the European languages. If we take a language where the words are not graphically separated in writing, like Chinese, Thai, or German compound words, we need to choose another approach.

### Normalization

Normalization is another important step in text preprocessing since it removes a lot of input information and makes it easier for the model to choose the most important things. Two main steps in normalization are __stemming__ and __lemmatization__. 
_Stemming_ usually refers to removing endings and prefixes from a word. For example, `playing` and `played` are going to be reduced to `play` after going through the stemmer. It works rather well for English but it can be troublesome for other languages with not so straightforward morphology. Also, the past tense for `run`,  `ran` is not going to be changed with stemming and finally is going to be considered a different word.

NLTK library includes a stemming package as well.

In [None]:
words_to_stem = ['playing', 'played', 'plays', 'play', 'running', 'ran', 'runs', 'run']
...
print('Stemming with NLTK:\n')
...

To solve the problem with the words that change their roots in different grammarical forms, we should use more complicated method, called _lemmatization_. Lemmatization usually uses more sophisticated rules to find the normal form of the word. Now, however, most of the lemmatizers are trained using neural networks.

Both NLTK and Spacy have a lemmatization module for English.

In [None]:
print('Lemmatization with NLTK:\n')
...

We can see immediately that NLTK doesn't give correct lemmas for our words. This is because the NLTK lemmarizer expects to have a part-of-speech (POS) tag for each word, i.e. the information if the word is a noun, a verb, an adjective etc. We can, of course, specify the pos tag for each word but if our corpus is big, it will be tiresome to determine the pos tags by hand. In order to do that, we can use already pretrained pos tagger. We're going to look at pos tagging later.

In [None]:
print('Lemmatization with NLTK with correct pos tags:\n')
...

Conveniently for us, Spacy does pos tagging and other necessary preprocessing for lemmatization, and we can get all the lemmas with only one command.

In [None]:
print('Lemmatization with Spacy:\n')
...

We can also see how our sentences from the previous exercise look after stemming and lemmatization: 

In [None]:
print("NLTK stemming:\n")
for sent in difficult_sents:
    nltk_sents = ...
    print(f'Original sentence:\n{nltk_sents}')
    nltk_stems = []
    for sent in nltk_sents:
        ...
    print(f'Stemmed sentence:\n{nltk_stems}')
    print('\n------\n')

We can see the NLTK stemmer also __puts all the words to lowercase__ which is another part of normalization. Also, we can also see some artifacts with the stemming like `thi`, `full-tim`, `on-campu`.

Let's now see the lemmatized sentence from Spacy:

In [None]:
print("Spacy lemmatization:\n")
for sent in difficult_sents:
    doc = ...
    print(f'Original sentence:\n{...}')
    print(f'Lemmatized sentence:\n{...}')
    print('\n------\n')

With lemmatization, the results look better: `did` trasformed to `do`, as well as `is` and `'s` to `be`. Another good thing is that in the first sentence `'s` in `Smith's dog` stayed as `'s` which is important because in this case it is not a contraction from the verb _is_.

Another parts for the normalization include:
- Removing the punctuation
- Removing whitespace
- Removing numbers or converting them into text
- Removing stop words

Finally, we can look a bit more into the __stop words__. Stop words are the words that are very common in some language but usually don't carry any useful information about the idea of the text. For English, they can be *is*, *are*, *not*, *she*, *he*, *it* etc. This also usually includes prepositions and other particles. However, the stop list can be modified to fit a specific task.

Both NLTK and Spacy have built-in lists for stop words, however, you are free to find it anywhere else on the internet or even compose your own list.

In [None]:
print('Stop words for English from NLTK:\n')
nltk_stopwords = ...
print(nltk_stopwords)

In [None]:
print('Stop words for English from Spacy:\n')
...

Finally, we can see how our sentences look with the stop words removed:

In [None]:
print("NLTK stemming and stop words:\n")
for sent in difficult_sents:
    nltk_sents = ...
    print(f'Original sentence:\n{nltk_sents}')
    nltk_stems = []
    nltk_no_stop = []
    for sent in nltk_sents:
        ...
    print(f'Stemmed sentence:\n{nltk_stems}')
    print(f'Stemmed sentence without stop words:\n{nltk_no_stop}')
    print('\n------\n')

In [None]:
print("Spacy lemmatization and stop words:\n")
for sent in difficult_sents:
    doc = ...
    print(f'Original sentence:\n{...}')
    print(f'Lemmatized sentence:\n{...}')
    print(f'Lemmatized sentence without stop words:\n{...}')
    print('\n------\n')

To demonstrate how it works on a real example, we can try to detect the sentiment with [Microsoft Azure Text Analysis system](https://azure.microsoft.com/en-in/services/cognitive-services/text-analytics/). Let's try to see if our example is negative or positive:

In [None]:
negative_review = "The service in this restaurant is a nightmare."
print(negative_review)

![Not negative](review-example-01.png)

In [None]:
preprocessed_negative_review = ' '.join([token.lemma_ for token in nlp(negative_review) if not token.is_stop])
print(preprocessed_negative_review)

![Negative](review-example-02.png)

As we can see, the preprocessed text got identified correctly as negative while for the raw text the system showed it to be neutral.

## Noise Removal

In this lab, we are not going to go into details of this step. It includes:
- Removal of headers, footers and other parts of the articles
- Removal of HTML, XML etc. markup
- Extracting the data from various formats, like JSON, CONLL etc.

Most of these steps can be done with the regular expressions. There are also good libraries out there to help you. For example, [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a very powerful tool for the HTML and XML parsing.