# Introduction to NLTK

# Text Pre-Processing

In [None]:
import nltk

In [None]:
sample_text = "     The day for the race came, and    the Tortoise and Hare  started   together.        The Tortoise never stopped for a moment, walking slowly but steadily, right to the end of the course. The Hare ran fast and stopped to lie down for a rest. But he fell fast asleep. Eventually, he woke up and ran as fast as he could. But when he reached the end, he saw the Tortoise there already, sleeping comfortably after her effort."

## Text Cleaning

#### Lowercasing

In [None]:
# lower() is a Python function for strings
lower_text = sample_text.lower()
lower_text

#### Extra spaces removal

In [None]:
# we define a function to remove extra spaces. It uses the split() function to divide a string into words by deleting
# spaces. Then, re-join the string together separating words with single spaces.

def remove_whitespace(text):
    return  " ".join(text.split())

lower_text = remove_whitespace(lower_text)
lower_text

## Tokenization

In [None]:
nltk.download('punkt')

#### Sentence Splitting

In [None]:
# the output is a list, where each element is a sentence of the original text
nltk.sent_tokenize(lower_text)

#### Tokenization of the whole text

In [None]:
# the output is a list, where each element is a token of the original text
tokenized_text = nltk.word_tokenize(lower_text)
print(tokenized_text)

## Stopword removal

In [None]:
# we import the list of the english stopwords and save it into stopwords_en
from nltk.corpus import stopwords
stopwords_en = stopwords.words('english')

In [None]:
# we prepare a empty list, which will contain the words after the stopwords removal
tokenized_text_2 = []

# we iterate into the list of tokens obtained through the tokenization
for token in tokenized_text:
    # if a token is not a stopword, we insert it in the list
    if token not in stopwords_en:
        tokenized_text_2.append(token)

# the output is a list of all the tokens of the original text excluding the stopwords
print(tokenized_text_2)

## POS Tagging

In [None]:
nltk.download('averaged_perceptron_tagger')

#### The output of the POS Tagging is a list of tuples. A tuple is a collection which is ordered and unchangeable. Note that you can understand that it is a tuple since it is like a list but between round brackets

#### List = [ element 1, element 2, ... ]
#### Tuple = (element 1, element 2, ...)
#### Set = {element 1, element 2, ...}
#### Dictionary = { key 1 : elements 1, key 2 : elements 2, ...}

In [None]:
pos_tagging = nltk.pos_tag(tokenized_text_2)
print(pos_tagging)

## Punctuation Removal

Now we want to remove the punctuation. As you can see, punctuation does not have a specific POS Tag. All the other POS have a label, composed by at least two characters: "NN", "VBD", etc., while for punctuation the tag is the mark itself: ".", ",", etc.
So, to remove punctuation, we can remove all the tokens whose POS tag has length = 1

In [None]:
cleaned_POS_text = []

for tuple in pos_tagging:
    # POS tagged text is a list of tuples, where the first element tuple[0] is a token and the second one tuple[1] is
    # the Part of Speech. If the POS has length == 1, the token is punctuation, otherwise it is not, and we insert it
    # in the list cleaned_POS_text
    if len(tuple[1]) > 1:
        cleaned_POS_text.append(tuple)
        
print(cleaned_POS_text) 

nltk.pos_tag returns many different Part of Speech tags. For the next step, lemmatization, it is better to simplify these tags. To do it we use the following function to substitute the POS with an easier form. E.g. from "VBD", "VBG", etc. (which are verbs) we write "v"

In [None]:
def simpler_pos_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return "a"
    elif nltk_tag.startswith('V'):
        return "v"
    elif nltk_tag.startswith('N'):
        return "n"
    elif nltk_tag.startswith('R'):
        return "r"
    else:         
        return None
    
simpler_POS_text = []

# for each tuple of the list, we create a new tuple: the first element is the token, the second is
# the simplified pos tag, obtained calling the function simpler_pos_tag()
# then we append the new created tuple to a new list, which will be the output
for tuple in cleaned_POS_text:
    POS_tuple = (tuple[0], simpler_pos_tag(tuple[1]))
    simpler_POS_text.append(POS_tuple)
    
print(simpler_POS_text)

## Lemmatization

In [None]:
nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

The function **lemmatize()** of WordNet, can be executed with or without giving it the POS as second argument.<br>
1 - lemmatize(token)<br>
2 - lemmatize(token, pos="...")<br>
If we give the POS, it will perform the lemmatization with a better accuracy <br>
The POS tags that this function can read are the simplified ones, that we produced previously

In [None]:
lemmatizer.lemmatize('cars')

In [None]:
lemmatizer.lemmatize('was', pos='v')

In [None]:
lemmatized_text = []

for tuple in simpler_POS_text:
    if (tuple[1] == None):
        lemmatized_text.append(lemmatizer.lemmatize(tuple[0]))
    else:
        lemmatized_text.append(lemmatizer.lemmatize(tuple[0], pos=tuple[1]))
    
print(lemmatized_text)

## Stemming

In [None]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stem_text = []

for tuple in simpler_POS_text:
    stem_text.append(stemmer.stem(tuple[0]))
        
print(stem_text)

In [None]:
stemmer.stem("I was feeding two dogs")