# Introduction to Natural Language Processing (NLP) by using unsupervised models

## What is NLP?

Natural Language Processing is an artificial intelligence domain that focuses on the interaction between computers and humans using natural language.

Before using the AI models to make projects, it is very important to choose wisely your data and to preprocess them right.

A process for this preprocess is called cleansing

# Preprocess data (cleansing)

There are 6 steps to clean data :
- The removal of punctuation marks
- The tokenization
- The removal of stop words
- The stemming
- The lemmatization
- The Part Of Speech Tagging

### The removal of punctuation marks

We delete the punctuation marks of sentences, they're not useful because they don't add meaning to the sentence, to the language.

Example in python code:

At first, we need to load our data set.

In [1]:
import pandas as pd

In [2]:
dataframe = pd.read_csv("../../../../data/NLP/Corona_NLP_train.csv", encoding='latin1')

print(dataframe)

       UserName  ScreenName                      Location     TweetAt  \
0          3799       48751                        London  16-03-2020   
1          3800       48752                            UK  16-03-2020   
2          3801       48753                     Vagabonds  16-03-2020   
3          3802       48754                           NaN  16-03-2020   
4          3803       48755                           NaN  16-03-2020   
...         ...         ...                           ...         ...   
41152     44951       89903  Wellington City, New Zealand  14-04-2020   
41153     44952       89904                           NaN  14-04-2020   
41154     44953       89905                           NaN  14-04-2020   
41155     44954       89906                           NaN  14-04-2020   
41156     44955       89907  i love you so much || he/him  14-04-2020   

                                           OriginalTweet           Sentiment  
0      @MeNyrbie @Phil_Gahan @Chrisitv https

In [3]:
import re

In [4]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

We will then use this function to remove the punctuation marks of our text

In [5]:
text_without_punctuation = [remove_punctuation(text) for text in dataframe['OriginalTweet']]

print(text_without_punctuation)



After the removal of punctuation marks we can go onto tokenization

### Tokenization

The tokenization splits string words as separate word list. This will be very important for data processing in the different models.

To achieve tokenization, we can use various libraries such as NLTK specialized in natural language processing.

In [6]:
import nltk
from nltk.tokenize import word_tokenize

# Required data to achieve tokenization with the NLTK library
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gaspa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Here we will write the tokenize function

In [7]:
def tokenize(text_without_punctuation):
    text_tokenized = []
    for text in text_without_punctuation:
        text_tokenized.append(word_tokenize(text))
    return text_tokenized

In [8]:
text_tokenized = tokenize(text_without_punctuation)

print(text_tokenized[1])

['advice', 'Talk', 'to', 'your', 'neighbours', 'family', 'to', 'exchange', 'phone', 'numbers', 'create', 'contact', 'list', 'with', 'phone', 'numbers', 'of', 'neighbours', 'schools', 'employer', 'chemist', 'GP', 'set', 'up', 'online', 'shopping', 'accounts', 'if', 'poss', 'adequate', 'supplies', 'of', 'regular', 'meds', 'but', 'not', 'over', 'order']


After tokenization we're going to delete the stop words

### removal of stop words

The removal of stop words removes the words that are used for language structure (said structuring language) but that doesn't impact the content or its understanding in any way.

To achieve that we can use the stop_words list of the nltk library

In [9]:
from nltk.corpus import stopwords

# data to have a stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gaspa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Then we can define our function

In [10]:
def remove_stopwords(text_tokenized):
    text_without_stopwords = []
    stop_words = set(stopwords.words('english'))
    for text in text_tokenized:
        text_without_stopwords.append([word for word in text if word not in stop_words])
    return text_without_stopwords

In [11]:
text_without_stopwords = remove_stopwords(text_tokenized)

print(text_without_stopwords[1])

['advice', 'Talk', 'neighbours', 'family', 'exchange', 'phone', 'numbers', 'create', 'contact', 'list', 'phone', 'numbers', 'neighbours', 'schools', 'employer', 'chemist', 'GP', 'set', 'online', 'shopping', 'accounts', 'poss', 'adequate', 'supplies', 'regular', 'meds', 'order']


After the removal of stop words we can now preprocess the data to put the words in lowercase

Here's a function for it:

In [12]:
def convert_to_lowercase(text_without_stopwords):
    lowercase_text = []
    for text in text_without_stopwords:
        lowercase_text.append([word.lower() for word in text])
    return lowercase_text

In [13]:
lowercased_text = convert_to_lowercase(text_without_stopwords)

print(lowercased_text[1])

['advice', 'talk', 'neighbours', 'family', 'exchange', 'phone', 'numbers', 'create', 'contact', 'list', 'phone', 'numbers', 'neighbours', 'schools', 'employer', 'chemist', 'gp', 'set', 'online', 'shopping', 'accounts', 'poss', 'adequate', 'supplies', 'regular', 'meds', 'order']


### Stemming

Stemming is a process to remove the word affixes and leave them in an invariant canonical form, we can also say that it makes the word in the form of its grammatical "root". The words will not necessarily have a proper meaning in our language.

Here's the function to achieve this. We'll still use the NLTK library.

In [14]:
from nltk.stem import PorterStemmer

In [15]:
def stemming(lowercased_text):
    stemming_text = []
    for text in lowercased_text:
        stemming_text.append([PorterStemmer().stem(word) for word in text])
    return stemming_text

In [16]:
stemming_text = stemming(lowercased_text)

print(stemming_text[1])

['advic', 'talk', 'neighbour', 'famili', 'exchang', 'phone', 'number', 'creat', 'contact', 'list', 'phone', 'number', 'neighbour', 'school', 'employ', 'chemist', 'gp', 'set', 'onlin', 'shop', 'account', 'poss', 'adequ', 'suppli', 'regular', 'med', 'order']


### Lemmatization

Lemmatization is an algorithmic process to put a word into its dictionary form said "Lemma"

_It is recommended to use lemmatization instead of stemming if the the given text (data) are clean with few errors. Indeed, the stemming will not make the difference between the word (great) and words such as Great-Britain_

_It needs the text to be clean because otherwise the meaning of the word,for example if there are typos, it can be misinterpreted and so change the meaning of it after lemmatization. (This can also be the case with grammar,syntax or other errors)_

⚠️ Lemmatization needs to be achieved before removing the stop words ! The POS tagging is need to achieve lemmatization if we want the lemmatization to be better than the stemming.

_Here we will only show how to achieve the techniques that we need in the futur to preprocess the data_

Here is a function to achieve lemmatization

In [17]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gaspa\AppData\Roaming\nltk_data...


True

In [18]:
def lemmatization(lowercase_text):
    lemma_text = []
    for text in lowercase_text:
        lemma_text.append([WordNetLemmatizer().lemmatize(word) for word in text])
    return lemma_text

In [19]:
lemma_text = lemmatization(lowercased_text)

print(lemma_text[1])

['advice', 'talk', 'neighbour', 'family', 'exchange', 'phone', 'number', 'create', 'contact', 'list', 'phone', 'number', 'neighbour', 'school', 'employer', 'chemist', 'gp', 'set', 'online', 'shopping', 'account', 'po', 'adequate', 'supply', 'regular', 'med', 'order']


### Part of Speech Tagging (POS tagging)

The Part Of Speech Tagging puts a tag at each word of a sentence with its grammatical class, like an adjective, a noun a verb, an adverb and etc...

In most functions that do the lemmatization of POS Tagging is made while in the process. However on the NLTK library with the function that we used the POS tagging is not used. That's why I recommend you to read the documentation about libraries that you can use for preprocessing.

POS is realized just after the tokenization stage so we'll resume at this stage.

Let's make this stage:

In [20]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gaspa\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [21]:
def pos_tagging(text):
    tag_text = []
    for text in text:
        tag_text .append(nltk.pos_tag(text))
    return tag_text 

In [22]:
tag_text = pos_tagging(text_tokenized)

print(tag_text[1])

[('advice', 'NN'), ('Talk', 'NN'), ('to', 'TO'), ('your', 'PRP$'), ('neighbours', 'NNS'), ('family', 'NN'), ('to', 'TO'), ('exchange', 'VB'), ('phone', 'NN'), ('numbers', 'NNS'), ('create', 'VBP'), ('contact', 'JJ'), ('list', 'NN'), ('with', 'IN'), ('phone', 'NN'), ('numbers', 'NNS'), ('of', 'IN'), ('neighbours', 'NNS'), ('schools', 'NNS'), ('employer', 'VBP'), ('chemist', 'JJ'), ('GP', 'NNP'), ('set', 'VBD'), ('up', 'RP'), ('online', 'JJ'), ('shopping', 'NN'), ('accounts', 'NNS'), ('if', 'IN'), ('poss', 'JJ'), ('adequate', 'JJ'), ('supplies', 'NNS'), ('of', 'IN'), ('regular', 'JJ'), ('meds', 'NNS'), ('but', 'CC'), ('not', 'RB'), ('over', 'IN'), ('order', 'NN')]


It's usually at this stage that we do the lemmatization. In this case we need to make a new function.

In [23]:
def pos_tagging_lemmatization(text):
    lemma_text = []
    for text in text:
        lemma_tags = []
        for token, tag in text:
            if tag.startswith('N'):
                lemma = WordNetLemmatizer().lemmatize(token, pos='n')
            elif tag.startswith('V'):
                lemma = WordNetLemmatizer().lemmatize(token, pos='v')
            elif tag.startswith('J'):
                lemma = WordNetLemmatizer().lemmatize(token, pos='a')
            elif tag.startswith('R'):
                lemma = WordNetLemmatizer().lemmatize(token, pos='r')
            else:
                lemma = WordNetLemmatizer().lemmatize(token)
            lemma_tags.append(lemma)
        lemma_text.append(lemma_tags)
    return lemma_text

In [24]:
lemma_text = pos_tagging_lemmatization(tag_text)

print(lemma_text[1])

['advice', 'Talk', 'to', 'your', 'neighbour', 'family', 'to', 'exchange', 'phone', 'number', 'create', 'contact', 'list', 'with', 'phone', 'number', 'of', 'neighbour', 'school', 'employer', 'chemist', 'GP', 'set', 'up', 'online', 'shopping', 'account', 'if', 'poss', 'adequate', 'supply', 'of', 'regular', 'med', 'but', 'not', 'over', 'order']


### To sum up

We need to realize each of this steps in order if we want to achieve lemmatization:

Tokenization -> POS Tagging (if not contained in lemmatization) -> Lemmatization -> Punctuation removal -> Put in same case (lower or upper case) -> Stop words removal

Otherwise:

Punctuation removal -> Tokenization -> Put in same case -> Stop words removal -> Stemming

### To conclude

This data preprocessing type concerns text-mining (see the course about text-mining)