<a href="https://colab.research.google.com/github/HHansi/Applied-AI-Course/blob/main/NLP/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing

This notebook contains the practical examples and exercises for the Applied AI-Natural Language Processing.

*Created by Hansi Hettiarachchi*

Importing libraries

In [1]:
import nltk

# download NLTK modules
nltk.download('punkt')  # required for Tokenizers
nltk.download('wordnet')  # required for WordNetLemmatizer
nltk.download('omw-1.4') # required for WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')  # requried for PoS tagger
nltk.download('stopwords')  

from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords
import string

import spacy
from spacy import displacy
import en_core_web_sm  # spacy model
nlp = en_core_web_sm.load()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Tokenisation
Tokenisation is the process which divides text into smaller parts called tokens.

Let's see how to use [tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available with NLTK (Natural Language Toolkit) package to tokenise text.

In [2]:
sample_text = "This is a sentence, which contains all kind of words, and needs to be tokenized!"
sample_tweet1 = "This is a cooool :-) :-P <3 #cool"
sample_tweet2 = "@remy: This is waaaaayyyy too much for you!!!!!!"

Tokenising normal text

In [3]:
tokenized_text = word_tokenize(sample_text)
print(tokenized_text)

['This', 'is', 'a', 'sentence', ',', 'which', 'contains', 'all', 'kind', 'of', 'words', ',', 'and', 'needs', 'to', 'be', 'tokenized', '!']


Tokenising tweets

In [4]:
tokenized_tweet1 = word_tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = word_tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':', '-', ')', ':', '-P', '<', '3', '#', 'cool']
tokenized tweet2: ['@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']


As you can see in the above outputs, <i>word_tokenize</i> cannot tokenize the tweet text correctly. 
Considering the differences in tweet text compared to normal text, there is an another tokenizer named <i>TweetTokenizer</i> available with NLTK which is specifically designed for tweets.

In [5]:
tknzr = TweetTokenizer()

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']


Let's analyse more features available with [TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html?highlight=tweettokenizer#nltk.tokenize.casual.TweetTokenizer).
- preserve_case (default setting=True) - Keep case sensitivity of the text
- reduce_len (default setting=False) - Normalize text by removing repeated character sequences of length 3 or greater with sequences of length 3.
- strip_handles (default setting=False) - Remove Twitter usernames in the text

In [6]:
# make the tokens case insensitive or convert into lowercase
print('configs: preserve_case=False')
tknzr = TweetTokenizer(preserve_case=False)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


# make the tokens case insensitive and reduce length
print('\nconfigs: preserve_case=False, reduce_len=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


# make the tokens case insensitive, reduce length and remove usernames
print('\nconfigs: preserve_case=False, reduce_len=True, strip_handles=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


configs: preserve_case=False
tokenized tweet1: ['this', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']

configs: preserve_case=False, reduce_len=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']

configs: preserve_case=False, reduce_len=True, strip_handles=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: [':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


# Text Normalisation


## Lower casing

In [7]:
sample_text = "The striped BATs are hanging on their feet for best"

In [8]:
# Any string can be lower cased using the function lower()
lower_cased_text = sample_text.lower()
print(lower_cased_text)

the striped bats are hanging on their feet for best


If you are not familiar with string methods, you can find a list of all of them in the [documentation](https://docs.python.org/3.7/library/stdtypes.html#string-methods).

## Stemming
Stemming chops off the end or beginning of words by taking into account a list of common prefixes or suffixes that could be found in that word.

The most common and effecive algorithm for stemming English is <i>Porter’s algorithm.</i>

[Stemmers in NLTK](https://www.nltk.org/howto/stem.html)

In [9]:
sample_words = ["dogs", "ponies", "eating", "corpora"]
sample_sentence = "The striped bats are hanging on their feet for best."

In [10]:
ps = PorterStemmer()

stem_words = [ps.stem(word) for word in sample_words]
print(f'Stemmed words: {stem_words}\n')

stem_words = [ps.stem(word) for word in word_tokenize(sample_sentence)]
print(f'Stemmed words: {stem_words}')
print(f'Stemmed sentence: {" ".join(stem_words)}')

Stemmed words: ['dog', 'poni', 'eat', 'corpora']

Stemmed words: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best', '.']
Stemmed sentence: the stripe bat are hang on their feet for best .


## Lemmatisation

Lemmatisation is an more organised procedure to obtain the base form of a word (lemma) with the use of a vocabulary and morphological analysis (word structure and grammar relations) of words.

### NLTK [WordNetLemmatizer](https://www.nltk.org/api/nltk.stem.wordnet.html#nltk.stem.WordNetLemmatizer.lemmatize)

In [11]:
sample_words = ["dogs", "ponies", "eating", "corpora"]
sample_sentence = "The striped bats are hanging on their feet for best."

In [12]:
wnl = WordNetLemmatizer()

lemma_words = [wnl.lemmatize(word) for word in sample_words]
print(f'Lemmatised words: {lemma_words}\n')

lemma_words = [wnl.lemmatize(word) for word in word_tokenize(sample_sentence)]
print(f'Lemmatised words: {lemma_words}')
print(f'Lemmatised sentence: {" ".join(lemma_words)}')

Lemmatised words: ['dog', 'pony', 'eating', 'corpus']

Lemmatised words: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best', '.']
Lemmatised sentence: The striped bat are hanging on their foot for best .


**Exercise**

Compare the difference between the outputs by stemmer and lemmatiser.


### NLTK WordNetLemmatizer with Part-of-Speech (PoS) tags

[Parts of speech](https://www.englishclub.com/grammar/parts-of-speech.htm) are also known as word classes or lexical categories.

In [13]:
lemma_word=wnl.lemmatize('ponies', pos='n')
print(lemma_word)

lemma_word=wnl.lemmatize('eating', pos='v')
print(lemma_word)

lemma_word=wnl.lemmatize('ate', pos='v')
print(lemma_word)

lemma_word=wnl.lemmatize('better', pos='a')
print(lemma_word)

pony
eat
eat
good


[NLTK PoS Taggers](https://www.nltk.org/api/nltk.tag.html)

In [14]:
nltk.pos_tag(sample_words)

[('dogs', 'NNS'), ('ponies', 'NNS'), ('eating', 'VBG'), ('corpora', 'NN')]

In [15]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

If you are not familiar with what happens with Python dictionary get() method, find more details [here](https://www.w3schools.com/python/ref_dictionary_get.asp).

In [16]:
lemma_words = [wnl.lemmatize(word, pos=get_wordnet_pos(word)) for word in sample_words]
print(f'Lemmatised words: {lemma_words}\n')

lemma_words = [wnl.lemmatize(word) for word in word_tokenize(sample_sentence)]
print(f'Lemmatised words: {lemma_words}')
print(f'Lemmatised sentence: {" ".join(lemma_words)}')

Lemmatised words: ['dog', 'pony', 'eat', 'corpus']

Lemmatised words: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best', '.']
Lemmatised sentence: The striped bat are hanging on their foot for best .


### spaCy Lemmatization

spaCy models are pipelines designed with multiple components.<br>
[spaCy English Models](https://spacy.io/models/en)


In [17]:
doc = nlp(sample_sentence)
print([token.lemma_ for token in doc])

['the', 'stripe', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good', '.']


**Exercise**

Compare the lemmatised outputs generated by WordNetLemmatizer and spaCy Lemmatizer.

# Stop Word Removal

In [18]:
sample_text = "This is a sample sentence, showing off the stop words removal."

In [19]:
# define set of English stopwords
stop_words = set(stopwords.words('english')) 

# tokenise text
tokens = word_tokenize(sample_text)

# remove stopwords from tokens
filtered_words = [token for token in tokens if token not in stop_words]
print(filtered_words)

['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'removal', '.']


**Exercise**

Update the default stop word list by removing 'off', and remove stop words in the sample_text.

expected output = \['This', 'sample', 'sentence', ',', 'showing', 'off', 'stop', 'words', 'removal', '.']

# Punctuation Removal

In [20]:
sample_text = "Let's remove punctuation marks!"

In [21]:
print(f'Punctuation marks: {string.punctuation}\n')

# remove puncuation marks in sample text
table = str.maketrans(dict.fromkeys(string.punctuation))
no_punctuation= sample_text.translate(table)

print(no_punctuation)

Punctuation marks: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Lets remove punctuation marks


# Named Entity Recognition (NER)

Let's see how to use [spaCy](https://spacy.io/usage/linguistic-features#named-entities) models for NER.

[spaCy English Models](https://spacy.io/models/en)

In [22]:
sample_text = "Apple is looking at buying U.K. startup for $1 billion"

In [23]:
doc = nlp(sample_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, jupyter=True, style='ent')

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Replace text with recognised named entities.

In [24]:
doc = nlp(sample_text)

updated_tokens = [t.text if not t.ent_type_ else t.ent_type_ for t in doc]
updated_sentence = " ".join(updated_tokens)
print(updated_sentence)

ORG is looking at buying GPE startup for MONEY MONEY MONEY


Repitetions of the same named entity can be merged by adding ['merge_entities'](https://spacy.io/api/pipeline-functions#merge_entities) to the pipeline.

In [25]:
nlp.add_pipe("merge_entities")

doc = nlp(sample_text)

updated_tokens = [t.text if not t.ent_type_ else t.ent_type_ for t in doc]
updated_sentence = " ".join(updated_tokens)
print(updated_sentence)

ORG is looking at buying GPE startup for MONEY
