<a href="https://colab.research.google.com/github/HHansi/Applied-AI-Course/blob/main/NLP/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Processing

This notebook contains the practical examples and exercises for the Applied AI-Natural Language Processing.

*Created by Hansi Hettiarachchi*

# Tokenisation
Tokenisation is the task of cutting a string into identifiable linguistic units that constitute a piece of language data.

Let's see how to use [tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available with NLTK (Natural Language Toolkit) package to tokenise text.

In [1]:
import nltk
nltk.download('punkt')  # NLTK module required for Tokenizers

from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
sample_text = "This is a sentence, which contains all kind of words, and needs to be tokenized!"
sample_tweet1 = "This is a cooool :-) :-P <3 #cool"
sample_tweet2 = "@remy: This is waaaaayyyy too much for you!!!!!!"

Tokenising normal text

In [3]:
tokenized_text = word_tokenize(sample_text)
print(tokenized_text)

['This', 'is', 'a', 'sentence', ',', 'which', 'contains', 'all', 'kind', 'of', 'words', ',', 'and', 'needs', 'to', 'be', 'tokenized', '!']


Tokenising tweets

In [4]:
tokenized_tweet1 = word_tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = word_tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':', '-', ')', ':', '-P', '<', '3', '#', 'cool']
tokenized tweet2: ['@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']


As you can see in the above outputs, <i>word_tokenize</i> cannot tokenize the tweet text correctly.
Considering the differences in tweet text compared to normal text, there is an another tokenizer named <i>TweetTokenizer</i> available with NLTK which is specifically designed for tweets.

In [5]:
tknzr = TweetTokenizer()

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']


Let's analyse more features available with [TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html?highlight=tweettokenizer#nltk.tokenize.casual.TweetTokenizer).
- preserve_case (default setting=True) - Keep case sensitivity of the text
- reduce_len (default setting=False) - Normalize text by removing repeated character sequences of length 3 or greater with sequences of length 3.
- strip_handles (default setting=False) - Remove Twitter usernames in the text

In [6]:
# setting1: make the tokens case insensitive or convert into lowercase
print('configs: preserve_case=False')
tknzr = TweetTokenizer(preserve_case=False)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

configs: preserve_case=False
tokenized tweet1: ['this', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']


In [7]:
# setting2: make the tokens case insensitive and reduce length
print('\nconfigs: preserve_case=False, reduce_len=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


configs: preserve_case=False, reduce_len=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


In [8]:
# setting3: make the tokens case insensitive, reduce length and remove usernames
print('\nconfigs: preserve_case=False, reduce_len=True, strip_handles=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


configs: preserve_case=False, reduce_len=True, strip_handles=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: [':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


### <font color='green'>**Activity 1**</font>

Analyse the outputs from TweetTokenizer above under settings 1, 2 and 3 and identify the best setting to use, if your final aim is to predict the sentiment (positive, negative and neutral) of the tweet.

# Text Normalisation


## Lower casing

In [9]:
import string

In [10]:
sample_text = "The striped BATs are hanging on their feet for best"

In [11]:
# Any string can be lower cased using the function lower()
lower_cased_text = sample_text.lower()
print(lower_cased_text)

the striped bats are hanging on their feet for best


If you are not familiar with string methods, you can find a list of all of them in the [documentation](https://docs.python.org/3.7/library/stdtypes.html#string-methods).

## Stemming
Stemming chops off the end or beginning of words by taking into account a list of common prefixes or suffixes that could be found in that word.

The most common and effecive algorithm for stemming English is <i>Porter’s algorithm.</i>

Details of different stemmers available with NLTK is available [here](https://www.nltk.org/howto/stem.html).

In [12]:
import nltk

from nltk.stem import PorterStemmer

In [13]:
ps = PorterStemmer()

word = "dogs"
stem_word = ps.stem(word)

print(f'Stemmed word: {stem_word}')

Stemmed word: dog


In [14]:
# If you have a list of words, you need to iteratively go through each to do the conversion
sample_words = ["dogs", "ponies", "eating", "corpora"]
stem_words = [ps.stem(word) for word in sample_words]

print(f'Stemmed words: {stem_words}')

Stemmed words: ['dog', 'poni', 'eat', 'corpora']


Stemmers take a single word as the input. If you have a sentence, you need to first tokenise it.

In [15]:
sample_sentence = "The striped bats are hanging on their feet for best."

tokens = word_tokenize(sample_sentence)
stem_words = [ps.stem(word) for word in tokens]

print(f'Stemmed words: {stem_words}')

Stemmed words: ['the', 'stripe', 'bat', 'are', 'hang', 'on', 'their', 'feet', 'for', 'best', '.']


In [16]:
# If required, you can also convert the stem words into a sentence by merging them with a space between each word.
print(f'Stemmed sentence: {" ".join(stem_words)}')

Stemmed sentence: the stripe bat are hang on their feet for best .


## Lemmatisation

Lemmatisation is an more organised procedure to obtain the base form of a word (lemma) with the use of a vocabulary and morphological analysis (word structure and grammar relations) of words.

### NLTK [WordNetLemmatizer](https://www.nltk.org/api/nltk.stem.wordnet.html#nltk.stem.WordNetLemmatizer.lemmatize)

In [17]:
import nltk

nltk.download('wordnet')  # NLTK module required for WordNetLemmatizer
nltk.download('omw-1.4') # NLTK module required for WordNetLemmatizer

from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [18]:
wnl = WordNetLemmatizer()

word = "dogs"
lemma_word = wnl.lemmatize(word)

print(f'Lemmatised word: {lemma_word}')

Lemmatised word: dog


If you have a list of words, you need to iteratively go through each to do the conversion

In [19]:
sample_words = ["dogs", "ponies", "eating", "corpora"]
lemma_words = [wnl.lemmatize(word) for word in sample_words]

print(f'Lemmatised words: {lemma_words}')

Lemmatised words: ['dog', 'pony', 'eating', 'corpus']


Similar to Stemmmers, Lemmatizers also take a single word as the input. If you have a sentence, you need to first tokenise it.

In [20]:
sample_sentence = "The striped bats are hanging on their feet for best."

tokens = word_tokenize(sample_sentence)
lemma_words = [wnl.lemmatize(word) for word in tokens]

print(f'Lemmatised words: {lemma_words}')

Lemmatised words: ['The', 'striped', 'bat', 'are', 'hanging', 'on', 'their', 'foot', 'for', 'best', '.']


In [21]:
# If required, you can also convert the stem words into a sentence by merging them with a space between each word.
print(f'Lemmatised sentence: {" ".join(lemma_words)}')

Lemmatised sentence: The striped bat are hanging on their foot for best .


### spaCy Lemmatization

spaCy models are pipelines designed with multiple components.<br>
You can find more details about available pipelines and models [here](https://spacy.io/models/en).


In [22]:
import spacy
from spacy import displacy
import en_core_web_sm  # spacy model
nlp = en_core_web_sm.load()

In [23]:
sample_sentence = "The striped bats are hanging on their feet for best."

# process a sentence using the spaCy pipeline
doc = nlp(sample_sentence)
# iterate through each token in the output document (processed sentence) and get its lemmatised version
print([token.lemma_ for token in doc])

['the', 'stripe', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'good', '.']


### <font color='green'>**Activity 2**</font>

Fill the following table by applying WordNet and spaCy lemmatization to each given word.

|Original word | Lemmatised word- WordNetLemmatizer  | Lemmatised word- spaCy |
|------|--------------------|------------|
|walking    | ? | ? |
|is    | ? | ? |
|main    | ? | ? |
|animals    | ? | ? |
|terrestrial    | ? | ? |
|jumping    | ? | ? |
|best    | ? | ? |
|sleeping    | ? | ? |

Which lemmatiser is the best to normalise text and why?


### NLTK WordNetLemmatizer with Part-of-Speech (PoS) tags

[Parts of speech](https://www.englishclub.com/grammar/parts-of-speech.htm) are also known as word classes or lexical categories.

By feeding the corresponding PoS tag along with the word, we can further improve the WordNetLemmatizer.

According the NLTK's [documentation](https://www.nltk.org/api/nltk.stem.wordnet.html#nltk.stem.WordNetLemmatizer.lemmatize), “n” for nouns, “v” for verbs, “a” for adjectives and “r” for adverbs are the valid PoS tag options for WordNetLemmatizer.

In [24]:
import nltk

nltk.download('wordnet')  # NLTK module required for WordNetLemmatizer
nltk.download('omw-1.4') # NLTK module required for WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')  #  NLTK module requried for PoS tagger

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [25]:
lemma_word=wnl.lemmatize('ponies', pos='n')
print(lemma_word)

lemma_word=wnl.lemmatize('walking', pos='v')
print(lemma_word)

lemma_word=wnl.lemmatize('better', pos='a')
print(lemma_word)

lemma_word=wnl.lemmatize('effectively', pos='r')
print(lemma_word)

pony
walk
good
effectively


We can use the PoSTagger available with [NLTK](https://www.nltk.org/api/nltk.tag.html) to automatically identify the PoS tags.

In [26]:
nltk.pos_tag(['ponies', 'walking', 'best', 'effectively'])

[('ponies', 'NNS'), ('walking', 'VBG'), ('best', 'JJS'), ('effectively', 'RB')]

WordNetLemmatizer requires PoS tags in the format of 'n', 'v', 'a' and 'r'.
But, PoSTagger return tags in the format of 'NNS', 'VBG', 'JJS' and 'RB'.

Let's write a simple function to get the PoS tag of a word in the format required by WordNetLemmatizer.

In [27]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

If you are not familiar with what happens with Python dictionary get() method, find more details [here](https://www.w3schools.com/python/ref_dictionary_get.asp).

In [28]:
sample_words = ['ponies', 'walking', 'best', 'effectively']

for word in sample_words:
  pos = get_wordnet_pos(word)
  lemma_word=wnl.lemmatize(word, pos)
  print(lemma_word)

pony
walk
best
effectively


# Stop Word Removal

In [29]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


You do not have to create a stopword list from scratch, as NLTK provides us with a readily available list. But you may need to update us depending on the problem you try to solve.

In [30]:
stop_words = set(stopwords.words('english'))

print(stop_words)

{"should've", 'because', 'by', "mightn't", 'do', 'doing', 'have', 'having', 'to', 'when', 'few', 'no', 'he', "wasn't", 'their', 'nor', 'just', 'those', 'at', 'has', 'between', 'where', 'his', 'theirs', "it's", 'through', 'such', 'can', 'will', 'aren', 'hers', 'are', 'both', 'i', 'while', 'more', 'you', 'only', 'further', 'themselves', 'each', "you're", 'y', "needn't", 'than', 'them', 'ain', "hadn't", 'off', 'here', 'yours', 've', 'be', 'against', 'yourself', 'below', 'my', 'how', 'again', 'what', 'hasn', 'about', 'over', 'weren', 'before', 's', "wouldn't", 'o', 'they', 'if', "weren't", 'other', 'its', 'am', "she's", 't', 'after', 'of', 'there', 'shan', 'with', 'during', 'a', 'yourselves', 'needn', 'she', 'any', 'whom', 'don', 'should', 'most', 'these', 'won', 'had', 'why', 're', 'd', "you'd", 'did', 'from', 'too', 'been', 'which', 'out', 'own', 'in', 'her', 'is', 'up', 'on', 'once', 'and', 'not', "hasn't", 'same', "isn't", 'doesn', 'didn', 'isn', 'were', 'that', "you'll", 'shouldn', 'a

Let's try to remove stopwords in a sentence.

In [31]:
sample_text = "This is a sample sentence, showing off the stop words removal."

# tokenise text
tokens = word_tokenize(sample_text)

# remove stopwords from tokens
filtered_words = [token for token in tokens if token not in stop_words]
print(filtered_words)

['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'removal', '.']


### <font color='green'>**Activity 3**</font>

Update the default stop word list by removing 'off', and remove stop words in the sample_text.

**Expected output:** \['This', 'sample', 'sentence', ',', 'showing', 'off', 'stop', 'words', 'removal', '.']

**Hint:** [Python - Remove List Items](https://www.w3schools.com/python/python_lists_remove.asp)

# Punctuation Removal

In [32]:
import string

You can get a readily available set of punctuations using the Python string package.

In [33]:
print(f'Punctuation marks: {string.punctuation}')

Punctuation marks: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [34]:
sample_text = "Let's remove punctuation marks!"

# remove puncuation marks in sample text
table = str.maketrans(dict.fromkeys(string.punctuation))
no_punctuation= sample_text.translate(table)

print(no_punctuation)

Lets remove punctuation marks


# Named Entity Recognition (NER)

Let's see how to use [spaCy](https://spacy.io/usage/linguistic-features#named-entities) models for NER.

[spaCy English Models](https://spacy.io/models/en)

In [35]:
import spacy
from spacy import displacy
import en_core_web_sm  # spacy model
nlp = en_core_web_sm.load()

In [36]:
sample_text = "Apple is looking at buying U.K. startup for $1 billion"

In [37]:
doc = nlp(sample_text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

displacy.render(doc, jupyter=True, style='ent')

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Replace text with recognised named entities.

In [38]:
doc = nlp(sample_text)

updated_tokens = [t.text if not t.ent_type_ else t.ent_type_ for t in doc]
updated_sentence = " ".join(updated_tokens)
print(updated_sentence)

ORG is looking at buying GPE startup for MONEY MONEY MONEY


Repitetions of the same named entity can be merged by adding ['merge_entities'](https://spacy.io/api/pipeline-functions#merge_entities) to the pipeline.

In [39]:
nlp.add_pipe("merge_entities")

doc = nlp(sample_text)

updated_tokens = [t.text if not t.ent_type_ else t.ent_type_ for t in doc]
updated_sentence = " ".join(updated_tokens)
print(updated_sentence)

ORG is looking at buying GPE startup for MONEY


### <font color='green'>**Activity 4**</font>

Let's assume you need to identify the sentiment (positive, negative and neutral) of a given product review. A few sample reviews are given bellow.

* "Apple's new product is amazing."
* "I'm quite dissapointed with recent Apple products."
* "Android products are amazing and versatile."

a) Replace the entities in these sentences using entity tags.

b) Would this replacement be helpful for sentiment identification from the perpective of a machine learning model?


