<a href="https://colab.research.google.com/github/HHansi/Applied-AI-Course/blob/main/NLP/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tokenisation
Tokenisation is the process which divides text into smaller parts called tokens.

Let's see how to use [tokenizers](https://www.nltk.org/api/nltk.tokenize.html) available with NLTK (Natural Language Toolkit) package to tokenise text. 

In [1]:
sample_text = "This is a sentence, which contains all kind of 'words', and needs to be tokenized!"
sample_tweet1 = "This is a cooool :-) :-P <3 #cool"
sample_tweet2 = "@remy: This is waaaaayyyy too much for you!!!!!!"

Tokenising normal text

In [3]:
# download the module 'punkt'
import nltk
nltk.download('punkt')

from nltk.tokenize import word_tokenize

tokenized_text = word_tokenize(sample_text)
print(tokenized_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['This', 'is', 'a', 'sentence', ',', 'which', 'contains', 'all', 'kind', 'of', "'words", "'", ',', 'and', 'needs', 'to', 'be', 'tokenized', '!']


Tokenising tweets

In [4]:
tokenized_tweet1 = word_tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = word_tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':', '-', ')', ':', '-P', '<', '3', '#', 'cool']
tokenized tweet2: ['@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']


As you can see in the above outputs, <i>word_tokenize</i> cannot tokenize the tweet text correctly. 
Considering the differences in tweet text compared to normal text, there is an another tokenizer named <i>TweetTokenizer</i> available with NLTK.

In [5]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')

tokenized tweet1: ['This', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']


Let us analyse more features available with TweetTokenizer.
- preserve_case (default setting=True) - Keep case sensitivity of the text
- reduce_len (default setting=False) - Normalize text by removing repeated character sequences of length 3 or greater with sequences of length 3.
- strip_handles (default setting=False) - Remove Twitter usernames in the text

In [6]:
# make the tokens case insensitive or convert into lowercase
print('configs: preserve_case=False')
tknzr = TweetTokenizer(preserve_case=False)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


# make the tokens case insensitive and reduce length
print('\nconfigs: preserve_case=False, reduce_len=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


# make the tokens case insensitive, reduce length and remove usernames
print('\nconfigs: preserve_case=False, reduce_len=True, strip_handles=True')
tknzr = TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)

tokenized_tweet1 = tknzr.tokenize(sample_tweet1)
print(f'tokenized tweet1: {tokenized_tweet1}')

tokenized_tweet2 = tknzr.tokenize(sample_tweet2)
print(f'tokenized tweet2: {tokenized_tweet2}')


configs: preserve_case=False
tokenized tweet1: ['this', 'is', 'a', 'cooool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']

configs: preserve_case=False, reduce_len=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: ['@remy', ':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']

configs: preserve_case=False, reduce_len=True, strip_handles=True
tokenized tweet1: ['this', 'is', 'a', 'coool', ':-)', ':-P', '<3', '#cool']
tokenized tweet2: [':', 'this', 'is', 'waaayyy', 'too', 'much', 'for', 'you', '!', '!', '!']


# Text Normalisation


## Lower casing

In [7]:
# Any string can be lower cased using the function lower()
lower_cased_text = sample_text.lower()
print(lower_cased_text)

this is a sentence, which contains all kind of 'words', and needs to be tokenized!


## Stemming
Stemming chops off the end or beginning of words by taking into account a list of common prefixes or suffixes that could be found in that word.

The most common and effecive algorithm for stemming English is <i>Porter’s algorithm.</i>

In [8]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

sample_words = ["ponies", "cats", "eating", "ate", "eat", "goose", "geese"]

stem_words = []
for word in sample_words:
    stem_word=ps.stem(word)
    stem_words.append(stem_word)
  
print(f'Stemmed words: {stem_words}')

Stemmed words: ['poni', 'cat', 'eat', 'ate', 'eat', 'goos', 'gees']


## Lemmatisation

Lemmatisation is an more organised procedure to obtain the base form of a word (lemma) with the use of a vocabulary and morphological analysis (word structure and grammar relations) of words.

In [9]:
# download the module 'wordnet'
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer


wnl = WordNetLemmatizer()

sample_words = ["ponies", "cats", "eating", "ate", "eat", "goose", "geese"]

lemma_words = []
for word in sample_words:
    lemma_word=wnl.lemmatize(word)
    lemma_words.append(lemma_word)
  
print(f'Lemmatised words: {lemma_words}')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Lemmatised words: ['pony', 'cat', 'eating', 'ate', 'eat', 'goose', 'goose']


Compare the difference between the outputs by stemmer and lemmatiser.


<b>Lemmatisation with Part-of-Speech (PoS) tags</b>

[Parts of speech](https://www.englishclub.com/grammar/parts-of-speech.htm) are also known as word classes or lexical categories.

In [10]:
lemma_word=wnl.lemmatize('ponies', pos='n')
print(lemma_word)

lemma_word=wnl.lemmatize('eating', pos='v')
print(lemma_word)

lemma_word=wnl.lemmatize('ate', pos='v')
print(lemma_word)

lemma_word=wnl.lemmatize('eat', pos='v')
print(lemma_word)


pony
eat
eat
eat


# Stop Word Removal

In [11]:
# download the module 'stopwords'
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

# define set of English stopwords
stop_words = set(stopwords.words('english')) 

sample_text = "This is a sample sentence, showing off the stop words removal."

# tokenise text
tokens = word_tokenize(sample_text)

filtered_words = []
 
# remove stopwords from tokens
for token in tokens:
  if token not in stop_words:
    filtered_words.append(token)
print(filtered_words)

# join tokens into a sentence using space
merged_text = " ".join(filtered_words)
print(merged_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'removal', '.']
This sample sentence , showing stop words removal .


# Punctuation Removal

In [12]:
import string

sample_text = "Let's remove punctuation marks!"

print(f'Punctuation marks to remove: {string.punctuation}')

# remove puncuation marks in sample text
table = str.maketrans(dict.fromkeys(string.punctuation))
no_punctuation= sample_text.translate(table)

print(no_punctuation)

Punctuation marks to remove: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Lets remove punctuation marks
