In [2]:
import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer, TweetTokenizer, MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

In [3]:
# sample sentence
sentence = "This is a sample sentence for tokenization, stemming, and lemmatization using NLTK library in Python!"

# Tokenization

In [4]:
# Whitespace tokenization
whitespace_tokens = sentence.split()
print("Whitespace Tokenization: ", whitespace_tokens)

Whitespace Tokenization:  ['This', 'is', 'a', 'sample', 'sentence', 'for', 'tokenization,', 'stemming,', 'and', 'lemmatization', 'using', 'NLTK', 'library', 'in', 'Python!']


<b>Punctuation-based tokenization:</b>
Punctuation-based tokenization is slightly more advanced than whitespace-based tokenization since it splits on whitespace and punctuations and also retains the punctuations.

In [5]:
# Punctuation-based tokenization
punctuation_tokens = word_tokenize(sentence)
print("Punctuation-based Tokenization: ", punctuation_tokens)

Punctuation-based Tokenization:  ['This', 'is', 'a', 'sample', 'sentence', 'for', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', 'using', 'NLTK', 'library', 'in', 'Python', '!']


<b>Treebank Tonkenization:</b>
    This technique of tokenization separates the punctuation, clitics (words that occur along with other words like I'm, don't) and hyphenated words together.

In [6]:
# Treebank tokenization
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(sentence)
print("Treebank Tokenization: ", treebank_tokens)

Treebank Tokenization:  ['This', 'is', 'a', 'sample', 'sentence', 'for', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', 'using', 'NLTK', 'library', 'in', 'Python', '!']


<b>Tweet Tokenization:</b>
NLTK has this special method called TweetTokenizer() that helps to tokenize Tweet Corpus into relevant tokens. The advantage of using TweetTokenizer() compared to regular word_tokenize is that, when processing tweets, we often come across emojis, hashtags that need to be handled differently.

In [7]:
# Tweet tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(sentence)
print("Tweet Tokenization: ", tweet_tokens)

Tweet Tokenization:  ['This', 'is', 'a', 'sample', 'sentence', 'for', 'tokenization', ',', 'stemming', ',', 'and', 'lemmatization', 'using', 'NLTK', 'library', 'in', 'Python', '!']


<b>Mulit-word expression Tokenization:</b>
The multi-word expression tokenizer is a rule-based, “add-on” tokenizer offered by NLTK. Once the text has been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.

In [8]:
# Multi-word expression tokenization
mwe_tokenizer = MWETokenizer([('NLTK', 'library')]) # In output NLTK and Library is a combined single token
mwe_tokens = mwe_tokenizer.tokenize(sentence.split())
print("MWE Tokenization: ", mwe_tokens)

MWE Tokenization:  ['This', 'is', 'a', 'sample', 'sentence', 'for', 'tokenization,', 'stemming,', 'and', 'lemmatization', 'using', 'NLTK_library', 'in', 'Python!']


# Stemming

<b>Porter Stemmer:</b>
It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity.

Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as
‘<b>agreed</b>’ becomes ‘<b>agree</b>’. 

In [9]:
# Stemming using Porter Stemmer
porter_stemmer = PorterStemmer()
stemmed_tokens_porter = [porter_stemmer.stem(token) for token in punctuation_tokens]
print("Stemming using Porter Stemmer: ", stemmed_tokens_porter)

Stemming using Porter Stemmer:  ['thi', 'is', 'a', 'sampl', 'sentenc', 'for', 'token', ',', 'stem', ',', 'and', 'lemmat', 'use', 'nltk', 'librari', 'in', 'python', '!']


<b>Snowball Stemmer:</b>

<i> multi-lingual stemmer

<i> more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer

In [10]:
# Stemming using Snowball Stemmer
snowball_stemmer = SnowballStemmer('english')
stemmed_tokens_snowball = [snowball_stemmer.stem(token) for token in punctuation_tokens]
print("Stemming using Snowball Stemmer: ", stemmed_tokens_snowball)

Stemming using Snowball Stemmer:  ['this', 'is', 'a', 'sampl', 'sentenc', 'for', 'token', ',', 'stem', ',', 'and', 'lemmat', 'use', 'nltk', 'librari', 'in', 'python', '!']


<b>Lemmatization using WordNetLemmatizer:</b><br>
Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words.

Wordnet links words into semantic relations. ( eg. synonyms )<br>
It groups synonyms in the form of synsets.<br>
synsets : a group of data elements that are semantically equivalent.

In [11]:
# Lemmatization using WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
for token in punctuation_tokens:
    pos_tag = nltk.pos_tag([token])[0][1][0].lower()
    if pos_tag == 'j':
        pos_tag = 'a'
    elif pos_tag in ['v', 'n']:
        pos_tag = pos_tag
    else:
        pos_tag = 'n'
    lemma = wordnet_lemmatizer.lemmatize(token, pos_tag)
    lemmatized_tokens.append(lemma)
print("Lemmatization using WordNetLemmatizer: ", lemmatized_tokens)

Lemmatization using WordNetLemmatizer:  ['This', 'be', 'a', 'sample', 'sentence', 'for', 'tokenization', ',', 'stem', ',', 'and', 'lemmatization', 'use', 'NLTK', 'library', 'in', 'Python', '!']
