## 1. Tokenization

Tokenization is the process of breaking text into smaller pieces called tokens. These smaller pieces can be sentences, words, or sub-words. For example, the sentence “I won” can be tokenized into two word-tokens “I” and “won”.

In [1]:
import nltk
from nltk.tokenize import (TreebankWordTokenizer,
                           word_tokenize,
                           wordpunct_tokenize,
                           TweetTokenizer,
                           MWETokenizer)

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
sentence = "Today we would leanr about tokenization. Are you all ready?"

### A. Whitespace tokenization

This is the most simple and commonly used form of tokenization. It splits the text whenever it finds whitespace characters.

In [3]:
print(f'Whitespace tokenization = {sentence.split()}')

Whitespace tokenization = ['Today', 'we', 'would', 'leanr', 'about', 'tokenization.', 'Are', 'you', 'all', 'ready?']


### B. Punctuation-based tokenization

Punctuation-based tokenization is slightly more advanced than whitespace-based tokenization since it splits on whitespace and punctuations and also retains the punctuations.

In [4]:
print(f'Punctuation-based tokenization = {wordpunct_tokenize(sentence)}')

Punctuation-based tokenization = ['Today', 'we', 'would', 'leanr', 'about', 'tokenization', '.', 'Are', 'you', 'all', 'ready', '?']


### C. Default/TreebankWordTokenizer

The default tokenization method in NLTK involves tokenization using regular expressions as defined in the Penn Treebank (based on English text). It assumes that the text is already split into sentences.

In [5]:
tokenizer = TreebankWordTokenizer()
print(f'Default/Treebank tokenization = {tokenizer.tokenize(sentence)}')

Default/Treebank tokenization = ['Today', 'we', 'would', 'leanr', 'about', 'tokenization.', 'Are', 'you', 'all', 'ready', '?']


### D. TweetTokenizer


Special texts, like Twitter tweets, have a characteristic structure and the generic tokenizers mentioned above fail to produce viable tokens when applied to these datasets.

In [6]:
tokenizer = TweetTokenizer()
print(f'Tweet-rules based tokenization = {tokenizer.tokenize(sentence)}')

Tweet-rules based tokenization = ['Today', 'we', 'would', 'leanr', 'about', 'tokenization', '.', 'Are', 'you', 'all', 'ready', '?']


### E. MWETokenizer

The multi-word expression tokenizer is a rule-based, “add-on” tokenizer offered by NLTK. Once the text has been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.

In [9]:
tokenizer = MWETokenizer()
tokenizer.add_mwe(('Martha', 'Jones'))
print(f'Multi-word expression (MWE) tokenization = {tokenizer.tokenize(word_tokenize(sentence))}')

Multi-word expression (MWE) tokenization = ['Today', 'we', 'would', 'leanr', 'about', 'tokenization', '.', 'Are', 'you', 'all', 'ready', '?']


## 2. Stemming and Lemmatization

### A. Stemming 

Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers. Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for [boat, boater, boating, boats]. Stemming is a somewhat crude method for cataloging related words; it essentially chops off letters from the end until the stem is reached.

#### i) Porter Stemmer

In [10]:
# Import the toolkit and the full Porter Stemmer library
import nltk

from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


#### ii) Snowball Stemmer

In [12]:
from nltk.stem.snowball import SnowballStemmer

# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')
words = ['run','runner','running','ran','runs','easily','fairly']
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


#### In this case, the stemmer performed the same as the Porter Stemmer, with the exception that it handled the stem of “fairly” more appropriately with “fair”

### B. Lemmatization 

In contrast to stemming, lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is ‘mouse’.

Lemmatization is typically seen as much more informative than simple stemming, which is why Spacy has opted to only have Lemmatization available instead of Stemming

In [13]:
# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}')

In [14]:
doc = nlp(u"I saw eighteen mice today!")

show_lemmas(doc)

I            PRON   561228191312463089     -PRON-
saw          VERB   11925638236994514241   see
eighteen     NUM    9609336664675087640    eighteen
mice         NOUN   1384165645700560590    mouse
today        NOUN   11042482332948150395   today
!            PUNCT  17494803046312582752   !
