In [1]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

In [2]:
sentence = "Don’t be disheartened, try and try until you succeed! To resubmit your project or view other projects go to your projects dashboard."

# Tokenization

Tokenization is the first process of NLP. it is used to break the text into list of words or sentances for further processing

### 1. word_tokenize

word_tokenize create list of words based on spaces

In [3]:
words = word_tokenize(sentence)
words

['Don',
 '’',
 't',
 'be',
 'disheartened',
 ',',
 'try',
 'and',
 'try',
 'until',
 'you',
 'succeed',
 '!',
 'To',
 'resubmit',
 'your',
 'project',
 'or',
 'view',
 'other',
 'projects',
 'go',
 'to',
 'your',
 'projects',
 'dashboard',
 '.']

### 2. sent_tokenize
sent_tokenize is Sentance Tokenizer which break peragraph into sentances

In [4]:
sent = sent_tokenize(sentence)
sent

['Don’t be disheartened, try and try until you succeed!',
 'To resubmit your project or view other projects go to your projects dashboard.']

# Stemming

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

It is the 2nd stage of NLP. We use stemming on tokenize words to convert words into root words

### 1. PorterStammer

PorterStemmer being the oldest one originally developed in 1979. PorterStemmer uses Suffix Stripping to produce stems. **PorterStemmer algorithm does not follow linguistics rather a set of 05 rules for different cases that are applied in phases (step by step) to generate stems**. This is the reason why PorterStemmer does not often generate stems that are actual English words. It does not keep a lookup table for actual stems of the word but applies algorithmic rules to generate stems. It uses the rules to decide whether it is wise to strip a suffix. One can generate its own set of rules for any language that is why Python nltk introduced SnowballStemmers that are used to create non-English Stemmers!


**PorterStemmer is known for its simplicity and speed**. It is commonly useful in **Information Retrieval Environments** known as **IR Environments** for fast recall and fetching of search queries. In a typical IR, environment documents are represented as vectors of words or terms. Words having the same stem will have a similar meaning.

In [5]:
ps = PorterStemmer()

for word in words:
    rootWord = ps.stem(word)
    print(rootWord)

don
’
t
be
dishearten
,
tri
and
tri
until
you
succeed
!
To
resubmit
your
project
or
view
other
project
go
to
your
project
dashboard
.


### 2. LancasterStemmer (Paice-Husk Stemmer)

LancasterStemmer was developed in 1990 and uses a more aggressive approach than Porter Stemming Algorithm. The LancasterStemmer (Paice-Husk stemmer) is an iterative algorithm with rules saved externally. One table containing about 120 rules indexed by the last letter of a suffix. On each iteration, it tries to find an applicable rule by the last character of the word. Each rule specifies either a deletion or replacement of an ending. If there is no such rule, it terminates. It also terminates if a word starts with a vowel and there are only two letters left or if a word starts with a consonant and there are only three characters left. Otherwise, the rule is applied, and the process repeats.

**LancasterStemmer is simple, but heavy stemming due to iterations and over-stemming may occur. Over-stemming causes the stems to be not linguistic, or they may have no meaning.**

In [6]:
ls = LancasterStemmer()

for word in words:
    rootWord = ls.stem(word)
    print(rootWord)

don
’
t
be
disheart
,
try
and
try
until
you
success
!
to
resubmit
yo
project
or
view
oth
project
go
to
yo
project
dashboard
.


## 3. SnowballStemmers

This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer. That being said, it is also more aggressive than the Porter stemmer. A lot of the things added to the Snowball stemmer were because of issues noticed with the Porter stemmer. There is about a 5% difference in the way that Snowball stems versus Porter.

In [7]:
# See which languages are supported
print(", ".join(SnowballStemmer.languages))

arabic, danish, dutch, english, finnish, french, german, hungarian, italian, norwegian, porter, portuguese, romanian, russian, spanish, swedish


In [8]:
stemmer = SnowballStemmer("english")
for word in words:
    rootWord = stemmer.stem(word)
    print(rootWord)

don
’
t
be
dishearten
,
tri
and
tri
until
you
succeed
!
to
resubmit
your
project
or
view
other
project
go
to
your
project
dashboard
.


## Compearing all Stemming results

In [9]:
word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]

print("{0:20} {1:20} {2:20} {3:20}".format("Word","PorterStemmer","LancasterStemmer","SnowballStemmer"))
print("-----------------------------------------------------------------------------------")
for word in word_list:
    print("{0:20} {1:20} {2:20} {3:20}".format(word, ps.stem(word), ls.stem(word), stemmer.stem(word)))

Word                 PorterStemmer        LancasterStemmer     SnowballStemmer     
-----------------------------------------------------------------------------------
friend               friend               friend               friend              
friendship           friendship           friend               friendship          
friends              friend               friend               friend              
friendships          friendship           friend               friendship          
stabil               stabil               stabl                stabil              
destabilize          destabil             dest                 destabil            
misunderstanding     misunderstand        misunderstand        misunderstand       
railroad             railroad             railroad             railroad            
moonlight            moonlight            moonlight            moonlight           
football             footbal              footbal              footbal      

# Lemmatization

lemmatization to resolve a word to its lemma, it needs to know its part of speech. That requires extra computational linguistics power such as a part of speech tagger. This allows it to do better resolutions (like resolving is and are to “be”). 

## 1. WordNetLemmatizer

WordNet Lemmatizer uses the WordNet Database to lookup lemmas of words.

In [10]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

sentence = "The striped bats are hanging on their feet for best"

words = word_tokenize(sentence)

print("{0:20} {1:20}".format("word", "lemma"))
print("--------------------------------------")
for word in words:
    rootWord = wnl.lemmatize(word)
    print("{0:20} {1:20}".format(word, rootWord))

word                 lemma               
--------------------------------------
The                  The                 
striped              striped             
bats                 bat                 
are                  are                 
hanging              hanging             
on                   on                  
their                their               
feet                 foot                
for                  for                 
best                 best                


it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().

In the above output, no actual root form has been given for any word, this is because they are given without context. We need to provide the context in which we want to lemmatize that is the parts-of-speech (POS). This is done by giving the value for pos parameter in wordnet_lemmatizer.lemmatize.

In [11]:
print("{0:20} {1:20}".format("word", "lemma"))
print("--------------------------------------")
for word in words:
    rootWord = wnl.lemmatize(word, pos='v')
    print("{0:20} {1:20}".format(word, rootWord))

word                 lemma               
--------------------------------------
The                  The                 
striped              strip               
bats                 bat                 
are                  be                  
hanging              hang                
on                   on                  
their                their               
feet                 feet                
for                  for                 
best                 best                


### Part-of-speech (POS) tagging

It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

In [12]:
from nltk.corpus import wordnet
from nltk.tag import pos_tag

In [13]:
posTag = pos_tag(words)
posTag

[('The', 'DT'),
 ('striped', 'JJ'),
 ('bats', 'NNS'),
 ('are', 'VBP'),
 ('hanging', 'VBG'),
 ('on', 'IN'),
 ('their', 'PRP$'),
 ('feet', 'NNS'),
 ('for', 'IN'),
 ('best', 'JJS')]

In [14]:
universalPosTag = pos_tag(words, tagset='universal')
universalPosTag

[('The', 'DET'),
 ('striped', 'ADJ'),
 ('bats', 'NOUN'),
 ('are', 'VERB'),
 ('hanging', 'VERB'),
 ('on', 'ADP'),
 ('their', 'PRON'),
 ('feet', 'NOUN'),
 ('for', 'ADP'),
 ('best', 'ADJ')]

### passing custom pos_tag to the lemmatizer

nltk.pos_tag() returns a tuple with the POS tag. The key here is to map NLTK’s POS tags to the format wordnet lemmatizer would accept. The get_wordnet_pos() function defined below does this mapping job.

In [15]:
def get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
               "N": wordnet.NOUN,
               "V": wordnet.VERB,
               "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [16]:
# Lemmatize a Sentence with the appropriate POS tag
sentence = "The striped bats are hanging on their feet for best"
print([wnl.lemmatize(w, get_wordnet_pos(w)) for w in word_tokenize(sentence)])

['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']


## 2. spaCy Lemmatizer

spaCy is a relatively new in the space and is billed as an industrial strength NLP engine. It comes with pre-built models that can parse text and compute various NLP related features through one single function call. Ofcourse, it provides the lemma of the word too.

In [17]:
# !pip install spaCy
# !spaCy download en

spaCy determines the part-of-speech tag by default and assigns the corresponding lemma. It comes with a bunch of prebuilt models where the ‘en’ we just downloaded above is one of the standard ones for english.

In [18]:
import spacy

# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

sentence = "The striped bats are hanging on their feet for best"

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

# Extract the lemma for each token and join
" ".join([token.lemma_ for token in doc])

'the stripe bat be hang on their foot for good'

It did all the lemmatizations the Wordnet Lemmatizer supplied with the correct POS tag did. Plus it also lemmatized ‘best’ to ‘good’.

## 3. TextBlob Lemmatizer

TexxtBlob is a powerful, fast and convenient NLP package as well. Using the Word and TextBlob objects, its quite straighforward to parse and lemmatize words and sentences respectively.

In [19]:
# !pip install textblob

In [20]:
from textblob import TextBlob,Word

# Lemmatize a word
word = 'stripes'
w = Word(word)
w.lemmatize()

'stripe'

However to lemmatize a sentence or paragraph, we parse it using TextBlob and call the lemmatize() function on the parsed words.

In [21]:
# Lemmatize a sentence
sentence = "The striped bats are hanging on their feet for best"
sent = TextBlob(sentence)
" ". join([w.lemmatize() for w in sent.words])

'The striped bat are hanging on their foot for best'

It did not do a great job at the outset, because, like NLTK, TextBlob also uses wordnet internally. So, let’s pass the appropriate POS tag to the lemmatize() method.

### TextBlob Lemmatizer with appropriate POS tag

In [22]:
# Define function to lemmatize each word with its POS tag

def lemmatize_with_postag(sentence):
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w,pos in sent.tags]
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)

# Lemmatize
sentence = "The striped bats are hanging on their feet for best"

lemmatize_with_postag(sentence)

'The striped bat be hang on their foot for best'

## 4. Pattern Lemmatizer

In [23]:
# !pip install pattern

In [24]:
import pattern
from pattern.en import lemma, lexeme

In [26]:
# Not sure why but its always throwing error on first execution
sent = "The striped bats were hanging on their feet and ate best fishes"

" ".join([lemma(wd) for wd in sent.split()])

'the stripe bat be hang on their feet and eat best fishes'

In [27]:
# We can also view the possible lexeme’s for each word.

[lexeme(wd) for wd in sentence.split()]

[['the', 'thes', 'thing', 'thed'],
 ['stripe', 'stripes', 'striping', 'striped'],
 ['bat', 'bats', 'batting', 'batted'],
 ['be',
  'am',
  'are',
  'is',
  'being',
  'was',
  'were',
  'been',
  'am not',
  "aren't",
  "isn't",
  "wasn't",
  "weren't"],
 ['hang', 'hangs', 'hanging', 'hung'],
 ['on', 'ons', 'oning', 'oned'],
 ['their', 'theirs', 'theiring', 'theired'],
 ['feet', 'feets', 'feeting', 'feeted'],
 ['for', 'fors', 'forring', 'forred'],
 ['best', 'bests', 'besting', 'bested']]

In [30]:
# We could also obtain the lemma by parsing the text.

from pattern.en import parse

In [32]:
print(parse('The striped bats were hanging on their feet and ate best fishes', 
            lemmata=True, tags=False, chunks=False))

The/DT/the striped/JJ/striped bats/NNS/bat were/VBD/be hanging/VBG/hang on/IN/on their/PRP$/their feet/NNS/foot and/CC/and ate/VBD/eat best/JJ/best fishes/NNS/fish


## Applications of Stemming and Lemmatization
Stemming and Lemmatization are itself form of NLP and widely used in Text mining. Text Mining is the process of analysis of texts written in natural language and extract high-quality information from text. It involves looking for interesting patterns in the text or to extract data from the text to be inserted into a database. Text mining tasks include **text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities)**. Developers have to prepare text using lexical analysis, POS (Parts-of-speech) tagging, stemming and other Natural Language Processing techniques to gain useful information from text.

#### Information Retrieval (IR) Environments:
It is useful to use stemming and lemmatization to map documents to common topics and display search results by indexing when documents are increasing to mind-boggling numbers. **Query Expansion** is a term used in Search Environments which refers to that when a user inputs a query. It is used to expand or enhance the query to match additional documents.
Stemming has been used in Query systems such as Web Search Engines, but due to problems of under-stemming and over-stemming it's effectiveness in returning correct results have been found limited. For example, a person searching for 'marketing' may not be pleased with results that will show 'markets' and not marketing. But Stemming may be found useful in other languages and using different algorithms for stemming may result in better outputs. Google search adopted stemming in 2003.

#### Sentiment Analysis
Sentiment Analysis is the analysis of people's reviews and comments about something. It is widely used for analysis of product on online retail shops. Stemming and Lemmatization is used as part of the text-preparation process before it is analyzed.

#### Document Clustering
Document clustering (or text clustering) is the application of cluster analysis to textual documents. It has applications in an automatic document organization, topic extraction, and fast information retrieval or filtering. Examples of document clustering include web document clustering for search engines. Before Clustering methods are applied document is prepared through tokenization, removal of stop words and then Stemming and Lemmatization to reduce the number of tokens that carry out the same information and hence speed up the whole process. After this pre-processing, features are calculated by calculating the frequency of all tokens and then clustering methods are applied.

## Stemming or lemmatization?

1. Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word.

2. Stemming follows an algorithm with steps to perform on the words which makes it faster. Whereas, in lemmatization, you used WordNet corpus and a corpus for stop words as well to produce lemma which makes it slower than stemming. You also had to define a parts-of-speech to obtain the correct lemma.

So when to use what! The above points show that if speed is focused then stemming should be used since lemmatizers scan a corpus which consumed time and processing. If we are building a language application in which language is important we should use lemmatization as it uses a corpus to match root forms.