# Stemming and Lemmatization

##### Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing

# Understanding 

In grammar, inflection is the modification of a word to express different grammatical categories such as tense, case, voice, aspect, person, number, gender, and mood. An inflection expresses one or more grammatical categories with a prefix, suffix or infix, or another internal modification such as a vowel change

<img src="stemminglemmatization_n8bmou.jpg">

**Stemming** is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language

**Lemmatization**, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words

## Stemming

In [1]:
import nltk

## Types of Stemmers

1. Snowball Stemmer
2. Porter Stemmer - **most used**
3. Lancaster Stemmer

In [4]:
nltk.SnowballStemmer(language = 'english').stem("automate")

'autom'

In [5]:
tokens = ['waterloo','fortune','catchy','hired','trapping','inn','driven']

stemmed_tokens = [nltk.SnowballStemmer(language = 'english').stem(i) for i in tokens]

In [6]:
stemmed_tokens

['waterloo', 'fortun', 'catchi', 'hire', 'trap', 'inn', 'driven']

In [7]:
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()

In [8]:
print([porter.stem(t) for t in tokens])

['waterloo', 'fortun', 'catchi', 'hire', 'trap', 'inn', 'driven']


In [10]:
print ([lancaster.stem(t) for t in tokens])

['waterloo', 'fortun', 'catchy', 'hir', 'trap', 'in', 'driv']


## Lemmatization

In [11]:
lemmer = nltk.WordNetLemmatizer()

In [15]:
words = ['women','supreme','stocking']
print("Actual     {}".format(words))
print("Lemmatized {}".format([lemmer.lemmatize(word) for word in words]))

Actual     ['women', 'supreme', 'stocking']
Lemmatized ['woman', 'supreme', 'stocking']


# Sentence Tokenisation / Segmentation

Important to understand that segmentation can't be done solely on the basis of full stops

In [16]:
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')

In [21]:
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
print(text[:1000])

[The Man Who Was Thursday by G. K. Chesterton 1908]

To Edmund Clerihew Bentley

A cloud was on the mind of men, and wailing went the weather,
Yea, a sick cloud upon the soul when we were boys together.
Science announced nonentity and art admired decay;
The world was old and ended: but you and I were gay;
Round us in antic order their crippled vices came--
Lust that had lost its laughter, fear that had lost its shame.
Like the white lock of Whistler, that lit our aimless gloom,
Men showed their own white feather as proudly as a plume.
Life was a fly that faded, and death a drone that stung;
The world was very old indeed when you and I were young.
They twisted even decent sin to shapes not to be named:
Men were ashamed of honour; but we were not ashamed.
Weak if we were and foolish, not thus we failed, not thus;
When that black Baal blocked the heavens he had no hymns from us
Children we were--our forts of sand were even as weak as eve,
High as they went we piled them up to break that b

In [22]:
sents = sent_tokenizer.tokenize(text)

In [30]:
from pprint import pprint
pprint(sents[0])

('[The Man Who Was Thursday by G. K. Chesterton 1908]\n'
 '\n'
 'To Edmund Clerihew Bentley\n'
 '\n'
 'A cloud was on the mind of men, and wailing went the weather,\n'
 'Yea, a sick cloud upon the soul when we were boys together.')


In [32]:
pprint(sents[1:1000])

['Science announced nonentity and art admired decay;\n'
 'The world was old and ended: but you and I were gay;\n'
 'Round us in antic order their crippled vices came--\n'
 'Lust that had lost its laughter, fear that had lost its shame.',
 'Like the white lock of Whistler, that lit our aimless gloom,\n'
 'Men showed their own white feather as proudly as a plume.',
 'Life was a fly that faded, and death a drone that stung;\n'
 'The world was very old indeed when you and I were young.',
 'They twisted even decent sin to shapes not to be named:\n'
 'Men were ashamed of honour; but we were not ashamed.',
 'Weak if we were and foolish, not thus we failed, not thus;\n'
 'When that black Baal blocked the heavens he had no hymns from us\n'
 'Children we were--our forts of sand were even as weak as eve,\n'
 'High as they went we piled them up to break that bitter sea.',
 'Fools as we were in motley, all jangling and absurd,\n'
 'When all church bells were silent our cap and beds were heard.',
 '