## Stemming

Stemming is a process of stripping affixes from words.

More often, you normalize text by converting all the words into lowercase. This will treat both words The and the as same.

With stemming, the words playing, played and play will be treated as single word, i.e. play.

nltk comes with few stemmers.
The two widely used stemmers are **Porter** and **Lancaster** stemmers.
These stemmers have their own rules for string affixes.

In [2]:
import nltk

In [3]:
# Ported Stemmer

from nltk import PorterStemmer
porter = nltk.PorterStemmer()
porter.stem('builders')

'builder'

In [4]:
from nltk import LancasterStemmer
lanc = LancasterStemmer()
lanc.stem('builders')

'build'

### Normalizing with Stemming

In [5]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [6]:
text1

<Text: Moby Dick by Herman Melville 1851>

In [10]:
type(text1)

nltk.text.Text

In [8]:
len(set(text1))

19317

In [14]:
lc_text1 = [w.lower() for w in text1]
lc_text1

['[',
 'moby',
 'dick',
 'by',
 'herman',
 'melville',
 '1851',
 ']',
 'etymology',
 '.',
 '(',
 'supplied',
 'by',
 'a',
 'late',
 'consumptive',
 'usher',
 'to',
 'a',
 'grammar',
 'school',
 ')',
 'the',
 'pale',
 'usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'i',
 'see',
 'him',
 'now',
 '.',
 'he',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and',
 'grammars',
 ',',
 'with',
 'a',
 'queer',
 'handkerchief',
 ',',
 'mockingly',
 'embellished',
 'with',
 'all',
 'the',
 'gay',
 'flags',
 'of',
 'all',
 'the',
 'known',
 'nations',
 'of',
 'the',
 'world',
 '.',
 'he',
 'loved',
 'to',
 'dust',
 'his',
 'old',
 'grammars',
 ';',
 'it',
 'somehow',
 'mildly',
 'reminded',
 'him',
 'of',
 'his',
 'mortality',
 '.',
 '"',
 'while',
 'you',
 'take',
 'in',
 'hand',
 'to',
 'school',
 'others',
 ',',
 'and',
 'to',
 'teach',
 'them',
 'by',
 'what',
 'name',
 'a',
 'whale',
 '-',
 'fish',
 'is',
 'to',
 'be',
 

In [15]:
len(set(lc_text1))

17231

In [18]:
# Now let's further normalize text1 with Porter Stemmer.
porter = PorterStemmer()
p_stem_words = [porter.stem(w) for w in set(lc_text1)]
p_stem_words

['abruptli',
 'height',
 'walk',
 'unimagin',
 'verbatim',
 'long',
 'whistl',
 '90',
 'erskin',
 'desist',
 'stolidli',
 'tape',
 'stealthili',
 'digniti',
 'hind',
 'mcculloch',
 'day',
 'quiet',
 'lock',
 'gazett',
 'spile',
 'aris',
 'cliff',
 ':--"',
 'lime',
 'humorist',
 'idolatr',
 'onset',
 'yanke',
 'twist',
 'drawn',
 'bedstead',
 'explor',
 'polici',
 'snap',
 'heav',
 'wheel',
 'denot',
 'goddess',
 'bayonet',
 'wealthiest',
 'mishap',
 'topsail',
 'period',
 'hope',
 'standest',
 'stoutest',
 'wrist',
 'buckl',
 'palm',
 'elabor',
 'calcul',
 'lavish',
 'slightest',
 'thrash',
 'anyway',
 'behr',
 'triangular',
 'salli',
 'deserv',
 'pleas',
 'spike',
 'orchard',
 'peer',
 'magnific',
 'bad',
 'whalebon',
 'solicit',
 'pickl',
 'victori',
 '000',
 'cetacea',
 'belay',
 'they',
 'breakwat',
 'heart',
 'process',
 'fearless',
 'lament',
 'uneven',
 'immeasur',
 'dimli',
 'beech',
 'allot',
 'pusi',
 'pistol',
 'curl',
 'dole',
 'continu',
 'after',
 'perpetu',
 'convey',
 '

In [20]:
len(set(p_stem_words))

10927

In [21]:
lanc = LancasterStemmer()
l_stem_words = [lanc.stem(w) for w in set(lc_text1)]
l_stem_words

['abrupt',
 'height',
 'walk',
 'unimagin',
 'verbatim',
 'long',
 'whistl',
 '90',
 'erskin',
 'desist',
 'stolid',
 'tap',
 'stealthy',
 'dign',
 'hind',
 'mcculloch',
 'day',
 'quiet',
 'lock',
 'gazet',
 'spil',
 'ar',
 'cliff',
 ':--"',
 'lim',
 'hum',
 'idol',
 'onset',
 'yank',
 'twist',
 'drawn',
 'bedstead',
 'expl',
 'policy',
 'snap',
 'heav',
 'wheel',
 'denot',
 'goddess',
 'bayonet',
 'wealthiest',
 'mishap',
 'topsail',
 'period',
 'hop',
 'standest',
 'stoutest',
 'wrist',
 'buckl',
 'palm',
 'elab',
 'calc',
 'lav',
 'slightest',
 'thrashing',
 'anyway',
 'behr',
 'triangul',
 'sal',
 'deserv',
 'pleas',
 'spik',
 'orchard',
 'peer',
 'magn',
 'bad',
 'whalebon',
 'solicit',
 'pickl',
 'vict',
 '000',
 'cetace',
 'belay',
 'they',
 'breakw',
 'heart',
 'process',
 'fearless',
 'lam',
 'unev',
 'immeas',
 'dim',
 'beech',
 'allot',
 'pusy',
 'pistol',
 'curl',
 'dol',
 'continu',
 'aft',
 'perpet',
 'convey',
 'outstretch',
 'dent',
 'springs',
 'gro',
 'esteem',
 'show

In [22]:
len(set(l_stem_words))

9036

## Lemmatization or Lemma
Lemma is a lexical entry in a lexical resource such as word dictionary. Lemmatization reduces the word to its base form (known as lemma).

You can find multiple Lemma's with the same spelling. These are known as homonyms.

For example, consider the two Lemma's listed below, which are homonyms.
1. saw [verb] - Past tense of see
2. saw [noun] - Cutting instrument

In [33]:
wnl = nltk.WordNetLemmatizer()
wnl_stem_words = [wnl.lemmatize(word) for word in set(lc_text1) ]
wnl_stem_words

['abruptly',
 'height',
 'walking',
 'unimaginative',
 'verbatim',
 'longing',
 'whistled',
 '90',
 'erskine',
 'desist',
 'stolidly',
 'tape',
 'stealthily',
 'dignity',
 'hind',
 'mcculloch',
 'day',
 'quiet',
 'locked',
 'gazette',
 'spile',
 'arises',
 'cliff',
 ':--"',
 'lime',
 'humorist',
 'idolatrous',
 'onset',
 'yankee',
 'twist',
 'drawn',
 'bedstead',
 'explored',
 'policy',
 'snapping',
 'heave',
 'wheeled',
 'denotes',
 'goddess',
 'bayonet',
 'wealthiest',
 'mishap',
 'topsail',
 'periodical',
 'hope',
 'standest',
 'stoutest',
 'wrist',
 'buckling',
 'palm',
 'elaborately',
 'calculation',
 'lavish',
 'slightest',
 'thrashing',
 'anyways',
 'behring',
 'triangular',
 'sally',
 'deserved',
 'pleased',
 'spike',
 'orchard',
 'peer',
 'magnificence',
 'bad',
 'whaleboning',
 'solicitously',
 'pickle',
 'victory',
 '000',
 'cetacea',
 'belaying',
 'they',
 'breakwater',
 'heart',
 'procession',
 'fearlessness',
 'lamentable',
 'uneven',
 'immeasurable',
 'dimly',
 'beech',


In [34]:
len(set(wnl_stem_words))

15168

nltk comes with **WordNetLemmatizer**. This lemmatizer removes affixes only if the resulting word is found in lexical resource, Wordnet.
WordNetLemmatizer is majorly used to build a vocabulary of words, which are valid Lemmas.

In [31]:
lanc.stem('lying')

'lying'

In [27]:
# How many words are ending with 'ly' in text collection text6?
text6

<Text: Monty Python and the Holy Grail>

In [29]:
 result = [w for w in text6 if w.lower().endswith('ly')]

In [30]:
len(result)

110

### POS tagging 

The method of categorizing words into their parts of speech and then labeling them respectively is called POS Tagging.

A POS Tagger processes a sequence of words and tags a part of speech to each word.

**pos_tag** is the simplest tagger available in nltk.

In [4]:
text = 'Python is awesome.'
words = nltk.word_tokenize(text)
nltk.pos_tag(words)

# The words Python, is and awesome are tagged to Proper Noun (NNP), Present Tense Verb (VB), and adjective (JJ) respectively.

[('Python', 'NNP'), ('is', 'VBZ'), ('awesome', 'JJ'), ('.', '.')]

In [6]:
## You can read more about the pos tags with the below help command
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [7]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


In [15]:
text = 'Python/NN is/VB awesome/JJ ./.'
[ nltk.pos_tag([word]) for word in text.split() ]

[[('Python/NN', 'NN')],
 [('is/VB', 'NN')],
 [('awesome/JJ', 'NN')],
 [('./.', 'NN')]]

In [16]:
[ nltk.tag.str2tuple(word) for word in text.split() ]

[('Python', 'NN'), ('is', 'VB'), ('awesome', 'JJ'), ('.', '.')]

In [19]:
## Many of the text corpus available in nltk, are already tagged to their respective parts of speech.
## **tagged_words** method can be used to obtain tagged words of a corpus.

from nltk.corpus import brown
b_tagged_words = brown.tagged_words()
b_tagged_words

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

#### Default tagger

DefaultTagger assigns a specified tag to every word or token of given text.

In [22]:
text = 'Python is awesome.'
text_tokens = nltk.word_tokenize(text)
text_tokens_default_tage = nltk.DefaultTagger('NN')
text_tokens_default_tage.tag(text_tokens)

[('Python', 'NN'), ('is', 'NN'), ('awesome', 'NN'), ('.', 'NN')]

#### Lookup Tagger
You can define a custom tagger and use it to tag words present in any text.

**UnigramTagger** provides you the flexibility to create your taggers.

In [23]:
text = 'Python is awesome.'
text_tokens = nltk.word_tokenize(text)
default_tags = {'Python': 'AT', 'is':'NN', 'over':'BEZ'}

tagger = nltk.UnigramTagger(model=default_tags)
tagger.tag(text_tokens)

[('Python', 'AT'), ('is', 'NN'), ('awesome', None), ('.', None)]

#### Training and Testing your UnigramTagger

In [40]:
from nltk.corpus import brown
print(brown.sents(categories='government'))
brown_tagged_sents = brown.tagged_sents(categories='government')

print('\n\n',brown_tagged_sents[0])

[['The', 'Office', 'of', 'Business', 'Economics', '(', 'OBE', ')', 'of', 'the', 'U.S.', 'Department', 'of', 'Commerce', 'provides', 'basic', 'measures', 'of', 'the', 'national', 'economy', 'and', 'current', 'analysis', 'of', 'short-run', 'changes', 'in', 'the', 'economic', 'situation', 'and', 'business', 'outlook', '.'], ['It', 'develops', 'and', 'analyzes', 'the', 'national', 'income', ',', 'balance', 'of', 'international', 'payments', ',', 'and', 'many', 'other', 'business', 'indicators', '.'], ...]


 [('The', 'AT'), ('Office', 'NN-TL'), ('of', 'IN-TL'), ('Business', 'NN-TL'), ('Economics', 'NN-TL'), ('(', '('), ('OBE', 'NP'), (')', ')'), ('of', 'IN'), ('the', 'AT'), ('U.S.', 'NP-TL'), ('Department', 'NN-TL'), ('of', 'IN-TL'), ('Commerce', 'NN-TL'), ('provides', 'VBZ'), ('basic', 'JJ'), ('measures', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('national', 'JJ'), ('economy', 'NN'), ('and', 'CC'), ('current', 'JJ'), ('analysis', 'NN'), ('of', 'IN'), ('short-run', 'NN'), ('changes', 'NNS'), (

In [41]:
brown_sents = brown.sents(categories='government')
print(brown_sents)
print(len(brown_sents))

[['The', 'Office', 'of', 'Business', 'Economics', '(', 'OBE', ')', 'of', 'the', 'U.S.', 'Department', 'of', 'Commerce', 'provides', 'basic', 'measures', 'of', 'the', 'national', 'economy', 'and', 'current', 'analysis', 'of', 'short-run', 'changes', 'in', 'the', 'economic', 'situation', 'and', 'business', 'outlook', '.'], ['It', 'develops', 'and', 'analyzes', 'the', 'national', 'income', ',', 'balance', 'of', 'international', 'payments', ',', 'and', 'many', 'other', 'business', 'indicators', '.'], ...]
3032


In [44]:
train_size = int(len(brown_sents)*0.8)
train_sents = brown_tagged_sents[:train_size]
test_sents = brown_tagged_sents[train_size:]

In [46]:
tagger = nltk.UnigramTagger(train_sents)
tagger

<UnigramTagger: size=6810>

In [47]:
tagger.evaluate(test_sents)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  tagger.evaluate(test_sents)


0.7799495586380832

In [48]:
tagger.accuracy(test_sents)

0.7799495586380832

In [51]:
nltk.FreqDist(brown.tagged_words())[('The','AT')]

6725

In [None]:
tokenizedwords = nltk.word_tokenize(textcontent)
    tokenizedwords = [w.lower() for w in set(tokenizedwords)]
    eng_stopwords = set(stopwords.words('english'))
    filteredwords = [w for w in tokenizedwords if w not in eng_stopwords]
    
    porter = nltk.PorterStemmer()
    porterstemmedwords = [porter.stem(w) for w in filteredwords]
    
    lanc = nltk.LancasterStemmer()
    lancasterstemmedwords = [lanc.stem(w) for w in filteredwords]
    
    wnl = nltk.WordNetLemmatizer()
    lemmatizedwords = [wnl.lemmatize(w) for w in filteredwords]
    return (porterstemmedwords, lancasterstemmedwords, lemmatizedwords)