## Understand the difference between Stemming and Lemmatization

In [2]:
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Collecting nltk
  Downloading nltk-3.6.7-py3-none-any.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 1.1 MB/s            
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
     |████████████████████████████████| 78 kB 1.5 MB/s             
[?25hCollecting regex>=2021.8.3
  Downloading regex-2023.8.8-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
     |████████████████████████████████| 759 kB 30.1 MB/s            
[?25hCollecting importlib-resources
  Downloading importlib_resources-5.4.0-py3-none-any.whl (28 kB)
Installing collected packages: importlib-resources, tqdm, regex, nltk
Successfully installed importlib-resources-5.4.0 nltk-3.6.7 regex-2023.8.8 tqdm-4.64.1


In [3]:
import nltk

# Natural Language Toolkit

In [4]:
from nltk.corpus import stopwords

In [5]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [6]:
# set stopwords to english
stop_words = set(stopwords.words('english'))

## Stemming

In [7]:
from nltk.stem import PorterStemmer

# Stemming Algorithm

In [8]:
# Create an instance of PorterStemmer

porter = PorterStemmer() # Create an instance

### Stem words

In [None]:
porter.stem("Walking")

'walk'

In [None]:
porter.stem("Walked")

'walk'

In [None]:
porter.stem("Walks")

'walk'

In [None]:
porter.stem("run")

'run'

In [None]:
porter.stem("ran")

'ran'

In [None]:
porter.stem("running")

'run'

In [None]:
porter.stem("bosses")

'boss'

In [None]:
porter.stem("replacement") # "ement" -> remove

'replac'

In [None]:
porter.stem("unnecessary")

'unnecessari'

In [None]:
porter.stem("berry") # y -> i

'berri'

### Stem sentence

In [33]:
sentence = "Stemming is a crude method for normalizaing words"

In [34]:
# iter of object -> enumerate(iterobject) => Output (index, element)
# Example : (0, "Stemming"), (1, "is"), etc.

sentence_split = sentence.split() # default : delimiter as whitespace

In [35]:
for tokens in sentence_split: # tokens = "Stemming","is","a"...
  stemmed_word = porter.stem(tokens)
  print(stemmed_word, end = " ")

stem is a crude method for normaliza word 

## Stemming corpus

In [19]:
# load the corpus 
corpus_1 = open('corpus_1.txt','r')

# print the original corpus
print('Corpus: \n\n',corpus_1.read())

corpus_1 = open('corpus_1.txt','r')

# print the corpus after stemming is applied to it
print('\nAfter removing stopwords and Stemming: \n')
for sentence in corpus_1:                                # takes each sentence from the corpus
    for token in sentence.split():                       # splits the sentence and loops on each token
        if token not in stop_words:                      # removes stop words
            stemmed_word = porter.stem(token.lower())            # stems the token
            print(stemmed_word, end = " ")               # prints the stemmed token

Corpus: 

 She tried to explain that love wasn't like pie. There wasn't a set number of slices to be given out. There wasn't less to be given to one person if you wanted to give more to another. That after a set amount was given out it would all disappear. She tried to explain this, but it fell on deaf ears.
He lifted the bottle to his lips and took a sip of the drink. He had tasted this before, but he couldn't quite remember the time and place it had happened. He desperately searched his mind trying to locate and remember where he had tasted this when the bicycle ran over his foot.
Don't be scared. The things out there that are unknown aren't scary in themselves. They are just unknown at the moment. Take the time to know them before you list them as scary. Then the world will be a much less scary place for you.
She had been an angel for coming up on 10 years and in all that time nobody had told her this was possible. The fact that it could ever happen never even entered her mind. Yet 

## Lemmatization

In [11]:
from nltk.stem import WordNetLemmatizer

In [12]:
# Lemmatization involves looking into the database
nltk.download("wordnet") # word -> POS (Part-of-Speech) [saw - VERB -> see]

[nltk_data] Downloading package wordnet to /home/user/nltk_data...


True

In [14]:
# Import wordnet module for nltk.corpus

from nltk.corpus import wordnet

In [15]:
lemmatizer = WordNetLemmatizer() # Lemmatizer is applicable for words

In [None]:
lemmatizer.lemmatize("walking") # Lemmantization takes POS as argument, which by default is NOUN

'walking'

In [None]:
lemmatizer.lemmatize("walking", pos = wordnet.VERB)      # Part of speech tagged as verb

'walk'

In [None]:
lemmatizer.lemmatize("going")

'going'

In [None]:
lemmatizer.lemmatize("going", pos = wordnet.VERB)

'go'

In [None]:
lemmatizer.lemmatize("ran", pos = wordnet.VERB)

'run'

In [None]:
porter.stem("mice")

'mice'

In [None]:
lemmatizer.lemmatize("mice")

'mouse'

In [None]:
porter.stem("was")

'wa'

In [None]:
lemmatizer.lemmatize("was")

'wa'

In [None]:
lemmatizer.lemmatize("was", pos = wordnet.VERB)

'be'

In [None]:
porter.stem("is")

'is'

In [None]:
lemmatizer.lemmatize("is", pos = wordnet.VERB)

'be'

In [None]:
lemmatizer.lemmatize("better", pos = wordnet.ADJ)

'good'

**Note:** Parts of speech obtained by the NLTK tagger is not compatible with the Lemmatizer

In [16]:
 # POS Tagger for Lemmatizer
 # tags -> 'NOUN', 'VERB', etc.
def get_wordnet_pos(tags): # Returns the tags or POS {Noun, Verb, etc.}
  if tags.startswith('J'): # NNP
    return wordnet.ADJ
  elif tags.startswith('V'):
    return wordnet.VERB
  elif tags.startswith('N'):
    return wordnet.NOUN
  elif tags.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN

In [17]:
# The averaged_perceptron_tagger.zip contains
# the pre-trained English [Part-of-Speech (POS]]
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
sentence = "John Abraham have a devoted following".split() # "following" is Noun

In [None]:
# Execute the NLTK POS tagger

word_tag = nltk.pos_tag(sentence) # The output is a tuple (word, POS)

In [None]:
print(word_tag)

[('John', 'NNP'), ('Abraham', 'NNP'), ('have', 'VBP'), ('a', 'DT'), ('devoted', 'VBN'), ('following', 'NN')]


In [None]:
for word, tag in word_tag: # word_tag -> "John",'NNP'
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag)) # wordnet.NOUN
  print(lemma)
  #print(word, tag)

John
Abraham
have
a
devote
following


In [None]:
# Document - Paragraphs -> Stemming, Lemmatization also sentence after
# stopword removal
sentence = "The cat was following the mouse".split()


In [None]:
word_and_tag = nltk.pos_tag(sentence)
word_and_tag

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('mouse', 'NN')]

In [None]:
for word, tag in word_and_tag:
  lemma = lemmatizer.lemmatize(word,pos=get_wordnet_pos(tag)) # wordnet.VERB
  print(lemma, end = " ")

The cat be follow the mouse 

## Lemmatizing a corpus

In [23]:
# load the corpus
corpus_2 = open('corpus_2.txt','r')

# print the original corpus
print('Corpus: \n\n',corpus_2.read())

corpus_2 = open('corpus_2.txt','r')

# print the corpus after lemmatization is applied to it
print('\nAfter removal of stopwords and Lemmatization: \n')
for sentence in corpus_2:                                              # loops over each sentence
    for word,tag in nltk.pos_tag(sentence.split()):                    # loops over a list of tuples with POS tagging
        if word not in stop_words:                                     # removes stopwords
            lemma = lemmatizer.lemmatize(word.lower(), pos=get_wordnet_pos(tag))  # lemmatizes the words based on POS
            print(lemma, end=" ")                                          # prints lemmatized words

Corpus: 

 The alarm went off and Jake rose awake. Rising early had become a daily ritual, one that he could not fully explain. From the outside, it was a wonder that he was able to get up so early each morning for someone who had absolutely no plans to be productive during the entire day.
The water rush down the wash and into the slot canyon below. Two hikers had started the day to sunny weather without a cloud in the sky, but they hadn't thought to check the weather north of the canyon. Huge thunderstorms had brought a deluge o rain and produced flash floods heading their way. The two hikers had no idea what was coming.
Her hand was balled into a fist with her keys protruding out from between her fingers. This was the weapon her father had shown her how to make when she walked alone to her car after work. She wished that she had something a little more potent than keys between her fingers. It would have been nice to have some mace or pepper spray. He had been meaning to buy some but 