<a href="https://colab.research.google.com/github/Nawapon19/NLP/blob/main/Stemming_and_Lemmatization_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Practice Stemming and Lemmatization using NLTK library - Natural Language Toolkit**

In [1]:
# import NLTK library
import nltk

In [40]:
# for Stemming, use PorterStemmer
from nltk.stem import PorterStemmer

In [3]:
porter = PorterStemmer()

In [4]:
porter.stem("walking")

'walk'

In [5]:
porter.stem("walked")

'walk'

In [6]:
porter.stem("walks")

'walk'

In [7]:
porter.stem("ran")

'ran'

In [8]:
porter.stem("running")

'run'

In [9]:
porter.stem("bosses")

'boss'

In [10]:
porter.stem("replacement")

'replac'

In [11]:
sentence = "Lemmatization is more sophisticated than stemming".split()

In [12]:
for token in sentence:
  print(porter.stem(token), end = " ")

lemmat is more sophist than stem 

In [41]:
# PorterStemmer replace y at the end to i
porter.stem("unnecessary")

'unnecessari'

In [14]:
porter.stem("berry")

'berri'

In [42]:
# for Lemmatization, use WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

In [16]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [17]:
from nltk.corpus import wordnet

In [18]:
lemmatizer = WordNetLemmatizer()

In [19]:
# default pos is wordnet.NOUN, so the word "walking" is treated as noun
lemmatizer.lemmatize("walking")

'walking'

In [20]:
# specify pos as wordnet.VERB
lemmatizer.lemmatize("walking", pos=wordnet.VERB)

'walk'

In [21]:
lemmatizer.lemmatize("going")

'going'

In [22]:
lemmatizer.lemmatize("going", pos=wordnet.VERB)

'go'

In [23]:
lemmatizer.lemmatize("ran", pos=wordnet.VERB)

'run'

In [24]:
# stemming simply chops off ending of the word, does not follow rules of languages
porter.stem("mice")

'mice'

In [25]:
# while lemmatization follows rules of languages
lemmatizer.lemmatize("mice")

'mouse'

In [26]:
porter.stem("was")

'wa'

In [27]:
lemmatizer.lemmatize("was", pos=wordnet.VERB)

'be'

In [28]:
porter.stem("is")

'is'

In [29]:
lemmatizer.lemmatize("is", pos=wordnet.VERB)

'be'

In [30]:
porter.stem("better")

'better'

In [31]:
lemmatizer.lemmatize("better", pos=wordnet.ADJ)

'good'

In [32]:
# nltk has a tagger, however it is not compatible with WordNetLemmatizer. So we need to define function to change pos format as follow
def get_wordnet_pos(treebank_tag):
  if treebank_tag.startswith('J'):
    return wordnet.ADJ
  elif treebank_tag.startswith('V'):
    return wordnet.VERB
  elif treebank_tag.startswith('N'):
    return wordnet.NOUN
  elif treebank_tag.startswith('R'):
    return wordnet.ADV
  else:
    return wordnet.NOUN

In [33]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [34]:
sentence = "Donald Trump has a devoted following".split()

In [35]:
# NLTK pos tagger format
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('Donald', 'NNP'),
 ('Trump', 'NNP'),
 ('has', 'VBZ'),
 ('a', 'DT'),
 ('devoted', 'VBN'),
 ('following', 'NN')]

In [36]:
# iterate lemmatizer for each word in the sentence and get pos using defined function
for word, tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
  print(lemma, end=" ")

Donald Trump have a devote following 

In [37]:
sentence = "The cat was following the bird as it flew by".split()

In [38]:
words_and_tags = nltk.pos_tag(sentence)
words_and_tags

[('The', 'DT'),
 ('cat', 'NN'),
 ('was', 'VBD'),
 ('following', 'VBG'),
 ('the', 'DT'),
 ('bird', 'NN'),
 ('as', 'IN'),
 ('it', 'PRP'),
 ('flew', 'VBD'),
 ('by', 'IN')]

In [39]:
for word, tag in words_and_tags:
  lemma = lemmatizer.lemmatize(word, pos=get_wordnet_pos(tag))
  print(lemma, end=" ")

The cat be follow the bird a it fly by 