<h2 align="center" >Stemming and Lemmatization</h2>

In simple terms Reducing words to its base word


**Stemming:**
* Use fixed rules such as remove able , ing, etc, to derive base word.


**Lemmatization:**
* Use knowledge of a language (a.k.a linguistic knowledge) to derive a base word.


-------------------------------------------------
- NLTK supports both Stemming & Lemmatization
- Spacy supports only Lemmatization


In [3]:
import nltk
import spacy

In [4]:
from nltk.stem import PorterStemmer #other stemmer is snowballstemmer

stemmer = PorterStemmer()

In [9]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting", "better"]

for word in words:
    print(word, " | ", stemmer.stem(word))

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  ate
adjustable  |  adjust
rafting  |  raft
ability  |  abil
meeting  |  meet
better  |  better


In [8]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("eating eats eat   ate  adjustable rafting ability meeting better")

for token in doc:
    print(token," | ",token.lemma_)
    

eating  |  eat
eats  |  eat
eat  |  eat
    |    
ate  |  eat
   |   
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meeting
better  |  well


In [11]:
doc = nlp('''The mother gave her baby a red apple. The baby tried to eat the apple. His mouth was too small.
And he didn’t have any teeth. His brother took the apple. His brother ate the apple.''')

for token in doc:
    print(token," | ",token.lemma_)

The  |  the
mother  |  mother
gave  |  give
her  |  her
baby  |  baby
a  |  a
red  |  red
apple  |  apple
.  |  .
The  |  the
baby  |  baby
tried  |  try
to  |  to
eat  |  eat
the  |  the
apple  |  apple
.  |  .
His  |  his
mouth  |  mouth
was  |  be
too  |  too
small  |  small
.  |  .

  |  

And  |  and
he  |  he
did  |  do
n’t  |  not
have  |  have
any  |  any
teeth  |  tooth
.  |  .
His  |  his
brother  |  brother
took  |  take
the  |  the
apple  |  apple
.  |  .
His  |  his
brother  |  brother
ate  |  eat
the  |  the
apple  |  apple
.  |  .


In [12]:
#customize behaviour

nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [14]:
doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")


for token in doc:
    print(token," | ",token.lemma_)

Bro  |  bro
,  |  ,
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brah
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
!  |  !
I  |  I
am  |  be
exhausted  |  exhaust


In [19]:
#so we add attribute ruler


ar = nlp.get_pipe("attribute_ruler")

ar.add([[{"TEXT":"Bro"}], [{"TEXT":"Brah"}]], {"LEMMA": "Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")

for token in doc:
    print(token," | ",token.lemma_)

Bro  |  Brother
,  |  ,
you  |  you
wanna  |  wanna
go  |  go
?  |  ?
Brah  |  Brother
,  |  ,
do  |  do
n't  |  not
say  |  say
no  |  no
!  |  !
I  |  I
am  |  be
exhausted  |  exhaust


Now it's working & showing as Brpther for "Bro" and "Brah"

<h2>Tasks</h2>

**Exercise1:**

* Convert these list of words into base form using Stemming and Lemmatization and observe the transformations
* Write a short note on the words that have different base words using stemming and Lemmatization

In [22]:
#stemming
lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']

for word in lst_words:
    print(word ," | ", stemmer.stem(word))

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  like
children  |  children
whom  |  whom
good  |  good
ate  |  ate
fishing  |  fish


In [23]:
#lemmetization 

doc = nlp("running painting walking dressing likely children who good ate fishing")

for token in doc:
    print(token," | ",token.lemma_)

running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  likely
children  |  child
who  |  who
good  |  good
ate  |  eat
fishing  |  fishing


**Observations**

* Words that are different in stemming and lemmatization are:

    * painting
    * likely
    * children
    * ate
    * fishing

* As Stemming achieves the base word by removing the suffixes [ing, ly etc], so it successfully transform the words like 'painting', 'likely', 'fishing' and lemmatization fails for some words ending with suffixes here.

* As Lemmatization uses the dictionary meanings while converting to the base form, so words like 'children' and 'ate' are successfully transformed and stemming fails here.

**Exercise2:**

* convert the given text into it's base form using both stemming and lemmatization

In [24]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhagi. she has a 
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""

In [35]:
#stemming

#step 1: Word tokenizing
#words = list(text.split(" ")) #we can use this also
all_word_tokens = nltk.word_tokenize(text)


#step2: getting the base form for each token using stemmer

stemmed_words = []

for word in all_word_tokens:
    stemmed_words.append(stemmer.stem(word))

#step3: joining all words in a list into string using 'join()'

" ".join(stemmed_words)

'latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhagi . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .'

In [34]:
#Lemmatization

#step1: Creating the object for the given text
doc = nlp(text)

#step2: getting the base form for each token using spacy 'lemma_'

after_lemma = []

for token in doc:
    after_lemma.append(token.lemma_)

#step3: joining all words in a list into string using 'join()'

" ".join(after_lemma)

'Latha be very multi talented girl . she be good at many skill like dancing , running , singing , play . she also like eat Pav Bhagi . she have a \n habit of fishing and swim too . besides all this , she be a wonderful at cook too . \n'