<a href="https://colab.research.google.com/github/newtonxp/Natural_language_processing/blob/main/stemming_and_lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###                     **Stemming and Lemmatization**

- **Run this cell to import all necessary packages**

In [None]:
#let import necessary libraries and create the object

#for nltk
import nltk
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

#download the package 'punkt' related to nltk
nltk.download('punkt')


#for spacy
import spacy
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Exercise1:**
- Convert these list of words into base form using Stemming and Lemmatization and observe the transformations
- Write a short note on the words that have different base words using stemming and Lemmatization

In [None]:
#using stemming in nltk
lst_words = ['running', 'painting', 'walking', 'dressing', 'likely', 'children', 'whom', 'good', 'ate', 'fishing']

for word in lst_words:
  print(word, " | ", stemmer.stem(word))


running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  like
children  |  children
whom  |  whom
good  |  good
ate  |  ate
fishing  |  fish


In [None]:
#using lemmatization in spacy

doc = nlp("running painting walking dressing likely children who good ate fishing")

for word in doc:
  print(word, " | ", word.lemma_)


running  |  run
painting  |  paint
walking  |  walk
dressing  |  dress
likely  |  likely
children  |  child
who  |  who
good  |  good
ate  |  eat
fishing  |  fishing


**Exercise2:**

- convert the given text into it's base form using both stemming and lemmatization

In [None]:
text = """Latha is very multi talented girl.She is good at many skills like dancing, running, singing, playing.She also likes eating Pav Bhaji. she has a
habit of fishing and swimming too.Besides all this, she is a wonderful at cooking too.
"""

In [None]:
#using stemming in nltk


#step1: Word tokenizing
all_words_token = nltk.word_tokenize(text)



#step2: getting the base form for each token using stemmer
base_words = []

for word in all_words_token:
  base_form = stemmer.stem(word)
  base_words.append(base_form)



#step3: joining all words in a list into string using 'join()'
print(' '.join(base_words))

latha is veri multi talent girl.sh is good at mani skill like danc , run , sing , playing.sh also like eat pav bhaji . she ha a habit of fish and swim too.besid all thi , she is a wonder at cook too .


In [None]:
#using lemmatization in spacy


#step1: Creating the object for the given text
doc = nlp(text)


#step2: getting the base form for each token using spacy 'lemma_'
base_words = []

for word in doc:
  base_form = word.lemma_
  base_words.append(base_form)



#step3: joining all words in a list into string using 'join()'
print(" ".join(base_words))

Latha be very multi talented girl . she be good at many skill like dancing , running , singing , play . she also like eat Pav Bhaji . she have a 
 habit of fishing and swim too . besides all this , she be a wonderful at cook too . 



**Observations**

- Stemming is a heuristic process in NLTK that chops off suffixes from words to obtain their root form. The stemming algorithm is faster but less accurate than lemmatization. For example, the word "running" is stemmed to "run," and "jumps" is stemmed to "jump." However, stemming may result in words that are not actual words in the language, as it follows a set of predefined rules.

- Lemmatization, on the other hand, uses a more sophisticated approach in spaCy to convert words to their base or dictionary form (lemma). It considers the context of the word and its part of speech (POS) to provide accurate lemmatized forms. For example, the word "running" is lemmatized to "run," and "jumps" is lemmatized to "jump." Lemmatization ensures that the output is a valid word in the language and is more linguistically accurate compared to stemming.