<a href="https://colab.research.google.com/github/MeSamiulIslam/NLP_Learning/blob/main/Stemming_vs_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stemming
## Stemming is the process of reducing a word to its word to its **word stem** that affixes to suffixes and prefixes r to the roots of words knows as lemma. Stemming is important in Natural language understaning (NLU) and Natural language processing (NLP)

[Documentation NLTK PorterStemmer](https://www.nltk.org/_modules/nltk/stem/porter.html#PorterStemmer)

In [1]:
from nltk.stem import PorterStemmer

In [2]:
stemming = PorterStemmer()

In [3]:
words = ["eating", "eats", "eaten", "writing", "programming", "programs", "history", "finally", "finalized"]

In [4]:
for word in words:
  print(word + "--->" +stemming.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


In [5]:
stemming.stem("congratulations") #Problem with some words, like this.

'congratul'

In [6]:
stemming.stem("understanding")

'understand'

In [7]:
stemming.stem("sitting")

'sit'

# Lancaster Stemming Algorithm
[Documentation](https://www.nltk.org/api/nltk.stem.lancaster.html)

In [8]:
from nltk.stem import LancasterStemmer

In [9]:
lancaster = LancasterStemmer()

In [10]:
for word in words:
  print(word + "--->" + lancaster.stem(word))

eating--->eat
eats--->eat
eaten--->eat
writing--->writ
programming--->program
programs--->program
history--->hist
finally--->fin
finalized--->fin


##Porter stemmer far better then Lancester stemmer but some specific problems Lancester stemmer perform better.

# NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithm. It basically takes a single regular expression and remove any prefix or suffix that matches the expression.

[Documentation](https://www.nltk.org/api/nltk.stem.regexp.html)

In [11]:
from nltk.stem import RegexpStemmer

In [12]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4) #first peramitter is Regular expression and then word limit.

In [13]:
reg_stemmer.stem("eating")

'eat'

In [14]:
reg_stemmer = RegexpStemmer('ing|s$|e$|able$', min=4)

In [15]:
reg_stemmer.stem("ingplaying")

'play'

In [16]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [17]:
reg_stemmer.stem("ingplaying")

'ingplay'

In [18]:
reg_stemmer.stem('advisable')

'advis'

In [19]:
reg_stemmer.stem('mass')

'mas'

# Snowball Stemmer
[Documentation](https://www.nltk.org/api/nltk.stem.snowball.html)

In [20]:
from nltk.stem import SnowballStemmer

In [21]:
snowball_stemmer = SnowballStemmer('english', ignore_stopwords= False) #fast peramitter is Language and 2nd one is if you want to stemm stop words then FALSE.

In [22]:
for word in words:
  print(word + "--->" + snowball_stemmer.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
programming--->program
programs--->program
history--->histori
finally--->final
finalized--->final


In [23]:
# consider Porter and Snowball

In [24]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [25]:
snowball_stemmer.stem("fairly"), snowball_stemmer.stem("sportingly")

('fair', 'sport')

# Lemmatization Techinque

## wordnet Lemmatizer
lemmatization technique is like stemming. the output we will get after lemmatization is called 'lemma', which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma.

[Documentation](https://www.nltk.org/api/nltk.stem.wordnet.html)

In [26]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [27]:
for word in words:
  print(word + "--->" + lemmatizer.lemmatize(word)) #default pos is noun=n

eating--->eating
eats--->eats
eaten--->eaten
writing--->writing
programming--->programming
programs--->program
history--->history
finally--->finally
finalized--->finalized


#first peramitter is word and 2nd is Pos tag [eg. Noun=n, Verb=v, adjective=a, adverb=r]

In [28]:
for word in words:
  print(word + "--->" + lemmatizer.lemmatize(word, pos='v'))

eating--->eat
eats--->eat
eaten--->eat
writing--->write
programming--->program
programs--->program
history--->history
finally--->finally
finalized--->finalize


In [29]:
lemmatizer.lemmatize("better", pos='n')

'better'

In [30]:
lemmatizer.lemmatize("better", pos='v')

'better'

In [31]:
lemmatizer.lemmatize("better", pos='r')

'well'

In [32]:
lemmatizer.lemmatize("better", pos='a')

'good'

# Some use case


*   Sentiment Analysis = Stemming
*   Chatbot = Lemmatization