# Title: Stemming and Lemmatization in NLP using NLTK

# 1. Stemming

**Stemming** is used in **text processing** in NLP is the process of reducing a word to their root form called 'stem'.

**For example**, the stemming process might convert words like "running," "runner," and "ran" to the common stem "run."

There are different techniques of stemming performed in this notebook:

- PorterStemmer
- RegexStemmer
- SnowBallStemmer

### (i). Porter Stemmer

Developed by Martin Porter, the Porter stemming algorithm is one of the oldest and widely used stemming algorithms

In [1]:
from nltk.stem import PorterStemmer

In [4]:
stemmer = PorterStemmer()

In [2]:
words = ["eating","eaten","eater","congratulations","lovely","lover","loving"]

This technique performs well, but in some cases it have disadvantage because it completely change the meaning of word such as here 'congratulations' is stem into 'congratul', which is different word.

In [5]:
for word in words:
  print(word + "---->" + stemmer.stem(word))

eating---->eat
eaten---->eaten
eater---->eater
congratulations---->congratul
lovely---->love
lover---->lover
loving---->love


### (ii). RegexpStemmer:

Regex (regular expression) stemming involves using regular expressions to find and remove suffixes from words. Developers can customize the regular expressions based on the specific requirements of their text processing tasks.

In [18]:
from nltk.stem import RegexpStemmer

In [19]:
reg_stemmer = RegexpStemmer('^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', min=5)

In [23]:
print(reg_stemmer.stem("eates"))




In [21]:
reg_stemmer.stem("lover")

''

### (iii). SnowBall Stemmer:

Snowball is a language-specific stemming algorithm developed by Martin Porter, and it is an improvement upon the original Porter stemming algorithm.

In [25]:
from nltk.stem import SnowballStemmer

In [27]:
snow_ball_stemmer = SnowballStemmer("english")

In [28]:
for word in words:
  print(word + "--->" + snow_ball_stemmer.stem(word))

eating--->eat
eaten--->eaten
eater--->eater
congratulations--->congratul
lovely--->love
lover--->lover
loving--->love


### Difference in PorterStemmer and SnowBallStemmer


SnowBallStemmer performs well from PorterStemmer. Also SnowballStemmer support multiple languages but PorterStemmer supports only English language.

In [29]:
stemmer.stem("fairly"), snow_ball_stemmer.stem("fairly")

('fairli', 'fair')

In [31]:
stemmer.stem("congratulations"), snow_ball_stemmer.stem("congratulations")

('congratul', 'congratul')

# 2. Lemmatization:

Lemmatization is also a technique used in NLP, but it goes beyond stemming by reducing words to their "lemma" or base form, which is a valid word found in the dictionary. Lemmatization involves considering the context of a word and its part of speech to produce a meaningful base form. Unlike stemming, lemmatization ensures linguistic accuracy but can be computationally more intensive.

The technique of lemmatization performed in this notebook:

- WordNetLemmatizer

In [35]:
import nltk
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


There are pos-tags in lemmatization:

- Noun - 'n'
- Verb - 'v'
- Adjective - 'a'
- Adverb - 'r'

In [36]:
lemmatizer = WordNetLemmatizer()

In [43]:
lemmatizer.lemmatize("goes", pos="v")

'go'

In [44]:
lemmatizer.lemmatize("goes", pos="n")

'go'

In [45]:
lemmatizer.lemmatize("goes", pos="a")

'goes'

In [46]:
lemmatizer.lemmatize("goes", pos="r")

'goes'

In [47]:
for word in words:
  print(word + "--->" + lemmatizer.lemmatize(word, pos="v"))

eating--->eat
eaten--->eat
eater--->eater
congratulations--->congratulations
lovely--->lovely
lover--->lover
loving--->love


End of Code!