# 03_NLP : Stemming and Lemmatization in NLP

## 1. Introduction

In Natural Language Processing (NLP), **text normalization** is the process of converting text into a standard, consistent form. Two common normalization techniques are **stemming** and **lemmatization**.

## 2. Conceptual Explanation

### Stemming

* Stemming reduces words to their **root form** by chopping off suffixes.
* It uses **rule-based heuristics**, not vocabulary or grammar.
* The resulting word may **not be a valid English word**.

**Example:**

* running → run
* studies → studi
* better → better

**Pros:** Fast, simple

**Cons:** Can produce incorrect or non-meaningful words


### Lemmatization

* Lemmatization reduces words to their **dictionary base form (lemma)**.
* It considers **context, vocabulary, and part of speech (POS)**.
* The output is always a **valid word**.

**Example:**

* running → run
* studies → study
* better → good

**Pros:** More accurate, meaningful output

**Cons:** Slower, computationally heavier


### Key Differences

| Feature         | Stemming             | Lemmatization |
| --------------- | -------------------- | ------------- |
| Speed           | Fast                 | Slower        |
| Accuracy        | Lower                | Higher        |
| Uses Vocabulary | No                   | Yes           |
| Output          | May not be real word | Real word     |


## 3. When to Use What?

* Use **stemming** when:

  * Speed is critical
  * Exact meaning is not required (e.g., search engines)

* Use **lemmatization** when:

  * Semantic meaning matters
  * You are doing sentiment analysis, QA, chatbots

# Stemming with NLTK

### Import Libraries

In [1]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Porter Stemmer

In [2]:
stemming = PorterStemmer()
words = ["running", "flies", "happily", "studies", "better"]

In [3]:
for word in words:
    print(word+"---->"+stemming.stem(word))

running---->run
flies---->fli
happily---->happili
studies---->studi
better---->better


### Snowball Stemmer

In [4]:
snowball = SnowballStemmer("english")


for word in words:
    print(word+"---->"+stemming.stem(word))

running---->run
flies---->fli
happily---->happili
studies---->studi
better---->better


### LAncaster  Stemming algorithm

In [5]:
from nltk.stem import LancasterStemmer

In [6]:
lancaster=LancasterStemmer()

In [7]:
for word in words:
    print(word+"---->"+lancaster.stem(word))

running---->run
flies---->fli
happily---->happy
studies---->study
better---->bet


# Lemmatization with NLTK

### Import WordNet Lemmatizer

In [8]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True


### RegexpStemmer class <br>
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [9]:
from nltk.stem import RegexpStemmer

In [10]:
reg_stemmer=RegexpStemmer('ing|s$|e$|able$', min=4)

In [11]:
reg_stemmer.stem("eating")

'eat'

In [12]:
reg_stemmer.stem("ingplaying")

'play'

### Lemmatization Without POS

In [13]:
lemmatizer = WordNetLemmatizer()

In [14]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word))

running---->running
flies---->fly
happily---->happily
studies---->study
better---->better


### Lemmatization With POS Tags

In [15]:
print(lemmatizer.lemmatize("running", pos='v')) # verb
print(lemmatizer.lemmatize("better", pos='a')) # adjective

run
good


In [16]:
## Sentiment Analysis-- stemming
## Chatbot---lemmatization


## 8. Conclusion

Stemming and lemmatization are both important NLP preprocessing techniques. While stemming is fast and simple, lemmatization provides better linguistic accuracy. The choice depends on your application.


**End of Notebook**
