# Stemming & Lemmatization in NLP

## Introduction
When processing text, words often appear in **different forms**. For example:

- `running`, `ran`, `runs` → root form is `run`
- `better` → base form is `good`

**Stemming** and **lemmatization** are techniques to **reduce words to their base forms**:

| Technique       | Example                 | Output        |
|-----------------|------------------------|---------------|
| Stemming        | running, runs, ran      | run           |
| Lemmatization   | better                  | good          |

**Key Difference:**
- **Stemming:** crude, removes suffixes (may not produce real words)
- **Lemmatization:** more accurate, considers grammar and dictionary

**In this notebook, we will learn:**
1. Stemming using NLTK
2. Lemmatization using NLTK
3. Examples on a small dataset


In [1]:
# Step 0: Import libraries
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
import nltk

# Download WordNet (required for lemmatization)
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     D:\miniconda_setup\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     D:\miniconda_setup\nltk_data...


True

In [2]:
# Step 1: Stemming Example
stemmer = PorterStemmer()

words = ["running", "ran", "runs", "easily", "fairly"]
stemmed_words = [stemmer.stem(w) for w in words]

print("Original Words:", words)
print("Stemmed Words:", stemmed_words)

Original Words: ['running', 'ran', 'runs', 'easily', 'fairly']
Stemmed Words: ['run', 'ran', 'run', 'easili', 'fairli']


**Explanation:**  
- The stemmer removes common suffixes like `-ing`, `-s`.  
- Output may not always be a real English word.  
- Good for **search engines or fast preprocessing**.

In [3]:
# Step 2: Lemmatization Example
lemmatizer = WordNetLemmatizer()

words = ["running", "ran", "runs", "better", "fairly"]
lemmatized_words = [lemmatizer.lemmatize(w, pos='v') for w in words]  # 'v' = verb

print("Original Words:", words)
print("Lemmatized Words (verbs):", lemmatized_words)

# Lemmatization with default POS (noun)
lemmatized_words_noun = [lemmatizer.lemmatize(w) for w in words]
print("Lemmatized Words (default/noun):", lemmatized_words_noun)


Original Words: ['running', 'ran', 'runs', 'better', 'fairly']
Lemmatized Words (verbs): ['run', 'run', 'run', 'better', 'fairly']
Lemmatized Words (default/noun): ['running', 'ran', 'run', 'better', 'fairly']


**Explanation:**  
- Lemmatization produces **real words**.  
- We can specify **part-of-speech** (POS) for better accuracy (`v`=verb, `n`=noun, etc.)  
- Works well for **text analysis and NLP pipelines**.

In [4]:
# Step 3: Stemming & Lemmatization on a Sample Dataset
sample_texts = [
    "The children are running faster than yesterday.",
    "He runs every morning and enjoys it.",
    "She is better at playing football than me."
]

# Tokenize words
tokenized_texts = [text.split() for text in sample_texts]

for i, tokens in enumerate(tokenized_texts):
    stemmed = [stemmer.stem(w) for w in tokens]
    lemmatized = [lemmatizer.lemmatize(w, pos='v') for w in tokens]
    print(f"Original: {sample_texts[i]}")
    print(f"Stemmed: {stemmed}")
    print(f"Lemmatized: {lemmatized}")
    print('-'*50)


Original: The children are running faster than yesterday.
Stemmed: ['the', 'children', 'are', 'run', 'faster', 'than', 'yesterday.']
Lemmatized: ['The', 'children', 'be', 'run', 'faster', 'than', 'yesterday.']
--------------------------------------------------
Original: He runs every morning and enjoys it.
Stemmed: ['he', 'run', 'everi', 'morn', 'and', 'enjoy', 'it.']
Lemmatized: ['He', 'run', 'every', 'morning', 'and', 'enjoy', 'it.']
--------------------------------------------------
Original: She is better at playing football than me.
Stemmed: ['she', 'is', 'better', 'at', 'play', 'footbal', 'than', 'me.']
Lemmatized: ['She', 'be', 'better', 'at', 'play', 'football', 'than', 'me.']
--------------------------------------------------


- **Stemming**: Cuts words to root form, fast but may create non-words.  
  - Example: running → run, easily → easi  
- **Lemmatization**: Converts words to proper base form, considers grammar, accurate but slower.  
  - Example: running → run, better → good  

**Tips:**  
1. Always tokenize first.  
2. Use stemming for speed, lemmatization for accuracy.  
3. Both reduce vocabulary size and help NLP models.
