# üìå Topic: Stemming

### What you will learn
- What stemming is and why it matters
- How the Porter Stemmer algorithm works
- How stemming reduces vocabulary size
- Trade-offs between stemming and lemmatization
- Real-world examples of word stemming

### Why this matters
Stemming is a crucial text normalization technique that reduces words to their root form. This further reduces vocabulary size, which improves model efficiency and generalizes word patterns. However, stemming is **lossy and rule-based**, so it sometimes produces non-words ("stemm" instead of "stem"). Understanding when to use stemming vs. lemmatization is essential for proper NLP preprocessing.

---

## What is Stemming?

**Stemming** is the process of reducing words to their base or root form (called the "stem") by removing prefixes and suffixes.

### Examples:
- "connecting" ‚Üí "connect"
- "connected" ‚Üí "connect"
- "connectivity" ‚Üí "connect"
- "connects" ‚Üí "connect"

All these variations are treated as the same word, which:
- **Reduces vocabulary size**: One stem instead of 4+ word forms
- **Improves pattern recognition**: Models see more instances of the same concept
- **Saves memory**: Smaller vocabulary = smaller models

### Stemming vs. Lemmatization:

| Aspect | Stemming | Lemmatization |
|--------|----------|----------------|
| **Method** | Rule-based (strip suffixes) | Dictionary-based (find root word) |
| **Speed** | Fast | Slow |
| **Accuracy** | May produce non-words | Always produces real words |
| **"running"** | "run" | "run" |
| **"better"** | "better" | "good" |
| **Use case** | Search, document retrieval | NER, sentiment analysis |

**Key insight**: Stemming is aggressive; lemmatization is intelligent.

## The Porter Stemmer Algorithm

The **Porter Stemmer** is the most widely used stemming algorithm. It uses a series of **rewriting rules** to strip suffixes from English words.

### How it works:
1. Identifies the "measure" of a word (roughly, how many vowel-consonant pairs it has)
2. Applies suffix-stripping rules based on the measure
3. Returns the resulting stem

### Example rule:
```
If measure > 0 and word ends in 'ed':
    Remove 'ed'
```

This is why "connected" (measure > 0) ‚Üí "connect", but "ped" (measure = 0) is not modified.

### Why Porter Stemmer?
- Simple and fast (O(n) where n = word length)
- Works well for English
- Widely supported (NLTK, scikit-learn, etc.)
- Good baseline before moving to lemmatization

In [None]:
# Stemming reduces words to their base form by stripping suffixes
# Example: "running", "runs", "ran" all become "run"
# This is different from lemmatization, which uses a dictionary to find the actual root word

In [None]:
# Import the Porter Stemmer from NLTK
# PorterStemmer is an implementation of the famous Porter Stemming Algorithm
# It's rule-based and doesn't use a dictionary
from nltk.stem import PorterStemmer

## Practical Example: Stemming a Word Family

Let's see how the Porter Stemmer reduces all variations of the word "connect" to a single stem.

In [None]:
# Create an instance of the Porter Stemmer
# This object has a .stem() method that reduces words to their stem
ps = PorterStemmer()

# Create a list of words that are all variations of "connect"
# These have different suffixes: -ing, -ed, -ivity, -s
connect_variations = ["connecting", "connected", "connectivity", "connects"]

# Iterate through each word and stem it
print("Porter Stemmer Output:")
print("=" * 40)
for word in connect_variations:
    # ps.stem() removes suffix and returns the root
    stem = ps.stem(word)
    print(f"{word:15} ‚Üí {stem}")

# Notice: All variations reduce to "connect"
print("\n‚úì Success: All word variants reduced to the same stem!")

## More Stemming Examples

Let's see how the Porter Stemmer handles various English words and their common variations.

In [None]:
# Test stemming on various word families
ps = PorterStemmer()

# Dictionary of word families for testing
test_words = {
    "run": ["run", "running", "ran", "runs"],
    "argue": ["argue", "argued", "arguing", "argument"],
    "play": ["play", "playing", "played", "plays"],
    "happy": ["happy", "happiness", "happily"],
}

# Display stemming results
for family, words in test_words.items():
    print(f"\nWord family: {family}")
    print("-" * 40)
    stems = set()  # Use a set to see unique stems
    for word in words:
        stem = ps.stem(word)
        stems.add(stem)
        print(f"  {word:15} ‚Üí {stem}")
    print(f"  Unique stem(s): {stems}")

## Edge Cases and Limitations

The Porter Stemmer is rule-based, so it has limitations:

### ‚ö†Ô∏è Problem 1: Over-stemming (too aggressive)
```python
"universal" ‚Üí "univers"
"university" ‚Üí "univers"
# These are different concepts but produce the same stem!
```

### ‚ö†Ô∏è Problem 2: Under-stemming (not aggressive enough)
```python
"data" ‚Üí "data" (not stemmed)
"datum" ‚Üí "datum" (not stemmed)
# Same concept, but different stems
```

### ‚ö†Ô∏è Problem 3: Non-words as output
```python
"caresses" ‚Üí "caress"
"ponies" ‚Üí "poni"  # Not a real English word!
```

### ‚úÖ Solutions:
- Use **lemmatization** if accuracy matters more than speed
- Use **stemming** if processing large datasets where speed is critical
- Combine with domain-specific dictionaries for specialized vocabularies

In [None]:
# Demonstrate edge cases
ps = PorterStemmer()

# Edge case examples
edge_cases = [
    "universal",   # Over-stemming example
    "university",  # Over-stemming example
    "data",        # Under-stemming example
    "ponies",      # Produces non-word
    "caresses",    # Works fine
]

print("Edge Case Analysis:")
print("=" * 50)
for word in edge_cases:
    stem = ps.stem(word)
    print(f"{word:15} ‚Üí {stem}")

print("\n‚ö†Ô∏è Note: Some outputs aren't real English words!")

## When to Use Stemming

### ‚úÖ Good use cases:
1. **Search and retrieval**: User searches "running", finds documents with "run", "runs", etc.
2. **Text classification**: Reducing vocabulary helps with sparse data
3. **Information retrieval**: Web search engines use stemming
4. **Fast preprocessing**: When you have billions of documents
5. **Early-stage exploration**: Quick baseline before more sophisticated methods

### ‚ùå Bad use cases:
1. **Named Entity Recognition**: You need the original word form
2. **Sentiment analysis**: Nuances matter ("worse" vs "bad")
3. **Machine translation**: Word form carries grammatical meaning
4. **Dense neural models**: Modern transformers handle morphology themselves

### Rule of thumb:
**Stemming works best for bag-of-words models and high-speed applications.** For neural networks and accurate NLP tasks, prefer lemmatization or skip normalization altogether.

## Stemming vs. Lemmatization

Let's compare with a preview of lemmatization (which we'll cover fully later):

In [None]:
# Comparison between stemming and lemmatization
# (Lemmatization requires spaCy, which we'll install later)

ps = PorterStemmer()

test_words = ["better", "running", "arguing", "universal"]

print("Stemming vs. Lemmatization comparison:")
print("=" * 50)
print(f"{'Word':<15} | {'Stemmed':<15} | {'Notes':<20}")
print("-" * 50)
for word in test_words:
    stem = ps.stem(word)
    print(f"{word:<15} | {stem:<15} | Stemmed")

print("\nNote: Lemmatization (next notebook) produces smarter results")
print("Example: 'better' ‚Üí 'good' (lemmatization) vs 'better' (stemming)")

## Key Takeaways

1. **Stemming reduces word variance**: Multiple word forms ‚Üí single stem
2. **Porter Stemmer is rule-based and fast**: Good for large datasets
3. **Trade-off: Speed vs. Accuracy**: Fast but may produce non-words or miss semantic differences
4. **Over-stemming problem**: "universal" and "university" ‚Üí same stem
5. **Know when to use it**: Search/retrieval systems, bag-of-words models
6. **Know when NOT to use it**: NER, sentiment analysis, neural models

## Next Steps:
- Learn **lemmatization** (smarter, slower, dictionary-based)
- Combine stemming with stopword removal for complete preprocessing
- Experiment with different stemmers (Snowball, Lancaster) for comparison

## Practice Exercise

Try stemming these words and see if you can identify:
1. Words that stem correctly
2. Over-stemming examples
3. Under-stemming examples

In [None]:
# Practice: Stem this list and categorize the results
practice_words = [
    "testing", "tested", "test",
    "running", "runner", "runs",
    "quickly", "quick",
    "caresses", "ponies", "ties"
]

ps = PorterStemmer()
results = {}

for word in practice_words:
    stem = ps.stem(word)
    if stem not in results:
        results[stem] = []
    results[stem].append(word)

print("Words grouped by stem:")
print("=" * 50)
for stem, words in results.items():
    print(f"Stem '{stem}': {words}")