# NLTK Complete Guide - Section 5: Stemming

This notebook covers:
- Porter Stemmer
- Lancaster Stemmer
- Snowball Stemmer (Multi-language)
- Regexp Stemmer
- Comparing Stemmers
- Practical Applications

In [None]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, RegexpStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)

## What is Stemming?

**Stemming** reduces words to their root/base form by removing suffixes.

- `running` ‚Üí `run`
- `studies` ‚Üí `studi`
- `happiness` ‚Üí `happi`

‚ö†Ô∏è **Note**: Stems are not always valid words!

## 5.1 Porter Stemmer

Most widely used stemmer. Uses a series of rules to strip suffixes.

In [None]:
ps = PorterStemmer()

words = [
    "running", "runs", "runner", "ran",
    "easily", "fairly", "happily",
    "studies", "studying", "studied",
    "connection", "connected", "connecting",
]

print("Porter Stemmer Results")
print("=" * 35)
print(f"{'Word':<20} {'Stem':<15}")
print("-" * 35)

for word in words:
    print(f"{word:<20} {ps.stem(word):<15}")

### Porter Stemmer Rules Demo

In [None]:
rules_demo = {
    "Plural (-s, -es)": ["cats", "dogs", "boxes", "churches"],
    "Past tense (-ed)": ["walked", "jumped", "added", "needed"],
    "Progressive (-ing)": ["running", "walking", "sitting", "getting"],
    "Adverbs (-ly)": ["quickly", "happily", "easily", "angrily"],
    "Nouns (-tion, -ment)": ["connection", "movement", "action", "judgment"],
}

print("Porter Stemmer - Common Rules")
print("=" * 50)

for rule, words in rules_demo.items():
    print(f"\n{rule}:")
    for word in words:
        print(f"  {word:<15} ‚Üí {ps.stem(word)}")

## 5.2 Lancaster Stemmer

More aggressive than Porter. Often produces shorter stems.

In [None]:
ls = LancasterStemmer()

words = [
    "running", "maximum", "presumably", "multiply",
    "organization", "generalization", "maximize",
]

print("Lancaster Stemmer Results")
print("=" * 35)
print(f"{'Word':<20} {'Stem':<15}")
print("-" * 35)

for word in words:
    print(f"{word:<20} {ls.stem(word):<15}")

### Porter vs Lancaster

In [None]:
words = [
    "running", "maximum", "presumably", "multiply",
    "generalization", "organization", "loving", "happiness",
    "connection", "generate", "university", "friendship",
]

print("Porter vs Lancaster Comparison")
print("=" * 55)
print(f"{'Word':<18} {'Porter':<15} {'Lancaster':<15} {'Diff'}")
print("-" * 55)

for word in words:
    porter = ps.stem(word)
    lancaster = ls.stem(word)
    diff = "*" if porter != lancaster else ""
    print(f"{word:<18} {porter:<15} {lancaster:<15} {diff}")

print("\n* = Different results")
print("üí° Lancaster is more aggressive, often shorter stems")

## 5.3 Snowball Stemmer

Improved Porter stemmer with multi-language support.

In [None]:
# Available languages
print("Available languages:")
print(SnowballStemmer.languages)

In [None]:
# English Snowball Stemmer
ss = SnowballStemmer("english")

words = ["running", "generously", "happiness", "beautiful", "organization"]

print("English Snowball Stemmer:")
for word in words:
    print(f"  {word} ‚Üí {ss.stem(word)}")

### Multi-language Stemming

In [None]:
# Test words in different languages
test_cases = {
    "english": ["running", "happiness", "organization"],
    "spanish": ["corriendo", "felicidad", "organizaci√≥n"],
    "french": ["courant", "bonheur", "organisation"],
    "german": ["laufend", "Gl√ºck", "Organisation"],
    "italian": ["correndo", "felicit√†", "organizzazione"],
}

print("Multi-language Snowball Stemming")
print("=" * 60)

for language, words in test_cases.items():
    stemmer = SnowballStemmer(language)
    print(f"\n{language.capitalize()}:")
    for word in words:
        print(f"  {word:<20} ‚Üí {stemmer.stem(word)}")

## 5.4 Regexp Stemmer

Create custom stemmers using regular expressions.

In [None]:
# Basic suffix removal
rs = RegexpStemmer('ing$|ed$|s$', min=4)

words = ["running", "walked", "cats", "dogs", "jumping", "needed"]

print("Regexp Stemmer (removes -ing, -ed, -s)")
print(f"Pattern: 'ing$|ed$|s$', min_length=4")
print("-" * 40)

for word in words:
    print(f"  {word:<15} ‚Üí {rs.stem(word)}")

In [None]:
# Custom patterns for different word types
patterns = {
    "Verbal": ('ing$|ed$|es$|s$', 3, ["running", "walked", "boxes", "plays"]),
    "Noun": ('tion$|ment$|ness$|ity$', 4, ["connection", "movement", "happiness", "ability"]),
    "Adjective": ('able$|ible$|ful$|less$', 4, ["readable", "visible", "beautiful", "careless"]),
    "Adverb": ('ly$', 4, ["quickly", "happily", "slowly", "carefully"]),
}

print("Custom Regexp Stemmers")
print("=" * 50)

for name, (pattern, min_len, words) in patterns.items():
    stemmer = RegexpStemmer(pattern, min=min_len)
    print(f"\n{name} suffixes (pattern: '{pattern}'):")
    for word in words:
        print(f"  {word:<15} ‚Üí {stemmer.stem(word)}")

## 5.5 Comparing All Stemmers

In [None]:
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer("english")
rs = RegexpStemmer('ing$|ed$|s$|able$|tion$', min=4)

words = [
    "programming", "programmer", "programmed",
    "organization", "organized", "organizing",
    "beautiful", "beautifully", "beauty",
    "happiness", "happy", "happily",
]

print("All Stemmers Comparison")
print("=" * 75)
print(f"{'Word':<16} {'Porter':<12} {'Lancaster':<12} {'Snowball':<12} {'Regexp':<12}")
print("-" * 75)

for word in words:
    print(f"{word:<16} {ps.stem(word):<12} {ls.stem(word):<12} {ss.stem(word):<12} {rs.stem(word):<12}")

### Consistency Test

Do words with the same meaning get the same stem?

In [None]:
word_families = [
    ["run", "running", "runs", "runner", "ran"],
    ["connect", "connection", "connected", "connecting"],
    ["happy", "happiness", "happily", "happier"],
    ["beauty", "beautiful", "beautifully", "beautify"],
]

print("Stemmer Consistency Test")
print("=" * 60)

for family in word_families:
    print(f"\nWord family: {family}")
    
    for name, stemmer in [("Porter", ps), ("Lancaster", ls), ("Snowball", ss)]:
        stems = [stemmer.stem(word) for word in family]
        unique_stems = set(stems)
        is_consistent = len(unique_stems) == 1
        status = "‚úÖ Consistent" if is_consistent else f"‚ö†Ô∏è {len(unique_stems)} different stems"
        print(f"  {name:<10} ‚Üí {unique_stems}  {status}")

## 5.6 Practical Applications

### Stemming a Sentence

In [None]:
def stem_sentence(sentence, stemmer=None):
    """Stem all words in a sentence"""
    if stemmer is None:
        stemmer = PorterStemmer()
    
    tokens = word_tokenize(sentence.lower())
    stemmed = [stemmer.stem(t) for t in tokens if t.isalpha()]
    return stemmed

sentences = [
    "The cats are running and jumping happily.",
    "She was studying programming and organizing her notes.",
    "The beautiful organization connected many communities.",
]

print("Sentence Stemming")
print("=" * 60)

for sentence in sentences:
    stemmed = stem_sentence(sentence)
    print(f"\nOriginal: {sentence}")
    print(f"Stemmed:  {' '.join(stemmed)}")

### Stemming for Search

In [None]:
# Documents
documents = [
    "The runner was running in the marathon.",
    "She runs every morning before work.",
    "Running is good exercise for runners.",
    "The car drove quickly down the street.",
]

# Search query
query = "run"
query_stem = ps.stem(query)

print(f"Search query: '{query}' (stem: '{query_stem}')")
print("\nMatching documents:")
print("-" * 50)

for i, doc in enumerate(documents, 1):
    tokens = word_tokenize(doc.lower())
    stems = [ps.stem(t) for t in tokens if t.isalpha()]
    
    if query_stem in stems:
        # Find which words matched
        matched = [t for t in tokens if t.isalpha() and ps.stem(t) == query_stem]
        print(f"\n‚úÖ Doc {i}: {doc}")
        print(f"   Matched words: {matched}")
    else:
        print(f"\n‚ùå Doc {i}: {doc}")

## 5.7 Stemmer Utility Class

In [None]:
class Stemmer:
    """Utility class for stemming operations"""
    
    STEMMERS = {
        'porter': PorterStemmer,
        'lancaster': LancasterStemmer,
        'snowball': lambda: SnowballStemmer('english'),
    }
    
    def __init__(self, stemmer_type='porter'):
        if stemmer_type not in self.STEMMERS:
            raise ValueError(f"Unknown stemmer: {stemmer_type}")
        
        creator = self.STEMMERS[stemmer_type]
        self.stemmer = creator() if callable(creator) else creator
        self.stemmer_type = stemmer_type
    
    def stem(self, word):
        """Stem a single word"""
        return self.stemmer.stem(word)
    
    def stem_words(self, words):
        """Stem a list of words"""
        return [self.stemmer.stem(w) for w in words]
    
    def stem_text(self, text):
        """Tokenize and stem text"""
        tokens = word_tokenize(text.lower())
        return [self.stemmer.stem(t) for t in tokens if t.isalpha()]
    
    def stem_documents(self, documents):
        """Stem multiple documents"""
        return [self.stem_text(doc) for doc in documents]

In [None]:
# Use the utility class
text = "The programmers are programming different programs."

print(f"Text: {text}\n")

for stype in ['porter', 'lancaster', 'snowball']:
    stemmer = Stemmer(stype)
    result = stemmer.stem_text(text)
    print(f"{stype.capitalize():<10} ‚Üí {result}")

## 5.8 When to Use Stemming

### ‚úÖ Good Use Cases

| Use Case | Why |
|----------|-----|
| **Information Retrieval / Search** | Match different word forms to same concept |
| **Text Classification** | Reduce vocabulary size |
| **Document Clustering** | Group similar documents |
| **Quick Prototyping** | Faster than lemmatization |

### ‚ùå Not Recommended For

| Use Case | Why Not |
|----------|--------|
| **Sentiment Analysis** | Loses nuance ("happy" vs "happily") |
| **Machine Translation** | Need exact word forms |
| **Text Generation** | Stems aren't valid words |
| **Named Entity Recognition** | Proper nouns shouldn't be stemmed |

## Summary

| Stemmer | Aggressiveness | Speed | Multi-language |
|---------|---------------|-------|----------------|
| **Porter** | Medium | Fast | No |
| **Lancaster** | High | Fast | No |
| **Snowball** | Medium | Fast | Yes |
| **Regexp** | Custom | Very Fast | Custom |

### Quick Reference
```python
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

ps.stem('running')  # 'run'
```