# NLTK Complete Guide - Section 5: Stemming

This notebook covers:
- Porter Stemmer
- Lancaster Stemmer
- Snowball Stemmer (Multi-language)
- Regexp Stemmer
- Comparing Stemmers
- Practical Applications

In [33]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, RegexpStemmer
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)

True

## What is Stemming?

**Stemming** reduces words to their root/base form by removing suffixes.

- `running` ‚Üí `run`
- `studies` ‚Üí `studi`
- `happiness` ‚Üí `happi`

‚ö†Ô∏è **Note**: Stems are not always valid words!

## 5.1 Porter Stemmer

Most widely used stemmer. Uses a series of rules to strip suffixes.

In [34]:
ps = PorterStemmer()

words = [
    "running", "runs", "runner", "ran",
    "easily", "fairly", "happily",
    "studies", "studying", "studied",
    "connection", "connected", "connecting",
]

print("Porter Stemmer Results")
print("=" * 35)
print(f"{'Word':<20} {'Stem':<15}")
print("-" * 35)

for word in words:
    print(f"{word:<20} {ps.stem(word):<15}")

Porter Stemmer Results
Word                 Stem           
-----------------------------------
running              run            
runs                 run            
runner               runner         
ran                  ran            
easily               easili         
fairly               fairli         
happily              happili        
studies              studi          
studying             studi          
studied              studi          
connection           connect        
connected            connect        
connecting           connect        


### Porter Stemmer Rules Demo

In [35]:
rules_demo = {
    "Plural (-s, -es)": ["cats", "dogs", "boxes", "churches"],
    "Past tense (-ed)": ["walked", "jumped", "added", "needed"],
    "Progressive (-ing)": ["running", "walking", "sitting", "getting"],
    "Adverbs (-ly)": ["quickly", "happily", "easily", "angrily"],
    "Nouns (-tion, -ment)": ["connection", "movement", "action", "judgment"],
}

print("Porter Stemmer - Common Rules")
print("=" * 50)

for rule, words in rules_demo.items():
    print(f"\n{rule}:")
    for word in words:
        print(f"  {word:<15} ‚Üí {ps.stem(word)}")

Porter Stemmer - Common Rules

Plural (-s, -es):
  cats            ‚Üí cat
  dogs            ‚Üí dog
  boxes           ‚Üí box
  churches        ‚Üí church

Past tense (-ed):
  walked          ‚Üí walk
  jumped          ‚Üí jump
  added           ‚Üí ad
  needed          ‚Üí need

Progressive (-ing):
  running         ‚Üí run
  walking         ‚Üí walk
  sitting         ‚Üí sit
  getting         ‚Üí get

Adverbs (-ly):
  quickly         ‚Üí quickli
  happily         ‚Üí happili
  easily          ‚Üí easili
  angrily         ‚Üí angrili

Nouns (-tion, -ment):
  connection      ‚Üí connect
  movement        ‚Üí movement
  action          ‚Üí action
  judgment        ‚Üí judgment


## 5.2 Lancaster Stemmer

More aggressive than Porter. Often produces shorter stems.

In [36]:
ls = LancasterStemmer()

words = [
    "running", "maximum", "presumably", "multiply",
    "organization", "generalization", "maximize",
]

print("Lancaster Stemmer Results")
print("=" * 35)
print(f"{'Word':<20} {'Stem':<15}")
print("-" * 35)

for word in words:
    print(f"{word:<20} {ls.stem(word):<15}")

Lancaster Stemmer Results
Word                 Stem           
-----------------------------------
running              run            
maximum              maxim          
presumably           presum         
multiply             multiply       
organization         org            
generalization       gen            
maximize             maxim          


### Porter vs Lancaster

In [37]:
words = [
    "running", "maximum", "presumably", "multiply",
    "generalization", "organization", "loving", "happiness",
    "connection", "generate", "university", "friendship",
]

print("Porter vs Lancaster Comparison")
print("=" * 55)
print(f"{'Word':<18} {'Porter':<15} {'Lancaster':<15} {'Diff'}")
print("-" * 55)

for word in words:
    porter = ps.stem(word)
    lancaster = ls.stem(word)
    diff = "*" if porter != lancaster else ""
    print(f"{word:<18} {porter:<15} {lancaster:<15} {diff}")

print("\n* = Different results")
print("üí° Lancaster is more aggressive, often shorter stems")

Porter vs Lancaster Comparison
Word               Porter          Lancaster       Diff
-------------------------------------------------------
running            run             run             
maximum            maximum         maxim           *
presumably         presum          presum          
multiply           multipli        multiply        *
generalization     gener           gen             *
organization       organ           org             *
loving             love            lov             *
happiness          happi           happy           *
connection         connect         connect         
generate           gener           gen             *
university         univers         univers         
friendship         friendship      friend          *

* = Different results
üí° Lancaster is more aggressive, often shorter stems


## 5.3 Snowball Stemmer

Improved Porter stemmer with multi-language support.

In [38]:
# Available languages
print("Available languages:")
print(SnowballStemmer.languages)

Available languages:
('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [39]:
# English Snowball Stemmer
ss = SnowballStemmer("english")

words = ["running", "generously", "happiness", "beautiful", "organization"]

print("English Snowball Stemmer:")
for word in words:
    print(f"  {word} ‚Üí {ss.stem(word)}")

English Snowball Stemmer:
  running ‚Üí run
  generously ‚Üí generous
  happiness ‚Üí happi
  beautiful ‚Üí beauti
  organization ‚Üí organ


### Multi-language Stemming

In [40]:
# Test words in different languages
test_cases = {
    "english": ["running", "happiness", "organization"],
    "spanish": ["corriendo", "felicidad", "organizaci√≥n"],
    "french": ["courant", "bonheur", "organisation"],
    "german": ["laufend", "Gl√ºck", "Organisation"],
    "italian": ["correndo", "felicit√†", "organizzazione"],
}

print("Multi-language Snowball Stemming")
print("=" * 60)

for language, words in test_cases.items():
    stemmer = SnowballStemmer(language)
    print(f"\n{language.capitalize()}:")
    for word in words:
        print(f"  {word:<20} ‚Üí {stemmer.stem(word)}")

Multi-language Snowball Stemming

English:
  running              ‚Üí run
  happiness            ‚Üí happi
  organization         ‚Üí organ

Spanish:
  corriendo            ‚Üí corr
  felicidad            ‚Üí felic
  organizaci√≥n         ‚Üí organiz

French:
  courant              ‚Üí cour
  bonheur              ‚Üí bonheur
  organisation         ‚Üí organis

German:
  laufend              ‚Üí laufend
  Gl√ºck                ‚Üí gluck
  Organisation         ‚Üí organisation

Italian:
  correndo             ‚Üí corr
  felicit√†             ‚Üí felic
  organizzazione       ‚Üí organizz


## 5.4 Regexp Stemmer

Create custom stemmers using regular expressions.

In [41]:
# Basic suffix removal
rs = RegexpStemmer('ing$|ed$|s$', min=4)

words = ["running", "walked", "cats", "dogs", "jumping", "needed"]

print("Regexp Stemmer (removes -ing, -ed, -s)")
print(f"Pattern: 'ing$|ed$|s$', min_length=4")
print("-" * 40)

for word in words:
    print(f"  {word:<15} ‚Üí {rs.stem(word)}")

Regexp Stemmer (removes -ing, -ed, -s)
Pattern: 'ing$|ed$|s$', min_length=4
----------------------------------------
  running         ‚Üí runn
  walked          ‚Üí walk
  cats            ‚Üí cat
  dogs            ‚Üí dog
  jumping         ‚Üí jump
  needed          ‚Üí need


In [42]:
# Custom patterns for different word types
patterns = {
    "Verbal": ('ing$|ed$|es$|s$', 3, ["running", "walked", "boxes", "plays"]),
    "Noun": ('tion$|ment$|ness$|ity$', 4, ["connection", "movement", "happiness", "ability"]),
    "Adjective": ('able$|ible$|ful$|less$', 4, ["readable", "visible", "beautiful", "careless"]),
    "Adverb": ('ly$', 4, ["quickly", "happily", "slowly", "carefully"]),
}

print("Custom Regexp Stemmers")
print("=" * 50)

for name, (pattern, min_len, words) in patterns.items():
    stemmer = RegexpStemmer(pattern, min=min_len)
    print(f"\n{name} suffixes (pattern: '{pattern}'):")
    for word in words:
        print(f"  {word:<15} ‚Üí {stemmer.stem(word)}")

Custom Regexp Stemmers

Verbal suffixes (pattern: 'ing$|ed$|es$|s$'):
  running         ‚Üí runn
  walked          ‚Üí walk
  boxes           ‚Üí box
  plays           ‚Üí play

Noun suffixes (pattern: 'tion$|ment$|ness$|ity$'):
  connection      ‚Üí connec
  movement        ‚Üí move
  happiness       ‚Üí happi
  ability         ‚Üí abil

Adjective suffixes (pattern: 'able$|ible$|ful$|less$'):
  readable        ‚Üí read
  visible         ‚Üí vis
  beautiful       ‚Üí beauti
  careless        ‚Üí care

Adverb suffixes (pattern: 'ly$'):
  quickly         ‚Üí quick
  happily         ‚Üí happi
  slowly          ‚Üí slow
  carefully       ‚Üí careful


## 5.5 Comparing All Stemmers

In [43]:
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer("english")
rs = RegexpStemmer('ing$|ed$|s$|able$|tion$', min=4)

words = [
    "programming", "programmer", "programmed",
    "organization", "organized", "organizing",
    "beautiful", "beautifully", "beauty",
    "happiness", "happy", "happily",
]

print("All Stemmers Comparison")
print("=" * 75)
print(f"{'Word':<16} {'Porter':<12} {'Lancaster':<12} {'Snowball':<12} {'Regexp':<12}")
print("-" * 75)

for word in words:
    print(f"{word:<16} {ps.stem(word):<12} {ls.stem(word):<12} {ss.stem(word):<12} {rs.stem(word):<12}")

All Stemmers Comparison
Word             Porter       Lancaster    Snowball     Regexp      
---------------------------------------------------------------------------
programming      program      program      program      programm    
programmer       programm     program      programm     programmer  
programmed       program      program      program      programm    
organization     organ        org          organ        organiza    
organized        organ        org          organ        organiz     
organizing       organ        org          organ        organiz     
beautiful        beauti       beauty       beauti       beautiful   
beautifully      beauti       beauty       beauti       beautifully 
beauty           beauti       beauty       beauti       beauty      
happiness        happi        happy        happi        happines    
happy            happi        happy        happi        happy       
happily          happili      happy        happili      happily     


### Consistency Test

Do words with the same meaning get the same stem?

In [44]:
word_families = [
    ["run", "running", "runs", "runner", "ran"],
    ["connect", "connection", "connected", "connecting"],
    ["happy", "happiness", "happily", "happier"],
    ["beauty", "beautiful", "beautifully", "beautify"],
]

print("Stemmer Consistency Test")
print("=" * 60)

for family in word_families:
    print(f"\nWord family: {family}")
    
    for name, stemmer in [("Porter", ps), ("Lancaster", ls), ("Snowball", ss)]:
        stems = [stemmer.stem(word) for word in family]
        unique_stems = set(stems)
        is_consistent = len(unique_stems) == 1
        status = "‚úÖ Consistent" if is_consistent else f"‚ö†Ô∏è {len(unique_stems)} different stems"
        print(f"  {name:<10} ‚Üí {unique_stems}  {status}")

Stemmer Consistency Test

Word family: ['run', 'running', 'runs', 'runner', 'ran']
  Porter     ‚Üí {'runner', 'run', 'ran'}  ‚ö†Ô∏è 3 different stems
  Lancaster  ‚Üí {'ran', 'run'}  ‚ö†Ô∏è 2 different stems
  Snowball   ‚Üí {'runner', 'run', 'ran'}  ‚ö†Ô∏è 3 different stems

Word family: ['connect', 'connection', 'connected', 'connecting']
  Porter     ‚Üí {'connect'}  ‚úÖ Consistent
  Lancaster  ‚Üí {'connect'}  ‚úÖ Consistent
  Snowball   ‚Üí {'connect'}  ‚úÖ Consistent

Word family: ['happy', 'happiness', 'happily', 'happier']
  Porter     ‚Üí {'happili', 'happi', 'happier'}  ‚ö†Ô∏è 3 different stems
  Lancaster  ‚Üí {'happy'}  ‚úÖ Consistent
  Snowball   ‚Üí {'happili', 'happi', 'happier'}  ‚ö†Ô∏è 3 different stems

Word family: ['beauty', 'beautiful', 'beautifully', 'beautify']
  Porter     ‚Üí {'beautifi', 'beauti'}  ‚ö†Ô∏è 2 different stems
  Lancaster  ‚Üí {'beauty', 'beaut'}  ‚ö†Ô∏è 2 different stems
  Snowball   ‚Üí {'beautifi', 'beauti'}  ‚ö†Ô∏è 2 different stems


## 5.6 Practical Applications

### Stemming a Sentence

In [45]:
def stem_sentence(sentence, stemmer=None):
    """Stem all words in a sentence"""
    if stemmer is None:
        stemmer = PorterStemmer()
    
    tokens = word_tokenize(sentence.lower())
    stemmed = [stemmer.stem(t) for t in tokens if t.isalpha()]
    return stemmed

sentences = [
    "The cats are running and jumping happily.",
    "She was studying programming and organizing her notes.",
    "The beautiful organization connected many communities.",
]

print("Sentence Stemming")
print("=" * 60)

for sentence in sentences:
    stemmed = stem_sentence(sentence)
    print(f"\nOriginal: {sentence}")
    print(f"Stemmed:  {' '.join(stemmed)}")

Sentence Stemming

Original: The cats are running and jumping happily.
Stemmed:  the cat are run and jump happili

Original: She was studying programming and organizing her notes.
Stemmed:  she wa studi program and organ her note

Original: The beautiful organization connected many communities.
Stemmed:  the beauti organ connect mani commun


### Stemming for Search

In [46]:
# Documents
documents = [
    "The runner was running in the marathon.",
    "She runs every morning before work.",
    "Running is good exercise for runners.",
    "The car drove quickly down the street.",
]

# Search query
query = "run"
query_stem = ps.stem(query)

print(f"Search query: '{query}' (stem: '{query_stem}')")
print("\nMatching documents:")
print("-" * 50)

for i, doc in enumerate(documents, 1):
    tokens = word_tokenize(doc.lower())
    stems = [ps.stem(t) for t in tokens if t.isalpha()]
    
    if query_stem in stems:
        # Find which words matched
        matched = [t for t in tokens if t.isalpha() and ps.stem(t) == query_stem]
        print(f"\n‚úÖ Doc {i}: {doc}")
        print(f"   Matched words: {matched}")
    else:
        print(f"\n‚ùå Doc {i}: {doc}")

Search query: 'run' (stem: 'run')

Matching documents:
--------------------------------------------------

‚úÖ Doc 1: The runner was running in the marathon.
   Matched words: ['running']

‚úÖ Doc 2: She runs every morning before work.
   Matched words: ['runs']

‚úÖ Doc 3: Running is good exercise for runners.
   Matched words: ['running']

‚ùå Doc 4: The car drove quickly down the street.


## 5.7 Stemmer Utility Class

In [47]:
class Stemmer:
    """Utility class for stemming operations"""
    
    STEMMERS = {
        'porter': PorterStemmer,
        'lancaster': LancasterStemmer,
        'snowball': lambda: SnowballStemmer('english'),
    }
    
    def __init__(self, stemmer_type='porter'):
        if stemmer_type not in self.STEMMERS:
            raise ValueError(f"Unknown stemmer: {stemmer_type}")
        
        creator = self.STEMMERS[stemmer_type]
        self.stemmer = creator() if callable(creator) else creator
        self.stemmer_type = stemmer_type
    
    def stem(self, word):
        """Stem a single word"""
        return self.stemmer.stem(word)
    
    def stem_words(self, words):
        """Stem a list of words"""
        return [self.stemmer.stem(w) for w in words]
    
    def stem_text(self, text):
        """Tokenize and stem text"""
        tokens = word_tokenize(text.lower())
        return [self.stemmer.stem(t) for t in tokens if t.isalpha()]
    
    def stem_documents(self, documents):
        """Stem multiple documents"""
        return [self.stem_text(doc) for doc in documents]

In [48]:
# Use the utility class
text = "The programmers are programming different programs."

print(f"Text: {text}\n")

for stype in ['porter', 'lancaster', 'snowball']:
    stemmer = Stemmer(stype)
    result = stemmer.stem_text(text)
    print(f"{stype.capitalize():<10} ‚Üí {result}")

Text: The programmers are programming different programs.

Porter     ‚Üí ['the', 'programm', 'are', 'program', 'differ', 'program']
Lancaster  ‚Üí ['the', 'program', 'ar', 'program', 'diff', 'program']
Snowball   ‚Üí ['the', 'programm', 'are', 'program', 'differ', 'program']


## 5.8 When to Use Stemming

### ‚úÖ Good Use Cases

| Use Case | Why |
|----------|-----|
| **Information Retrieval / Search** | Match different word forms to same concept |
| **Text Classification** | Reduce vocabulary size |
| **Document Clustering** | Group similar documents |
| **Quick Prototyping** | Faster than lemmatization |

### ‚ùå Not Recommended For

| Use Case | Why Not |
|----------|--------|
| **Sentiment Analysis** | Loses nuance ("happy" vs "happily") |
| **Machine Translation** | Need exact word forms |
| **Text Generation** | Stems aren't valid words |
| **Named Entity Recognition** | Proper nouns shouldn't be stemmed |

## Summary

| Stemmer | Aggressiveness | Speed | Multi-language |
|---------|---------------|-------|----------------|
| **Porter** | Medium | Fast | No |
| **Lancaster** | High | Fast | No |
| **Snowball** | Medium | Fast | Yes |
| **Regexp** | Custom | Very Fast | Custom |

### Quick Reference
```python
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer

ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

ps.stem('running')  # 'run'
```