# Stemming and Lemmatization

Both **stemming** and **lemmatization** are text normalization techniques that reduce words to their base or root form.

## Stemming
- Uses **rule-based** approach (heuristics)
- Chops off word endings to get the stem
- May produce **non-dictionary words** (e.g., "studies" → "studi")
- **Faster** but less accurate

## Lemmatization
- Uses **vocabulary and morphological analysis**
- Returns the **dictionary form** (lemma) of a word
- More **accurate** but computationally expensive
- Context-aware (e.g., "better" → "good")

| Word | Stemming | Lemmatization |
|------|----------|---------------|
| studies | studi | study |
| better | better | good |
| running | run | run |

<h3>Stemming in NLTK</h3>

In [1]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [2]:
words = ["eating", "eats", "eat", "ate", "adjustable", "rafting", "ability", "meeting"]

for word in words:
    print(word, "|", stemmer.stem(word))

eating | eat
eats | eat
eat | eat
ate | ate
adjustable | adjust
rafting | raft
ability | abil
meeting | meet


### Porter Stemmer Example

The **Porter Stemmer** is one of the most popular stemming algorithms. Notice how it produces stems that may not be actual words.

<h3>Lemmatization in Spacy</h3>

In [3]:
import spacy

In [4]:
nlp = spacy.load("en_core_web_sm")

doc = nlp("Mando talked for 3 hours although talking isn't his thing")
doc = nlp("eating eats eat ate adjustable rafting ability meeting better")
for token in doc:
    print(token, " | ", token.lemma_)

eating  |  eat
eats  |  eat
eat  |  eat
ate  |  eat
adjustable  |  adjustable
rafting  |  raft
ability  |  ability
meeting  |  meet
better  |  well


### Lemmatization Example

spaCy's lemmatizer returns the dictionary form. Notice how "better" correctly maps to "good" - something stemming cannot achieve!

<h3>Customizing lemmatizer</h3>

In [5]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

## Customizing the Lemmatizer

Sometimes we need to handle slang or domain-specific words. spaCy allows us to add custom lemmatization rules using the **attribute ruler**.

In [6]:
ar = nlp.get_pipe('attribute_ruler')

ar.add([[{"TEXT":"Bro"}],[{"TEXT":"Brah"}]],{"LEMMA":"Brother"})

doc = nlp("Bro, you wanna go? Brah, don't say no! I am exhausted")
for token in doc:
    print(token.text, "|", token.lemma_)

Bro | Brother
, | ,
you | you
wanna | wanna
go | go
? | ?
Brah | Brother
, | ,
do | do
n't | not
say | say
no | no
! | !
I | I
am | be
exhausted | exhaust


### Adding Custom Lemma Rules

Let's add rules to convert slang words "Bro" and "Brah" to their proper form "Brother".

In [7]:
doc[6]

Brah

In [8]:
doc[6].lemma_

'Brother'