# 8.1.2 Stemming and Lemmatization

## Explanation of Stemming and Lemmatization

**Stemming** and **lemmatization** are text normalization techniques used in natural language processing (NLP) to reduce words to their base or root forms.

- **Stemming** involves trimming or chopping off prefixes or suffixes from words to get their root form. This process may result in words that are not actual dictionary words. For example, "running" may be stemmed to "run."

- **Lemmatization** involves reducing words to their base or dictionary form using morphological analysis. Unlike stemming, lemmatization considers the context and meaning of the word, resulting in more accurate base forms. For example, "running" is lemmatized to "run."

## Differences between Stemming and Lemmatization

- **Accuracy**: Lemmatization provides more accurate base forms by considering the word's meaning and context, while stemming is a more heuristic approach that may produce less accurate results.
- **Output**: Stemming may produce non-dictionary words, whereas lemmatization always produces valid words.
- **Complexity**: Lemmatization is computationally more intensive and requires a more complex algorithm compared to the simpler stemming process.

## Methods for Implementing Stemming and Lemmatization

### Stemming

1. **Porter Stemmer**

   **Example**: 
   - "running" --> "run"
   - "easily" --> "easili" (may not be a real word)

2. **Snowball Stemmer**

   **Example**:
   - "running" --> "run"
   - "fairly" --> "fair"

3. **Lancaster Stemmer**

   **Example**:
   - "running" --> "run"
   - "easily" --> "eas" (overly simplified)

### Lemmatization

1. **WordNet Lemmatizer**

   **Example**:
   - "running" --> "run"
   - "easily" --> "easily" (unchanged if no valid lemma)

2. **spaCy Lemmatizer**

   **Example**:
   - "running" --> "run"
   - "fairly" --> "fair"


These methods can be implemented using libraries like NLTK, spaCy, and others.

In [1]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [None]:
# Download required NLTK data files
nltk.download('punkt')  # For tokenization
nltk.download('wordnet')  # For lemmatization
nltk.download('averaged_perceptron_tagger')  # For POS tagging
nltk.download('omw-1.4')  # For lemmatization with different languages

In [3]:
# Sample text
text = "The runners are running faster than the other runners"

# Stemming
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
lancaster_stemmer = LancasterStemmer()

print("Original Text:", text)
print("Porter Stemmer:", [porter_stemmer.stem(word) for word in text.split()])
print("Snowball Stemmer:", [snowball_stemmer.stem(word) for word in text.split()])
print("Lancaster Stemmer:", [lancaster_stemmer.stem(word) for word in text.split()])

Original Text: The runners are running faster than the other runners
Porter Stemmer: ['the', 'runner', 'are', 'run', 'faster', 'than', 'the', 'other', 'runner']
Snowball Stemmer: ['the', 'runner', 'are', 'run', 'faster', 'than', 'the', 'other', 'runner']
Lancaster Stemmer: ['the', 'run', 'ar', 'run', 'fast', 'than', 'the', 'oth', 'run']


In [4]:
# Lemmatization
lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

text_tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(text_tokens)
lemmatized_text = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in pos_tags]

print("Lemmatized Text:", lemmatized_text)

Lemmatized Text: ['The', 'runner', 'be', 'run', 'faster', 'than', 'the', 'other', 'runner']


## Conclusion

Stemming and lemmatization are essential techniques in the preprocessing pipeline of natural language processing (NLP). 
- **Stemming** simplifies words by cutting off derivational affixes, which can help in standardizing words to a common base form but may result in words that are not found in dictionaries.

- **Lemmatization** provides a more refined approach by reducing words to their base or dictionary form, considering their part of speech, which ensures that the output is a valid word and maintains semantic meaning.

Using these techniques effectively can significantly impact the performance of NLP models. Stemming is often used in applications where speed is more critical than precision, such as in search engines and information retrieval systems. Lemmatization, with its more accurate output, is preferred in scenarios requiring a deep understanding of text, such as sentiment analysis and text classification.

By integrating stemming and lemmatization into text preprocessing, you can improve the quality and efficiency of text analysis, leading to better insights and more robust NLP applications.
