## Stemming and lemmatization

- **Stemming and Lemmatization** are essential text normalization techniques in **Natural Language Processing (NLP)** that help reduce words to their base or root form. This process aids in improving text processing tasks like search, text mining, and machine learning models.
- Stemming and lemmatization are text normalization techniques used in natural language processing (NLP) to reduce words to their base or root form.
- Here’s a brief overview of both, along with example implementations using popular NLP libraries.

## Stemming

- Stemming removes suffixes to get the root form of a word. However, the resulting root word may not always be a valid word in the language.
- Stemming is the process of reducing a word to its base or root form by removing suffixes. The result may not be an actual word but a truncated version of the original.

In [3]:
from nltk.stem import PorterStemmer

# Create a PorterStemmer object
stemmer = PorterStemmer()

# Example words
words = ["running", "ran", "runs","flies", "easily", "fairly", "fairness", "studies"]

# Stem each word
stems = [stemmer.stem(word) for word in words]

print(stems)  # Output: ['run', 'ran', 'run', 'easili', 'fairli']

['run', 'ran', 'run', 'fli', 'easili', 'fairli', 'fair', 'studi']


Here, you can see that some words, like **"flies" → "fli"**, are not actual words.

## Lemmatization

- Lemmatization reduces a word to its base or dictionary form, known as the lemma.
- This process considers the context and the morphological analysis of the words, resulting in valid words.

In [4]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')  # For POS tagging if needed

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\devad\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\devad\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\devad\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [5]:
from nltk.stem import WordNetLemmatizer
import nltk

# Download necessary data
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Example words
words = ["running", "ran", "runs", "better", "fairly"]

# Lemmatize words as verbs ('v')
lemmas_verbs = [lemmatizer.lemmatize(word, pos='v') for word in words]

# Lemmatize words as adjectives ('a')
lemmas_adjectives = [lemmatizer.lemmatize(word, pos='a') for word in words]

print("Lemmatized as Verbs:", lemmas_verbs)
print("Lemmatized as Adjectives:", lemmas_adjectives)


Lemmatized as Verbs: ['run', 'run', 'run', 'better', 'fairly']
Lemmatized as Adjectives: ['running', 'ran', 'runs', 'good', 'fairly']


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\devad\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\devad\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# **Differences Between Stemming and Lemmatization**

## **1. Accuracy**
- **Stemming:** Can produce non-existent words by chopping off parts of the word.  
  - Example: *"better"* might be stemmed to *"bett"*.
- **Lemmatization:** Results in actual words by considering the morphological analysis of the word.

## **2. Context**
- **Stemming:** Does not consider the context and simply removes common prefixes and suffixes.
- **Lemmatization:** Takes into account the **context** and **part of speech**, providing more accurate results.

## **3. Complexity**
- **Stemming:** A simpler and faster process.
- **Lemmatization:** More complex, as it requires a dictionary and understanding of the word's context.

---

### **Example in Python**

```python
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk

# Download necessary data for lemmatization
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Example words
word = "better"

# Apply stemming
stemmed_word = stemmer.stem(word)

# Apply lemmatization (as adjective 'a')
lemmatized_word = lemmatizer.lemmatize(word, pos='a')

print(f"Stemmed: {stemmed_word}")        # Output: bett
print(f"Lemmatized: {lemmatized_word}")  # Output: good


# **Differences Between Stemming and Lemmatization**

Both **Stemming** and **Lemmatization** are text preprocessing techniques used in **Natural Language Processing (NLP)** to reduce words to their root form. However, they differ in accuracy, context handling, and complexity.

---

## **Key Differences**

| Feature         | Stemming       | Lemmatization  |
|---------------|---------------|---------------|
| **Accuracy** | Produces non-existent words by chopping off prefixes/suffixes (e.g., *"better"* → *"bett"*) | Produces actual words by considering word structure (e.g., *"better"* → *"good"*) |
| **Context** | Does not consider context, applies simple rules to remove affixes | Considers context and part of speech for accurate results |
| **Complexity** | Faster and simpler | Requires a dictionary and morphological analysis, making it slower but more precise |
| **Use Case** | Suitable for quick and large-scale text processing where perfect accuracy is not required | Preferred for NLP tasks requiring grammatically correct words |


## 🎯 Key Differences: Stemming vs Lemmatization

| **Feature**   | **Stemming**                        | **Lemmatization**                   |
|--------------|-----------------------------------|-----------------------------------|
| **Approach**  | Rule-based (suffix stripping)   | Dictionary-based (linguistic)    |
| **Speed**     | Faster                          | Slower (needs word database)     |
| **Accuracy**  | Less accurate (truncated words) | More accurate (real words)       |
| **Example**   | "flies" → "fli"                 | "flies" → "fly"                  |

