# Q2. Word Stemming

1. **Importance of Stemming in Text Analytics**

Stemming is the process of reducing a word to its base or root form. In text analytics, stemming is important because:

- Reduces Vocabulary Size: Words with the same root (e.g., running, runs, ran) are treated as a single token (run), reducing dimensionality in text analysis.

- Improves Consistency: Helps unify different word forms so that models recognize them as the same concept.

- Enhances Text Mining & ML Performance: Simplifies feature space for tasks like sentiment analysis, document classification, and information retrieval.

2. **Demonstrate Word Stemming**

In [1]:
import re
from nltk.stem import PorterStemmer, LancasterStemmer

# Load Data_1.txt
file_path = '../assets/dataset-a/Data_1.txt'
with open(file_path, 'r') as file:
    text = file.read().strip()

# Split text into words
words = text.split()
print("Original Words (first 20):")
print(words[:20])

# 1. Regex Stemmer (removes common suffixes)
def regex_stemmer(word):
    return re.sub(r'(ing|ly|ed|es|s|er)$', '', word)

regex_stems = [regex_stemmer(word) for word in words]
print("\nRegex Stemmer Output (first 20):")
print(regex_stems[:20])

# 2. Porter Stemmer
porter = PorterStemmer()
porter_stems = [porter.stem(word) for word in words]
print("\nPorter Stemmer Output (first 20):")
print(porter_stems[:20])

# 3. Lancaster Stemmer
lancaster = LancasterStemmer()
lancaster_stems = [lancaster.stem(word) for word in words]
print("\nLancaster Stemmer Output (first 20):")
print(lancaster_stems[:20])


Original Words (first 20):
['Classification', 'is', 'the', 'task', 'of', 'choosing', 'the', 'correct', 'class', 'label', 'for', 'a', 'given', 'input.', 'In', 'basic', 'classification', 'tasks,', 'each', 'input']

Regex Stemmer Output (first 20):
['Classification', 'i', 'the', 'task', 'of', 'choos', 'the', 'correct', 'clas', 'label', 'for', 'a', 'given', 'input.', 'In', 'basic', 'classification', 'tasks,', 'each', 'input']

Porter Stemmer Output (first 20):
['classif', 'is', 'the', 'task', 'of', 'choos', 'the', 'correct', 'class', 'label', 'for', 'a', 'given', 'input.', 'in', 'basic', 'classif', 'tasks,', 'each', 'input']

Lancaster Stemmer Output (first 20):
['class', 'is', 'the', 'task', 'of', 'choos', 'the', 'correct', 'class', 'label', 'for', 'a', 'giv', 'input.', 'in', 'bas', 'class', 'tasks,', 'each', 'input']


3. **Differences among Regex, Porter and Lancaster Stemmer**

| Stemmer               | Characteristics                                                    | Observations from Output                                                                                                                       |
| --------------------- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| **Regex Stemmer**     | Simple rule-based stemmer that removes common suffixes             | Produces inconsistent stems (`Classification -> Classification`, `choosing -> choos`, `class -> clas`). Does not always reduce words effectively. |
| **Porter Stemmer**    | Moderate rule-based algorithm for English; balances aggressiveness | Produces more consistent stems (`Classification -> classif`, `choosing -> choos`, `class -> class`). Keeps some word meaning intact.              |
| **Lancaster Stemmer** | Aggressive, short-stemmer algorithm                                | Produces the shortest stems (`Classification -> class`, `choosing -> choos`, `given -> giv`, `basic -> bas`). Can over-stem, losing some nuance.   |

- Regex Stemmer is customizable but naive.
- Porter Stemmer is widely used in NLP and balances accuracy and stemming aggressiveness.
- Lancaster Stemmer is very aggressive; it reduces words more drastically, which can sometimes remove important meaning.

Conclusion:

Regex is good for quick rules, Porter is good for general NLP tasks, and Lancaster is suitable when aggressive reduction is needed.
Choice depends on the task goal, e.g., text classification vs. detailed linguistic analysis.