# $$ Step\ 6 : Stemming $$

_______________

# **Text Preprocessing: Stemming in NLP**

## **1️⃣ What is Stemming?**
Stemming is a technique in **Natural Language Processing (NLP)** where words are reduced to their root form by removing prefixes and suffixes.  
For example:
- **"running" → "run"**
- **"connected" → "connect"**
- **"amazingly" → "amaz"**

Stemming is useful when we want to group different word forms together to reduce vocabulary size and improve text processing efficiency.

---

## **2️⃣ Why Use Stemming?**
✔ **Reduces complexity:** Minimizes the number of unique words in a dataset.  
✔ **Improves text analysis:** Helps in sentiment analysis, search engines, and chatbot responses.  
✔ **Saves memory:** Less unique words mean less storage and computation.  

⚠ **But beware!** Stemming can sometimes produce incorrect base forms that are not actual words.  

---

## **3️⃣ Implementing Stemming in Python**
We will use the **PorterStemmer** from the **NLTK** library to apply stemming to sample text.

__________

## 4️⃣ **When to Use Stemming?**
🔹 When reducing vocabulary size is more important than precise meanings. \
🔹 In search engines (e.g., "run", "running", and "ran" should be treated as the same).\
🔹 For quick text preprocessing in machine learning models.

________________

### Exemple : 

In [4]:
from nltk.stem import PorterStemmer

In [5]:
# Create a stemmer 
stemmer = PorterStemmer()

In [7]:
tokens = ['connecting', 'connected', 'connectivity', 'connect', 'connects']

for token in tokens :
    print( token , " : " , stemmer.stem(token))

connecting  :  connect
connected  :  connect
connectivity  :  connect
connect  :  connect
connects  :  connect


### Exemple : 

In [8]:
tokens = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']

for token in tokens : 
    print( token , " : " , stemmer.stem(token))

learned  :  learn
learning  :  learn
learn  :  learn
learns  :  learn
learner  :  learner
learners  :  learner


### Exemple : Stemming in Sentiment Analysis

In [9]:
# Sample product reviews
reviews = [
    "The product is amazing and works perfectly!",
    "I really loved the experience of using this.",
    "This is the worst purchase I have ever made.",
    "The shipping was delayed, but the product is great."
]


In [12]:
# Tokenize and stem words in reviews
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Create a stemmer 
ps = PorterStemmer()

stemmed_reviews = []
for review in reviews:
    words = word_tokenize(review)  # Tokenize words
    stemmed_words = [ps.stem(word) for word in words]  # Apply stemming
    stemmed_reviews.append(" ".join(stemmed_words))  # Join words back into sentences

In [13]:
# Print results
for original, stemmed in zip(reviews, stemmed_reviews):
    print(f"Original: {original}\nStemmed: {stemmed}\n")

Original: The product is amazing and works perfectly!
Stemmed: the product is amaz and work perfectli !

Original: I really loved the experience of using this.
Stemmed: i realli love the experi of use thi .

Original: This is the worst purchase I have ever made.
Stemmed: thi is the worst purchas i have ever made .

Original: The shipping was delayed, but the product is great.
Stemmed: the ship wa delay , but the product is great .




#### 📌 **Observation:**

"amazing" → "amaz" 

"perfectly" → "perfectli" 

"loved" → "love" 

"experience" → "experi" 

Stemming can sometimes produce incorrect words (e.g., "amaz" instead of "amazing"), but it still helps standardize text for machine learning models.


    