## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [1]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer

In [2]:
from nltk.stem import PorterStemmer

In [3]:
stemming=PorterStemmer()

In [4]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [5]:
stemming.stem('congratulations')

'congratul'

In [6]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

** SelfNotes**
The code snippet you provided defines a **`RegexpStemmer`**, which is a type of stemmer used in Natural Language Processing (NLP) to reduce words to their base or root form by removing suffixes. Stemming is a common preprocessing step in NLP tasks like text classification, sentiment analysis, and information retrieval.

Let’s break down the code:

---

### **Code Explanation**
```python
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)
```

1. **`RegexpStemmer`**:
   - This is a stemmer that uses **regular expressions** to identify and remove suffixes from words.
   - It is typically part of NLP libraries like NLTK (Natural Language Toolkit) or custom implementations.

2. **`'ing$|s$|e$|able$'`**:
   - This is the **regular expression pattern** used to match suffixes in words.
   - The `|` symbol means "OR," so the pattern matches any of the following suffixes:
     - `ing$`: Matches words ending with "ing" (e.g., "running" → "runn").
     - `s$`: Matches words ending with "s" (e.g., "cats" → "cat").
     - `e$`: Matches words ending with "e" (e.g., "love" → "lov").
     - `able$`: Matches words ending with "able" (e.g., "comfortable" → "comfort").

   - The `$` symbol ensures that the pattern matches only at the **end of the word**.

3. **`min=4`**:
   - This parameter specifies the **minimum length of the word** for stemming to be applied.
   - Words shorter than 4 characters will not be stemmed.
   - For example:
     - "cats" (4 characters) → "cat" (stemmed).
     - "is" (2 characters) → remains "is" (not stemmed).

---

### **How It Works**
- The `RegexpStemmer` applies the regular expression pattern to each word in the input text.
- If a word matches the pattern and meets the minimum length requirement (`min=4`), the stemmer removes the matched suffix.
- If a word does not match the pattern or is too short, it remains unchanged.

---

### **Example**
Let’s see how this stemmer would process a list of words:

```python
words = ["running", "cats", "love", "comfortable", "is", "jumping"]
stemmed_words = [reg_stemmer.stem(word) for word in words]
print(stemmed_words)
```

**Output**:
```
['runn', 'cat', 'lov', 'comfort', 'is', 'jump']
```

- "running" → "runn" (`ing$` removed)
- "cats" → "cat" (`s$` removed)
- "love" → "lov" (`e$` removed)
- "comfortable" → "comfort" (`able$` removed)
- "is" → "is" (too short, no stemming applied)
- "jumping" → "jump" (`ing$` removed)

---

### **Use Cases**
- **Text Preprocessing**:
  - Stemming reduces words to their base form, which helps in standardizing text data for analysis.
- **Information Retrieval**:
  - Improves search results by matching different forms of the same word (e.g., "run," "running," "runs").
- **Sentiment Analysis**:
  - Helps in reducing the vocabulary size and improving model performance.

---

### **Limitations**
- **Over-stemming**:
  - The stemmer may remove too much, leading to incorrect or nonsensical stems (e.g., "comfortable" → "comfort").
- **Under-stemming**:
  - The stemmer may fail to remove valid suffixes (e.g., "happiness" → "happiness" if the pattern doesn’t account for "ness").
- **Language-Specific**:
  - This stemmer is designed for English and may not work well for other languages.

---

### **Alternative Stemmers**
- **Porter Stemmer**:
  - A rule-based stemmer that handles a wider range of suffixes.
- **Snowball Stemmer**:
  - An improved version of the Porter Stemmer, supporting multiple languages.
- **Lemmatization**:
  - A more advanced technique that reduces words to their dictionary form (e.g., "better" → "good").

---

In summary, the `RegexpStemmer` is a simple and customizable tool for stemming words based on regular expressions. It’s useful for quick preprocessing tasks but may require tuning for specific use cases.

In [18]:
from nltk.stem import RegexpStemmer

In [19]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [20]:
reg_stemmer.stem('eating')

'eat'

In [21]:
reg_stemmer.stem('ingeating')

'ingeat'

In [22]:
reg_stemmer.stem('is')

'is'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [23]:
from nltk.stem import SnowballStemmer

In [24]:
snowballsstemmer=SnowballStemmer('english')

In [25]:
for word in words:
    print(word+"---->"+snowballsstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [26]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [27]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [33]:
snowballsstemmer.stem('goes')

'goe'

In [34]:
stemming.stem('goes')

'goe'

In [28]:
snowballsstemmer.stem('congratulations')

'congratul'

In [29]:
stemming.stem('congratulations')

'congratul'