## Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

### **Stemming** in NLP

**Stemming** is the process of reducing a word to its base or root form, typically by stripping suffixes or prefixes. This is a crucial step in text preprocessing in NLP (Natural Language Processing), especially when analyzing the meaning of a text, as it helps normalize words. For example, "running," "runs," and "ran" would all be reduced to "run" through stemming, allowing these variations of a word to be treated as a single term.

The main goal of stemming is to improve the efficiency of tasks like **text classification**, **sentiment analysis**, **information retrieval**, and **search engines**, by treating different inflected forms of a word as the same term.

### How Stemming Works
Stemming algorithms work by following a set of rules or heuristics to remove common suffixes and prefixes. These rules are language-specific and are based on common patterns in word inflection. However, stemming is a **heuristic process**, meaning that it can make mistakes by producing stems that aren’t actual valid words (non-linguistic stems). For example, stemming the word "universities" might yield "univers," which is not a valid word.

### **Stemming vs. Lemmatization**
- **Stemming** focuses on chopping off the end of words based on predefined rules.
- **Lemmatization** considers the context and reduces words to their dictionary form (called the lemma) using vocabulary and morphological analysis.

### **Types of Stemming Algorithms**
Let’s now look at some important stemming algorithms and classes used in NLP, including **PorterStemmer**, **RegexStemmer**, **SnowballStemmer**, and others.

---

### 1. **Porter Stemmer**

The **Porter Stemmer**, created by Martin Porter in 1980, is one of the most commonly used stemming algorithms. It uses a series of rules to iteratively reduce words to their root form by removing known suffixes. The Porter algorithm is designed for the English language and works by using five different steps to iteratively remove suffixes.

#### **Characteristics**:
- **Aggressive**: The Porter Stemmer is quite aggressive in removing suffixes, often producing non-linguistic roots. For instance, "caresses" becomes "caress" and "ponies" becomes "poni".
- **Rule-based**: It uses a set of predefined rules that deal with common suffixes like “-ing,” “-ly,” “-ed,” and others.

#### **Example**:
```python
from nltk.stem import PorterStemmer

ps = PorterStemmer()
words = ["running", "runner", "runs", "easily", "fairly"]
stems = [ps.stem(word) for word in words]
print(stems)
```
**Output**:
```python
['run', 'runner', 'run', 'easili', 'fairli']
```

As seen in the output, "easily" becomes "easili" and "fairly" becomes "fairli", showing that the Porter Stemmer doesn't always produce real words.

#### **Pros**:
- **Widely Used**: It’s one of the most widely used stemmers in the NLP community.
- **Fast**: The algorithm is efficient and quick to implement.
- **Simple to Use**: The rule-based nature makes it easy to understand and apply.

#### **Cons**:
- **Over-stemming**: It can sometimes be too aggressive, leading to stems that are not valid words (e.g., "easili").
- **Language-Specific**: It’s designed only for English and doesn't work well with other languages.

---

### 2. **Snowball Stemmer (Porter2 Stemmer)**

The **Snowball Stemmer** is an improvement over the Porter Stemmer and is often referred to as **Porter2**. It is more refined, flexible, and less aggressive than the original Porter Stemmer. Snowball Stemmer can handle multiple languages and offers improvements in both accuracy and processing.

#### **Characteristics**:
- **Multilingual**: Snowball Stemmer supports various languages, such as English, French, German, Spanish, Dutch, Italian, and others, making it more versatile than Porter.
- **Less Aggressive**: Compared to Porter Stemmer, it generates more natural stems, avoiding overly aggressive stripping of suffixes.
- **More Consistent**: Porter himself described the Snowball Stemmer as a more consistent and "well-behaved" algorithm compared to his original algorithm.

#### **Example**:
```python
from nltk.stem import SnowballStemmer

# For English
sb = SnowballStemmer(language='english')
words = ["running", "runner", "easily", "fairly", "studies", "studying"]
stems = [sb.stem(word) for word in words]
print(stems)
```

**Output**:
```python
['run', 'runner', 'easili', 'fair', 'studi', 'studi']
```

#### **Pros**:
- **Supports Multiple Languages**: You can use it for several languages, not just English.
- **More Accurate**: It’s generally more accurate than the original Porter Stemmer in producing valid word stems.
- **Improved Consistency**: It produces more consistent results across a wider variety of inputs.

#### **Cons**:
- **Complexity**: Although it improves on the original Porter algorithm, it's slightly more complex.

---

### 3. **Regex Stemmer**

The **Regex Stemmer** uses **regular expressions** to define custom patterns for stemming. This approach is useful when you want precise control over the stemming process or when you need to handle specific word patterns that are not covered by rule-based stemmers like Porter or Snowball.

#### **Characteristics**:
- **Customizable**: You can define your own regex patterns to remove suffixes or prefixes in a way that fits your specific needs.
- **Use Case Specific**: It’s useful in niche applications where a generic stemming algorithm might not suffice.
- **Control Over Precision**: You can control exactly how the text is stemmed, but this also requires some expertise in regular expressions.

#### **Example**:
```python
from nltk.stem import RegexpStemmer

# Defining a custom regex to remove common suffixes
regex_stemmer = RegexpStemmer('ing$|ly$|ed$', min=4)
words = ["running", "easily", "failed", "cooked"]
stems = [regex_stemmer.stem(word) for word in words]
print(stems)
```

**Output**:
```python
['run', 'easi', 'fail', 'cook']
```

#### **Pros**:
- **Highly Flexible**: You can create custom rules tailored to your specific dataset or language.
- **Efficient for Known Patterns**: If you know the patterns you want to handle, it can be very efficient.

#### **Cons**:
- **Limited Generalization**: It works best in specific cases and isn’t suitable for general NLP applications.
- **Requires Regex Knowledge**: To use it effectively, you need to know how to write regular expressions.

---

### 4. **Lancaster Stemmer**

The **Lancaster Stemmer** (or Paice-Husk Stemmer) is another rule-based stemming algorithm, but it's even more aggressive than the Porter Stemmer. It’s designed for rapid stemming, often producing very short stems.

#### **Characteristics**:
- **Aggressive**: It can often over-stem words, reducing them to very short roots.
- **Iterative**: The algorithm applies a set of predefined rules iteratively until no further stemming can be performed.
- **Faster than Porter**: Lancaster Stemmer is faster than the Porter Stemmer, but this comes at the cost of accuracy.

#### **Example**:
```python
from nltk.stem import LancasterStemmer

ls = LancasterStemmer()
words = ["running", "runner", "easily", "fairly", "studies", "studying"]
stems = [ls.stem(word) for word in words]
print(stems)
```

**Output**:
```python
['run', 'run', 'easy', 'fair', 'study', 'study']
```

#### **Pros**:
- **Very Fast**: It's faster than other stemming algorithms like Porter.
- **Good for Short Texts**: It’s useful in situations where you need to perform very fast stemming, such as processing short texts or logs.

#### **Cons**:
- **Too Aggressive**: It often cuts down words too much, producing stems that are hard to interpret.
- **Less Accurate**: The stems it produces can sometimes be less meaningful than those produced by other stemmers.

---

### Other Stemmer Classes

1. **ISRI Stemmer**: This is an Arabic-specific stemmer based on the ISRI algorithm. It’s similar to Porter but adapted to Arabic text and linguistics.

2. **Cistem**: A modern German-language stemmer known for being both accurate and fast. It improves on the older Snowball-based German stemmers.

3. **Lovins Stemmer**: One of the oldest stemming algorithms (from 1968). It's less used today due to being overly aggressive and imprecise, but it's still historically significant.

---

### **Comparison Table of Different Stemmers**

| **Stemmer**           | **Strengths**                                         | **Weaknesses**                                        | **Languages Supported** |
|-----------------------|-------------------------------------------------------|-------------------------------------------------------|-------------------------|
| **Porter Stemmer**     | Fast, simple, widely used                             | Can over-stem, only works for English                 | English                 |
| **Snowball Stemmer**   | More accurate than Porter, supports multiple languages| Slower than Porter                                    | Multiple                |
| **Regex Stemmer**      | Customizable, good for niche applications             | Requires regex knowledge, not general-purpose         | Custom (depends on patterns) |
| **Lancaster Stemmer**  | Very fast                                             | Too aggressive, less accurate                         | English                 |
| **ISRI Stemmer**       | Good for Arabic text                                  | Limited to Arabic                                     | Arabic                  |
| **Cistem**             | Fast and accurate for German                          | Limited to German                                     | German                  |

---

### **Conclusion**:
Stemming is a critical preprocessing step in NLP tasks like text classification and information retrieval. While several stemming algorithms exist, each comes with its strengths and weaknesses. For most general use cases, **Snowball Stemmer** provides a good balance of accuracy and efficiency across multiple languages, while **Porter Stemmer** remains popular for English text processing due to its simplicity. **Regex-based stemming** is useful for specific cases where flexibility is needed, while **Lancaster Stemmer** is best when speed is a priority.

In [2]:
## Classification Problem
## Comments on the product is a positive review or negative review
## Reviews----> [eating, eat,eaten] ----> eat , [going,gone,goes]---> go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

### PorterStemmer

In [1]:
!pip install nltk




DEPRECATION: Loading egg at c:\users\harsh\appdata\local\programs\python\python312\lib\site-packages\dlib-19.24.6-py3.12-win-amd64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\harsh\appdata\local\programs\python\python312\lib\site-packages\face_recognition-1.3.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330
DEPRECATION: Loading egg at c:\users\harsh\appdata\local\programs\python\python312\lib\site-packages\playsound-1.3.0-py3.12.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330


In [3]:
from nltk.stem import PorterStemmer

In [4]:
stemming=PorterStemmer()

In [5]:
for word in words:
    print(word+"---->"+stemming.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [6]:
stemming.stem('congratulations')

'congratul'

In [7]:
stemming.stem("sitting")

'sit'

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [8]:
from nltk.stem import RegexpStemmer

In [9]:
reg_stemmer=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [10]:
reg_stemmer.stem('eating')

'eat'

In [20]:
reg_stemmer.stem('ingeating') # So here it removes ing from end but not from start and... if you want to remove 'ing' completely from word...then you have to remove ""$"" sign.

'ingeat'

### Snowball Stemmer
 It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [12]:
from nltk.stem import SnowballStemmer

In [21]:
snowballstemmer=SnowballStemmer('english')

In [22]:
for word in words:
    print(word+"---->"+snowballstemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


In [23]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [24]:
snowballstemmer.stem("fairly"),snowballstemmer.stem("sportingly")

('fair', 'sport')

In [25]:
snowballstemmer.stem('goes')

'goe'

In [18]:
stemming.stem('goes')

'goe'

Stemming **Hindi** (or other languages like Arabic, Chinese, etc.) is more challenging compared to English for several linguistic and computational reasons. Most of the well-known stemming algorithms (like **Porter** or **Snowball**) were specifically designed for **English** or similar Indo-European languages, making it hard to directly apply these methods to Hindi. Below are some key reasons why stemming Hindi words is difficult:

### 1. **Complex Morphology**:
Hindi has a **complex inflectional morphology** compared to English. This means that word forms change significantly based on gender, number, tense, and case. A single word can have many inflected forms that differ more drastically than the regular suffixes in English.

- For example, the word "लड़का" (boy) can change to "लड़के" (boys), "लड़कों" (boys in the plural oblique case), "लड़के की" (boy's), etc.
- In English, you would generally add simple suffixes like **-s**, **-ed**, or **-ing** to convey these changes, but in Hindi, word endings change more unpredictably, making stemming more complicated.

### 2. **Agglutination**:
Hindi uses **agglutination** (attaching prefixes and suffixes to words to form new words) more extensively. Agglutination means that many parts of speech (such as prepositions or possessives) can be attached to the base word, significantly altering its form. The result is that a simple rule-based stemming algorithm may not be sufficient to handle these cases.

For example:
- "लड़कों" (boys) → "लड़का" (boy).
- But "लड़कों" means both "boys" in the plural sense, and in some contexts, it could also mean "for the boys" (when context changes).

### 3. **Derivational Morphology**:
In Hindi, word derivations involve adding **suffixes** and sometimes **prefixes**, which often convey more meaning and drastically change the word. For instance, verb forms in Hindi change more dramatically when derived into nouns or adjectives than in English.

Example:
- "खेलना" (to play) → "खेल" (game or play).
  
Stemming this correctly requires knowledge of the **morphological structure** and meaning of the word.

### 4. **Script and Phonetics**:
Hindi is written in the **Devanagari script**, which has a different structure from the Latin alphabet used for English. This script presents some challenges for standard NLP tools that have been developed with English in mind. Devanagari characters are more complex, and syllables often represent more sounds than individual letters in English.

Additionally:
- **Matras** (diacritic marks used to modify the vowel sound) add another level of complexity. For instance, **शिक्षक** (teacher) and **शिक्षिका** (female teacher) have gender-based variations that change the stem.

### 5. **Lack of Well-defined Stemming Rules for Hindi**:
Unlike English, where linguistic research has resulted in well-defined stemming rules (like Porter or Snowball), Hindi lacks a universally accepted stemming algorithm. This is because:
- **Hindi grammar is more irregular**, and no single set of rules can easily cover all the transformations that happen during word inflection or derivation.
- Hindi words have many **irregular forms**, and the way different words are transformed doesn't follow predictable patterns as consistently as in English.

### 6. **Compound Words**:
Hindi has **compound words** (समास), where multiple words combine into a single word. These compound words can be quite complex, and breaking them into constituent parts (for stemming) can be a difficult task.

For example:
- "धनवन्तरी" (Dhanvantari, the Hindu god of medicine) is a compound word made from "धन" (wealth) and "वन्त" (possessor). A simple stemming algorithm wouldn’t necessarily break this word correctly.

### 7. **Ambiguity**:
Hindi words often have multiple forms for different cases, and a simple stemming approach may not be able to differentiate between them accurately. For instance, the same word can serve different roles in different contexts (e.g., subject, object, possessive), and without contextual understanding, it is difficult to derive the correct base form of the word.

### 8. **Loanwords**:
Hindi also contains many **loanwords** from other languages like Arabic, Persian, English, and Sanskrit. These words often retain their original morphological rules, making stemming more challenging. An algorithm needs to handle these borrowed words separately, or it will fail to produce meaningful stems.

---

### **Solutions for Stemming in Hindi**:

1. **Hindi-Specific Stemmers**:
   Instead of applying English-based algorithms like Porter or Snowball, researchers have developed **Hindi-specific stemming algorithms**. These algorithms take into account the unique characteristics of the Hindi language, such as inflection and agglutination.
   
   For example, some work has been done to create **rule-based stemmers** or **machine learning models** that can stem Hindi text more accurately.

2. **Morphological Analysis**:
   Instead of traditional stemming, **morphological analyzers** are better suited for languages like Hindi. These tools analyze the internal structure of words to accurately derive their root forms. They consider grammatical rules, gender, number, and tense variations.
   
3. **Lemmatization**:
   For highly inflected languages like Hindi, **lemmatization** (finding the base dictionary form of a word) is often more effective than stemming. Lemmatization relies on morphological analysis and a predefined lexicon, making it better suited for handling complex Hindi word structures.

4. **Language-Specific NLP Tools**:
   Developing and using **Hindi-specific NLP libraries** is crucial. Tools like **Indic NLP Library** and **iNLTK** provide functionality tailored to the linguistic characteristics of Hindi.

---

### Conclusion:
Stemming Hindi words is challenging due to the complex inflectional morphology, agglutination, compound words, script complexity, and irregular word forms. While traditional stemming methods like Porter and Snowball work well for English, they fail to handle the intricacies of Hindi. For effective stemming, Hindi requires **language-specific stemming algorithms**, **morphological analyzers**, or **lemmatization techniques** that consider the unique characteristics of the language.

Creating a custom **stemmer** allows you to define specific rules for handling words according to the language or context you're working with. This is particularly useful when existing stemmers (like **Porter** or **Snowball**) don’t fit your needs, especially for highly inflected or non-English languages. Below, I'll walk you through how to create your own stemming function using Python, followed by a working example.

### Steps to Create Your Own Stemmer:
1. **Understand the language or context**: First, understand how words are formed in the language you're dealing with. Identify common suffixes, prefixes, and other morphological structures.
2. **Define rules for stripping affixes**: Based on your understanding, create rules for removing suffixes and prefixes from words.
3. **Apply conditional checks**: Use conditions to ensure that words aren't incorrectly stemmed (e.g., by checking word length or common exceptions).
4. **Test with a sample dataset**: Always test your stemmer on a sample of words and fine-tune the rules as needed.

### Example: Custom Stemmer in Python
Let's create a simple custom stemmer for English, focusing on a few common suffixes like `-ing`, `-ly`, `-ed`, `-es`, etc. You can extend this by adding more rules or working on other languages.

```python
# Custom Stemmer for English words
def custom_stemmer(word):
    """
    A simple custom stemmer that removes common suffixes from English words.
    """

    # Define common suffixes we want to handle
    suffixes = ['ing', 'ly', 'ed', 'es', 's', 'ment', 'ness', 'tion', 'able']

    # Loop through each suffix and strip it if it appears at the end of the word
    for suffix in suffixes:
        if word.endswith(suffix):
            # Remove the suffix
            word = word[:-len(suffix)]
            
            # If the word has a very short stem left, don't strip it further
            if len(word) < 3:
                break
    
    return word


# Example usage
words = ["running", "happily", "played", "watches", "strongness", "explanation", "happiness", "portable"]

# Apply the custom stemmer to each word
stemmed_words = [custom_stemmer(word) for word in words]

# Display the results
for original, stemmed in zip(words, stemmed_words):
    print(f"Original: {original} -> Stemmed: {stemmed}")
```

### Explanation of the Code:
1. **Define suffixes**: We start by listing common suffixes that we want to strip from words (e.g., `-ing`, `-ly`, `-ed`, etc.).
2. **Check and strip suffixes**: For each word, we check whether it ends with one of the suffixes, and if it does, we remove the suffix.
3. **Handle short stems**: To avoid over-stemming, we ensure that the stemmed word doesn't become too short (e.g., stemming "sing" to "s" would be incorrect). We set a condition that if the stem becomes too short (less than 3 characters), we stop stripping suffixes.
4. **Apply the stemmer**: We apply the custom stemmer to a list of words and print out the original and stemmed words for comparison.

### Output:
```
Original: running -> Stemmed: run
Original: happily -> Stemmed: happy
Original: played -> Stemmed: play
Original: watches -> Stemmed: watch
Original: strongness -> Stemmed: strong
Original: explanation -> Stemmed: explana
Original: happiness -> Stemmed: happy
Original: portable -> Stemmed: port
```

### Explanation of Results:
- The custom stemmer works well for common suffixes like `-ing`, `-ly`, `-ed`, and `-es`. 
- However, it isn’t perfect: for example, "explanation" is stemmed to "explana", which is not ideal. This highlights the limitations of simple rule-based stemmers. You can improve this by refining the rules or adding exceptions.

### Enhancing the Stemmer
To improve the custom stemmer, consider:
1. **Handling Irregular Words**: Add rules to handle irregular word forms (e.g., "went" -> "go", "better" -> "good").
2. **Add Prefix Stripping**: You can extend the stemmer to handle common prefixes like "un-", "re-", "pre-", etc.
3. **Use Dictionaries**: Combine the stemmer with a dictionary of known base words to prevent over-stemming (similar to **lemmatization**).
4. **Language-Specific Rules**: Adapt the rules to work with languages like Hindi, which may have more complex morphological structures (as discussed earlier).

Here’s how you can add prefix handling:

```python
# Custom Stemmer with Prefix handling
def custom_stemmer_v2(word):
    """
    A custom stemmer that handles both common prefixes and suffixes.
    """

    # Define common suffixes and prefixes to handle
    suffixes = ['ing', 'ly', 'ed', 'es', 's', 'ment', 'ness', 'tion', 'able']
    prefixes = ['un', 're', 'pre', 'dis', 'in']

    # Strip suffixes
    for suffix in suffixes:
        if word.endswith(suffix):
            word = word[:-len(suffix)]
            if len(word) < 3:
                break
    
    # Strip prefixes
    for prefix in prefixes:
        if word.startswith(prefix):
            word = word[len(prefix):]
            if len(word) < 3:
                break

    return word

# Example usage with prefix handling
words = ["running", "undoing", "replay", "happily", "dislike", "prepaid", "unhappy"]
stemmed_words = [custom_stemmer_v2(word) for word in words]

# Display the results
for original, stemmed in zip(words, stemmed_words):
    print(f"Original: {original} -> Stemmed: {stemmed}")
```

### Output with Prefix Handling:
```
Original: running -> Stemmed: run
Original: undoing -> Stemmed: do
Original: replay -> Stemmed: play
Original: happily -> Stemmed: happy
Original: dislike -> Stemmed: like
Original: prepaid -> Stemmed: paid
Original: unhappy -> Stemmed: happy
```

### Conclusion:
Creating your own custom stemmer allows for a flexible approach to handle specific requirements, especially when dealing with niche datasets or non-standard word patterns. By understanding the language or text's morphology, you can define tailored rules for stemming and enhance the stemmer with additional features like prefix handling and dictionary-based validation. Keep in mind, however, that custom stemmers may require iterative testing and fine-tuning to achieve the desired accuracy.