<a href="https://colab.research.google.com/github/MOHAN-DATTA-24/NLP/blob/main/Stemming_and_its_types.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# using ***NLTK***

## **Stemming**
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).

In [49]:
## Classification Problem(Say Sentiment Analysis)
## Comments of product is a positive review or negative review
## Reviews ----> eating, eat, eaten [going, gone, goes]---->go

words = ["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

# **PorterStemmer**

In [50]:
from nltk.stem import PorterStemmer

In [51]:
porter_stemmer = PorterStemmer()

In [52]:
for word in words:
  print(word+"---->"+porter_stemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


### Here we can see that for words like:
eaten---->eaten<br>
history---->histori<br>

It is computing something but the output of this computation do not have meaning this is the major disadvantage of stemming. For good number of words it works fine but for few it does not work.

In [53]:
porter_stemmer.stem("congratulations")

'congratul'

In [54]:
porter_stemmer.stem("Speaking ")

'speaking '

## **RegexpStemmer class**
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [55]:
from nltk.stem import RegexpStemmer

**A stemmer that uses regular expressions to identify morphological
affixes.  Any substrings that match the regular expressions will
be removed.**

In [56]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$|tion$', min=4)

In [57]:
reg_stemmer.stem("eating")

'eat'

In [58]:
# Consider only at the end
reg_stemmer.stem("ingeating")

'ingeat'

In [59]:
reg_stemmer.stem("congratulations")

'congratulation'

In [60]:
reg_stemmer.stem("congratulation")

'congratula'

In [61]:
reg_stemmer.stem("Finalized")

'Finalized'

In [62]:
reg_stemmer.stem("Hatred")

'Hatred'

## **Snowball Stemmer**
Better form a word compared to porter stemmer

In [63]:
from nltk.stem import SnowballStemmer

In [64]:
snowball_stemmer = SnowballStemmer('english') #various langauges are available

In [65]:
for word in words:
  print(word+"---->"+snowball_stemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final


No change for the above words

In [66]:
porter_stemmer.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [67]:
# Now we can observe difference
snowball_stemmer.stem("fairly"),snowball_stemmer.stem("sportingly")

('fair', 'sport')

In [68]:
snowball_stemmer.stem("goes")

'goe'

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Stemmer Comparison</title>
<style>
    table {
        border-collapse: collapse;
        width: 100%;
    }
    th, td {
        border: 1px solid #dddddd;
        text-align: left;
        padding: 8px;
    }
    th {
        background-color: #f2f2f2;
    }
</style>
</head>
<body>

<h2>Stemmer Comparison</h2>

<table>
    <tr>
        <th>Feature</th>
        <th>PorterStemmer</th>
        <th>RegexpStemmer</th>
        <th>Snowball Stemmer</th>
    </tr>
    <tr>
        <td>Algorithm</td>
        <td>Uses a set of rules and suffix stripping to normalize words.</td>
        <td>Applies regular expressions to remove common word endings.</td>
        <td>Employs language-specific algorithms for stemming.</td>
    </tr>
    <tr>
        <td>Language Support</td>
        <td>Supports various languages but may not be as accurate for all.</td>
        <td>Language agnostic; can be customized for specific languages.</td>
        <td>Provides extensive language support with separate algorithms for each.</td>
    </tr>
    <tr>
        <td>Flexibility</td>
        <td>Limited flexibility; follows predefined rules for stemming.</td>
        <td>High flexibility; allows custom regular expressions for stemming.</td>
        <td>Moderate flexibility; offers language-specific stemming algorithms.</td>
    </tr>
    <tr>
        <td>Performance</td>
        <td>Generally faster due to its simplicity.</td>
        <td>Speed depends on the complexity of the regular expressions used.</td>
        <td>Performance varies based on language and algorithm complexity.</td>
    </tr>
    <tr>
        <td>Example</td>
        <td>'running' -> 'run'</td>
        <td>'running' -> 'run'</td>
        <td>'running' -> 'run'</td>
    </tr>
</table>

</body>
</html>


<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Stemmer Comparison</title>
<style>
    table {
        border-collapse: collapse;
        width: 100%;
    }
    th, td {
        border: 1px solid #dddddd;
        text-align: left;
        padding: 8px;
    }
    th {
        background-color: #f2f2f2;
    }
</style>
</head>
<body>

<h2>Major Conditions for Usage:</h2>
<ol>
    <li><strong>PorterStemmer:</strong>
        <ul>
            <li>Use when you need a simple and fast stemming algorithm.</li>
            <li>Suitable for general text processing tasks where accuracy is not critical.</li>
            <li>Available in various programming languages, making it widely accessible.</li>
        </ul>
    </li>
    <li><strong>RegexpStemmer:</strong>
        <ul>
            <li>Ideal when you require flexibility in defining custom stemming rules using regular expressions.</li>
            <li>Useful for languages or domains with specific word variations not covered well by predefined algorithms.</li>
            <li>Recommended for tasks where you need fine-grained control over stemming rules.</li>
        </ul>
    </li>
    <li><strong>Snowball Stemmer:</strong>
        <ul>
            <li>Choose when you need language-specific stemming algorithms for better accuracy.</li>
            <li>Suitable for tasks involving multilingual text processing, as it supports a wide range of languages.</li>
            <li>Ideal for applications where language nuances significantly affect stemming outcomes.</li>
        </ul>
    </li>
</ol>

<h2>Three Major Differences:</h2>
<ol>
    <li><strong>Algorithm Complexity:</strong>
        <ul>
            <li>PorterStemmer and Snowball Stemmer utilize predefined algorithms based on linguistic rules and suffix stripping.</li>
            <li>RegexpStemmer employs regular expressions to remove common word endings, allowing for custom-defined stemming rules.</li>
        </ul>
    </li>
    <li><strong>Language Support:</strong>
        <ul>
            <li>PorterStemmer offers support for various languages but may not provide the same level of accuracy across all languages.</li>
            <li>RegexpStemmer is language-agnostic and can be customized for specific languages or domains.</li>
            <li>Snowball Stemmer provides extensive language support with separate algorithms for each language, ensuring better accuracy for different linguistic contexts.</li>
        </ul>
    </li>
    <li><strong>Flexibility:</strong>
        <ul>
            <li>PorterStemmer has limited flexibility as it follows predefined stemming rules.</li>
            <li>RegexpStemmer offers high flexibility, allowing users to define custom regular expressions for stemming based on specific requirements.</li>
            <li>Snowball Stemmer provides moderate flexibility by offering language-specific stemming algorithms, accommodating language nuances and variations.</li>
        </ul>
    </li>
</ol>

</body>
</html>


Alternative for the above disadvantages is Lemmatization.