<a href="https://www.kaggle.com/code/aravindanr22052001/stemming-lemmatization-stopwords?scriptVersionId=99013036" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### What is Stemming? 

* Stemming, in literal terms, is the process of cutting down the branches of a tree to its stem. So
effectively, with the use of some basic rules, any token can be cut down to its stem `(base form)`. Stemming is more of
a crude rule-based process by which we want to club together different variations of the token.  

* `For example:`, the word eat will have variations like `eating`, `eaten`, `eats`, and so on. In some applications, as it
does not make sense to differentiate between `eat` and `eaten`, we typically use `stemming` to club both
grammatical variances to the root of the word. 

* While stemming is used most of the time for its
simplicity, there are cases of complex language or complex NLP tasks where it's necessary to use
lemmatization instead. 

* Lemmatization is a more robust and methodical way of combining grammatical
variations to the root of a word.


<center><img src="https://drive.google.com/uc?export=view&id=1XcK3OzdPd2ywO8Y4G6vfjuIFthPce3FH" width="800"/></center>

* where we are interested in stripping the `*suffix*` at the end of the word. When stemming we are interesting in reducing the `*inflected` or `*derived*` word to it's base form. Take a look at the figure above to get some intuition about the process.
* where the stems and affixes (called the `*morphemes*`) are extracted and used to reduce inflections to their base form. For instance, the word `*cats*` has two morphemes, `*cat*` and `*s*`, the `*cat*` being the stem and the `*s*` being the **`affix`** representing plurality.
* Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of
a word, a search engine can store only the stems, greatly reducing the size of index while increasing
retrieval accuracy.

There are different types of stemmig algorithm are there, let's see one by one! 


<center><img src="https://www.tutorialspoint.com/natural_language_toolkit/images/stemming_algorithms.jpg" width="600"/></center>

A basic rule-based stemmer, like removing `–s/es` or `-ing` or `-ed` can give you a precision of more than `70`
percent, while **`Porter stemmer`** also uses more rules and can achieve very good accuracies.


In [1]:
# Let's see the porter stemmer in code! 
from nltk.stem import PorterStemmer # import Porter stemmer

pst = PorterStemmer() # create obj of the PorterStemmer

pst.stem("shopping")

'shop'

In [2]:
# Let's check the preceding example! 
pst.stem('I prefer not to argue')

'i prefer not to argu'

* However there are many
stemming algorithms around, and the `precision `and `performance` of them differ. You may want to have a
look at [**here**](http://www.nltk.org/api/nltk.stem.html) for more details. 

* I have used `Porter Stemmer` most often,
and if you are working with English, it's good enough. There is a family of **`Snowball stemmers`** that can
be used for Dutch, English, French, German, Italian, Portuguese, Romanian, Russian, and so on. I also
came across a light weight stemmer for Hindi on [**here**](http://research.variancia.com/hindi_stemmer).

In [3]:
# !pip install snowballstemmer

In [4]:
# Let's see the Snowball stemmer in cod! 
import snowballstemmer
stemmer = snowballstemmer.stemmer('english');  # initialize the stemmer 

print(stemmer.stemWords("I prefer not to argue".split()))

['I', 'prefer', 'not', 'to', 'argu']


#### **`RegexpStemmer`** 

You can also construct your own `stemmer` using the `RegexpStemmer`. It takes a single regular
expression (either compiled or as a string) and removes any prefix or suffix that matches the expression:

In [5]:
from nltk.stem import RegexpStemmer
stemmer = RegexpStemmer('ing')
stemmer.stem('cooking')

'cook'

#### **`LancasterStemmer class`**
The functions of the LancasterStemmer class are just like the functions of the PorterStemmer
class, but can produce slightly different results. It is known to be slightly more aggressive than the
PorterStemmer functions:


In [6]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
stemmer.stem('eating')

'eat'

**Note:** But most users can live with Porter and Snowball stemmer for a large number of use cases. In modern
NLP applications, sometimes people even ignore stemming as a pre-processing step, so it typically
depends on your domain and application. I would also like to tell you the fact that if you want to use
some NLP taggers, like Part of Speech tagger (POS), NER or dependency parser, you should avoid
stemming, because stemming will modify the token and this can result in a different result.

### What is lemmatization? 

* Lemmatization is a more methodical way of converting all the grammatical/inflected forms of the root
of the word. 

* Lemmatization uses `context` and `part of speech `to determine the inflected form of the word
and applies different `normalization` rules for each part of speech to get the `root word (lemma)`. 

* For instance, a lemmatization process reduces the inflections, `"am"`, `"are",` and `"is"`, to the base form, `"be"`. 

* Lemmatization is helpful for normalizing text for text classification tasks or search engines, and a variety of other NLP tasks such as sentiment classification. It is particularly important when dealing with complex languages like Arabic and Spanish.

<center><img src="https://drive.google.com/uc?export=view&id=1_-wxBOU_JebjdG1sxoobKYRCtX3dVF0L" width="800"/></center>

In [7]:
## import the libraries
import spacy
from spacy.lemmatizer import Lemmatizer

nlp = spacy.load('en_core_web_sm') 
## lemmatization
doc = nlp(u'I love coding and writing')
for word in doc:
    print(word.text, "=>", word.lemma_)

I => -PRON-
love => love
coding => code
and => and
writing => write


Let's see some difference between `Lemmatization` and `stemming`~


<center><img src="https://stringfixer.com/files/107654628.jpg" width="800"/></center>

`Stemming`: Reduce inflectional forms. It's a heuristic process it remove the prefix and suffix to achieve the goal correctly.

(vs)

`Lemmatization`: Reduce inflectional forms. It's make the word to base form with the help of vocabulary and morphological analysis of words. 


<center><img src="https://www.altexsoft.com/media/2021/03/word-image.jpeg" width="800"/></center>

### What are Stop words? 


- The words which are filtered out before processing is called **stop words**. These are the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc.) and add little information to the text. 

Examples are: 

<center><img src="https://nlp.stanford.edu/IR-book/html/htmledition/img95.png" width="600"/></center>

> 𝗦𝗼𝗺𝗲𝘁𝗶𝗺𝗲𝘀, 𝘀𝗼𝗺𝗲 𝗲𝘅𝘁𝗿𝗲𝗺𝗲𝗹𝘆 𝗰𝗼𝗺𝗺𝗼𝗻 𝘄𝗼𝗿𝗱𝘀 𝘄𝗵𝗶𝗰𝗵 𝘄𝗼𝘂𝗹𝗱 𝗮𝗽𝗽𝗲𝗮𝗿 𝘁𝗼 𝗯𝗲 𝗼𝗳 𝗹𝗶𝘁𝘁𝗹𝗲 𝘃𝗮𝗹𝘂𝗲 𝗶𝗻 𝗵𝗲𝗹𝗽𝗶𝗻𝗴 𝘀𝗲𝗹𝗲𝗰𝘁 𝗱𝗼𝗰𝘂𝗺𝗲𝗻𝘁𝘀 𝗺𝗮𝘁𝗰𝗵𝗶𝗻𝗴 𝗮 𝘂𝘀𝗲𝗿 𝗻𝗲𝗲𝗱 𝗮𝗿𝗲 𝗲𝘅𝗰𝗹𝘂𝗱𝗲𝗱 𝗳𝗿𝗼𝗺 𝘁𝗵𝗲 𝘃𝗼𝗰𝗮𝗯𝘂𝗹𝗮𝗿𝘆 𝗲𝗻𝘁𝗶𝗿𝗲𝗹𝘆. 𝗧𝗵𝗲𝘀𝗲 𝘄𝗼𝗿𝗱𝘀 𝗮𝗿𝗲 𝗰𝗮𝗹𝗹𝗲𝗱 𝘀𝘁𝗼𝗽 𝘄𝗼𝗿𝗱𝘀 . 

- Stop words are available in abundance in any human language. By removing these words, we remove the low-level information from our text in order to give more focus to the important information. In order words, we can say that the removal of such words does not show any negative consequences on the model we train for our task.

- Removal of stop words definitely reduces the dataset size and thus reduces the training time due to the fewer number of tokens involved in the training.

**Do we always remove stop words? Are they always useless for us? 🙋‍♀️**

The answer is no! 🙅‍♂️

We do not always remove the stop words. The removal of stop words is highly dependent on the task we are performing and the goal we want to achieve. 

**Word of caution:** Before removing stop words, research a bit about your task and the problem you are trying to solve, and then make your decision.

### Reference: 
* [**Stemming in Natural Language Processing**](https://www.c-sharpcorner.com/blogs/stemming-in-natural-language-processing) 
* [**Nlp Basics(colab)**](https://colab.research.google.com/drive/18ZnEnXKLQkkJoBXMZR2rspkWSm9EiDuZ#scrollTo=0lVd74BE5BXK) 
* [**NLTK official Documentation**](https://www.nltk.org/_modules/nltk/stem/snowball.html) 
* [**SnowBall Stemmer**](https://pypi.org/project/snowballstemmer/)
* [**Natural Language Processing with python and NLTK (Book)**](https://www.pdfdrive.com/natural-language-processing-python-and-nltk-d158232635.html)
* [**spaCy documentation**](https://spacy.io/api/data-formats#lemmatization)
* [**stop words TDS**](https://towardsdatascience.com/text-pre-processing-stop-words-removal-using-different-libraries-f20bac19929a)
* [**Standford IR**](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html)