# <span style = "color:green">Stemming & Lemmatization</span>

***

Usually, a word has multiple meanings based on its usage in text, similarly, different forms of words convey related meaning, like "toy" and "toys", indicate identical meaning.

You would probably find no different objective between a search for "toy" and a search for "toys". This kind of contrast between various forms of words termed as an "inflection", however, this makes various problems in understanding queries. 

Suppose another word "came" and "camel", thier search intent gives a different meaning, instead of having root-word. Similarly, if you search for the word "Love" in google, it shows results in stems of words like "Loves", "Loved", "Loving".

For the simplification of various search queries, Stemming and Lemmatization are the strategies used for the same.

Stemming and Lemmatization have been developed in the 1960s. These are the text normalizing and Text mining procedures in the field of Natural Language processing. These are a widely used system for tagging, SEO, Web search Result, and Information Retrieval.

While implementing NLP, you will always face an issue of similar root-forms but different representations, for example, the word "caring" can be stripped out to "car" and "care" using the method Stemming and Lemmatization respectively.

| Stemming |  | Lemmatization | 
| --- |  | --- |
| history, historical - histori |  | history, historical - history | 
| finally,final,finalized - fina |  | finally, final, finalized - final |

## <span style = "color:coral">What is Stemming?</span>

We already know that a word has a root-base form but having different variations, for example, "play" is a root-base word and playing, played, plays are different forms of a single word. So, these words get stripped out, they might get the incorrect meanings or some other sort of errors.

The process of reducing inflection towards their root forms are called stemming, this occurs in such a way that depicting a group of relatable words under the same stem, even if the root has no appropriate  meaning.

Moreover;
* Stemming is a rule-based approach because it slices the inflected words from prefix or suffix as per the need using a set of commonly underused prefix and suffix, like "-ing","-ed","-es","-pre",etc. It results in a word that is actually not a word.
* There are mainly two errors that occur while performing stemming, Over-stemming, and under-stemming. Over-steaming occurs when two words are stemmed from the same root of different stems. Under-stemming occurs when two words are stemmed from the same root of not a different stems.

#### Stemming words with NLTK

In [1]:
from nltk.stem import PorterStemmer


ps = PorterStemmer()

# Choose some words to be stemmed
words = ["programs", "program","programmer", "programming","programmers"]

for i in words:
    print(i,":", ps.stem(i))


programs : program
program : program
programmer : programm
programming : program
programmers : programm


In [2]:
# From a sentence
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

ps = PorterStemmer()

sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)

for i in words:
    print(i,":",ps.stem(i))

Programmers : programm
program : program
with : with
programming : program
languages : languag


### Applications of Stemming
1. Stemming is used in information retrieval systems like search engines.
2. It is used to determine domain vocabularies in domain analysis.

## <span style = "color:coral">What is Lemmatization?</span>

In simpler forms, a method that switches any kind of a word to its base root mode is called Lemmatization.

In other words, Lemmatization is a method responsible for grouping different inflected forms of words into the root form, having the same meaning. It is similar to stemming, in turn, it gives the stripped word that has some dictionary meaning. The Morphological analysis would require the extraction of the correct lemma of each word.

For example, Lemmatization clearly identifies the base form of 'troubled' to 'trouble' denoting some meaning whereas, Stemming will cut out 'ed' part and convert it into 'troubl' which has the wrong meaning and spelling errors.

#### Lemmatization with NLTK

In [3]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize('rocks'))
print("corpora :", lemmatizer.lemmatize('corpora'))

# a denotes adjective in 'pos'
print('better :', lemmatizer.lemmatize('better', pos = 'a'))

rocks : rock
corpora : corpus
better : good


## What is the difference amid Stemming and Lemmatization?

| S.No | Stemming | Lemmatization |
| --- | --- | --- |
| 1 | Stemming is faster because it chops words without knowing the context of the word in given sentences. | Lemmatization is slower as compared to stemming but it knows the context of the word before proceeding. |
| 2 | It is a rule-based approach | It is a dictionary-based approach |
| 3 | Accuracy is less | Accuracy is more as compared to Stemming |
| 4 | When we convert any words into root-form then stemming may create the non-existence meaning of a word. | Lemmatization always gives the dictionary meaning word while converting into root-form. |
| 5 | Stemming is preferred when the meaning of the word is not important for analysis. Example: Spam Detection | Lemmatization would be recommended when the meaning of the word is important in analysis. Example: Question Answer |
| 6 | "Studies" => "Studi" | "Studies" => "Study" |

***