# 1) Stemming

Stemming works fairly well in most of the cases but unfortunately English has so many exceptions where a more sophisticated process is required

**SpaCy dosen't include stemming**, it uses lemmatization instead

Stemming is basically **removes the suffixes** from a word and reduce it to its root word.

For example: “Flying” is a word and its suffix is “ing”, if we remove “ing” from “Flying” then we will get base word or root word which is “Fly”.

we will use Natural Language Toolkit (nltk) to understand and learn stemming

## Porter Stemmer

In [0]:
import nltk
from nltk.stem.porter import PorterStemmer

In [0]:
p_stemmer = PorterStemmer() # object of class PorterStemmer

In [0]:
words = ['run','runner','running','ran','runs','easily','fairly'] # list of words 

In [4]:
for word in words:
  print(word + '------>' + p_stemmer.stem(word))

run------>run
runner------>runner
running------>run
ran------>ran
runs------>run
easily------>easili
fairly------>fairli


## Snowball Stemmer

In [0]:
# Snowball Stemmer is also called the "English Stemmer" or "Porter2 Stemmer"
# It offers a slight improvement over the original Porter stemmer

In [0]:
from nltk.stem.snowball import SnowballStemmer
# Pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [0]:
words = ['run','runner','running','ran','runs','easily','fairly']

In [7]:
for word in words:
    print(word +' --> '+ s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


# 2) Lemmatization

Lemmatization is **the Process of converting words into their dictionary form**

Lemmatization considers full vocabulary of a language to apply a morphological analysis

e.g.
Word: Feet, Lemma: Foot

**Stemming**
It is the process of converting words into their non-changing portion


In [0]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [0]:
doc1 = nlp("The striped bats are hanging on their feet for best")

In [10]:
for token in doc1:
    print(token.text, '\t', token.pos_, '\t',token.lemma_)

# token, POS, Lemma

The 	 DET 	 the
striped 	 VERB 	 stripe
bats 	 NOUN 	 bat
are 	 AUX 	 be
hanging 	 VERB 	 hang
on 	 ADP 	 on
their 	 DET 	 -PRON-
feet 	 NOUN 	 foot
for 	 ADP 	 for
best 	 ADJ 	 good


# 3) Stemming vs Lemmaatization

In [0]:
doc2 = ['the','striped','bats','are','hanging','on','their','feet','for','best']

In [12]:
for word in doc2:
    print(word +' --> '+ s_stemmer.stem(word))

the --> the
striped --> stripe
bats --> bat
are --> are
hanging --> hang
on --> on
their --> their
feet --> feet
for --> for
best --> best


In [0]:
# lemma,      are ---> be
# stemming,   are ---> are

# lemma,      feet ---> foot
# stemming,   feet ---> feet

# lemma,      best ---> good
# stemming,   best ---> best