## Stemming and Lemmatization
* Stemming means removing the suffixes from words to get the base word (lemma)
* Lemmatization means deriving the base word by using the specific language knowledge. 
* NLTK -> Both Stemming and Lemmatization
* SpaCy -> Only Lemmatization

In [1]:
import nltk
import spacy
words = """On discussion with him you establish that he has recently been experiencing back pain which prevents him from getting about as much as he used to"""

In [2]:
# Stemming (rule based approach)
from nltk.stem import PorterStemmer 

stemmer = PorterStemmer()
for word in words.split(' '):
    print(word, " | ", stemmer.stem(word))

On  |  on
discussion  |  discuss
with  |  with
him  |  him
you  |  you
establish  |  establish
that  |  that
he  |  he
has  |  ha
recently  |  recent
been  |  been
experiencing  |  experienc
back  |  back
pain  |  pain
which  |  which
prevents  |  prevent
him  |  him
from  |  from
getting  |  get
about  |  about
as  |  as
much  |  much
as  |  as
he  |  he
used  |  use
to  |  to


In [3]:
# lemmatization 
nlp = spacy.load('en_core_web_sm')      #! we have to choose the model based on the type of lemmatization we need

doc = nlp(words);

for token in doc:
    print(token, " | ", token.lemma_)

On  |  on
discussion  |  discussion
with  |  with
him  |  he
you  |  you
establish  |  establish
that  |  that
he  |  he
has  |  have
recently  |  recently
been  |  be
experiencing  |  experience
back  |  back
pain  |  pain
which  |  which
prevents  |  prevent
him  |  he
from  |  from
getting  |  get
about  |  about
as  |  as
much  |  much
as  |  as
he  |  he
used  |  use
to  |  to


In [4]:
# side by side comparision

for token in doc:
    stem = stemmer.stem(token.text)
    lemma = token.lemma_
    if stem != lemma:
        print(token.text+ ", Stem: <", stemmer.stem(token.text), "> ,Lemma: <", token.lemma_, ">")

discussion, Stem: < discuss > ,Lemma: < discussion >
him, Stem: < him > ,Lemma: < he >
has, Stem: < ha > ,Lemma: < have >
recently, Stem: < recent > ,Lemma: < recently >
been, Stem: < been > ,Lemma: < be >
experiencing, Stem: < experienc > ,Lemma: < experience >
him, Stem: < him > ,Lemma: < he >


### Custom Lemmatization rules

In [5]:
# custom rules >>>
ar = nlp.get_pipe('attribute_ruler')
ar.add([[{"TEXT":"bro"}], [{"TEXT":"bruh"}]], {"LEMMA": "brother"})
# custom rules <<<

doc2 = nlp("hello bro, whats up. What are you doing bruh!!")
for token in doc2:
    print(token.text, " | ", token.lemma_)

hello  |  hello
bro  |  brother
,  |  ,
what  |  what
s  |  s
up  |  up
.  |  .
What  |  what
are  |  be
you  |  you
doing  |  do
bruh  |  brother
!  |  !
!  |  !
