## Stemming and Lemmatization

* We have words with different forms. When tokenization is applied on a text, each form would be considered as differnet token although they infer the same meaning.
* So in order to reduce this issue, we have techinques called **Stemming** and **Lemmatization**.
* By reducing the different form of words to their original or root form, we can reduce the vocabulary size, which will lead to better visualization, classification and analysis.

### Stemming : 
Stemming is a process of removing suffixes of a word based on the assumption that **different word forms**`(lightning, lightly, lighting)` consist of **stem**`(light)` and and **ending**`(+ning, +ly, +ing)`

In [1]:
# For extracting the text from the wikipedia page
import requests
def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page 
    given a wikipedia page title
    '''
    params = { 
        'action': 'query', 
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts', 
        'explaintext': True
    }
    # send a request to the wikipedia api 
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content 
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"

Applying Stemming :
1. Extract the text from source
2. Tokenize the text
3. Extract the stem of a word

In [2]:
# Imoprt Tokenizer, Stemmer and Stopwords
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer # Commonly used stemming algorithm
from nltk.corpus import stopwords

In [3]:
# 1. Get the text from the wikipedia page
text = wikipedia_page('Earth').lower()

# 2. Apply tokenization
tokens = WordPunctTokenizer().tokenize(text)

# Filter out the stopwords in the tokens list
tokens = [tk for tk in tokens if tk not in stopwords.words('english')]

In [4]:
# 3. Instantiate the stemmer
ps = PorterStemmer()

# Stem the tokens
stems = [ps.stem(tk) for tk in tokens]

In [6]:
# Lets look at random selection of stemmed tokens
import numpy as np

for i in range(5) :
    print()
    print(np.random.choice(stems, size=10))


['complex' 'state' 'two' 'name' 'distribut' 'latitud' 'within' 'distribut'
 'system' 'fix']

['crust' 'french' ',' '.' '(' ',' 'arid' 'köppen' 'coal' 'kelvin']

["'" 'significantli' '2020' 'grain' 'less' 'atmospher' 'form' 'earliest'
 'hemispher' 'excess']

['.' 'descend' 'planetari' 'circl' 'biotic' 'solstic' 'sinc' 'univers'
 'continent' 'earth']

['water' ',' '.' 'planet' '.' 'rel' '.' 'earth' '6' '(']


* As you can see in the above results, stem `distibut`, `atmoshpher` etc are stemmed brutally.
* So we need a smarter way o reducing the word forms

### Lemmatization :
* Lemmatization reduces the word to its **lemma**, and the lemma is the word form that you **find in the dictionary**.
* *Lemma* is also called as **Canonical form** of a word.
* It is more readable, interpretabe and less brutal than stemming

In [7]:
# install spacy
# !pip install -U spacy
# !python -m spacy download en_core_web_sm

Collecting spacy
  Obtaining dependency information for spacy from https://files.pythonhosted.org/packages/92/fb/d1f0605e1e8627226c6c96053fe1632e9a04a3fbcd8b5d715528cb95eb97/spacy-3.7.4-cp311-cp311-win_amd64.whl.metadata
  Downloading spacy-3.7.4-cp311-cp311-win_amd64.whl.metadata (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Obtaining dependency information for spacy-legacy<3.1.0,>=3.0.11 from https://files.pythonhosted.org/packages/c3/55/12e842c70ff8828e34e543a2c7176dac4da006ca6901c9e8b43efab8bc6b/spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Obtaining dependency information for spacy-loggers<2.0.0,>=1.0.0 from https://files.pythonhosted.org/packages/33/78/d1a1a026ef3af911159398c939b1509d5c36fe524c7b644f34a5146c4e16/spacy_loggers-1.0.5-py3-none-any.whl.metadata
  Downloading spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting 

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------- 0.0/12.8 MB 653.6 kB/s eta 0:00:20
     ---------------------------------------- 0.1/12.8 MB 1.1 MB/s eta 0:00:12
      --------------------------------------- 0.2/12.8 MB 1.2 MB/s eta 0:00:11
      --------------------------------------- 0.3/12.8 MB 1.5 MB/s eta 0:00:09
      --------------------------------------- 0.3/12.8 MB 1.5 MB/s eta 0:00:09
      --------------------------------------- 0.3/12.8 MB 1.5 MB/s eta 0:00:09
      --------------------------------------- 0.3/12.8 MB 1.5 MB/s eta 0:00:09
     - ------------------------------------- 0.4/12.8 MB 928.4 kB/s eta 0:00:14
     - ------------------------------------- 0

* Using `spacy` involves 3 steps :
    1. Importing the library `import spacy`
    2. Loading the model `nlp = spacy.load("en_core_web_sm")`
    3. Apply the model to the text `doc = nlp("This is the sentence")`

In [8]:
# Import the library
import spacy

# Load the 'small english model'
nlp = spacy.load("en_core_web_sm")

##### Tokenization
Right out of the box

In [9]:
# Parse the sentence
doc = nlp("Roads? Where we're going we don't need roads!")

# Print the tokens
for token in doc:
    print(token)

Roads
?
Where
we
're
going
we
do
n't
need
roads
!


##### Lemmatization
* Also Right out of the box.
* Lemma of token is directly available via `token.lemma_`

In [14]:
print(f"{'Token':10}\t Lemma")
print(f"{'------':10}\t ------")

# Print the lemma for each token
for token in doc:
    print(f"{token.text:>10}\t{token.lemma_}")

Token     	 Lemma
------    	 ------
     Roads	road
         ?	?
     Where	where
        we	we
       're	be
     going	go
        we	we
        do	do
       n't	not
      need	need
     roads	road
         !	!


Each element of the doc object holds information on the nature and style of the token:

* `is_space` : is a space token.  
* `is_punct` : is a punctuation sign token.
* `is_upper` : is an all uppercase token.
* `is_digit` : is a number token.
* `is_stop`  : is a stopword token.

In [24]:
# Parse the text
doc = nlp("All aboard! \t Train NXH123 departs from platform 22 at 3:16 sharp.")

# Print the header
print(f"{'Token':10}\t\tspace?\tpunct?\tupper?\t digit?")
print(f"----------------------------------------")

# Extract the information on each token
for token in doc :
    print(f"{str(token):10} \t\t {token.is_space} \t {token.is_punct} \t {token.is_upper} \t {token.is_digit}")

Token     		space?	punct?	upper?	 digit?
----------------------------------------
All        		 False 	 False 	 False 	 False
aboard     		 False 	 False 	 False 	 False
!          		 False 	 True 	 False 	 False
	          		 True 	 False 	 False 	 False
Train      		 False 	 False 	 False 	 False
NXH123     		 False 	 False 	 True 	 False
departs    		 False 	 False 	 False 	 False
from       		 False 	 False 	 False 	 False
platform   		 False 	 False 	 False 	 False
22         		 False 	 False 	 False 	 True
at         		 False 	 False 	 False 	 False
3:16       		 False 	 False 	 False 	 False
sharp      		 False 	 False 	 False 	 False
.          		 False 	 True 	 False 	 False
