# Stemming and Lemmatization


[Lemmatization](https://en.wikipedia.org/wiki/Lemmatisation) is the algorithmic process of determining the lemma of a word based on its intended meaning.

[Stemming](https://en.wikipedia.org/wiki/Stemming) is the process of reducing inflected or sometimes derived words to their word stem, base or root form

Let's see how stemming works on the Wikipedia Earth's page


In [None]:
import requests
def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page
    given a wikipedia page title
    '''
    params = {
        'action': 'query',
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts',
        'explaintext': True
    }
    # send a request to the wikipedia api
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"


In [None]:
# Install NLTK and download relevant resources if you haven't done so already
!pip install nltk
import nltk
nltk.download('popular')



[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [None]:
from nltk.tokenize import WordPunctTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

# Get the text, for instance from Wikipedia.
text    = wikipedia_page('Earth').lower()

# Tokenize and remove stopwords
tokens  = WordPunctTokenizer().tokenize(text)
tokens = [tk for tk in tokens if tk not in stopwords.words('english')]


In [None]:
tokens

['earth',
 'third',
 'planet',
 'sun',
 'astronomical',
 'object',
 'known',
 'harbor',
 'life',
 '.',
 'enabled',
 'earth',
 'water',
 'world',
 ',',
 'one',
 'solar',
 'system',
 'sustaining',
 'liquid',
 'surface',
 'water',
 '.',
 'almost',
 'earth',
 "'",
 'water',
 'contained',
 'global',
 'ocean',
 ',',
 'covering',
 '70',
 '.',
 '8',
 '%',
 'earth',
 "'",
 'surface',
 '.',
 'remaining',
 '29',
 '.',
 '2',
 '%',
 'earth',
 "'",
 'surface',
 'land',
 ',',
 'located',
 'form',
 'continental',
 'landmasses',
 'within',
 'one',
 'hemisphere',
 ',',
 'earth',
 "'",
 'land',
 'hemisphere',
 '.',
 'earth',
 "'",
 'land',
 'somewhat',
 'humid',
 'covered',
 'vegetation',
 ',',
 'large',
 'sheets',
 'ice',
 'earth',
 "'",
 'polar',
 'deserts',
 'retain',
 'water',
 'earth',
 "'",
 'groundwater',
 ',',
 'lakes',
 ',',
 'rivers',
 'atmospheric',
 'water',
 'together',
 '.',
 'earth',
 "'",
 'land',
 'part',
 'earth',
 "'",
 'crust',
 ',',
 'consisting',
 'several',
 'slowly',
 'moving',
 '

In [None]:
# Instantiate a stemmer
ps      = PorterStemmer()

# and stem
stems   = [ps.stem(tk) for tk in tokens ]

In [None]:
stems

['earth',
 'third',
 'planet',
 'sun',
 'astronom',
 'object',
 'known',
 'harbor',
 'life',
 '.',
 'enabl',
 'earth',
 'water',
 'world',
 ',',
 'one',
 'solar',
 'system',
 'sustain',
 'liquid',
 'surfac',
 'water',
 '.',
 'almost',
 'earth',
 "'",
 'water',
 'contain',
 'global',
 'ocean',
 ',',
 'cover',
 '70',
 '.',
 '8',
 '%',
 'earth',
 "'",
 'surfac',
 '.',
 'remain',
 '29',
 '.',
 '2',
 '%',
 'earth',
 "'",
 'surfac',
 'land',
 ',',
 'locat',
 'form',
 'continent',
 'landmass',
 'within',
 'one',
 'hemispher',
 ',',
 'earth',
 "'",
 'land',
 'hemispher',
 '.',
 'earth',
 "'",
 'land',
 'somewhat',
 'humid',
 'cover',
 'veget',
 ',',
 'larg',
 'sheet',
 'ice',
 'earth',
 "'",
 'polar',
 'desert',
 'retain',
 'water',
 'earth',
 "'",
 'groundwat',
 ',',
 'lake',
 ',',
 'river',
 'atmospher',
 'water',
 'togeth',
 '.',
 'earth',
 "'",
 'land',
 'part',
 'earth',
 "'",
 'crust',
 ',',
 'consist',
 'sever',
 'slowli',
 'move',
 'tecton',
 'plate',
 ',',
 'interact',
 'produc',
 'mo

In [None]:
# look at a random selection of stemmed tokens
import numpy as np
for i in range(5):
    print()
    print(np.random.choice(stems, size = 10))


[').' '-' 'stabil' ',' ';' 'asthenospher' 'appar' ',' ',' 'surfac']

['yr' 'reach' 'aphelion' 'captur' 'topograph' ',' 'plate' 'replac' ')' ',']

['tecton' 'plate' ',' 'biospher' 'water' 'life' 'field' 'take' 'explain'
 'million']

['23' 'basalt' '23' 'core' 'reconcil' '.' 'climat' 'view' 'asteroid' "'"]

['kilomet' 'consist' 'water' '0' 'portion' '.' 'lagrang' ',' 'mechan' '8']


Your results will differ but we see that some words are brutally truncated.

# Lemmatize with spacy
Since stemming can be brutal, we need a smarter way to reduce the number of forms of words.
Lemmatization reduces a word to its lemma. And the lemma is the word form you would find in a dictionary.

Let's see how we can tokenize and lemmatize with the library [spacy.io](https://spacy.io/)


see this page to install spacy: https://spacy.io/usage and download the models


In [None]:
# install spacy
!pip install -U spacy
!python -m spacy download en_core_web_sm

2023-09-08 07:22:21.497329: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
#import the library
import spacy

# load the small English model
nlp = spacy.load("en_core_web_sm")

# Tokenization
Right out of the box


In [None]:
# parse a text
doc = nlp("Roads? Where we’re going we don’t need roads!")

for token in doc:
    print(token)

Roads
?
Where
we
’re
going
we
do
n’t
need
roads
!


# Lemmatization
Also right out of the box

The lemma of a token is directly available via ```token.lemma_```

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I came in and met with her teammates at the meeting.")

print(f"{'Token':10}\t Lemma ")
print(f"{'-------':10}\t ------- ")
for token in doc:
    print(f"{token.text:10}\t {token.lemma_} ")

Token     	 Lemma 
-------   	 ------- 
I         	 I 
came      	 come 
in        	 in 
and       	 and 
met       	 meet 
with      	 with 
her       	 her 
teammates 	 teammate 
at        	 at 
the       	 the 
meeting   	 meeting 
.         	 . 


Notice how the word "met" was correctly lemmatized to "meet" while the noun "meeting" remained lemmatized to "meeting". Lemmatization of a word depends on its context and its grammatical role.

# Form detection
Spacy offers many other functions including some handy word caracterization methods

- is_space
- is_punct
- is_upper
- is_digit



In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("All aboard! \t Train NXH123 departs from platform 22 at 3:16 sharp.")

print(f"token \t\tspace? \tpunct?\tupper?\tdigit?")

token, token.is_space, token.is_punct, token.is_upper, token.is_digit

for token in doc:
    print(f"{str(token):10} \t{token.is_space} \t{token.is_punct} \t{token.is_upper} \t{token.is_digit}")
