# Stemming and Lemmatization

Two of the ways to further reduce the number of words but still keep the roots of words are via stemming and lemmatization. Stemming is to cutting off the word while lemmatization is to finding the root word

Example for stemming:

* adjustable --> adjust

* formality --> formaliti (a stemmed token does not need to be a real word)


Example for lemmatization:

* am --> be

* better --> good (a lemma token is a real word)

To create stemming and lemmatization of the word, we lean on NLTK package for this exercise. There are many NLP packages, NLTK is one of them.

### Stemming

NLTK has stemmers.

TextBlob has stemmers too.

Spacy has none.


In [1]:
list_of_words = ['run',
                 'runner',
                 'running',
                 'ran',
                 'runs',
                 'easily',
                 'fairly',
                 "cook",
                 "cooker",
                 "cooking",
                 "cooked"]


In [2]:
from nltk.stem.porter import * #import Stemming from NLTK

# Initialize the stemmer
porter = PorterStemmer() #porter is a popular one, there are many more https://www.nltk.org/api/nltk.stem.html

stem_example = [porter.stem(word) for word in list_of_words]
stem_example

['run',
 'runner',
 'run',
 'ran',
 'run',
 'easili',
 'fairli',
 'cook',
 'cooker',
 'cook',
 'cook']

In [3]:
#!pip install textblob

In [4]:
from textblob import Word
stemmed_words = [Word(word).stem() for word in list_of_words]
stemmed_words

['run',
 'runner',
 'run',
 'ran',
 'run',
 'easili',
 'fairli',
 'cook',
 'cooker',
 'cook',
 'cook']

### Lemmatization


In [6]:
#NLTK
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4') #download omw-1.4 as a prereq for wordnet
nltk.download('wordnet') #download Wordnet data for the lemmatization
nltk_lemmatizer = WordNetLemmatizer()
nltk_example = [nltk_lemmatizer.lemmatize(word) for word in list_of_words]
nltk_example

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\avo9\AppData\Roaming\nltk_data...
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\avo9\AppData\Roaming\nltk_data...


['run',
 'runner',
 'running',
 'ran',
 'run',
 'easily',
 'fairly',
 'cook',
 'cooker',
 'cooking',
 'cooked']

In [10]:
#Spacy

from spacy.cli import download
download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")
lemmatized_words = [nlp(word)[0].lemma_ for word in list_of_words]
lemmatized_words

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


['run',
 'runner',
 'run',
 'run',
 'run',
 'easily',
 'fairly',
 'cook',
 'cooker',
 'cook',
 'cook']

In [11]:
#Textblob
#from textblob import Word
nltk.download('wordnet') #lemma from Textblob uses Wordnet
lemmatized_words = [Word(word).lemmatize() for word in list_of_words]
lemmatized_words

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\avo9\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['run',
 'runner',
 'running',
 'ran',
 'run',
 'easily',
 'fairly',
 'cook',
 'cooker',
 'cooking',
 'cooked']

# Tokenizing Demo


# Word tokenization

In [12]:
text = "Dr. Smith’s AI-powered model (trained on Türkçe, 10+ languages) outperformed others in the '2023_Challenge'! #NLP."


## Sentence Tokenization



In [13]:
import nltk
nltk.download("punkt_tab") #download a pretrained tokenizer model
from nltk.tokenize import sent_tokenize

nltk_sentences = sent_tokenize(text)

print("NLTK Sentences:", nltk_sentences)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\avo9\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


NLTK Sentences: ["Dr. Smith’s AI-powered model (trained on Türkçe, 10+ languages) outperformed others in the '2023_Challenge'!", '#NLP.']


In [14]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

spacy_sentences = [sent.text for sent in doc.sents]

print("spaCy Sentences:", spacy_sentences)

spaCy Sentences: ["Dr. Smith’s AI-powered model (trained on Türkçe, 10+ languages) outperformed others in the '2023_Challenge'!", '#NLP.']


## White Space tokenization

In [15]:
tokens = text.split()

# Print tokens
print("Tokens:", tokens)

Tokens: ['Dr.', 'Smith’s', 'AI-powered', 'model', '(trained', 'on', 'Türkçe,', '10+', 'languages)', 'outperformed', 'others', 'in', 'the', "'2023_Challenge'!", '#NLP.']


## Word Tokenization



In [16]:
import nltk
nltk.download("punkt_tab") #download a pretrained tokenizer model
from nltk.tokenize import word_tokenize
# NLTK tokenization

nltk_word_tokens = word_tokenize(text)


print("NLTK Word Tokens:", nltk_word_tokens)


NLTK Word Tokens: ['Dr.', 'Smith', '’', 's', 'AI-powered', 'model', '(', 'trained', 'on', 'Türkçe', ',', '10+', 'languages', ')', 'outperformed', 'others', 'in', 'the', "'2023_Challenge", "'", '!', '#', 'NLP', '.']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\avo9\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [17]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

spacy_tokens = [token.text for token in doc]
print("spaCy Word Tokens:", spacy_tokens)


spaCy Word Tokens: ['Dr.', 'Smith', '’s', 'AI', '-', 'powered', 'model', '(', 'trained', 'on', 'Türkçe', ',', '10', '+', 'languages', ')', 'outperformed', 'others', 'in', 'the', "'", '2023_Challenge', "'", '!', '#', 'NLP', '.']


In [18]:
import textblob
from textblob import TextBlob
TextBlob(text).words #split to words

WordList(['Dr', 'Smith', '’', 's', 'AI-powered', 'model', 'trained', 'on', 'Türkçe', '10', 'languages', 'outperformed', 'others', 'in', 'the', "'2023_Challenge", 'NLP'])

##Subword Tokenization

In [19]:
from transformers import BertTokenizer

# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Sample text
text = "Dr. Smith’s AI-powered model (trained on Türkçe, 10+ languages) outperformed others in the '2023_Challenge'! #NLP."

# Tokenize the text
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

# Print tokens and IDs
print("Tokens:", tokens)
print("Token IDs:", token_ids)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Tokens: ['dr', '.', 'smith', '’', 's', 'ai', '-', 'powered', 'model', '(', 'trained', 'on', 'turk', '##ce', ',', '10', '+', 'languages', ')', 'out', '##per', '##formed', 'others', 'in', 'the', "'", '202', '##3', '_', 'challenge', "'", '!', '#', 'nl', '##p', '.']
Token IDs: [2852, 1012, 3044, 1521, 1055, 9932, 1011, 6113, 2944, 1006, 4738, 2006, 22883, 3401, 1010, 2184, 1009, 4155, 1007, 2041, 4842, 29021, 2500, 1999, 1996, 1005, 16798, 2509, 1035, 4119, 1005, 999, 1001, 17953, 2361, 1012]


In [20]:
from transformers import AutoTokenizer

tokenizer2 = AutoTokenizer.from_pretrained("bert-base-uncased")

# Sample text
text = "Dr. Smith’s AI-powered model (trained on Türkçe, 10+ languages) outperformed others in the '2023_Challenge'! #NLP."

# Tokenize the text
tokens = tokenizer2.tokenize(text)
token_ids = tokenizer2.convert_tokens_to_ids(tokens)

# Print tokens and IDs
print("Tokens:", tokens)
print("Token IDs:", token_ids)

Tokens: ['dr', '.', 'smith', '’', 's', 'ai', '-', 'powered', 'model', '(', 'trained', 'on', 'turk', '##ce', ',', '10', '+', 'languages', ')', 'out', '##per', '##formed', 'others', 'in', 'the', "'", '202', '##3', '_', 'challenge', "'", '!', '#', 'nl', '##p', '.']
Token IDs: [2852, 1012, 3044, 1521, 1055, 9932, 1011, 6113, 2944, 1006, 4738, 2006, 22883, 3401, 1010, 2184, 1009, 4155, 1007, 2041, 4842, 29021, 2500, 1999, 1996, 1005, 16798, 2509, 1035, 4119, 1005, 999, 1001, 17953, 2361, 1012]


In [21]:
tokenizer3 = AutoTokenizer.from_pretrained("gpt2")
tokenizer3.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
# Tokenize the text
tokens = tokenizer3.tokenize(text)
token_ids = tokenizer3.convert_tokens_to_ids(tokens)

# Print tokens and IDs
print("Tokens:", tokens)
print("Token IDs:", token_ids)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Tokens: ['Dr', '.', 'ĠSmith', 'âĢ', 'Ļ', 's', 'ĠAI', '-', 'powered', 'Ġmodel', 'Ġ(', 'trained', 'Ġon', 'ĠT', 'Ã¼r', 'k', 'Ã§', 'e', ',', 'Ġ10', '+', 'Ġlanguages', ')', 'Ġoutper', 'formed', 'Ġothers', 'Ġin', 'Ġthe', "Ġ'", '20', '23', '_', 'Chall', 'enge', "'", '!', 'Ġ#', 'N', 'LP', '.']
Token IDs: [6187, 13, 4176, 447, 247, 82, 9552, 12, 12293, 2746, 357, 35311, 319, 309, 25151, 74, 16175, 68, 11, 838, 10, 8950, 8, 33597, 12214, 1854, 287, 262, 705, 1238, 1954, 62, 41812, 3540, 6, 0, 1303, 45, 19930, 13]


In [22]:
tokenizer4 = AutoTokenizer.from_pretrained("t5-small")
tokenizer4.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
# Tokenize the text
tokens = tokenizer4.tokenize(text)
token_ids = tokenizer4.convert_tokens_to_ids(tokens)

# Print tokens and IDs
print("Tokens:", tokens)
print("Token IDs:", token_ids)

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Tokens: ['▁Dr', '.', '▁Smith', '’', 's', '▁AI', '-', 'powered', '▁model', '▁(', 't', 'rained', '▁on', '▁Tür', 'k', 'ç', 'e', ',', '▁10', '+', '▁languages', ')', '▁out', 'per', 'formed', '▁others', '▁in', '▁the', '▁', "'", '20', '23', '_', 'C', 'hall', 'en', 'ge', "'", '!', '▁#', 'N', 'LP', '.']
Token IDs: [707, 5, 3931, 22, 7, 7833, 18, 17124, 825, 41, 17, 10761, 30, 12087, 157, 8970, 15, 6, 335, 1220, 8024, 61, 91, 883, 10816, 717, 16, 8, 3, 31, 1755, 2773, 834, 254, 11516, 35, 397, 31, 55, 1713, 567, 6892, 5]


In [23]:
#how about deepseek?

tokenizer4 = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-r1")
tokenizer4.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
# Tokenize the text
tokens = tokenizer4.tokenize(text)
token_ids = tokenizer4.convert_tokens_to_ids(tokens)

# Print tokens and IDs
print("Tokens:", tokens)
print("Token IDs:", token_ids)

tokenizer_config.json: 0.00B [00:00, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json: 0.00B [00:00, ?B/s]

Tokens: ['Dr', '.', 'ĠSmith', 'âĢĻ', 's', 'ĠAI', '-powered', 'Ġmodel', 'Ġ(', 't', 'rained', 'Ġon', 'ĠTÃ¼r', 'k', 'Ã§', 'e', ',', 'Ġ', '10', '+', 'Ġlanguages', ')', 'Ġoutper', 'formed', 'Ġothers', 'Ġin', 'Ġthe', "Ġ'", '202', '3', '_', 'Challenge', "'", '!', 'Ġ#', 'N', 'LP', '.']
Token IDs: [12528, 16, 10201, 442, 85, 7703, 42793, 2645, 343, 86, 17021, 377, 82402, 77, 2341, 71, 14, 223, 553, 13, 10555, 11, 55006, 19886, 3628, 295, 270, 905, 939, 21, 65, 99042, 9, 3, 1823, 48, 23925, 16]
