## **Tokenization**

---

### **What is Tokenization?**

Tokenization means **splitting text into smaller parts**, usually **words** or **sentences**.
Example: `"I love NLP"` → `["I", "love", "NLP"]`

---

### **Why is it Needed?**

* Models can't understand full sentences directly.
* Breaking text into tokens helps in **analyzing each word**.

---


![image.png](attachment:image.png)


## **How is it Done?**

### **1. NTLK**

In [1]:
import nltk

In [2]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [3]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to C:\Users\SPPL
[nltk_data]     IT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


True

In [None]:
text = "I love NLP!"
tokens = word_tokenize(text)
print(tokens)  

['I', 'love', 'NLP', '!']


In [7]:
text = "I love NLP! It's fascinating. Let's learn it step by step."
sentences = sent_tokenize(text)
print(sentences)


['I love NLP!', "It's fascinating.", "Let's learn it step by step."]


In [8]:
text = "Email me at tanji.evan23@gmail.com"
tokens = word_tokenize(text)
print(tokens)  

['Email', 'me', 'at', 'tanji.evan23', '@', 'gmail.com']


In [25]:
text =  "I met with Mr. and Mrs. Khan. They live in the U.S. Do you know them?"

sentences = sent_tokenize(text)
print(sentences)


['I met with Mr. and Mrs. Khan.', 'They live in the U.S. Do you know them?']


In [26]:
text =  "Wow!!😲 That's crazy...Right? I can't believe it..."

sentences = sent_tokenize(text)
print(sentences)


['Wow!', "!😲 That's crazy...Right?", "I can't believe it..."]


### **2. Spacy**

In [27]:
import spacy

In [28]:
# Load English model
nlp = spacy.load("en_core_web_sm")

In [30]:
text = "NLP is fun!"
doc = nlp(text)

In [33]:
type(text)

str

In [32]:
type(doc)

spacy.tokens.doc.Doc

In [35]:
for token in doc:
    print(token)

NLP
is
fun
!


In [36]:
text1 = "Email me at tanji.evan23@gmail.com"
text2 =  "I met with Mr. and Mrs. Khan. They live in the U.S. Do you know them?"
text3 =  "Wow!!😲 That's crazy...Right? I can't believe it..."

In [37]:
doc1 = nlp(text1)
doc2 = nlp(text2)
doc3 = nlp(text3)

In [38]:
for token in doc1:
    print(token)

Email
me
at
tanji.evan23@gmail.com


In [43]:
for token in doc2:
    print(token,end=",")

I,met,with,Mr.,and,Mrs.,Khan,.,They,live,in,the,U.S.,Do,you,know,them,?,

In [45]:
for sent in doc2.sents:
    print(sent)

I met with Mr. and Mrs. Khan.
They live in the U.S.
Do you know them?


In [46]:
for token in doc3.sents:
    print(token)

Wow!!
😲
That's crazy...
Right?
I can't believe it...



## **Stemming**

---

### **What is Stemming?**

Stemming means **cutting words down** to their **root form** by removing suffixes.
Example: `"playing"`, `"played"` → `"play"`

---

### **Why is it Needed?**

* Different forms of a word (play, playing, played) mean **similar things**.
* Stemming helps in **reducing word variations**.
* It improves **search**, **text matching**, and **model efficiency**.

---

### **How is it Done?**


In [47]:
from nltk.stem import PorterStemmer


In [48]:
stemmer = PorterStemmer()

In [50]:
stemmer.stem("playing")

'play'

In [None]:
words = ["play", "playing", "played", "plays"]

for word in words:
    print(word, "→", stemmer.stem(word))

play → play
playing → play
played → play
plays → play


In [51]:
text = "Hello classes i am going to teach you all NLP "

for word in text:
    print(word, "→", stemmer.stem(text))

H → hello classes i am going to teach you all nlp 
e → hello classes i am going to teach you all nlp 
l → hello classes i am going to teach you all nlp 
l → hello classes i am going to teach you all nlp 
o → hello classes i am going to teach you all nlp 
  → hello classes i am going to teach you all nlp 
c → hello classes i am going to teach you all nlp 
l → hello classes i am going to teach you all nlp 
a → hello classes i am going to teach you all nlp 
s → hello classes i am going to teach you all nlp 
s → hello classes i am going to teach you all nlp 
e → hello classes i am going to teach you all nlp 
s → hello classes i am going to teach you all nlp 
  → hello classes i am going to teach you all nlp 
i → hello classes i am going to teach you all nlp 
  → hello classes i am going to teach you all nlp 
a → hello classes i am going to teach you all nlp 
m → hello classes i am going to teach you all nlp 
  → hello classes i am going to teach you all nlp 
g → hello classes i am going to

In [53]:
stemmer = PorterStemmer()

# Input sentence
text = "He was playing with his friends and enjoyed the games."


In [54]:
# Step 1: Tokenize the sentence properly
words = word_tokenize(text)

words

['He',
 'was',
 'playing',
 'with',
 'his',
 'friends',
 'and',
 'enjoyed',
 'the',
 'games',
 '.']

In [56]:
# Step 2: Stem each word using a loop
stemmed_words = []
for word in words:
    stemmed_word = stemmer.stem(word)
    stemmed_words.append(stemmed_word)


In [57]:
stemmed_words

['he',
 'wa',
 'play',
 'with',
 'hi',
 'friend',
 'and',
 'enjoy',
 'the',
 'game',
 '.']

In [58]:
# Step 3: Join stemmed words back into a sentence
result = ' '.join(stemmed_words)
print(result)

he wa play with hi friend and enjoy the game .


### **Snowball Stemmer**

> Snowball Stemmer is an improved version of Porter Stemmer.


In [59]:
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')

# Create stemmer for English
stemmer = SnowballStemmer("english")

text = "He was playing with his friends and enjoyed the games."

# Tokenize the sentence
words = word_tokenize(text)

# Apply stemming using loop
stemmed_words = []
for word in words:
    stemmed_word = stemmer.stem(word)
    stemmed_words.append(stemmed_word)

# Join back to sentence
result = ' '.join(stemmed_words)
print(result)


he was play with his friend and enjoy the game .


[nltk_data] Downloading package punkt to C:\Users\SPPL
[nltk_data]     IT\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [60]:
text = "history eaten"

# Tokenize the sentence
words = word_tokenize(text)

# Apply stemming using loop
stemmed_words = []
for word in words:
    stemmed_word = stemmer.stem(word)
    stemmed_words.append(stemmed_word)

# Join back to sentence
result = ' '.join(stemmed_words)
print(result)

histori eaten


### **Problem with Stemming Output**

Words like:

* `"was"` → `"wa"`
* `"his"` → `"hi"`

These are **not valid English words**. This is a **known problem** with stemming.

---

### **Why This Happens**

Stemming **just cuts suffixes**, without checking if the result is a **real word**.
It follows fixed rules — not meaning.

---

### **What’s the Solution?**

Use **Lemmatization** instead of Stemming.

---

### **Stemming vs Lemmatization**

| Feature        | Stemming     | Lemmatization             |
| -------------- | ------------ | ------------------------- |
| Cuts suffix?   | Yes          | No                        |
| Valid word?    | Often **no** | Always **yes**            |
| Based on rules | Simple rules | Uses dictionary + grammar |
| Output         | "wa", "hi"   | "was", "his"              |

---

