**Name**: Aditya Gujar  
**Reg no.**: 2448505    
**Subject**: Natural Language Processing  
**Date**: 04-02-2025    
**Practical-1 (Tokenization)**

# Tokenization  

## What is Tokenization?  
Tokenization is the process of breaking down text into smaller units called **tokens**. These tokens can be words, sentences, or subwords, depending on the type of tokenization used. It is a fundamental step in **Natural Language Processing (NLP)**.

## Types of Tokenization:  
1. **Word Tokenization** – Splits text into individual words.  
2. **Sentence Tokenization** – Divides text into sentences.  
3. **Punctuation-based Tokenization** – Splits text using punctuation marks.  
4. **Subword Tokenization** – Breaks words into meaningful subunits (useful for deep learning models).  

## Why is Tokenization Important?  
- Prepares text for **NLP tasks** like sentiment analysis, chatbots, and search engines.  
- Helps in **text normalization** by handling punctuation, contractions, and stop words.  
- Essential for **machine learning models** to process text effectively.  


In [1]:
paragraph = "Artificial Intelligence (AI) is revolutionizing industries, from healthcare 🏥 to finance 📉, and even creative fields like music 🎵 and art 🎨. However, AI doesn’t always get it right—sometimes it *doesn’t* understand context or nuances, leading to errors. For example, sentiment analysis tools might misinterpret sarcasm or negation (e.g., 'I don’t like this product 😒😤'). Despite these challenges, AI continues to evolve, offering solutions like personalized recommendations, fraud detection, and even autonomous vehicles 🚗🤖. The future of AI is bright, but it’s crucial to address ethical concerns and biases to ensure fairness and inclusivity."

## a. Word Tokenization (NLTK)

- Splits text into individual words.
- Retains punctuation and handles contractions.
- Example: `["Artificial", "Intelligence", "(", "AI", ")"]`

In [2]:
from nltk.tokenize import word_tokenize

words = word_tokenize(paragraph)
print("Word Tokenization:\n", words)

Word Tokenization:
 ['Artificial', 'Intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'industries', ',', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', ',', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨', '.', 'However', ',', 'AI', 'doesn', '’', 't', 'always', 'get', 'it', 'right—sometimes', 'it', '*', 'doesn', '’', 't', '*', 'understand', 'context', 'or', 'nuances', ',', 'leading', 'to', 'errors', '.', 'For', 'example', ',', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', '(', 'e.g.', ',', "'", 'I', 'don', '’', 't', 'like', 'this', 'product', '😒😤', "'", ')', '.', 'Despite', 'these', 'challenges', ',', 'AI', 'continues', 'to', 'evolve', ',', 'offering', 'solutions', 'like', 'personalized', 'recommendations', ',', 'fraud', 'detection', ',', 'and', 'even', 'autonomous', 'vehicles', '🚗🤖', '.', 'The', 'future', 'of', 'AI', 'is', 'bright', ',', 'but', 'it', '’', 's', 'crucial', 'to', 'address', 'ethical', 'concern

## b. Sentence Tokenization (NLTK)
- Splits text into sentences based on punctuation.
- Useful for sentence-level analysis.
- Example: `["Artificial Intelligence (AI) is revolutionizing industries.", "However, AI doesn’t always get it right."]`

In [3]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(paragraph)
print("Sentence Tokenization:\n", sentences)

Sentence Tokenization:
 ['Artificial Intelligence (AI) is revolutionizing industries, from healthcare 🏥 to finance 📉, and even creative fields like music 🎵 and art 🎨.', 'However, AI doesn’t always get it right—sometimes it *doesn’t* understand context or nuances, leading to errors.', "For example, sentiment analysis tools might misinterpret sarcasm or negation (e.g., 'I don’t like this product 😒😤').", 'Despite these challenges, AI continues to evolve, offering solutions like personalized recommendations, fraud detection, and even autonomous vehicles 🚗🤖.', 'The future of AI is bright, but it’s crucial to address ethical concerns and biases to ensure fairness and inclusivity.']


## c. Punctuation-based Tokenizer (WordPunctTokenizer)
- Splits words at punctuation marks.
- Example: `["Artificial", "Intelligence", "(", "AI", ")", "is"]`

In [4]:
from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()
punct_tokens = tokenizer.tokenize(paragraph)
print("Punctuation-based Tokenization:\n", punct_tokens)

Punctuation-based Tokenization:
 ['Artificial', 'Intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'industries', ',', 'from', 'healthcare', '🏥', 'to', 'finance', '📉,', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨.', 'However', ',', 'AI', 'doesn', '’', 't', 'always', 'get', 'it', 'right', '—', 'sometimes', 'it', '*', 'doesn', '’', 't', '*', 'understand', 'context', 'or', 'nuances', ',', 'leading', 'to', 'errors', '.', 'For', 'example', ',', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', '(', 'e', '.', 'g', '.,', "'", 'I', 'don', '’', 't', 'like', 'this', 'product', "😒😤').", 'Despite', 'these', 'challenges', ',', 'AI', 'continues', 'to', 'evolve', ',', 'offering', 'solutions', 'like', 'personalized', 'recommendations', ',', 'fraud', 'detection', ',', 'and', 'even', 'autonomous', 'vehicles', '🚗🤖.', 'The', 'future', 'of', 'AI', 'is', 'bright', ',', 'but', 'it', '’', 's', 'crucial', 'to', 'address', 'ethical', 'co

## d. Treebank Word Tokenizer
- Based on the Penn Treebank dataset.
- Handles contractions like `doesn't → does + n't`.
- Example: `["does", "n't", "like"]`

In [5]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
treebank_tokens = tokenizer.tokenize(paragraph)
print("Treebank Tokenization:\n", treebank_tokens)

Treebank Tokenization:
 ['Artificial', 'Intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'industries', ',', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', ',', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨.', 'However', ',', 'AI', 'doesn’t', 'always', 'get', 'it', 'right—sometimes', 'it', '*doesn’t*', 'understand', 'context', 'or', 'nuances', ',', 'leading', 'to', 'errors.', 'For', 'example', ',', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', '(', 'e.g.', ',', "'I", 'don’t', 'like', 'this', 'product', '😒😤', "'", ')', '.', 'Despite', 'these', 'challenges', ',', 'AI', 'continues', 'to', 'evolve', ',', 'offering', 'solutions', 'like', 'personalized', 'recommendations', ',', 'fraud', 'detection', ',', 'and', 'even', 'autonomous', 'vehicles', '🚗🤖.', 'The', 'future', 'of', 'AI', 'is', 'bright', ',', 'but', 'it’s', 'crucial', 'to', 'address', 'ethical', 'concerns', 'and', 'biases', 'to', 'ensure', 'fairness', 'an

## e. Tweet Tokenizer
- Designed for social media text (hashtags, emojis, etc.).
- Example: `["AI", "doesn", "’t", "like", "this", "product", "😒"]`

In [None]:
from nltk.tokenize import TweetTokenizer

tokenizer = TweetTokenizer()
tweet_tokens = tokenizer.tokenize(paragraph)
print("Tweet Tokenization:\n", tweet_tokens)

Tweet Tokenization:
 ['Artificial', 'Intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'industries', ',', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', ',', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨', '.', 'However', ',', 'AI', 'doesn', '’', 't', 'always', 'get', 'it', 'right', '—', 'sometimes', 'it', '*', 'doesn', '’', 't', '*', 'understand', 'context', 'or', 'nuances', ',', 'leading', 'to', 'errors', '.', 'For', 'example', ',', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', '(', 'e', '.', 'g', '.', ',', "'", 'I', 'don', '’', 't', 'like', 'this', 'product', '😒', '😤', "'", ')', '.', 'Despite', 'these', 'challenges', ',', 'AI', 'continues', 'to', 'evolve', ',', 'offering', 'solutions', 'like', 'personalized', 'recommendations', ',', 'fraud', 'detection', ',', 'and', 'even', 'autonomous', 'vehicles', '🚗', '🤖', '.', 'The', 'future', 'of', 'AI', 'is', 'bright', ',', 'but', 'it', '’', 's', 'crucial', 'to', '

## f. Multi-Word Expression Tokenizer (MWETokenizer)
- Keeps predefined multi-word expressions together.
- Example: `["Artificial_Intelligence", "(", "AI", ")"]`

In [7]:
from nltk.tokenize import MWETokenizer

tokenizer = MWETokenizer([('Artificial', 'Intelligence'), ('autonomous', 'vehicles')])
mwe_tokens = tokenizer.tokenize(word_tokenize(paragraph))
print("MWE Tokenization:\n", mwe_tokens)

MWE Tokenization:
 ['Artificial_Intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'industries', ',', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', ',', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨', '.', 'However', ',', 'AI', 'doesn', '’', 't', 'always', 'get', 'it', 'right—sometimes', 'it', '*', 'doesn', '’', 't', '*', 'understand', 'context', 'or', 'nuances', ',', 'leading', 'to', 'errors', '.', 'For', 'example', ',', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', '(', 'e.g.', ',', "'", 'I', 'don', '’', 't', 'like', 'this', 'product', '😒😤', "'", ')', '.', 'Despite', 'these', 'challenges', ',', 'AI', 'continues', 'to', 'evolve', ',', 'offering', 'solutions', 'like', 'personalized', 'recommendations', ',', 'fraud', 'detection', ',', 'and', 'even', 'autonomous_vehicles', '🚗🤖', '.', 'The', 'future', 'of', 'AI', 'is', 'bright', ',', 'but', 'it', '’', 's', 'crucial', 'to', 'address', 'ethical', 'concerns', 'an

## g. TextBlob Tokenization
- Similar to `nltk.word_tokenize` but optimized for NLP.
- Example: `["Artificial", "Intelligence", "AI", "is"]`

In [None]:
from textblob import TextBlob

blob = TextBlob(paragraph)
textblob_tokens = blob.words
print("TextBlob Tokenization:\n", textblob_tokens)


TextBlob Tokenization:
 ['Artificial', 'Intelligence', 'AI', 'is', 'revolutionizing', 'industries', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨', 'However', 'AI', 'doesn', '’', 't', 'always', 'get', 'it', 'right—sometimes', 'it', 'doesn', '’', 't', 'understand', 'context', 'or', 'nuances', 'leading', 'to', 'errors', 'For', 'example', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', 'e.g', 'I', 'don', '’', 't', 'like', 'this', 'product', '😒😤', 'Despite', 'these', 'challenges', 'AI', 'continues', 'to', 'evolve', 'offering', 'solutions', 'like', 'personalized', 'recommendations', 'fraud', 'detection', 'and', 'even', 'autonomous', 'vehicles', '🚗🤖', 'The', 'future', 'of', 'AI', 'is', 'bright', 'but', 'it', '’', 's', 'crucial', 'to', 'address', 'ethical', 'concerns', 'and', 'biases', 'to', 'ensure', 'fairness', 'and', 'inclusivity']


## h. spaCy Tokenizer
- Advanced NLP tokenizer with entity recognition.
- Handles punctuation, contractions, and spaces.
- Example: `["Artificial", "Intelligence", "(", "AI", ")"]`

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(paragraph)
spacy_tokens = [token.text for token in doc]
print("spaCy Tokenization:\n", spacy_tokens)

spaCy Tokenization:
 ['Artificial', 'Intelligence', '(', 'AI', ')', 'is', 'revolutionizing', 'industries', ',', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', ',', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨', '.', 'However', ',', 'AI', 'does', 'n’t', 'always', 'get', 'it', 'right', '—', 'sometimes', 'it', '*', 'does', 'n’t', '*', 'understand', 'context', 'or', 'nuances', ',', 'leading', 'to', 'errors', '.', 'For', 'example', ',', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', '(', 'e.g.', ',', "'", 'I', 'do', 'n’t', 'like', 'this', 'product', '😒', '😤', "'", ')', '.', 'Despite', 'these', 'challenges', ',', 'AI', 'continues', 'to', 'evolve', ',', 'offering', 'solutions', 'like', 'personalized', 'recommendations', ',', 'fraud', 'detection', ',', 'and', 'even', 'autonomous', 'vehicles', '🚗', '🤖', '.', 'The', 'future', 'of', 'AI', 'is', 'bright', ',', 'but', 'it', '’s', 'crucial', 'to', 'address', 'ethical', 'concer

## i. Gensim Tokenizer
- Efficient word tokenizer that removes punctuation.
- Example: `["Artificial", "Intelligence", "AI"]`

In [10]:
from gensim.utils import tokenize

gensim_tokens = list(tokenize(paragraph))
print("Gensim Tokenization:\n", gensim_tokens)

Gensim Tokenization:
 ['Artificial', 'Intelligence', 'AI', 'is', 'revolutionizing', 'industries', 'from', 'healthcare', 'to', 'finance', 'and', 'even', 'creative', 'fields', 'like', 'music', 'and', 'art', 'However', 'AI', 'doesn', 't', 'always', 'get', 'it', 'right', 'sometimes', 'it', 'doesn', 't', 'understand', 'context', 'or', 'nuances', 'leading', 'to', 'errors', 'For', 'example', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', 'e', 'g', 'I', 'don', 't', 'like', 'this', 'product', 'Despite', 'these', 'challenges', 'AI', 'continues', 'to', 'evolve', 'offering', 'solutions', 'like', 'personalized', 'recommendations', 'fraud', 'detection', 'and', 'even', 'autonomous', 'vehicles', 'The', 'future', 'of', 'AI', 'is', 'bright', 'but', 'it', 's', 'crucial', 'to', 'address', 'ethical', 'concerns', 'and', 'biases', 'to', 'ensure', 'fairness', 'and', 'inclusivity']


## j. Keras Tokenizer
- Converts text to lowercase and removes punctuation.
- Example: `["artificial", "intelligence", "ai"]`

In [11]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence

keras_tokens = text_to_word_sequence(paragraph)
print("Keras Tokenization:\n", keras_tokens)

Keras Tokenization:
 ['artificial', 'intelligence', 'ai', 'is', 'revolutionizing', 'industries', 'from', 'healthcare', '🏥', 'to', 'finance', '📉', 'and', 'even', 'creative', 'fields', 'like', 'music', '🎵', 'and', 'art', '🎨', 'however', 'ai', 'doesn’t', 'always', 'get', 'it', 'right—sometimes', 'it', 'doesn’t', 'understand', 'context', 'or', 'nuances', 'leading', 'to', 'errors', 'for', 'example', 'sentiment', 'analysis', 'tools', 'might', 'misinterpret', 'sarcasm', 'or', 'negation', 'e', 'g', "'i", 'don’t', 'like', 'this', 'product', "😒😤'", 'despite', 'these', 'challenges', 'ai', 'continues', 'to', 'evolve', 'offering', 'solutions', 'like', 'personalized', 'recommendations', 'fraud', 'detection', 'and', 'even', 'autonomous', 'vehicles', '🚗🤖', 'the', 'future', 'of', 'ai', 'is', 'bright', 'but', 'it’s', 'crucial', 'to', 'address', 'ethical', 'concerns', 'and', 'biases', 'to', 'ensure', 'fairness', 'and', 'inclusivity']


## Tokenizer Comparison Table

| Index | Tokenizer                      | Handles Punctuation | Handles Contractions | Good for Social Media | NLP Optimized |
|-------|--------------------------------|---------------------|----------------------|-----------------------|--------------|
| a   | **Word Tokenization (NLTK)**   | ✅ Yes             | ✅ Yes               | ❌ No                 | ✅ Yes       |
| b   | **Sentence Tokenization (NLTK)** | ✅ Yes           | ✅ Yes               | ❌ No                 | ✅ Yes       |
| c   | **Punctuation-based (WordPunct)** | ✅ Yes         | ❌ No                | ❌ No                 | ✅ Yes       |
| d   | **Treebank Tokenizer**         | ✅ Yes             | ✅ Yes               | ❌ No                 | ✅ Yes       |
| e   | **Tweet Tokenizer**            | ✅ Yes             | ✅ Yes               | ✅ Yes                | ✅ Yes       |
| f   | **Multi-Word Tokenizer**       | ✅ Yes             | ❌ No                | ❌ No                 | ✅ Yes       |
| g   | **TextBlob Tokenizer**         | ✅ Yes             | ✅ Yes               | ❌ No                 | ✅ Yes       |
| h   | **spaCy Tokenizer**            | ✅ Yes             | ✅ Yes               | ❌ No                 | ✅ Yes       |
| i   | **Gensim Tokenizer**           | ❌ No              | ❌ No                | ❌ No                 | ✅ Yes       |
| j   | **Keras Tokenizer**            | ❌ No              | ❌ No                | ❌ No                 | ✅ Yes       |
