# NLP Assignment 1: Tokenization, Stemming, and Lemmatization

This notebook demonstrates various NLP techniques using NLTK:
- **Tokenization**: Whitespace, Punctuation-based, Treebank, Tweet, and MWE (Multi-Word Expression)
- **Stemming**: Porter Stemmer and Snowball Stemmer
- **Lemmatization**: WordNet Lemmatizer

In [1]:
# Import necessary libraries
import nltk
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer, TweetTokenizer, MWETokenizer
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')

print("All libraries imported and data downloaded successfully!")

All libraries imported and data downloaded successfully!


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Harsh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Sample Text

Let's define sample texts for demonstration:

In [2]:
# Sample text for general tokenization
text = "Hello, world! This is a sample text for NLP. It contains words like running, flies, better, and easily."

# Sample text for tweet tokenization
tweet_text = "@NLPStudent: Just learned about #NLP and #MachineLearning! ðŸ˜Š Check it out: https://example.com"

print("Sample Text:")
print(text)
print("\nTweet Text:")
print(tweet_text)

Sample Text:
Hello, world! This is a sample text for NLP. It contains words like running, flies, better, and easily.

Tweet Text:
@NLPStudent: Just learned about #NLP and #MachineLearning! ðŸ˜Š Check it out: https://example.com


## 1. Tokenization Techniques

### 1.1 Whitespace Tokenization
Splits text based on whitespace characters (spaces, tabs, newlines).

In [3]:
# Whitespace Tokenization
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_tokens = whitespace_tokenizer.tokenize(text)

print("Whitespace Tokenization:")
print(whitespace_tokens)
print(f"\nNumber of tokens: {len(whitespace_tokens)}")

Whitespace Tokenization:
['Hello,', 'world!', 'This', 'is', 'a', 'sample', 'text', 'for', 'NLP.', 'It', 'contains', 'words', 'like', 'running,', 'flies,', 'better,', 'and', 'easily.']

Number of tokens: 18


### 1.2 Punctuation-based Tokenization
Splits text on whitespace and punctuation marks.

In [4]:
# Punctuation-based Tokenization
wordpunct_tokenizer = WordPunctTokenizer()
wordpunct_tokens = wordpunct_tokenizer.tokenize(text)

print("Punctuation-based Tokenization:")
print(wordpunct_tokens)
print(f"\nNumber of tokens: {len(wordpunct_tokens)}")

Punctuation-based Tokenization:
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'NLP', '.', 'It', 'contains', 'words', 'like', 'running', ',', 'flies', ',', 'better', ',', 'and', 'easily', '.']

Number of tokens: 25


### 1.3 Treebank Tokenization
Uses the conventions of the Penn Treebank for tokenization (separates contractions, handles punctuation).

In [5]:
# Treebank Tokenization
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(text)

print("Treebank Tokenization:")
print(treebank_tokens)
print(f"\nNumber of tokens: {len(treebank_tokens)}")

Treebank Tokenization:
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'sample', 'text', 'for', 'NLP.', 'It', 'contains', 'words', 'like', 'running', ',', 'flies', ',', 'better', ',', 'and', 'easily', '.']

Number of tokens: 24


### 1.4 Tweet Tokenization
Specially designed for tokenizing tweets, handles hashtags, mentions, emoticons, and URLs.

In [6]:
# Tweet Tokenization
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(tweet_text)

print("Tweet Tokenization:")
print(tweet_tokens)
print(f"\nNumber of tokens: {len(tweet_tokens)}")

Tweet Tokenization:
['@NLPStudent', ':', 'Just', 'learned', 'about', '#NLP', 'and', '#MachineLearning', '!', 'ðŸ˜Š', 'Check', 'it', 'out', ':', 'https://example.com']

Number of tokens: 15


### 1.5 MWE (Multi-Word Expression) Tokenization
Treats specified multi-word expressions as single tokens.

In [7]:
# MWE Tokenization
mwe_text = "New York is a beautiful city. Machine learning is part of artificial intelligence."
print("Original text:")
print(mwe_text)

# First tokenize with a basic tokenizer
basic_tokens = mwe_text.split()

# Create MWE tokenizer and add multi-word expressions
mwe_tokenizer = MWETokenizer([('New', 'York'), ('machine', 'learning'), ('artificial', 'intelligence')], separator='_')

# Tokenize
mwe_tokens = mwe_tokenizer.tokenize(basic_tokens)

print("\nMWE Tokenization:")
print(mwe_tokens)
print(f"\nNumber of tokens: {len(mwe_tokens)}")

Original text:
New York is a beautiful city. Machine learning is part of artificial intelligence.

MWE Tokenization:
['New_York', 'is', 'a', 'beautiful', 'city.', 'Machine', 'learning', 'is', 'part', 'of', 'artificial', 'intelligence.']

Number of tokens: 12


## 2. Stemming

Stemming reduces words to their root form by removing suffixes.

In [8]:
# Sample words for stemming
words = ['running', 'runs', 'ran', 'runner', 'easily', 'fairly', 'fairness', 
         'flying', 'flies', 'connection', 'connections', 'connected', 'connecting']

print("Words to be stemmed:")
print(words)

Words to be stemmed:
['running', 'runs', 'ran', 'runner', 'easily', 'fairly', 'fairness', 'flying', 'flies', 'connection', 'connections', 'connected', 'connecting']


### 2.1 Porter Stemmer
The Porter Stemmer is one of the most common stemming algorithms.

In [9]:
# Porter Stemmer
porter = PorterStemmer()

print("Porter Stemmer Results:")
print("-" * 50)
print(f"{'Original Word':<20} {'Stemmed Word':<20}")
print("-" * 50)

for word in words:
    stemmed = porter.stem(word)
    print(f"{word:<20} {stemmed:<20}")

Porter Stemmer Results:
--------------------------------------------------
Original Word        Stemmed Word        
--------------------------------------------------
running              run                 
runs                 run                 
ran                  ran                 
runner               runner              
easily               easili              
fairly               fairli              
fairness             fair                
flying               fli                 
flies                fli                 
connection           connect             
connections          connect             
connected            connect             
connecting           connect             


### 2.2 Snowball Stemmer
The Snowball Stemmer (Porter2) is an improved version of the Porter Stemmer and supports multiple languages.

In [10]:
# Snowball Stemmer
snowball = SnowballStemmer('english')

print("Snowball Stemmer Results:")
print("-" * 50)
print(f"{'Original Word':<20} {'Stemmed Word':<20}")
print("-" * 50)

for word in words:
    stemmed = snowball.stem(word)
    print(f"{word:<20} {stemmed:<20}")

Snowball Stemmer Results:
--------------------------------------------------
Original Word        Stemmed Word        
--------------------------------------------------
running              run                 
runs                 run                 
ran                  ran                 
runner               runner              
easily               easili              
fairly               fair                
fairness             fair                
flying               fli                 
flies                fli                 
connection           connect             
connections          connect             
connected            connect             
connecting           connect             


### 2.3 Comparison: Porter vs Snowball Stemmer

In [11]:
# Comparison of both stemmers
print("Comparison: Porter vs Snowball Stemmer")
print("-" * 70)
print(f"{'Original Word':<20} {'Porter':<20} {'Snowball':<20}")
print("-" * 70)

for word in words:
    porter_stem = porter.stem(word)
    snowball_stem = snowball.stem(word)
    print(f"{word:<20} {porter_stem:<20} {snowball_stem:<20}")

Comparison: Porter vs Snowball Stemmer
----------------------------------------------------------------------
Original Word        Porter               Snowball            
----------------------------------------------------------------------
running              run                  run                 
runs                 run                  run                 
ran                  ran                  ran                 
runner               runner               runner              
easily               easili               easili              
fairly               fairli               fair                
fairness             fair                 fair                
flying               fli                  fli                 
flies                fli                  fli                 
connection           connect              connect             
connections          connect              connect             
connected            connect              connect             


## 3. Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma) using vocabulary and morphological analysis.

### 3.1 WordNet Lemmatizer
Uses WordNet database to find the lemma of words.

In [12]:
# WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

print("WordNet Lemmatizer Results (default - noun):")
print("-" * 50)
print(f"{'Original Word':<20} {'Lemmatized Word':<20}")
print("-" * 50)

for word in words:
    lemma = lemmatizer.lemmatize(word)
    print(f"{word:<20} {lemma:<20}")

WordNet Lemmatizer Results (default - noun):
--------------------------------------------------
Original Word        Lemmatized Word     
--------------------------------------------------
running              running             
runs                 run                 
ran                  ran                 
runner               runner              
easily               easily              
fairly               fairly              
fairness             fairness            
flying               flying              
flies                fly                 
connection           connection          
connections          connection          
connected            connected           
connecting           connecting          


### 3.2 Lemmatization with POS Tags
Lemmatization works better when we specify the Part of Speech (POS) of the word.

In [13]:
# Lemmatization with different POS tags
test_words = ['running', 'runs', 'ran', 'better', 'good', 'best', 'worse', 'worst']

print("Lemmatization with Different POS Tags:")
print("-" * 90)
print(f"{'Word':<15} {'Noun':<15} {'Verb':<15} {'Adjective':<15} {'Adverb':<15}")
print("-" * 90)

for word in test_words:
    noun_lemma = lemmatizer.lemmatize(word, pos='n')
    verb_lemma = lemmatizer.lemmatize(word, pos='v')
    adj_lemma = lemmatizer.lemmatize(word, pos='a')
    adv_lemma = lemmatizer.lemmatize(word, pos='r')
    print(f"{word:<15} {noun_lemma:<15} {verb_lemma:<15} {adj_lemma:<15} {adv_lemma:<15}")

Lemmatization with Different POS Tags:
------------------------------------------------------------------------------------------
Word            Noun            Verb            Adjective       Adverb         
------------------------------------------------------------------------------------------
running         running         run             running         running        
runs            run             run             runs            runs           
ran             ran             run             ran             ran            
better          better          better          good            well           
good            good            good            good            good           
best            best            best            best            best           
worse           worse           worse           bad             worse          
worst           worst           worst           bad             worst          


## 4. Comparison: Stemming vs Lemmatization

In [14]:
# Comparison of Stemming vs Lemmatization
comparison_words = ['studies', 'studying', 'cries', 'crying', 'better', 'caring', 'leaves']

print("Stemming vs Lemmatization Comparison:")
print("-" * 90)
print(f"{'Original':<15} {'Porter':<15} {'Snowball':<15} {'Lemma (n)':<15} {'Lemma (v)':<15}")
print("-" * 90)

for word in comparison_words:
    porter_stem = porter.stem(word)
    snowball_stem = snowball.stem(word)
    noun_lemma = lemmatizer.lemmatize(word, pos='n')
    verb_lemma = lemmatizer.lemmatize(word, pos='v')
    print(f"{word:<15} {porter_stem:<15} {snowball_stem:<15} {noun_lemma:<15} {verb_lemma:<15}")

Stemming vs Lemmatization Comparison:
------------------------------------------------------------------------------------------
Original        Porter          Snowball        Lemma (n)       Lemma (v)      
------------------------------------------------------------------------------------------
studies         studi           studi           study           study          
studying        studi           studi           studying        study          
cries           cri             cri             cry             cry            
crying          cri             cri             cry             cry            
better          better          better          better          better         
caring          care            care            caring          care           
leaves          leav            leav            leaf            leave          


## Conclusion

**Key Differences:**

### Tokenization Methods:
- **Whitespace**: Simple, splits only on spaces
- **Punctuation-based**: Separates punctuation as individual tokens
- **Treebank**: Standard linguistic conventions, handles contractions
- **Tweet**: Specialized for social media content (hashtags, mentions, URLs, emojis)
- **MWE**: Treats multi-word expressions as single tokens

### Stemming vs Lemmatization:
- **Stemming**: Faster, rule-based, may produce non-words (e.g., "studi" from "studies")
- **Lemmatization**: Slower, dictionary-based, always produces valid words (e.g., "study" from "studies")
- **Porter vs Snowball**: Snowball is generally more accurate and supports multiple languages