# **Student Training Program on AIML**
### MODULE: CLASSIFICATION-1
### LAB-2 : Using KNN for Text Classification


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\IC1807\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\IC1807\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\IC1807\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\IC1807\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\IC1807\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\IC1807\AppData\Roaming\nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [4]:
5*12

60

In [5]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [6]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [8]:
# Load the Reviews CSV file directly from the project directory
# Make sure you have 'reviews.csv' file in the same folder as this notebook
import pandas as pd
print("Loading reviews.csv...")
df = pd.read_csv('reviews.csv')
print(f"Reviews dataset loaded successfully! Shape: {df.shape}")

Loading reviews.csv...
Reviews dataset loaded successfully! Shape: (999, 2)


In [9]:
# Display basic info about the reviews dataset
print("Reviews Dataset Info:")
print(f"Columns: {list(df.columns)}")
print(f"Shape: {df.shape}")
print("\nFirst few rows:")
df.head()

Reviews Dataset Info:
Columns: ['sentence', 'sentiment']
Shape: (999, 2)

First few rows:


Unnamed: 0,sentence,sentiment
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0


In [10]:
df = df.dropna()

In [11]:
df

Unnamed: 0,sentence,sentiment
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0
...,...,...
994,I just got bored watching Jessice Lange take h...,0
995,"Unfortunately, any virtue in this film's produ...",0
996,"In a word, it is embarrassing.",0
997,Exceptionally bad!,0


In [12]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [13]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [14]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

KNN with BOW accuracy = 64.3979057591623%
Cross Validation Accuracy: 0.65
[0.64313725 0.60392157 0.7007874 ]






In [15]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 71.72774869109948%
Cross Validation Accuracy: 0.73
[0.71764706 0.74509804 0.73622047]




# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [16]:
# Load the spam text data CSV file directly from the project directory
# Make sure you have 'spam.csv' file in the same folder as this notebook
# You can download it from: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
print("Loading spam.csv...")
df_spam = pd.read_csv('spam.csv')
print(f"Spam dataset loaded successfully! Shape: {df_spam.shape}")
df_spam

Loading spam.csv...
Spam dataset loaded successfully! Shape: (5572, 2)


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [17]:
# Display basic info about the spam dataset  
print("Spam Dataset Info:")
print(f"Columns: {list(df_spam.columns)}")
print(f"Shape: {df_spam.shape}")
print("\nSample data:")
df_spam.head()

Spam Dataset Info:
Columns: ['Category', 'Message']
Shape: (5572, 2)

Sample data:


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [18]:
# Map text categories to numerical values
df_spam['Category'] = df_spam['Category'].map({'ham': 0, 'spam': 1})
print("Category mapping completed:")
print("ham -> 0, spam -> 1")
print(f"\nCategory distribution:")
print(df_spam['Category'].value_counts())

Category mapping completed:
ham -> 0, spam -> 1

Category distribution:
Category
0    4825
1     747
Name: count, dtype: int64


In [19]:
# Display first 5 rows of the processed spam dataset
df_spam.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [20]:
# Check the total number of records in the spam dataset
print(f"Total number of messages in spam dataset: {len(df_spam)}")
len(df_spam)

Total number of messages in spam dataset: 5572


5572

In [21]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [22]:
# This cell may take some time to run
predicted, y_test = bow_knn()

KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.9064603  0.89973082 0.91313131]


Cross Validation Accuracy: 0.91
[0.9064603  0.89973082 0.91313131]




In [23]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html


# **Answer to Questions**

## **Question 1: Why does the TF-IDF approach generally result in better accuracy than Bag-of-Words?**

TF-IDF generally outperforms Bag-of-Words because it addresses BoW's key limitations by incorporating word importance weighting. While BoW only counts word occurrences (treating all words equally), TF-IDF uses two critical components:

• Term Frequency (TF): Measures how frequently a word appears in a document
• Inverse Document Frequency (IDF): Measures how rare/important a word is across the entire corpus

This approach reduces the weight of common words (like "the", "is", "and") that appear frequently across documents but carry little meaning, while boosting the weight of rare, discriminative words that are more likely to be topic-specific. In classification tasks, this means TF-IDF creates more meaningful feature vectors that capture the essence of document content rather than just word counts, leading to better separation between different classes and improved accuracy.

In [24]:
# Code Example: Demonstrating TF-IDF vs BoW Weight Differences
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

# Sample documents for comparison
sample_docs = [
    "The movie is very good and entertaining",
    "The movie is not good and boring", 
    "This film is excellent and amazing",
    "The the the is is is and and and"  # Document with many common words
]

# Create BoW representation
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(sample_docs)
bow_feature_names = bow_vectorizer.get_feature_names_out()

# Create TF-IDF representation  
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(sample_docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# Compare weights for the last document (lots of common words)
print("=== Document 4: 'The the the is is is and and and' ===")
print("\nBoW weights (raw counts):")
doc4_bow = bow_matrix[3].toarray()[0]
for i, word in enumerate(bow_feature_names):
    if doc4_bow[i] > 0:
        print(f"'{word}': {doc4_bow[i]}")

print("\nTF-IDF weights (importance-adjusted):")
doc4_tfidf = tfidf_matrix[3].toarray()[0]
for i, word in enumerate(tfidf_feature_names):
    if doc4_tfidf[i] > 0:
        print(f"'{word}': {doc4_tfidf[i]:.4f}")

print("\n=== Analysis ===")
print("BoW gives high weights to common words like 'the', 'is', 'and'")
print("TF-IDF reduces these weights significantly because they appear in many documents")
print("This makes TF-IDF better at capturing meaningful content differences!")

=== Document 4: 'The the the is is is and and and' ===

BoW weights (raw counts):
'and': 3
'is': 3
'the': 3

TF-IDF weights (importance-adjusted):
'and': 0.5348
'is': 0.5348
'the': 0.6542

=== Analysis ===
BoW gives high weights to common words like 'the', 'is', 'and'
TF-IDF reduces these weights significantly because they appear in many documents
This makes TF-IDF better at capturing meaningful content differences!


## **Question 2: Can you think of techniques that are better than both BoW and TF-IDF?**

Yes, several modern techniques outperform both BoW and TF-IDF by addressing their fundamental limitation: lack of semantic understanding. The most significant improvements come from:

• Word Embeddings (Word2Vec, GloVe): Create dense vector representations that capture semantic relationships between words (e.g., "king" - "man" + "woman" ≈ "queen")
• Contextualized Embeddings (BERT, GPT, RoBERTa): Generate different representations for the same word based on context (e.g., "bank" in "river bank" vs "money bank")
• N-grams with Neural Networks: Capture local word order and phrase meanings that BoW/TF-IDF completely ignore
• Doc2Vec/Paragraph Vectors: Learn document-level representations that consider word order and document structure
• Transformer-based Models: Use attention mechanisms to understand long-range dependencies and complex linguistic patterns

These techniques achieve better results because they understand word relationships, context, and meaning rather than just treating text as isolated word counts or weighted frequencies.

In [25]:
# Code Example: N-grams (Better than BoW/TF-IDF) - Capturing Word Order
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example sentences with different meanings but similar words
sentences = [
    "The movie is not good",        # Negative sentiment
    "The movie is very good",       # Positive sentiment  
    "Good movie, not bad at all"    # Positive sentiment
]

print("=== Comparing BoW vs N-grams for capturing meaning ===\n")

# 1. Standard BoW (ignores word order)
bow_vectorizer = CountVectorizer()
bow_matrix = bow_vectorizer.fit_transform(sentences)
print("BoW Features:", bow_vectorizer.get_feature_names_out())
print("BoW Vectors:")
for i, sent in enumerate(sentences):
    print(f"Sentence {i+1}: {bow_matrix[i].toarray()[0]}")

# Calculate similarity between sentences 1 and 2 using BoW
bow_sim = cosine_similarity(bow_matrix[0:1], bow_matrix[1:2])[0][0]
print(f"BoW Similarity between 'not good' and 'very good': {bow_sim:.3f}")

print("\n" + "="*60 + "\n")

# 2. N-grams (captures word order and phrases)
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))  # Unigrams + Bigrams
ngram_matrix = ngram_vectorizer.fit_transform(sentences)
print("N-gram Features (1-gram + 2-gram):", ngram_vectorizer.get_feature_names_out())
print("N-gram Vectors:")
for i, sent in enumerate(sentences):
    print(f"Sentence {i+1}: {ngram_matrix[i].toarray()[0]}")

# Calculate similarity between sentences 1 and 2 using N-grams
ngram_sim = cosine_similarity(ngram_matrix[0:1], ngram_matrix[1:2])[0][0]
print(f"N-gram Similarity between 'not good' and 'very good': {ngram_sim:.3f}")

print(f"\n=== Key Insight ===")
print("N-grams capture phrases like 'not good' vs 'very good' as separate features")
print("This leads to better understanding of negation and context!")
print("Modern techniques like BERT/GPT do this even better with deep learning.")

=== Comparing BoW vs N-grams for capturing meaning ===

BoW Features: ['all' 'at' 'bad' 'good' 'is' 'movie' 'not' 'the' 'very']
BoW Vectors:
Sentence 1: [0 0 0 1 1 1 1 1 0]
Sentence 2: [0 0 0 1 1 1 0 1 1]
Sentence 3: [1 1 1 1 0 1 1 0 0]
BoW Similarity between 'not good' and 'very good': 0.800


N-gram Features (1-gram + 2-gram): ['all' 'at' 'at all' 'bad' 'bad at' 'good' 'good movie' 'is' 'is not'
 'is very' 'movie' 'movie is' 'movie not' 'not' 'not bad' 'not good' 'the'
 'the movie' 'very' 'very good']
N-gram Vectors:
Sentence 1: [0 0 0 0 0 1 0 1 1 0 1 1 0 1 0 1 1 1 0 0]
Sentence 2: [0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1]
Sentence 3: [1 1 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0]
N-gram Similarity between 'not good' and 'very good': 0.667

=== Key Insight ===
N-grams capture phrases like 'not good' vs 'very good' as separate features
This leads to better understanding of negation and context!
Modern techniques like BERT/GPT do this even better with deep learning.


## **Question 3: Pros and Cons of Stemming vs Lemmatization**

Both stemming and lemmatization aim to reduce words to their root forms, but they differ significantly in approach and outcomes:

### Stemming:
• Pros: Fast and computationally efficient, works well for IR/search tasks, simple rule-based approach, reduces vocabulary size effectively
• Cons: Crude heuristic approach, can produce non-dictionary words (e.g., "better" → "better"), may over-stem or under-stem, ignores word context and meaning

### Lemmatization:  
• Pros: Produces actual dictionary words (lemmas), considers word context and part-of-speech, more linguistically accurate, better for semantic analysis
• Cons: Computationally expensive, requires full vocabulary and morphological analysis, slower processing, may not improve retrieval performance significantly

### Key Insight: 
Stemming prioritizes speed and recall (finding more matches), while lemmatization prioritizes accuracy and precision (finding correct matches). The choice depends on your specific application needs.

In [26]:
# Code Example: Stemming vs Lemmatization Comparison
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
import time

# Sample words to demonstrate differences
test_words = [
    "running", "ran", "runs", "easily", "fairly", 
    "better", "good", "dogs", "feet", "geese",
    "organizing", "organized", "organization",
    "am", "are", "is", "was", "were", "being"
]

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Helper function to get POS tag for lemmatization
def get_wordnet_pos(word):
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ, "N": wordnet.NOUN, "V": wordnet.VERB, "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

print("=== STEMMING vs LEMMATIZATION COMPARISON ===\n")
print(f"{'Original':<15} {'Stemmed':<15} {'Lemmatized':<15} {'Notes'}")
print("-" * 70)

# Compare each word
for word in test_words:
    # Stemming (fast, rule-based)
    stemmed = stemmer.stem(word)
    
    # Lemmatization (slower, context-aware)
    lemmatized = lemmatizer.lemmatize(word, get_wordnet_pos(word))
    
    # Determine if results differ
    if stemmed == lemmatized:
        note = "Same result"
    elif stemmed in ["better", "good"] or lemmatized in ["better", "good"]:
        note = "Lemma more accurate"
    elif len(stemmed) < len(lemmatized):
        note = "Stem more aggressive"
    else:
        note = "Different approaches"
    
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15} {note}")

# Performance comparison
print(f"\n=== PERFORMANCE COMPARISON ===")
test_text = " ".join(test_words * 1000)  # Large text for timing

# Time stemming
start_time = time.time()
stemmed_results = [stemmer.stem(word) for word in word_tokenize(test_text)]
stem_time = time.time() - start_time

# Time lemmatization  
start_time = time.time()
lemma_results = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in word_tokenize(test_text)]
lemma_time = time.time() - start_time

print(f"Stemming time: {stem_time:.4f} seconds")
print(f"Lemmatization time: {lemma_time:.4f} seconds") 
print(f"Lemmatization is {lemma_time/stem_time:.1f}x slower than stemming")

=== STEMMING vs LEMMATIZATION COMPARISON ===

Original        Stemmed         Lemmatized      Notes
----------------------------------------------------------------------
running         run             run             Same result
ran             ran             ran             Same result
runs            run             run             Same result
easily          easili          easily          Different approaches
fairly          fairli          fairly          Different approaches
better          better          well            Lemma more accurate
good            good            good            Same result
dogs            dog             dog             Same result
feet            feet            foot            Different approaches
geese           gees            geese           Stem more aggressive
organizing      organ           organize        Stem more aggressive
organized       organ           organize        Stem more aggressive
organization    organ           organization   