<a href="https://colab.research.google.com/github/005sudha/005sudha-fmml-lab/blob/main/Mod3_Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy compared to the Bag-of-Words (BoW) model because it addresses some of the key limitations of BoW by considering not just the presence or frequency of words but also their relative importance in a corpus. Here are the main reasons for its improved performance:


---

1. Importance of Rare Words

Bag-of-Words: Treats all words equally, regardless of how common they are in the corpus. Frequently occurring words (e.g., "the," "is") dominate the feature space, even though they contribute little to the meaning of the document.

TF-IDF: Weighs terms based on their frequency in a document (TF) and their rarity across the corpus (IDF). Rare but meaningful words are given more weight, improving the representation of documents.



---

2. Reduction of Noise from Common Words

Bag-of-Words: Often includes many stopwords and highly frequent terms that don't differentiate between documents, which can act as noise.

TF-IDF: Reduces the influence of such common terms by assigning them a low weight, as their IDF value is small.



---

3. Contextual Relevance

Bag-of-Words: Simply counts occurrences without considering how significant a term is within a particular document relative to the entire corpus.

TF-IDF: Captures the importance of a term in a specific document compared to its general usage, thus better reflecting the document's unique content.



---

4. Sparsity and Dimensionality

Both models often produce sparse feature matrices, but TF-IDF can better distribute feature importance across terms, leading to a more meaningful and nuanced feature space, which often benefits downstream models.



---

5. Better Discrimination Between Documents

Bag-of-Words: May fail to distinguish between documents if they share many common words, even if their meanings differ.

TF-IDF: Helps distinguish documents by emphasizing unique or less frequent terms, which often carry more semantic weight.



---

6. Improved Generalization in Models

Machine learning models trained on TF-IDF features tend to generalize better because they focus on terms that contribute more meaningfully to document distinctions, while Bag-of-Words often leads to overfitting on irrelevant terms.



---

Limitations of TF-IDF

While TF-IDF often outperforms BoW, it isn't perfect:

It doesn't consider word order, semantics, or relationships between words.

Rare terms might be overemphasized if they're irrelevant or noise.

Modern techniques like word embeddings (e.g., Word2Vec, BERT) have started outperforming both TF-IDF and BoW in many applications.


However, for many traditional text classification and retrieval tasks, TF-IDF strikes an effective balance between simplicity and performance.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
Yes, there are several techniques that outperform Bag-of-Words (BoW) and TF-IDF by capturing more semantic and contextual information in text data. These techniques leverage advances in machine learning, natural language processing (NLP), and deep learning. Here are some prominent ones:


---

1. Word Embeddings

Description: Word embeddings map words into dense, continuous vector spaces where semantically similar words are closer to each other. They capture semantic relationships and often outperform sparse representations like BoW and TF-IDF.

Examples:

Word2Vec (e.g., Skip-gram, CBOW)

GloVe (Global Vectors for Word Representation)

FastText (captures subword information, making it effective for rare or misspelled words)


Why better:

Embeddings capture contextual and semantic meaning beyond term frequency.

They are dense and lower-dimensional, making them computationally efficient.




---

2. Contextual Word Embeddings

Description: These embeddings represent words based on their meaning in a specific context, unlike static embeddings (e.g., Word2Vec), where a word has the same vector regardless of its usage.

Examples:

ELMo (Embeddings from Language Models): Contextual representations based on entire sentences.

BERT (Bidirectional Encoder Representations from Transformers): Pre-trained transformer model that generates embeddings based on both left and right context.

GPT (Generative Pre-trained Transformer): Focuses on language modeling with excellent text generation and understanding capabilities.


Why better:

They capture nuanced meanings, such as polysemy (e.g., "bank" as a financial institution vs. riverbank).

These models leverage massive pretraining on large corpora, making them highly effective across a variety of tasks.




---

3. Transformer-based Models

Description: Transformers use self-attention mechanisms to model relationships between words in a sequence, capturing long-range dependencies and contextual meaning.

Examples:

BERT, RoBERTa, DistilBERT, GPT, T5

Fine-tuned versions tailored for specific tasks (e.g., text classification, sentiment analysis).


Why better:

They encode entire sentences or paragraphs rather than treating words in isolation.

Excellent at capturing complex relationships and syntactic structure.




---

4. Topic Modeling Techniques

Description: Topic models uncover latent themes in a corpus, representing documents as distributions over topics and topics as distributions over words.

Examples:

Latent Dirichlet Allocation (LDA)

Non-Negative Matrix Factorization (NMF)


Why better:

These techniques go beyond individual word analysis by clustering semantically related terms, enabling a richer understanding of document content.




---

5. Sentence and Document Embeddings

Description: Instead of word-level embeddings, sentence or document embeddings provide a single dense vector representation of entire sentences or documents.

Examples:

Universal Sentence Encoder (USE)

InferSent

Sentence-BERT (SBERT): Adapts BERT for sentence similarity and embedding tasks.


Why better:

They capture the overall meaning of a sentence or document, which is useful for tasks like similarity analysis, classification, or clustering.




---

6. Neural Networks with Pretrained Word Embeddings

Description: Techniques like convolutional neural networks (CNNs), recurrent neural networks (RNNs), and transformers can be used with pretrained embeddings.

Examples:

CNNs for text classification.

RNNs and LSTMs/GRUs for sequence modeling.


Why better:

They can learn hierarchical features and contextual dependencies in a task-specific manner.




---

7. Latent Semantic Analysis (LSA)

Description: A dimensionality reduction technique using Singular Value Decomposition (SVD) on the term-document matrix to capture latent semantic structures.

Why better:

Reduces noise and highlights underlying patterns in text data.

Captures relationships between terms and documents that BoW and TF-IDF miss.




---

Comparison


---

In practice, techniques like transformer-based models (e.g., BERT) and sentence embeddings (e.g., SBERT) have largely replaced traditional approaches in applications requiring high accuracy and deep understanding of text. However, simpler methods like TF-IDF or BoW are still useful in resource-constrained environments or for straightforward tasks.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
Stemming and Lemmatization: An Overview

Stemming and lemmatization are text preprocessing techniques used to reduce words to their base or root form. While they are similar in purpose, they differ in methods and outcomes.


---

Stemming

Definition:
Stemming is a rule-based process of stripping suffixes and prefixes (affixes) from words to reduce them to their "stem," often without considering the word's meaning. It is a heuristic method and may produce non-linguistic or incomplete root forms.

Example:

Words: "running," "runner," "runs"

Stemmed Result: "run"


Popular Algorithms:

Porter Stemmer: Widely used, simple but effective.

Snowball Stemmer: An enhanced version of the Porter Stemmer.

Lancaster Stemmer: More aggressive and can over-stem words.


Pros:

1. Speed: Stemming is computationally faster as it uses simple, rule-based heuristics.


2. Simplicity: Easy to implement with minimal language understanding required.


3. Reduces Vocabulary Size: Useful for search engines or applications where precise semantics aren't critical.



Cons:

1. Inaccuracy: Can over-stem (e.g., "university" → "univers") or under-stem (e.g., "running" → "run" but misses "ran").


2. Ignores Context: Produces stems that may not be valid words or reflect correct meanings.


3. Language Dependency: Rule sets must be customized for different languages.




---

Lemmatization

Definition:
Lemmatization reduces words to their base or dictionary form (lemma) using linguistic analysis, including part-of-speech (POS) tagging and vocabulary knowledge. It ensures that the root form is meaningful and grammatically correct.

Example:

Words: "running" (verb), "better" (adjective)

Lemmatized Result: "run" (verb), "good" (adjective)


Popular Tools:

WordNet Lemmatizer (NLTK): Leverages WordNet to identify base forms.

SpaCy Lemmatizer: Uses linguistic rules for more accurate results.


Pros:

1. Accuracy: Produces valid dictionary words by considering grammatical context.


2. Semantics-Aware: Understands relationships between words, ensuring correct base forms (e.g., "better" → "good").


3. POS Tagging Integration: Handles complex linguistic scenarios more effectively.



Cons:

1. Slower: Computationally more expensive due to linguistic analysis and dependency on external resources (e.g., dictionaries).


2. Implementation Complexity: Requires POS tagging and may depend on large datasets or prebuilt tools.


3. Language-Specific: Needs separate implementations and dictionaries for different languages.




---

Comparison: Stemming vs. Lemmatization


---

When to Use Which?

1. Use Stemming If:

You need speed and simplicity (e.g., quick text classification, search engines).

Precision in word form isn't critical.



2. Use Lemmatization If:

You need accurate, meaningful base forms (e.g., chatbot development, semantic analysis).

The application requires understanding word context or grammatical relationships.




Conclusion

Both stemming and lemmatization have their place in text preprocessing, and the choice depends on the application's requirements. For modern NLP tasks, lemmatization is generally preferred due to its accuracy, especially when combined with advanced techniques like word embeddings or transformer-based models. However, stemming remains a practical choice for speed-critical or resource-constrained tasks.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
