<a href="https://colab.research.google.com/github/Ashwinidurga7/fmml_labs/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

Answer:
TF-IDF (Term Frequency-Inverse Document Frequency) generally results in better accuracy than the Bag-of-Words (BoW) approach because it introduces weighting to the terms, which helps in distinguishing important words from less significant ones. Here's why:

1. Focus on Relevant Terms

Bag-of-Words: Treats all words equally, regardless of their importance or frequency. Common words like "the," "is," or "and" might dominate the representation, even if they contribute little to distinguishing between documents.

TF-IDF: Reduces the influence of common terms by assigning lower weights to words that appear frequently across many documents (via the Inverse Document Frequency component). This ensures rare but significant words have a greater impact.


2. Better Representation of Term Importance

TF-IDF considers both:

Term Frequency (TF): How often a word appears in a document, capturing its importance within that specific context.

Inverse Document Frequency (IDF): How unique or rare the word is across all documents, reducing the weight of ubiquitous terms.


This dual consideration allows TF-IDF to better represent the semantic relevance of terms to a document.


3. Improved Differentiation Between Documents

By emphasizing distinctive terms, TF-IDF improves the ability to separate documents into different categories or clusters. This is especially useful in tasks like text classification or clustering.


4. Reduced Dimensionality Noise

Since common words (which are often noise) are downweighted in TF-IDF, it reduces the impact of irrelevant dimensions in the feature space, leading to better performance in machine learning models.


5. Better Suitability for Sparse Data

TF-IDF representations are typically more informative in sparse datasets because they provide a normalized measure of term importance, making it easier for models to identify patterns.


Trade-offs

However, while TF-IDF is generally more effective than BoW, it has some limitations:

TF-IDF doesn't capture word order or context (similar to BoW).

It might not perform well for very short texts, where every word might seem important.

More advanced models like Word Embeddings (e.g., Word2Vec, GloVe) or Transformers (e.g., BERT) often outperform TF-IDF by capturing contextual information.


In summary, the added weighting and normalization in TF-IDF often result in improved accuracy because it better captures the relevance of terms for a given document or task.


2. Can you think of techniques that are better than both BoW and TF-IDF ?

Answer:Yes, there are several advanced techniques that outperform both Bag-of-Words (BoW) and TF-IDF by capturing semantic meaning, word relationships, and context. These methods leverage advancements in natural language processing (NLP) and are particularly effective for complex tasks like text classification, clustering, and sentiment analysis. Here are some key techniques:


---

1. Word Embeddings (Dense Vector Representations)

Examples: Word2Vec, GloVe, FastText

Advantages:

Represent words as dense vectors in a continuous space, capturing semantic relationships.

Words with similar meanings are close to each other in the vector space.

Can encode syntactic and semantic properties of words.

FastText can handle out-of-vocabulary words by representing them as subword units.


Limitations: Word embeddings are static, meaning they don’t account for context or polysemy (e.g., "bank" as a financial institution vs. a riverbank).



---

2. Contextualized Word Embeddings

Examples: BERT, GPT, ELMo, RoBERTa

Advantages:

Capture the meaning of words in their context, allowing for better handling of polysemy.

Based on transformer architectures, which excel at capturing long-range dependencies in text.

Pretrained on large datasets, providing strong performance even with limited task-specific data.


Limitations: Require significant computational resources for training and inference.



---

3. Sentence and Document Embeddings

Examples: Universal Sentence Encoder (USE), InferSent, Sentence-BERT

Advantages:

Represent entire sentences or documents as single vectors.

Useful for tasks like sentence similarity, document clustering, and semantic search.

Often built on top of contextualized word embeddings.


Limitations: May require fine-tuning for domain-specific tasks.



---

4. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)

Advantages:

LSA captures relationships between terms by reducing the dimensionality of BoW/TF-IDF representations using Singular Value Decomposition (SVD).

LDA models documents as mixtures of topics, making it effective for topic modeling.


Limitations:

LSA and LDA are less effective at capturing semantic relationships compared to neural embeddings.

Performance can degrade on very large or very small datasets.




---

5. Neural Networks for Text (Recurrent and Convolutional)

Examples: RNNs (e.g., LSTM, GRU), CNNs for text

Advantages:

Capture sequential patterns and context in text.

RNNs excel at handling long-term dependencies, while CNNs can identify key phrases.


Limitations: Require large labeled datasets and are less efficient compared to transformers.



---

6. Transformers for Sequence-to-Sequence Tasks

Examples: T5 (Text-to-Text Transfer Transformer), GPT series

Advantages:

Flexible architectures capable of handling summarization, question answering, and machine translation.

Can generate text and perform tasks requiring a deep understanding of context.


Limitations: High computational cost and complexity.



---

7. Hybrid Approaches

Combine techniques for better results:

Use TF-IDF with embeddings for enhanced interpretability.

Fine-tune BERT or GPT models on domain-specific data.




---

Choosing the Best Approach

The choice of technique depends on the task, dataset size, and computational resources:

Small datasets: Pretrained embeddings (e.g., Word2Vec, BERT).

Large datasets: Fine-tuning transformers or training embeddings from scratch.

Interpretability required: TF-IDF, LDA.

Real-time applications: Sentence embeddings or FastText.


These advanced techniques often outperform BoW and TF-IDF by capturing richer linguistic and contextual information.



3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

Answer:
Stemming and Lemmatization

1. Stemming

Stemming is a text normalization technique that reduces words to their root form by chopping off suffixes (e.g., "playing" → "play", "studies" → "studi"). It uses simple heuristics and does not consider the word's context or grammar.

Examples:

"running" → "run"

"better" → "better" (no change in some cases)

"cares" → "care"




---

Pros:

1. Computationally Efficient:

Stemming algorithms like Porter or Snowball are fast and simple, making them suitable for large datasets.



2. Effective in Some Applications:

Works well when exact root words aren't required (e.g., search engines where approximate matching suffices).




Cons:

1. Inaccuracy:

Results may not always be meaningful. For instance, "university" may stem to "univers," which is not a valid word.



2. Loss of Context:

Stemming does not consider the grammatical role of the word or its context, leading to less precise results.



3. Over-stemming or Under-stemming:

Over-stemming: Reducing words too aggressively (e.g., "generous" → "gener").

Under-stemming: Failing to reduce related words to the same root (e.g., "data" and "datum").





---

2. Lemmatization

Lemmatization reduces words to their base or dictionary form (lemma) while considering the word's context and grammar. It requires understanding the part of speech (POS) to correctly normalize words.

Examples:

"running" → "run"

"better" → "good" (accounts for comparative/superlative forms)

"studies" → "study"




---

Pros:

1. Accuracy:

Produces grammatically valid words and considers the word's meaning and part of speech, ensuring more meaningful results.



2. Context-Aware:

Differentiates between words like "bank" (financial institution) and "bank" (of a river) when combined with POS tagging.



3. Semantic Preservation:

Preserves the original meaning of the text better than stemming.




Cons:

1. Computationally Expensive:

Lemmatization is slower and requires more resources due to its reliance on dictionaries and linguistic rules.



2. Complex Implementation:

Needs additional tools like POS taggers to determine the correct lemma.



3. Dependency on Quality of Dictionaries:

The quality of results depends on the completeness and accuracy of the dictionary used.





---

Comparison Table


---

Choosing Between Stemming and Lemmatization

Use Stemming when:

Speed is critical.

Approximate matching is acceptable.

The application is simple, such as basic text search.


Use Lemmatization when:

Context and grammatical accuracy are important.

Semantic understanding is required (e.g., text classification, sentiment analysis).

The dataset is smaller, and computational resources are available.



Each method has its strengths and weaknesses, and the choice depends on the specific requirements of your task. In practice, many modern NLP pipelines use lemmatization for its accuracy and context-awareness, despite its higher computational cost.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
