<a href="https://colab.research.google.com/github/Nikhithaprasannadurga/FMML-Labs/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [40]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [5]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [6]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [41]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [42]:
import pandas as pd
df = pd.read_csv('reviews.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'reviews.csv'

In [43]:
df = df.dropna()

NameError: name 'df' is not defined

In [44]:
df.to_csv('reviews.csv', index=False)

NameError: name 'df' is not defined

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [10]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [45]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

FileNotFoundError: [Errno 2] No such file or directory: 'spam.csv'

In [46]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

FileNotFoundError: [Errno 2] No such file or directory: 'spam.csv'

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [14]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [47]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

FileNotFoundError: [Errno 2] No such file or directory: 'spam.csv'

In [48]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

NameError: name 'df' is not defined

In [49]:
df.head(5)

NameError: name 'df' is not defined

In [50]:
 len(df)

NameError: name 'df' is not defined

In [25]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [51]:
# This cell may take some time to run
predicted, y_test = bow_knn()

FileNotFoundError: [Errno 2] No such file or directory: 'spam.csv'

In [52]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

FileNotFoundError: [Errno 2] No such file or directory: 'spam.csv'

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

TF-IDF (Term Frequency-Inverse Document Frequency) often results in better accuracy than the Bag-of-Words (BoW) approach because it addresses key limitations of BoW by incorporating both term importance and document specificity into the representation of text. Here’s why:

1. Weighting Terms by Importance

Bag-of-Words: It treats all terms equally by only considering their presence (binary) or raw frequency. This can lead to common but less informative words (e.g., the, and, is) dominating the representation.

TF-IDF: It weighs terms by their significance. Frequently occurring terms in a specific document (high term frequency) are important but are downweighted if they appear in many documents (low inverse document frequency). This highlights distinguishing words rather than common ones.


2. Reducing Noise from Common Words

BoW captures all words equally, which often leads to noise from stop words or other high-frequency but uninformative terms.

TF-IDF mitigates this by assigning lower weights to such common terms, thus focusing on more meaningful words for classification or clustering tasks.


3. Better Feature Differentiation

With BoW, documents that contain the same frequent words but have different key terms may look similar, which can mislead machine learning models.

TF-IDF emphasizes unique terms in each document, providing better differentiation among documents with distinct topics or content.


4. Sparse Representations

TF-IDF produces a sparse matrix, where most weights are close to zero for less relevant terms. This often helps in reducing the dimensionality of the data and improving computational efficiency.


5. Impact on Classifiers

Machine learning models, such as logistic regression, support vector machines, or neural networks, often perform better with TF-IDF representations because the features are scaled and better represent the relevance of terms to each document.

BoW, by contrast, may lead to poor generalization due to the overemphasis on word frequency without context.


When TF-IDF Outperforms BoW

TF-IDF tends to outperform BoW in scenarios where:

The dataset contains many documents, and distinguishing between them requires identifying unique or rare terms.

Tasks like classification, clustering, or information retrieval require semantic differentiation rather than just frequency counts.


However, in some cases (e.g., when semantic relationships between words or context are crucial), both approaches may be outperformed by more advanced techniques like word embeddings (e.g., Word2Vec, GloVe) or transformer models (e.g., BERT).


2. Can you think of techniques that are better than both BoW and TF-IDF ?

Yes, there are several advanced techniques that generally outperform both Bag-of-Words (BoW) and TF-IDF because they capture richer linguistic and semantic information from text. Here are some examples:


---

1. Word Embeddings

Word embeddings map words into dense, low-dimensional vectors, capturing their semantic and syntactic relationships. Unlike BoW and TF-IDF, embeddings consider the context of words in sentences.

Examples:

Word2Vec: Learns word embeddings using skip-gram or CBOW models, capturing relationships like "king - man + woman = queen."

GloVe: Combines global co-occurrence statistics to learn word embeddings, capturing relationships between words based on their broader corpus context.

FastText: Extends Word2Vec by considering subword information, allowing it to handle out-of-vocabulary words.



Why it's better:

Captures semantic relationships.

Represents words as continuous vectors, which are computationally efficient.

Handles synonyms and similar contexts better than sparse representations like TF-IDF.



---

2. Contextualized Word Embeddings

These models generate word representations that depend on their context in a sentence, overcoming limitations of static embeddings like Word2Vec.

Examples:

ELMo (Embeddings from Language Models): Uses a deep bidirectional LSTM to generate embeddings based on the word’s context.

BERT (Bidirectional Encoder Representations from Transformers): Pre-trained transformer-based model that captures bidirectional context and fine-tunes well for downstream tasks.

GPT (Generative Pre-trained Transformer): Focuses on text generation and language modeling while also providing contextual embeddings.



Why it's better:

Dynamic embeddings that change based on context.

Excellent for understanding polysemy (e.g., "bank" as a financial institution vs. a riverbank).

Achieves state-of-the-art results in many NLP tasks.



---

3. Sentence and Document Embeddings

These techniques generate embeddings not just for words, but for entire sentences or documents, capturing broader contextual and relational information.

Examples:

Doc2Vec: Extends Word2Vec to represent entire documents as vectors, capturing overall context.

Sentence-BERT: Fine-tunes BERT to generate sentence embeddings optimized for tasks like semantic similarity and clustering.

Universal Sentence Encoder (USE): Generates embeddings for sentences and paragraphs, optimized for transfer learning.



Why it's better:

Captures relationships and semantics at the sentence or document level.

Reduces dimensionality while retaining important context.

Works well for tasks like text similarity, clustering, and summarization.



---

4. Neural Network-Based Models

Neural networks can directly model text and often outperform TF-IDF and BoW in tasks where context, structure, or semantics are crucial.

Recurrent Neural Networks (RNNs): Captures sequential information (e.g., LSTM or GRU).

Transformers: Leverages self-attention to model long-range dependencies and parallelize training (e.g., BERT, GPT, T5).

Convolutional Neural Networks (CNNs) for Text: Extracts local features from text for tasks like sentiment analysis or topic classification.


Why it's better:

Can process text in sequence (RNNs) or in parallel (Transformers) while maintaining context.

Scalable and customizable for specific NLP tasks.



---

5. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)

These are dimensionality reduction and topic modeling techniques, respectively, that capture deeper relationships between terms.

LSA: Uses Singular Value Decomposition (SVD) to reduce the dimensionality of the term-document matrix, revealing latent patterns.

LDA: Generates topics by modeling documents as mixtures of topics and topics as mixtures of words.


Why it's better:

Reduces noise in text data by focusing on latent patterns.

Useful for topic extraction and identifying underlying themes in large corpora.



---

6. Hybrid Approaches

Combining TF-IDF with embeddings or neural networks often yields better results.

TF-IDF + Word2Vec/Embeddings: TF-IDF can be used to weigh word embeddings, combining frequency with semantic information.

Attention Mechanisms: Attention models (e.g., in Transformers) allow the model to focus on important words or phrases in context.



---

Why These Techniques Outperform BoW and TF-IDF:

They capture semantic meaning, relationships, and context, which BoW and TF-IDF largely ignore.

They are less sparse, leading to better generalization and reduced overfitting.

They are more suited to modern NLP tasks like machine translation, summarization, and question answering.


Each technique is suitable for different tasks depending on the size of the dataset, the complexity of the language, and the specific NLP application. For instance, while embeddings like Word2Vec are faster, BERT and other transformers provide richer representations but require more computational resources.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

Stemming vs. Lemmatization

Both stemming and lemmatization are techniques used in natural language processing (NLP) to reduce words to their root forms. However, they differ significantly in their methods and applications. Here's an analysis based on their pros and cons:


---

1. Stemming

Definition: Stemming reduces a word to its base or root form, often by chopping off suffixes using simple heuristics, without considering the word's context or meaning.

Pros:

1. Fast and Simple:

Stemming algorithms (e.g., Porter Stemmer, Snowball Stemmer) use rule-based methods, making them computationally efficient.

Suitable for large datasets or real-time applications.



2. Language-Agnostic:

Many stemmers work based on simple rules and can be adapted to different languages with minimal adjustments.



3. Reduces Vocabulary Size:

By reducing words to their base forms, stemming decreases the dimensionality of text data, making it easier for machine learning models to process.




Cons:

1. Crudeness:

Stemming does not account for linguistic rules or context, often producing non-existent words (e.g., running → run, flies → fli).

This can lead to ambiguous roots that may confuse downstream tasks.



2. Lower Accuracy:

Can over-stem (e.g., universal → univers) or under-stem (e.g., went → went instead of go), leading to less meaningful reductions.



3. Language-Specific Limitations:

While simple rules work for many languages, stemming is less effective for morphologically complex languages.





---

2. Lemmatization

Definition: Lemmatization reduces words to their base or dictionary form (lemma), considering both linguistic rules and the word's context. It requires a vocabulary and sometimes part-of-speech (POS) tagging.

Pros:

1. Context-Aware:

Lemmatization identifies the correct base form based on the word's grammatical role (e.g., better → good if it's an adjective, better → better if it's a verb).



2. Accuracy:

Produces valid dictionary words, ensuring outputs are meaningful and consistent with linguistic norms.



3. Handles Complex Morphology:

Effective for morphologically rich languages and nuanced contexts, making it suitable for high-accuracy applications like machine translation.




Cons:

1. Computationally Expensive:

Lemmatization relies on pre-built dictionaries, POS tagging, and advanced linguistic rules, which can slow processing, especially on large datasets.



2. Language Dependency:

Requires language-specific resources (e.g., WordNet for English), making it less adaptable across multiple languages.



3. Dependency on POS Tagging:

Requires accurate tagging of parts of speech, as incorrect tagging can lead to errors (e.g., flies as a noun vs. a verb).





---

Comparison Table


---

When to Use Which?

Stemming:

Best for quick pre-processing where precision is not critical (e.g., search engines, topic modeling).

Suitable for exploratory data analysis or when computational resources are limited.


Lemmatization:

Ideal for tasks requiring linguistic accuracy (e.g., text summarization, machine translation, sentiment analysis).

Preferred for small datasets or high-stakes applications where precision matters.



By understanding the trade-offs, you can choose the approach that aligns best with your goals and constraints.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
