<a href="https://colab.research.google.com/github/Sai-coder260/FMML_M1L1.ipynb/blob/main/FMML_M3L3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [7]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [8]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [21]:
import nltk
# Download the required resource
nltk.download('punkt_tab')

import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

# ... (rest of the code remains the same) ...

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [22]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [23]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [11]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [12]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [13]:
df = df.dropna()

In [14]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [15]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [34]:
!pip install nltk
import nltk

# Download the required resource
nltk.download('averaged_perceptron_tagger_eng')

# Now you can proceed with your original code
# ...



[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [35]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [19]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [25]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [26]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [27]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [28]:
len(df)

5572

In [29]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [36]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [37]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

TF-IDF (Term Frequency-Inverse Document Frequency) tends to perform better than Bag-of-Words (BoW) because it takes into account not only the frequency of a word in a document but also how unique or important that word is across the entire corpus. Here’s a breakdown of why TF-IDF typically leads to better performance:

BoW Approach:

Bag-of-Words is a simple model where each document is represented as a vector of word counts. It doesn't account for the importance of words in different contexts. The assumption is that the more times a word appears in a document, the more important it is.
Limitations of BoW:
High frequency words (like "the", "and", "is") appear frequently across most documents and can dominate the feature vector. These words might not carry useful information and could introduce noise in the analysis.
Ignoring word context: BoW treats all words independently, without considering their relevance or distinctiveness across documents.
High-dimensional sparse vectors: Since each word in the vocabulary is treated as a feature, the feature vector can be very sparse, especially in large corpora. This can lead to inefficiency and poor performance.
TF-IDF Approach:

TF-IDF adjusts the word frequency (TF) by how commonly the word appears across the entire corpus (IDF). The term frequency is balanced by the inverse document frequency, which reduces the importance of terms that appear frequently across many documents.
Why TF-IDF improves accuracy:
Reduces the impact of common words: Words that appear often in most documents (e.g., "the", "in", "of") are given lower weight, while words that appear frequently in only a few documents (but not across the entire corpus) are given higher weight. This highlights more meaningful and discriminative terms for the classification task.
Focus on uniqueness: It helps the model focus on words that distinguish one document from another, thus improving the model’s ability to classify documents correctly.
Feature normalization: The use of IDF smooths out the variations in word frequency, making it less likely that common, uninformative words dominate the learning process.

2. Can you think of techniques that are better than both BoW and TF-IDF ?

While BoW and TF-IDF are foundational methods in text representation, there are more advanced techniques that can offer better accuracy, particularly when dealing with complex and large datasets:

Word Embeddings (e.g., Word2Vec, GloVe, FastText):

Word2Vec and GloVe are vector-based models that represent words in a continuous, high-dimensional space. These embeddings capture semantic relationships between words (e.g., "king" - "man" + "woman" = "queen") and can handle synonyms, antonyms, and polysemy (same word with multiple meanings).
Advantages:
They capture the contextual meaning of words, unlike BoW and TF-IDF, which treat words independently of their context.
Dimensionality reduction: The embeddings are usually much lower-dimensional than the sparse vectors in BoW or TF-IDF, making them more efficient.
Word embeddings allow for better generalization and understanding of semantic relationships, which can improve performance in tasks like sentiment analysis, machine translation, and text classification.
Disadvantages:
Training data dependency: Word embeddings require large corpora to train, which can be computationally expensive.
Contextual limitation: Static embeddings (like Word2Vec and GloVe) do not capture context-dependent meanings of words (e.g., "bank" as in "river bank" vs. "financial bank"). Contextual embeddings (like BERT) address this limitation.
Contextual Embeddings (e.g., BERT, GPT):

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model that provides contextualized word embeddings. Unlike Word2Vec, BERT takes into account the surrounding words when generating embeddings, so the representation of a word depends on the context in which it appears.
Advantages:
Context-awareness: BERT, and similar models, generate different embeddings for the same word depending on its context. This leads to much more accurate representations of words.
Pre-trained models: Models like BERT can be pre-trained on massive datasets and fine-tuned for specific tasks, making them effective even for smaller task-specific datasets.
Disadvantages:
Resource-intensive: BERT and other transformer models require significant computational resources for training and inference.
Longer training times: Fine-tuning BERT on task-specific data can be slow compared to traditional methods like TF-IDF.
Transformers + Attention Mechanism:

Attention-based models, such as transformers (BERT, GPT), focus on the parts of the input sequence that are most relevant for each task. This allows them to handle long-range dependencies in text and capture intricate relationships between words, improving accuracy in tasks like document classification, named entity recognition, and machine translation.
Advantages:
Long-term dependencies: Transformers are especially effective at capturing dependencies between distant words, which is something simpler models like BoW and TF-IDF struggle with.
Scalability: These models can be scaled up to process large amounts of text and trained on massive corpora, leveraging pre-training and transfer learning.
Disadvantages:
Computational cost: The attention mechanism and large model sizes (like BERT) require significant computational power, both during training and inference.
Overfitting in small datasets: In small datasets, these models may overfit, and might not outperform simpler models like TF-IDF.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

Stemming and lemmatization are two techniques used to reduce words to their root forms, which can help improve text processing and NLP tasks.

Stemming
Stemming involves removing prefixes or suffixes from words to get a "stem" or base form of the word. It does not always produce a valid word in the dictionary.

Example: The stemmer may convert "running", "runner", and "ran" all to "run".

Algorithm: Stemming uses heuristic rules (e.g., the Porter Stemmer) to remove suffixes.

Pros:

Faster: Stemming algorithms are typically faster and computationally less expensive than lemmatization.
Simple: It’s easy to implement and can be effective in some applications where the exact form of the word is less important.
Cons:

Non-dictionary results: The result of stemming is often not a valid word (e.g., "running" becomes "run", but "flying" becomes "fli"), which may not be ideal for tasks requiring correct word forms.
Over-stemming: It might result in over-generalization, where words with different meanings are reduced to the same stem (e.g., "unhappiness" and "happiness" could both stem to "happi").
Loss of meaning: Since stemming does not take word context into account, it might miss nuances in meaning.
Lemmatization
Lemmatization is a more sophisticated technique that reduces words to their lemma, the dictionary form of a word, based on its meaning. It uses vocabulary and morphological analysis, so it takes into account the context and part of speech.

Example: The lemmatizer would reduce "better" to "good" and "running" to "run", based on context.

Pros:

Produces valid words: The output is a valid dictionary word (e.g., "running" → "run"), which is more linguistically correct.
More accurate: It considers the context and part of speech of the word, leading to more accurate text normalization.
Cons:

Slower: Lemmatization is generally slower than stemming because it requires more complex processing, such as using a dictionary and analyzing the word's context.
More computationally expensive: Since lemmatization involves a more intricate process, it can be more resource-intensive than stemming, especially on large datasets.
Conclusion:
TF-IDF tends to outperform BoW by reducing the influence of common words and highlighting more discriminative terms. However, techniques like Word Embeddings (Word2Vec, GloVe) and Contextual Embeddings (BERT) are generally better because they capture semantic meaning and context.
Between stemming and lemmatization, lemmatization is usually preferred for its accuracy, though it is slower and more resource-intensive than stemming. Stemming is faster but can result in non-dictionary forms and loss of meaning, which may hurt performance in some NLP tasks.




### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
