<a href="https://colab.research.google.com/github/Shrut718/FMML--lab/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [5]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [6]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [7]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the Bag-of-Words (BoW) model for the following reasons:

1. Weighing Word Importance:

BoW counts the frequency of each word in a document, treating all words as equally important.

TF-IDF weighs each word by its frequency in a document (TF) and its inverse frequency across all documents (IDF). This means that common words across many documents (e.g., "the", "is") are given less importance, while unique or rarer words (e.g., "rare", "specialized") receive higher weights. This helps highlight words that carry more informative value.



2. Handling Common Words:

In BoW, common words (like stop words) can dominate the feature space, leading to less meaningful features for machine learning models.

TF-IDF reduces the weight of such common words, making the model focus on more distinctive and relevant terms.



3. Dimensionality Reduction:

Since TF-IDF reduces the influence of frequently occurring words, the resulting feature vector is often more compact and informative. This is beneficial in improving the model's performance, especially when dealing with high-dimensional data.



4. Better Representation of Document Content:

TF-IDF can better capture the unique aspects of a document by giving higher importance to terms that are both frequent in the document and rare across other documents. This often leads to more relevant and discriminative features for tasks like classification or clustering.




In summary, TF-IDF's ability to downweight common words and highlight important, distinctive terms generally leads to better performance in text analysis tasks compared to the Bag-of-Words approach.


2. Can you think of techniques that are better than both BoW and TF-IDF ?Yes, several techniques have been developed that can outperform both Bag-of-Words (BoW) and TF-IDF, particularly in capturing semantic meaning and improving model performance. Some of these advanced techniques include:

1. Word Embeddings (e.g., Word2Vec, GloVe, FastText):

Word embeddings map words to dense, continuous vectors in a high-dimensional space, where similar words are closer to each other. These embeddings capture semantic relationships (e.g., "king" is closer to "queen" than "car").

Word2Vec and GloVe are popular models that learn word representations from large text corpora. Embeddings significantly improve performance over BoW and TF-IDF because they consider word meanings in context, rather than just frequency.



2. Contextualized Word Representations (e.g., BERT, GPT, ELMo):

Models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer) provide dynamic, context-dependent word representations. Unlike static embeddings (e.g., Word2Vec, GloVe), these models adjust the representation of a word based on its context in the sentence.

BERT, for example, uses a deep transformer architecture and is pre-trained on large text datasets. It is particularly powerful in understanding the meaning of words in context and can perform well on downstream tasks like classification, question answering, and more.



3. Transformers with Attention Mechanisms:

The Attention Mechanism, and particularly Transformer models like BERT and T5, improves upon earlier models by allowing the model to focus on different parts of the input text when generating a prediction. This gives the model a better understanding of the relationships between words and context, outperforming traditional methods like BoW and TF-IDF.



4. Doc2Vec (Paragraph Vectors):

Doc2Vec is an extension of Word2Vec that represents entire documents (or paragraphs) as vectors. It captures the semantic meaning of the entire document rather than just individual words. This approach is particularly useful for tasks where the meaning of the whole document is more important than individual words.



5. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA):

LSA reduces the dimensionality of the term-document matrix using Singular Value Decomposition (SVD), capturing latent semantic structures within the text. LDA is a generative probabilistic model that can be used to discover topics in a collection of text documents, thus providing a more abstract representation than BoW and TF-IDF.

Both techniques provide a way to capture the underlying topics and meanings in a text corpus, offering a more meaningful representation than BoW or TF-IDF.



6. Topic Modeling (LDA, NMF):

Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) can be used for topic modeling, where documents are represented as distributions over topics. These methods capture the underlying topics in a corpus and allow for more structured document representations compared to BoW or TF-IDF.



7. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM):

RNNs and LSTMs are neural network architectures that excel at handling sequential data, such as text. They can capture dependencies between words over long distances, making them more effective for tasks like language modeling, text generation, and sequence labeling compared to BoW or TF-IDF.




Each of these methods brings a more nuanced understanding of text, incorporating context, semantics, and complex relationships, leading to better performance on a wide range of natural language processing tasks. Depending on the problem and available resources, models like BERT, GPT, or Doc2Vec can outperform both BoW and TF-IDF significantly.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.Sure! Stemming and Lemmatization are two popular text preprocessing techniques used in natural language processing (NLP) to reduce words to their root form. While both serve a similar purpose of simplifying words, they do so in different ways. Here's a breakdown of their pros and cons:

Stemming:

Definition: Stemming involves chopping off prefixes or suffixes from words to obtain the root form (called the "stem"). It is usually a rule-based approach, which means it applies specific algorithms (like the Porter Stemmer or Snowball Stemmer) to trim words.

Pros:

1. Faster: Stemming algorithms are generally faster because they apply simple heuristic rules to shorten words.


2. Simplicity: The method is simple and easy to implement, making it a good starting point for basic text processing tasks.


3. Reduced Dimensionality: By reducing words to stems, it can help decrease the number of features in the dataset, thus reducing the complexity of the model.



Cons:

1. Overstemming: Since stemming is a heuristic-based approach, it can lead to overstemming, where words with different meanings are reduced to the same stem. For example, "running" and "runner" might both be reduced to "run," but the meanings are different in some contexts.


2. Understemming: Stemming may also undercut the root too much, resulting in stems that are not meaningful or recognizable words (e.g., "better" might stem to "bett").


3. Loss of Semantics: Since the process only focuses on chopping off affixes, it doesn’t consider the word’s meaning. This can lead to a loss of important information about the context and sense of the word.



Lemmatization:

Definition: Lemmatization involves reducing a word to its base or dictionary form (known as a "lemma"). It uses a vocabulary and morphological analysis of words, which means it considers the context and meaning of the word. For example, "running" becomes "run," but "better" becomes "good."

Pros:

1. Context-aware: Lemmatization takes the word's meaning into account and reduces it to a valid word form, making it more accurate in understanding the actual root word.


2. No Overstemming: Since lemmatization is context-sensitive and dictionary-based, it avoids the problem of overstemming and ensures that the word’s original meaning is preserved.


3. Grammatical correctness: Lemmatization ensures that the words returned are proper dictionary words (e.g., "running" becomes "run," and "better" becomes "good"), maintaining grammatical correctness.



Cons:

1. Slower: Lemmatization is computationally more expensive and slower than stemming, as it requires understanding the word’s meaning and context (often using additional resources like a dictionary or part-of-speech tagging).


2. Complexity: Lemmatization requires more complex algorithms and often requires more computational resources compared to stemming.


3. Requires POS Tagging: In some cases, lemmatization needs part-of-speech (POS) tagging to determine the correct base form, adding an additional layer of complexity and potentially reducing efficiency.



Comparison:

Accuracy: Lemmatization is typically more accurate than stemming, as it respects the meaning and grammatical form of the word, whereas stemming can lead to stems that may not even be valid words.

Speed: Stemming is faster than lemmatization, making it a better choice for applications where processing speed is critical and accuracy is not as important.

Complexity: Lemmatization is more complex and requires more resources, especially if POS tagging is involved. Stemming is simpler and can be applied with minimal computational overhead.


When to Use Each:

Stemming is suitable for situations where the goal is to reduce dimensionality and speed is a priority, and the exact form of the word is not crucial (e.g., search engines or document clustering).

Lemmatization is more appropriate when preserving the meaning of words is important, such as in tasks like sentiment analysis, named entity recognition, and machine translation.


In summary, the choice between stemming and lemmatization depends on the task at hand. If you need speed and simplicity, stemming is the way to go, but if accuracy and understanding of word meanings are essential, lemmatization is preferable.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
