<a href="https://colab.research.google.com/github/Satya6623/FMML_Projects_and_labs/blob/main/Mod3_lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [4]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [5]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [10]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()


Saving amazon_books_Data[1].csv to amazon_books_Data[1].csv


In [11]:
import pandas as pd
df = pd.read_csv('/content/amazon_books_Data[1].csv')

In [13]:
df = df.dropna()

In [12]:
df.to_csv('/content/amazon_books_Data[1].csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [14]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [20]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

FileNotFoundError: [Errno 2] No such file or directory: 'reviews.csv'

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [19]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam[1].csv to spam[1].csv


In [23]:
import pandas as pd
df = pd.read_csv('/content/spam[1].csv', error_bad_lines=False)
df



  df = pd.read_csv('/content/spam[1].csv', error_bad_lines=False)


UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 606-607: invalid continuation byte

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

#QUESTION 1

TF-IDF (Term Frequency-Inverse Document Frequency) and Bag-of-Words are both techniques used in natural language processing for text representation. While the choice between them depends on the specific task and dataset, TF-IDF often outperforms Bag-of-Words in certain scenarios. Here are some reasons why TF-IDF may generally result in better accuracy than Bag-of-Words:

1. **Term Importance Weighting:**
   - TF-IDF takes into account not only the frequency of a term in a document (TF) but also its importance in the entire corpus (IDF).
   - High TF-IDF values are assigned to terms that are frequent in a document but relatively rare across the entire corpus. This helps in identifying terms that are discriminative and carry more meaningful information.

2. **Stop Word Handling:**
   - TF-IDF automatically reduces the importance of common words (stop words) that appear frequently across documents. These words might not be very informative and can be noise in certain tasks. Bag-of-Words treats all words equally, regardless of their importance or frequency.

3. **Document-Specific Relevance:**
   - TF-IDF reflects the importance of terms within a specific document. Terms that are important to a particular document but not to the entire corpus are given higher weights.
   - Bag-of-Words does not consider the uniqueness of terms within a document, treating each term independently of its relevance to the specific document.

4. **Normalization:**
   - TF-IDF normalizes the term frequencies by the inverse document frequency, reducing the impact of very common or very rare terms. This normalization helps in making the representation more robust and less sensitive to outliers.
   - Bag-of-Words does not provide this normalization, potentially making it more sensitive to the absolute frequency of terms.

5. **Better Handling of Large Corpora:**
   - In large corpora, Bag-of-Words tends to create large and sparse feature vectors, which can be computationally expensive and may lead to overfitting.
   - TF-IDF can help mitigate this issue by down-weighting terms that are common across the entire corpus.

6. **Semantic Information:**
   - While neither TF-IDF nor Bag-of-Words inherently captures semantic relationships between words, TF-IDF tends to give more emphasis to terms that are discriminative, contributing to better semantic representation.

It's important to note that the effectiveness of these techniques depends on the specific characteristics of the dataset and the nature of the text processing task. In some cases, Bag-of-Words may be sufficient or even preferable, especially when the task does not require considering the importance of terms across the entire corpus.



#QUESTION 2
Certainly! While Bag-of-Words (BoW) and TF-IDF are widely used text representation techniques, there are more advanced approaches that have been developed to capture richer semantic information from text. Here are some techniques that are considered more advanced and have shown improvements over BoW and TF-IDF in certain contexts:

1. **Word Embeddings:**
   - Word embeddings, such as Word2Vec, GloVe, and FastText, represent words as dense vectors in a continuous vector space. These embeddings capture semantic relationships between words, enabling the model to understand context and meaning.
   - Pre-trained word embeddings can be used, or embeddings can be learned from the specific dataset.

2. **Doc2Vec (Paragraph Vectors):**
   - An extension of Word2Vec, Doc2Vec represents entire documents as vectors in a continuous space. It considers the context of words within the document, capturing document-level semantics.
   - This approach is particularly useful when the context and order of words in a document are important.

3. **BERT (Bidirectional Encoder Representations from Transformers):**
   - BERT is a state-of-the-art pre-trained language model based on transformer architecture. It captures bidirectional contextual information, allowing it to understand the meaning of words in the context of surrounding words.
   - Fine-tuning BERT on specific tasks or using its embeddings as features in downstream models has shown significant improvements in various natural language processing tasks.

4. **ELMo (Embeddings from Language Models):**
   - ELMo generates contextualized word embeddings by considering the context of a word within a sentence. It captures word meanings that may vary based on their context.
   - ELMo embeddings have been shown to be effective in tasks such as sentiment analysis, named entity recognition, and question answering.

5. **Transformer-based Models:**
   - Beyond BERT, other transformer-based models like GPT (Generative Pre-trained Transformer) and T5 (Text-to-Text Transfer Transformer) have been successful in various NLP tasks.
   - These models can be fine-tuned on specific tasks or used to generate contextual embeddings for downstream tasks.

6. **Ensemble Models:**
   - Combining multiple models or representations through ensemble methods can often lead to improved performance. This can include ensembling BoW/TF-IDF with word embeddings or combining predictions from multiple models.

7. **Attention Mechanisms:**
   - Attention mechanisms, as used in transformers, allow models to focus on different parts of the input when making predictions. Applying attention mechanisms to BoW or TF-IDF representations can enhance their ability to capture important information.

The choice of technique depends on the specific task, dataset, and available resources. In many cases, pre-trained models like BERT or fine-tuned embeddings offer state-of-the-art performance across a range of NLP tasks.



#QUESTION 3

### Stemming:

**Pros:**
1. **Simplicity and Speed:** Stemming is generally faster than lemmatization and is computationally less expensive. It involves simple heuristic rules to chop off prefixes or suffixes.

2. **Reduction of Variance:** Stemming can help in reducing the dimensionality of the feature space by grouping together words with similar stems. This can be beneficial in tasks like text classification.

**Cons:**
1. **Over-stemming and Under-stemming:** Stemming algorithms may over-stem, removing too many letters and leading to the loss of meaning, or under-stem, leaving too many variations of a word.

2. **Not Always Linguistically Correct:** Stemming does not always result in linguistically correct words. The stemmed forms may not be valid words, and this lack of linguistic accuracy can be a drawback in certain applications.

### Lemmatization:

**Pros:**
1. **Linguistic Accuracy:** Lemmatization aims to provide the base or dictionary form of a word, ensuring linguistic accuracy. The resulting lemma is a valid word.

2. **Context Preservation:** Lemmatization considers the context of the word in a sentence, providing a more meaningful representation. It helps in tasks where the meaning of the word is crucial, such as in question answering or chatbots.

**Cons:**
1. **Computational Cost:** Lemmatization is generally more computationally expensive compared to stemming, as it requires access to a lexical database or a full morphological analysis.

2. **Variability in Lemmas:** The lemmatization process might produce different lemmas for words depending on their part of speech (POS). Handling different POS variations can be complex.



### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
