<a href="https://colab.research.google.com/github/Harika-singana/FMML_LABS/blob/main/Mod3_lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [10]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [11]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [12]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [13]:
5*12

60

In [14]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [15]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv', error_bad_lines=False)
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

ANSWER1:
The Term Frequency-Inverse Document Frequency (TF-IDF) and Bag-of-Words (BoW) are both techniques used in natural language processing (NLP) for representing and vectorizing text data. While the choice between TF-IDF and BoW depends on the specific task and dataset, TF-IDF often results in better accuracy than BoW in certain scenarios. Here are some reasons why TF-IDF may perform better:

1. **Term Importance:**
   - **TF-IDF:** This approach takes into account not only the frequency of a term in a document (TF) but also its rarity across all documents in the dataset (IDF). Terms that are common in a specific document but rare in the entire dataset are assigned higher weights.
   - **BoW:** BoW represents documents as a bag of words, ignoring the importance of individual terms. It only considers the frequency of terms within a document.

2. **Common Word Handling:**
   - **TF-IDF:** Common words that appear in many documents (e.g., "the," "and") are assigned lower weights because of their high document frequency.
   - **BoW:** BoW treats all words equally in terms of importance, including common words that may not carry much discriminative information.

3. **Dimensionality Reduction:**
   - **TF-IDF:** By assigning lower weights to common terms, TF-IDF implicitly performs a form of dimensionality reduction. It focuses on the distinctive terms that differentiate documents from each other.
   - **BoW:** BoW can result in high-dimensional sparse representations, and the model may struggle with high-dimensional data, especially when the dataset is not very large.

4. **Sensitive to Word Importance:**
   - **TF-IDF:** Emphasizes the importance of words that are informative and discriminative for a particular document in the context of the entire dataset.
   - **BoW:** Treats all words as equally important, which may not be suitable for tasks where certain words are more indicative of the document's meaning.

5. **Inverse Document Frequency:**
   - **TF-IDF:** The IDF component down-weights terms that are common across all documents. This helps in capturing the unique and discriminative features of a document.
   - **BoW:** BoW does not consider the rarity of terms across the entire corpus, potentially leading to less distinctive representations.

It's important to note that the performance of TF-IDF or BoW depends on the specific characteristics of the dataset and the nature of the NLP task. In some cases, BoW might be sufficient or even preferable, especially if the dataset is small and the task involves simple text classification. It's often a good practice to experiment with both representations and evaluate their performance on the specific task at hand. Additionally, more advanced methods, such as word embeddings and deep learning approaches, have gained popularity for NLP tasks and may outperform traditional TF-IDF and BoW in certain scenarios.

ANSWER2:
Certainly! While Bag-of-Words (BoW) and TF-IDF are traditional and widely used techniques for text representation, there are more advanced methods that have shown better performance in various natural language processing (NLP) tasks. Here are a few techniques that are considered more advanced and often outperform BoW and TF-IDF:

1. **Word Embeddings:**
   - Word embeddings, such as Word2Vec, GloVe, and FastText, represent words as dense vectors in a continuous vector space. These embeddings capture semantic relationships between words and are pre-trained on large corpora. They can capture contextual information and often perform better than BoW and TF-IDF in tasks like sentiment analysis, named entity recognition, and machine translation.

2. **Doc2Vec (Paragraph Embeddings):**
   - Doc2Vec extends the idea of word embeddings to represent entire documents as continuous vectors. It considers the context of words within a document and generates a fixed-size vector representation for the entire document. This can be effective for tasks where document-level semantics are important.

3. **BERT (Bidirectional Encoder Representations from Transformers):**
   - BERT is a transformer-based model pre-trained on large amounts of text data. It captures bidirectional context and has been shown to achieve state-of-the-art results in various NLP tasks, such as question answering, text classification, and named entity recognition. BERT representations can be fine-tuned for specific tasks.

4. **ULMFiT (Universal Language Model Fine-tuning):**
   - ULMFiT is a transfer learning approach for NLP that involves pre-training a language model on a large corpus and then fine-tuning it for specific tasks. It has been successful in achieving good performance with limited labeled data for various NLP tasks.

5. **ELMo (Embeddings from Language Models):**
   - ELMo generates contextualized word embeddings by considering the context in which a word appears. It uses a deep bidirectional LSTM to capture contextual information and has been used for tasks like question answering and sentiment analysis.

6. **Transformer Models (e.g., GPT-3, T5):**
   - Large-scale transformer models, such as GPT-3 (Generative Pre-trained Transformer) and T5 (Text-to-Text Transfer Transformer), have achieved remarkable performance on a wide range of NLP tasks. These models are pre-trained on massive amounts of data and can be fine-tuned for specific tasks.

These advanced techniques leverage deep learning and pre-training on large corpora to capture complex relationships and contextual information in text. The choice of technique depends on the specific task, dataset size, and available computational resources. It's worth noting that these methods often require more computational resources and data compared to traditional methods like BoW and TF-IDF.

ANSWER3:
I don't have the ability to browse external resources in real-time. However, I can provide you with a general overview of stemming and lemmatization, including their pros and cons.

**Stemming:**

**Definition:** Stemming is the process of reducing words to their base or root form by removing suffixes. The goal is to simplify words so that different inflections or derivations of a word map to the same stem.

**Pros:**
1. **Simplicity:** Stemming is a simpler process compared to lemmatization, making it computationally less expensive.
2. **Speed:** Stemming algorithms are generally faster than lemmatization algorithms, which can be important for real-time applications and large datasets.
3. **Works well in information retrieval:** In information retrieval and search applications, stemming can be beneficial as it helps match different forms of a word.

**Cons:**
1. **Over-stemming:** Stemming may lead to over-stemming, where different words are reduced to the same stem even if they have different meanings. This can result in a loss of semantic information.
2. **Not always linguistically valid:** Stemming doesn't always produce linguistically valid words, and the resulting stems may not be actual words.

**Lemmatization:**

**Definition:** Lemmatization involves reducing words to their base or dictionary form (lemma) by considering the morphological analysis of the words. It aims to transform words to their canonical form.

**Pros:**
1. **Linguistic accuracy:** Lemmatization produces linguistically valid words, ensuring that the transformed words are actual words found in a dictionary.
2. **Preservation of meaning:** Lemmatization preserves the meaning of words, reducing the risk of losing semantic information compared to stemming.
3. **Better performance in certain NLP tasks:** In tasks where word meaning is crucial, such as question answering or sentiment analysis, lemmatization can outperform stemming.

**Cons:**
1. **Computational complexity:** Lemmatization is generally more computationally expensive than stemming due to the need for morphological analysis and dictionary lookups.
2. **May not handle irregular forms well:** Lemmatization may struggle with irregular forms of words that don't follow typical patterns, requiring additional linguistic knowledge.

**Conclusion:**

The choice between stemming and lemmatization depends on the specific requirements of the NLP task. Stemming is often chosen for its simplicity and speed, especially in applications like information retrieval, where recall is crucial. On the other hand, lemmatization is preferred when linguistic accuracy and preserving word meaning are more important, as in tasks that involve deeper semantic analysis. Some applications may even use a combination of both techniques based on the specific use case and trade-offs between speed and accuracy.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
