<a href="https://colab.research.google.com/github/Sleepybutterfly01/fmml-lab/blob/main/Mod3_lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [4]:
5*12

60

In [5]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [6]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv', error_bad_lines=False)
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

1.The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally yields better accuracy than the Bag-of-Words (BoW) model for several reasons:

1. **Term Importance:**
   - **TF-IDF:** It considers not only the frequency of a term in a document (TF) but also its importance in the entire corpus (IDF). Terms that appear frequently in a specific document but are rare in the overall corpus are assigned higher weights, capturing their significance.
   - **BoW:** BoW simply counts the occurrences of words in a document without considering their importance in the broader context. This may lead to less discriminative power for certain terms.

2. **Common Term Handling:**
   - **TF-IDF:** It penalizes terms that are common across many documents. Common words like "the" or "and" receive lower weights, making the representation more focused on distinctive terms.
   - **BoW:** BoW does not differentiate common words from important ones. As a result, common terms might dominate the representation and not contribute much to the discrimination between documents.

3. **Document Length Normalization:**
   - **TF-IDF:** It automatically normalizes the term frequencies based on the document length. This helps in mitigating the bias towards longer documents, ensuring that the model is not skewed by the length of the text.
   - **BoW:** BoW representations might be biased towards longer documents as they naturally have more words. This can impact the effectiveness of the model when comparing documents of different lengths.

4. **Sparse Representation:**
   - **TF-IDF:** The TF-IDF matrix tends to be more sparse than BoW, as it assigns higher weights to specific terms. This sparsity can be advantageous for machine learning models, especially in scenarios with high-dimensional data.
   - **BoW:** BoW results in a dense representation, where every word has a non-zero count. This can lead to a more challenging learning task, especially when dealing with large vocabularies.

5. **Semantic Understanding:**
   - **TF-IDF:** By considering the importance of terms across documents, TF-IDF captures some level of semantic understanding. It can help in distinguishing between terms that might have different meanings in various contexts.
   - **BoW:** BoW lacks semantic understanding as it treats each word independently, neglecting the relationships between words.

In summary, TF-IDF addresses some of the limitations of the Bag-of-Words model by considering term importance, handling common terms, normalizing document length, providing a sparse representation, and offering a degree of semantic understanding. These factors collectively contribute to the improved accuracy of models using TF-IDF representations in various natural language processing tasks.

2.Yes, several techniques have been developed that aim to overcome some limitations of both Bag-of-Words (BoW) and TF-IDF. Two notable approaches are:

1. **Word Embeddings:**
   - **Example Technique: Word2Vec, GloVe, FastText**
   - **Key Features:**
      - **Semantic Understanding:** Word embeddings capture semantic relationships between words by representing them as dense vectors in a continuous vector space.
      - **Contextual Information:** These models consider the context in which words appear, allowing them to capture word meanings based on neighboring words.
      - **Dimensionality Reduction:** Dense vector representations typically have lower dimensionality compared to high-dimensional BoW or TF-IDF matrices.

2. **Transformers and Attention Mechanisms:**
   - **Example Technique: BERT (Bidirectional Encoder Representations from Transformers)**
   - **Key Features:**
      - **Bidirectional Context:** Transformers consider bidirectional context during training, allowing them to capture dependencies from both preceding and succeeding words.
      - **Pre-trained Representations:** Models like BERT are pre-trained on large corpora, learning rich representations that can be fine-tuned for specific tasks.
      - **Contextualized Embeddings:** The embeddings produced by transformers are contextualized, meaning the representation of a word depends on its context in a sentence.

**Advantages over BoW and TF-IDF:**

1. **Semantic Richness:** Word embeddings and transformers capture more semantic nuances by representing words in a continuous vector space rather than as discrete counts.
   
2. **Context Awareness:** These techniques consider the context in which words appear, allowing them to capture contextual variations and dependencies.

3. **Dimensionality Reduction:** Dense vector representations have lower dimensionality compared to the high-dimensional BoW or TF-IDF matrices, which can be beneficial in terms of computational efficiency.

4. **Transfer Learning:** Pre-trained models like BERT can be fine-tuned for specific tasks, leveraging knowledge learned from vast amounts of data.

**Considerations:**

- **Computational Complexity:** These techniques often require more computational resources during training compared to BoW or TF-IDF.

- **Data Requirements:** Word embeddings and transformers are often pre-trained on large corpora, which might not be feasible for smaller datasets.

- **Task-Specific:** The choice of technique depends on the specific NLP task at hand. While word embeddings and transformers excel in capturing semantic relationships, their performance can vary based on the nature of the task.

In summary, word embeddings and transformers represent advancements over BoW and TF-IDF, offering richer semantic representations and contextual understanding. The choice among these techniques depends on the specific requirements and characteristics of the NLP task.

3.I don't have direct access to external resources, but I can provide you with a general overview of stemming and lemmatization based on my training data up to January 2022. Additionally, I can discuss the common pros and cons associated with each technique.

**Stemming:**
Stemming is a text normalization technique that aims to reduce words to their root or base form by removing suffixes. For example, "running" would be stemmed to "run." Here are some pros and cons of stemming:

*Pros:*
1. **Simplicity:** Stemming is a simpler and faster process compared to lemmatization, making it computationally more efficient.
2. **Reduced Dimensionality:** It helps in reducing the dimensionality of the feature space by consolidating different forms of words.

*Cons:*
1. **Over-stemming:** Stemming may lead to over-stemming, where words with different meanings are reduced to the same root. This can result in a loss of meaning.
2. **Lack of Linguistic Accuracy:** Since stemming operates based on rules without considering linguistic context, it may not always generate linguistically accurate roots.

**Lemmatization:**
Lemmatization is a more sophisticated text normalization technique that involves reducing words to their base or dictionary form (lemma). For example, "better" would be lemmatized to "good." Here are some pros and cons of lemmatization:

*Pros:*
1. **Linguistic Accuracy:** Lemmatization provides linguistically accurate lemmas, considering the context and part of speech of words.
2. **Improved Semantics:** It preserves the semantic meaning of words better than stemming, as it maps words to their dictionary forms.

*Cons:*
1. **Computational Complexity:** Lemmatization is computationally more intensive and can be slower compared to stemming, which might be a concern for large datasets.
2. **Word Variations:** Lemmatization might not handle irregular word variations as effectively as stemming.

**Common Considerations:**
1. **Task-Specific:** The choice between stemming and lemmatization often depends on the specific NLP task and the trade-off between linguistic accuracy and computational efficiency.
2. **Language Sensitivity:** The effectiveness of stemming and lemmatization can vary across different languages, and certain languages might benefit more from one technique over the other.

In practice, the choice between stemming and lemmatization depends on the specific requirements of the natural language processing task, the characteristics of the dataset, and the balance between linguistic accuracy and computational efficiency.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
