<a href="https://colab.research.google.com/github/12-19-2005/Fmml-2024/blob/main/lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

A) TF-IDF (Term Frequency-Inverse Document Frequency) generally outperforms Bag-of-Words (BoW) for several reasons:

*Limitations of Bag-of-Words:*

1. Ignores word importance: BoW treats all words equally, without considering their significance in the document.
2. Doesn't account for document length: Longer documents have more chances of having higher word frequencies.
3. Sensitive to noise: BoW is affected by stop words (common words like "the," "and") and outliers.

*Advantages of TF-IDF:*

1. *Term Frequency (TF)*: Measures word importance within a document.
2. *Inverse Document Frequency (IDF)*: Penalizes common words across documents, emphasizing rare words.
3. *Weighting*: Assigns higher weights to informative words, reducing noise.

*Key benefits:*

1. *Improved feature representation*: TF-IDF captures word importance and document structure.
2. *Reduced dimensionality*: By weighing words, TF-IDF reduces feature space.
3. *Enhanced generalization*: TF-IDF handles out-of-vocabulary words and noise.

*Comparison:*

|  *Approach*  | *Accuracy* | *Reason*                                                                 |
|  ------------  | -----------  | ------------------------------------------------------------------------- |
|  Bag-of-Words  | Lower        | Ignores word importance, sensitive to noise                                  |
|  TF-IDF        | Higher       | Considers word importance, reduces noise, and improves feature representation |

*When to use TF-IDF:*

1. Text classification
2. Information retrieval
3. Topic modeling
4. Document similarity analysis

*When to use Bag-of-Words:*

1. Simple text analysis
2. Quick prototyping
3. Small datasets

In summary, TF-IDF's weighting scheme and consideration of word importance make it a more effective approach than Bag-of-Words for text analysis tasks.

*Additional Tips:*

- Preprocessing techniques like stemming, lemmatization, and stopword removal can further improve TF-IDF performance.
- Experiment with different TF-IDF variants, such as normalized TF-IDF or BM25.
- Consider using word embeddings (e.g., Word2Vec, GloVe) for more advanced text representation.
2. Can you think of techniques that are better than both BoW and TF-IDF ?

A)Yes, several techniques have been developed to improve upon Bag-of-Words (BoW) and TF-IDF:

*Word Embeddings*

1. *Word2Vec*: Captures semantic relationships between words using vector representations.
2. *GloVe*: Represents words as vectors, considering global word co-occurrences.
3. *FastText*: Extends Word2Vec to handle out-of-vocabulary words.

*Advantages*

1. Capture semantic relationships
2. Handle polysemy (words with multiple meanings)
3. Improve text classification, clustering, and information retrieval

*Document Embeddings*

1. *Doc2Vec*: Extends Word2Vec to represent documents as vectors.
2. *BERT-based embeddings*: Utilize pre-trained language models for document representation.

*Advantages*

1. Capture document-level semantics
2. Improve document classification, clustering, and similarity analysis

*Deep Learning-based Methods*

1. *Convolutional Neural Networks (CNNs)*: Extract features from text using convolutional layers.
2. *Recurrent Neural Networks (RNNs)*: Model sequential relationships in text.
3. *Transformers*: Utilize self-attention mechanisms for text representation.

*Advantages*

1. Automatic feature extraction
2. Handle complex text structures
3. Improve text classification, sentiment analysis, and language modeling

*Other Techniques*

1. *Topic Modeling*: Extract underlying topics from text using Latent Dirichlet Allocation (LDA).
2. *Named Entity Recognition (NER)*: Identify and categorize entities in text.
3. *Part-of-Speech (POS) Tagging*: Identify word grammatical categories.

*Comparison*

|  Technique  | Complexity | Performance |
|  -----------  | ----------- | ----------- |
| BoW          | Low        | Baseline    |
| TF-IDF       | Low-Moderate | Better      |
| Word Embeddings | Moderate    | Improved    |
| Document Embeddings | Moderate-High | Advanced    |
| Deep Learning  | High        | State-of-the-art |

Keep in mind that the best technique depends on the specific task, dataset, and computational resources.

*Additional Tips*

- Experiment with pre-trained models and fine-tune them for your task.
- Consider using transfer learning and domain adaptation.
- Evaluate techniques using metrics relevant to your task (e.g., accuracy, F1-score, ROUGE score).
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

A) Stemming and Lemmatization are text normalization techniques used in Natural Language Processing (NLP) to reduce words to their base form.

*Stemming:*

- Reduces words to their stem or root form using algorithms (e.g., Porter Stemmer).
- Removes suffixes (-ed, -ing, -ly) to obtain the base word.

*Example:*

- Running → Run
- Hopping → Hop

*Pros:*

1. Simple and fast.
2. Reduces dimensionality.
3. Improves text comparison.

*Cons:*

1. Can produce incorrect roots (e.g., "running" → "run" instead of "runner").
2. Fails to handle irregular verbs.
3. Doesn't account for word context.

*Lemmatization:*

- Uses dictionaries and morphological analysis to reduce words to their base or root form (lemma).
- Considers word context and part-of-speech (POS) tagging.

*Example:*

- Running → Run (verb)
- Runner → Runner (noun)

*Pros:*

1. More accurate than stemming.
2. Handles irregular verbs and word context.
3. Produces meaningful roots.

*Cons:*

1. Computationally expensive.
2. Requires large dictionaries and POS tagging.
3. Can be language-dependent.

*Comparison:*

|  Technique  | Accuracy | Speed | Complexity |
|  -----------  | ----------- | ------ | ----------- |
| Stemming    | Lower     | Faster | Simpler    |
| Lemmatization | Higher    | Slower | More complex |

*When to use:*

- Stemming: Simple text analysis, quick prototyping.
- Lemmatization: Advanced NLP tasks, high-accuracy requirements.

*Resources:*

- NLTK (Natural Language Toolkit) for Python: Provides stemming and lemmatization tools.
- spaCy: Offers high-performance lemmatization and POS tagging.
- Stanford CoreNLP: Includes lemmatization and POS tagging tools.

*Additional Tips:*

- Experiment with different stemming algorithms (e.g., Porter, Snowball).
- Consider using pre-trained models for lemmatization.
- Evaluate techniques using metrics relevant to your task (e.g., accuracy, F1-score).

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
