<a href="https://colab.research.google.com/github/DevanshPatel234/FMML_Project_and_Labs/blob/main/Module%203%20Lab%203.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

MODULE: CLASSIFICATION-1

LAB-3 : Using KNN for Text Classification

Section 1: Understanding NLP tools

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

Section 1.2: Data Cleaning and Preprocessing step

NLTK

NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.

In [21]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [22]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [23]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [24]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


Section 1.3: TF-IDF

In [25]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET

In [26]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews (1).csv


In [27]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [28]:
df = df.dropna()

In [29]:
df.to_csv('reviews.csv', index=False)

Section 3: KNN MODEL

In [30]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [31]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%
Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]






In [32]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%
Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]




Section 4: SPAM TEXT DATASET

In [33]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [34]:
import pandas as pd
df = pd.read_csv('spam.csv', error_bad_lines=False)
df



  df = pd.read_csv('spam.csv', error_bad_lines=False)


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [35]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [36]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [37]:
len(df)

5572

In [38]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [39]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [40]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


Questions to Think About and Answer

1.Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

2.Can you think of techniques that are better than both BoW and TF-IDF ?

3.Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

Answer 1)

TF-IDF (Term Frequency-Inverse Document Frequency) and Bag-of-Words are both methods used in natural language processing (NLP) for representing and analyzing text data. While TF-IDF and Bag-of-Words serve similar purposes, they have distinct characteristics that can affect their performance in different ways.

Consideration of Word Importance:

TF-IDF: This approach takes into account not only the frequency of a word in a document (Term Frequency) but also its importance in the entire corpus (Inverse Document Frequency). Words that are common across many documents are given lower importance, while words that are specific to a document are given higher importance. This helps in capturing the uniqueness of each document and can be beneficial in tasks like text classification.
Bag-of-Words: In contrast, Bag-of-Words represents a document as an unordered set of words, ignoring the order and importance of words. This means that words that are frequent in a document but also frequent across the entire corpus might not contribute much to distinguishing that document from others.

Handling Stop Words:

TF-IDF: It automatically downweights common words that appear in many documents (like "the," "and," etc.) since their IDF values are low.
Bag-of-Words: Stop words are often included in the representation, and their high frequency may not provide much meaningful information about the content of the document.

Document Length Normalization:

TF-IDF: It naturally normalizes the representation by considering the term frequency in the context of the document length.
Bag-of-Words: Longer documents may have higher word counts, which can make them appear different from shorter documents even if they convey similar information. This issue is mitigated to some extent in TF-IDF.

Sparse Representation:

TF-IDF: The representation is typically sparse, meaning that most entries in the vector are zero. This can be advantageous in terms of memory and computation efficiency.
Bag-of-Words: The representation is also sparse, but without the additional weighting of TF-IDF, it might not capture the importance of terms as effectively.

Word Embeddings and Contextual Information:

Both TF-IDF and Bag-of-Words lack the ability to capture semantic relationships and contextual information present in more advanced methods like word embeddings.

Answer2)

Yes, several techniques have been developed that can outperform Bag-of-Words (BoW) and TF-IDF in certain NLP tasks. Here are a few notable ones:

Word Embeddings:

Description: Word embeddings represent words as dense vectors in a continuous vector space, capturing semantic relationships and contextual information.
Advantages: They can capture subtle meanings, word similarity, and relationships, which BoW and TF-IDF may miss. Popular methods include Word2Vec, GloVe, and FastText.

Word2Vec:

Description: Word2Vec is a popular word embedding technique that represents words as vectors in a continuous space. It is trained on large corpora and learns to predict the context of words.
Advantages: It captures semantic relationships and can provide a more nuanced representation of words compared to BoW and TF-IDF.

GloVe (Global Vectors for Word Representation):

Description: GloVe is another word embedding technique that focuses on capturing global word-word co-occurrence statistics. It represents words as vectors based on the overall statistics of their co-occurrence in a corpus.
Advantages: It captures semantic relationships and is especially effective in capturing word analogies.

FastText:

Description: FastText is an extension of Word2Vec that represents words as bags of character n-grams. This allows it to capture morphological information and handle out-of-vocabulary words better.
Advantages: It is effective in handling morphologically rich languages and capturing subword information.

BERT (Bidirectional Encoder Representations from Transformers):

Description: BERT is a transformer-based model pre-trained on large amounts of text data. It captures bidirectional context information and has achieved state-of-the-art results in various NLP tasks.
Advantages: It understands context better than BoW, TF-IDF, or traditional word embeddings and can be fine-tuned for specific downstream tasks.

ULMFiT (Universal Language Model Fine-tuning):

Description: ULMFiT is a transfer learning approach for NLP. It involves pre-training a language model on a large corpus and fine-tuning it for a specific task with a smaller dataset.
Advantages: It leverages pre-trained language models to improve performance on specific tasks with limited data.

ELMo (Embeddings from Language Models):

Description: ELMo is a deep contextualized word representation model. It produces word embeddings that are context-dependent, considering the surrounding words in a sentence.
Advantages: It captures context-specific information, making it more powerful than traditional word embeddings.


Answer3)

Stemming:

Definition: Stemming is a text normalization process that reduces words to their base or root form by removing suffixes.

Pros:

Simplicity: Stemming algorithms are generally simpler to implement compared to lemmatization, making them computationally faster.
Reduction of Dimensionality: Stemming reduces the dimensionality of the feature space by collapsing different forms of a word into a common root. This can be beneficial for tasks like information retrieval and document clustering.

Cons:

Over-Stemming or Under-Stemming: Stemming algorithms may sometimes over-stem (remove too many letters) or under-stem (fail to remove enough) which can result in incorrect or non-words. For example, "happily" and "happy" might both be reduced to "happi."
Lack of Semantic Understanding: Stemming doesn't consider the context or semantics of words. It may reduce words to a common root even if they have different meanings. For example, "bank" (financial institution) and "bank" (side of a river) might both be reduced to "bank."

Lemmatization:

Definition: Lemmatization is a more advanced form of text normalization that reduces words to their base or dictionary form (lemma) based on the word's meaning.

Pros:

Accuracy: Lemmatization tends to be more accurate than stemming because it considers the context and meaning of words.
Improved Readability: Lemmatized words are usually more readable and closer to actual words in a language, which can be important for tasks like text generation.
Cons:

Computational Complexity: Lemmatization is computationally more intensive than stemming, making it slower.
Resource Requirements: Lemmatization often requires access to dictionaries or lexical resources to identify the correct base form of a word. This can be a limitation for languages with complex morphology.

Comparison:

Use Case: If the application requires speed and efficiency, stemming might be preferred. If accuracy and language understanding are critical, lemmatization is usually a better choice.
Context: Stemming is a more heuristic approach, while lemmatization is more rule-based and context-aware.
Application: In information retrieval or search engines where speed is crucial, stemming might be suitable. In applications like machine translation or chatbots where understanding the meaning of words is essential, lemmatization is often preferred.