<a href="https://colab.research.google.com/github/Sweetyleela/Sweetyleela-FMML_All-Repos/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [4]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [5]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

Note: Cross-validation will be discussed in detail in the upcoming lab session.

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [14]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
A)The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the Bag-of-Words (BoW) model because it provides a more nuanced representation of the importance of words in a document relative to the entire corpus. Here's a breakdown of why TF-IDF is more effective:

1. Term Frequency (TF):

BoW: In the BoW model, the representation of documents is based purely on the frequency of words, treating all words with equal importance regardless of their relevance or commonality in the corpus.

TF-IDF: TF-IDF also considers the frequency of words but normalizes this by the length of the document, making the term frequency component more meaningful.



2. Inverse Document Frequency (IDF):

BoW: BoW does not account for how common or rare a word is across all documents in the corpus. Consequently, common words (like "the", "is", etc.) might dominate the document representation, potentially overshadowing more meaningful, less frequent words.

TF-IDF: The IDF component of TF-IDF down-weights the importance of words that are very common across all documents and up-weights the importance of words that are rare. This helps to emphasize words that are more unique and informative for the specific document.



3. Discrimination Power:

BoW: By treating all words with equal importance, the BoW model can fail to differentiate between documents effectively, as common words may dilute the distinguishing features of the text.

TF-IDF: TF-IDF enhances the discrimination power of the document representation by highlighting unique terms that contribute more to the meaning of the document, thereby improving the accuracy of tasks like classification and clustering.




In summary, the TF-IDF approach refines the document representation by considering both the frequency of words within a document and the rarity of words across the corpus. This leads to a more informative and discriminative feature set, which generally improves the accuracy of text-related tasks compared to the Bag-of-Words model.


2. Can you think of techniques that are better than both BoW and TF-IDF ?

A)Yes, there are several techniques that are generally considered to be more advanced and effective than both Bag-of-Words (BoW) and TF-IDF for natural language processing (NLP) tasks. Some of these techniques include:

1. Word Embeddings:

Word2Vec: This model learns word embeddings by predicting surrounding words in a context window (Skip-gram) or by predicting a word given its context (CBOW - Continuous Bag of Words). Word2Vec captures semantic relationships between words, allowing similar words to have similar vector representations.

GloVe (Global Vectors for Word Representation): GloVe combines the advantages of global matrix factorization and local context window methods. It leverages word co-occurrence statistics to generate word vectors that capture global statistical information about a corpus.



2. Document Embeddings:

Doc2Vec: An extension of Word2Vec, Doc2Vec (Paragraph Vector) learns vector representations for entire documents. It provides a fixed-length feature vector for variable-length text, capturing the semantics of the document.



3. Recurrent Neural Networks (RNNs):

LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Units): These are types of RNNs that can capture sequential dependencies and long-range context in text. They are particularly useful for tasks involving sequence prediction, such as language modeling and text generation.



4. Convolutional Neural Networks (CNNs) for Text:

CNNs can be applied to text data by treating the text as a sequence of word embeddings and applying convolutional filters to capture local patterns and hierarchical structures in the text.



5. Transformers:

BERT (Bidirectional Encoder Representations from Transformers): BERT is a pre-trained transformer model that captures bidirectional context from both the left and right sides of a word. It has set new benchmarks in many NLP tasks by leveraging attention mechanisms to model complex language patterns.

GPT (Generative Pre-trained Transformer): GPT is another transformer model designed for language generation tasks. It has demonstrated impressive performance in generating coherent and contextually relevant text.



6. Universal Sentence Encoder:

This model provides embeddings for sentences rather than individual words, capturing the semantics of the entire sentence. It is useful for tasks like semantic similarity, clustering, and classification.



7. Transformers with Transfer Learning:

Models like T5 (Text-to-Text Transfer Transformer) and RoBERTa (Robustly optimized BERT approach) build on the transformer architecture and fine-tune pre-trained models on specific tasks, achieving state-of-the-art performance.




These advanced techniques generally outperform BoW and TF-IDF because they can capture richer and more complex patterns in text data, including semantic relationships, contextual dependencies, and hierarchical structures.



3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

A)Pros and Cons of Stemming and Lemmatization

Stemming:

Pros:

Speed: Stemming is computationally efficient and faster than lemmatization. This makes it suitable for applications where processing time is a priority, such as in search engines and information retrieval systems.

Simplicity: Stemming algorithms are simpler to implement and do not require extensive linguistic resources.


Cons:

Accuracy: Stemming can be imprecise, often leading to over-stemming (where words are overly reduced and unrelated words are conflated) or under-stemming (where related words don’t appear related). This can result in non-words that may lose meaningful context.

Lack of Context: Stemming does not consider the context of the word usage, which can lead to less accurate text analysis.



Lemmatization:

Pros:

Accuracy and Context: Lemmatization provides more accurate results by considering the context and the morphological analysis of words. It returns valid dictionary words, preserving grammatical structure and meaning.

Reduction of Ambiguity: By converting words to their base or dictionary form, lemmatization reduces ambiguity, making the text analysis clearer and more meaningful.


Cons:

Computational Complexity: Lemmatization is more computationally intensive and requires more processing power and time than stemming.

Dependency on Language Resources: Lemmatization relies on extensive linguistic resources like dictionaries and morphological analyzers, which can be a limitation for certain languages.



Conclusion

When deciding between stemming and lemmatization, consider the specific needs of your application. If speed and computational efficiency are crucial, stemming might be the better choice. However, if accuracy and preserving the grammatical context are important, lemmatization is more suitable.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
