<a href="https://colab.research.google.com/github/GEETHIKACHINNI2/chinni-27/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
The **TF-IDF (Term Frequency-Inverse Document Frequency)** approach generally results in **better accuracy** than **Bag-of-Words (BoW)** in many **text classification** and **information retrieval** tasks due to the way it handles word importance. Let's break down the key reasons why **TF-IDF** tends to outperform **BoW**:

### 1. **BoW Treats All Words Equally**:
   - In **BoW**, each word in the document is treated as a feature, and the feature vector simply counts how often each word appears in the document, regardless of how common or rare the word is across the entire corpus.
   - This means that in **BoW**, common words like "the", "is", "and", "in", etc., are treated the same as more informative words like "disease", "stock", or "recommendation", even though common words carry little to no useful information for classification tasks.
   - **TF-IDF**, on the other hand, adjusts for this by giving higher weight to **rare words** and reducing the weight of **frequent words** that occur across many documents. This makes the feature vectors more **discriminative** and better suited for distinguishing between documents.

### 2. **Inverse Document Frequency (IDF) Reduces the Impact of Common Words**:
   - The **Inverse Document Frequency (IDF)** component of **TF-IDF** helps adjust for the **universal importance** of words. Words that appear frequently across many documents (like "the", "and", "is") will have a low IDF score, meaning their contribution to the model is reduced.
   - This is crucial because if a word appears in almost every document, it doesn’t provide any useful information to distinguish between classes or topics. **TF-IDF** diminishes the effect of such words, allowing more **informative words** to stand out.
   - For example, in a **text classification task** for spam detection, the word "free" may be more relevant for distinguishing between spam and non-spam emails, whereas "the" or "from" is not. **TF-IDF** reduces the weight of "the" and increases the weight of "free".

### 3. **Term Frequency (TF) Adjusts for Word Relevance within a Document**:
   - The **Term Frequency (TF)** part of **TF-IDF** considers how often a word appears in a document, which helps capture the **importance of a word** within a specific context.
   - This makes **TF-IDF** sensitive to the **local context** of a word in a document. If a word appears multiple times in a document, it indicates that the word might be more important to the document's content, and this is reflected in the model.
   - In contrast, **BoW** simply counts word occurrences and doesn't consider the relative importance of those words within each document.

### 4. **Better Differentiation Between Documents**:
   - Since **TF-IDF** considers both **local word frequency** (TF) and the **global document frequency** (IDF), it helps **discriminate between documents** more effectively.
   - Documents that contain **unique or rare words** (according to the IDF) will have higher **TF-IDF scores** for those words, making them more **distinctive** from other documents.
   - This **discriminatory power** helps in classification tasks, where the goal is to differentiate between classes or topics based on the words present in the documents.

### 5. **Improved Performance in Sparse Data**:
   - In scenarios where **sparse data** exists (e.g., short text, or documents with very specific jargon), **TF-IDF**'s ability to assign higher weight to **rare but informative words** helps improve the model’s ability to generalize from limited data.
   - **BoW**, however, might overemphasize frequently occurring words that don’t help in distinguishing document classes, especially in specialized or domain-specific datasets.

### 6. **Captures Document Relevance More Effectively**:
   - In **information retrieval** systems, **TF-IDF** is widely used because it helps rank documents based on the **relevance** of query terms. It identifies documents that are **more relevant** to a search query by focusing on words that are more informative (i.e., appear in fewer documents).
   - This is especially useful in **search engines** and **document ranking**, where **BoW** would simply match query terms to document words without considering their relative importance in the corpus.

---

### Example to Illustrate the Difference:

Imagine a corpus with the following three documents:
1. **Doc 1**: "apple orange banana apple"
2. **Doc 2**: "apple fruit banana"
3. **Doc 3**: "car bus train"

- **BoW Representation** (ignoring case):
  - Doc 1: [apple: 2, orange: 1, banana: 1]
  - Doc 2: [apple: 1, fruit: 1, banana: 1]
  - Doc 3: [car: 1, bus: 1, train: 1]
  
  If we apply BoW, the words **"apple"**, **"banana"**, and **"fruit"** might be treated similarly in the vector space, even though they are highly relevant to the topics in Documents 1 and 2.

- **TF-IDF Representation**:
  - Doc 1: [apple: 0.5, orange: 0.7, banana: 0.7]
  - Doc 2: [apple: 0.7, fruit: 0.7, banana: 0.7]
  - Doc 3: [car: 1.0, bus: 1.0, train: 1.0]
  
  In this case, **"apple"**, **"banana"**, and **"fruit"** will receive higher scores in Docs 1 and 2, because they are more important in these documents relative to the entire corpus. Meanwhile, **Doc 3** will receive lower scores for these words because they don't appear in that document at all.

---

### Conclusion:

- **TF-IDF** improves accuracy over **BoW** primarily because it **down-weights** frequent words that occur across many documents, focusing more on **informative** words that help differentiate between documents.
- It also helps in situations where you have a **large vocabulary** or **sparse data** and need a more nuanced representation of the text that accounts for both the word frequency in a document and its importance across the corpus.
- **BoW**, while simpler and easier to implement, does not consider the relative importance of words, which can lead to overemphasis on common words and less emphasis on those that are more **informative** for distinguishing document classes.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
Yes, there are several advanced **text representation techniques** that can outperform **Bag-of-Words (BoW)** and **TF-IDF** in various natural language processing (NLP) tasks, especially when dealing with more complex datasets or when better capturing semantic meaning is crucial. Below are some of the techniques that have been shown to be better than both **BoW** and **TF-IDF**:

---

### 1. **Word Embeddings (Word2Vec, GloVe, FastText)**

#### **How It Works:**
Word embeddings, such as **Word2Vec**, **GloVe** (Global Vectors for Word Representation), and **FastText**, represent words as continuous, dense vectors in a high-dimensional space. Unlike **BoW** or **TF-IDF**, which are based on sparse and high-dimensional representations, word embeddings are based on the **semantic relationships** between words. Similar words (in meaning) have similar vector representations.

- **Word2Vec** learns word embeddings using techniques like **Skip-gram** and **Continuous Bag of Words (CBOW)**, capturing both **local context** (nearby words) and **semantic similarity** (words used in similar contexts).
- **GloVe** generates embeddings by factoring in word co-occurrence statistics from a corpus, effectively capturing the relationships between words based on global context.
- **FastText**, an extension of Word2Vec, represents words as bags of character n-grams, making it effective for morphologically rich languages and better at handling out-of-vocabulary words.

#### **Why It’s Better than BoW/TF-IDF:**
- **Captures Semantic Similarity**: Word embeddings capture **semantic meaning** of words. For example, "king" and "queen" will have similar vector representations, unlike BoW or TF-IDF, which treat each word independently.
- **Dense Representations**: Embeddings provide a dense, fixed-length representation (e.g., 100-300 dimensions) for each word, in contrast to the sparse, high-dimensional vectors created by BoW or TF-IDF.
- **Handles Synonyms and Polysemy**: Embeddings can capture **synonyms** (e.g., "car" and "automobile") and the context of words (e.g., "bank" in the context of a financial institution vs. a river bank).

#### **Example Use Cases**:
- **Document Classification**: Embeddings are more effective in representing the overall meaning of documents, making them useful for tasks like sentiment analysis, topic modeling, and document classification.
- **Word Similarity**: In applications where understanding the meaning or similarity between words is critical (e.g., recommendation systems), word embeddings offer significant advantages over BoW/TF-IDF.

---

### 2. **Contextualized Word Embeddings (BERT, GPT, ELMo)**

#### **How It Works:**
Contextualized word embeddings represent words dynamically based on the **context** in which they appear. Unlike static word embeddings (e.g., Word2Vec or GloVe), which assign a single vector to each word regardless of context, contextual embeddings generate different vector representations for the same word depending on its usage in a sentence.

- **BERT (Bidirectional Encoder Representations from Transformers)** uses a transformer-based architecture to learn deep contextualized representations of words. BERT is trained by predicting missing words in sentences, so it can understand the context of words in both directions (left-to-right and right-to-left).
- **ELMo (Embeddings from Language Models)** also produces contextualized word embeddings but uses a two-layer bidirectional LSTM trained on a language modeling task.
- **GPT (Generative Pretrained Transformer)** is another transformer-based model that, like BERT, generates context-aware embeddings but uses a unidirectional transformer architecture.

#### **Why It’s Better than BoW/TF-IDF:**
- **Context-Awareness**: Unlike BoW and TF-IDF, which treat words as independent features, **contextual embeddings** dynamically adjust based on the surrounding words in the sentence, making them **far superior at capturing nuanced meaning**.
- **Deep Understanding of Language**: Models like BERT and GPT are pre-trained on massive corpora and can capture deep, **complex syntactic and semantic patterns** within text, leading to significantly better performance in downstream tasks.
- **State-of-the-Art Performance**: Contextual embeddings have **revolutionized NLP** and are widely regarded as the state-of-the-art for tasks like **question answering**, **sentiment analysis**, **named entity recognition (NER)**, and **machine translation**.

#### **Example Use Cases**:
- **Named Entity Recognition (NER)**: In applications where understanding the entities (like people, places, and organizations) in a sentence is critical, contextual embeddings excel.
- **Question Answering Systems**: BERT has shown state-of-the-art performance on tasks like **SQuAD (Stanford Question Answering Dataset)**, where it is able to understand both the context of the question and the passage.

---

### 3. **Doc2Vec (Paragraph2Vec)**

#### **How It Works:**
**Doc2Vec** is an extension of Word2Vec that learns dense vector representations for entire documents, rather than individual words. It operates by training a model to predict surrounding words in a fixed-size context window, while also learning a unique vector for each document.

#### **Why It’s Better than BoW/TF-IDF:**
- **Captures Document-Level Semantics**: Doc2Vec is more powerful than BoW and TF-IDF because it learns to represent the **entire document** as a dense vector that captures semantic meaning at the document level, not just individual word frequencies.
- **Handling Variable-Length Inputs**: While BoW and TF-IDF only work with fixed-length vectors (one per document), Doc2Vec is more flexible and captures **thematic similarity** between documents, even if they have different word distributions.
- **Improved for Document Classification**: Since Doc2Vec learns representations for entire documents, it is well-suited for tasks where document-level semantics matter, such as **document classification**, **topic modeling**, and **semantic search**.

#### **Example Use Cases**:
- **Document Clustering**: Grouping documents into semantically similar clusters becomes more accurate using Doc2Vec because it captures document-level semantics beyond word frequency.
- **Document Retrieval**: In a semantic search system, where the goal is to retrieve documents based on their meaning rather than exact word matches, Doc2Vec can provide much better results than traditional methods like BoW/TF-IDF.

---

### 4. **Transformer-Based Models (T5, RoBERTa, XLNet)**

#### **How They Work:**
Transformers, such as **T5** (Text-to-Text Transfer Transformer), **RoBERTa** (a robustly optimized BERT), and **XLNet** (which builds on the Transformer architecture), take the idea of contextual embeddings even further by fine-tuning pre-trained models on specific tasks. These models use self-attention mechanisms to capture complex patterns in text and outperform earlier methods like TF-IDF and Word2Vec.

#### **Why They’re Better than BoW/TF-IDF:**
- **Superior Contextualization**: These models use **transformer architectures** that capture long-range dependencies and contextual meaning at a much more sophisticated level than traditional techniques like BoW or TF-IDF.
- **Task-Specific Fine-Tuning**: Transformer models can be fine-tuned for a wide range of tasks, making them highly adaptable and extremely powerful for tasks like text generation, question answering, and summarization.

#### **Example Use Cases**:
- **Text Summarization**: Transformer-based models like T5 are able to generate **abstractive summaries** of long documents, going beyond simple keyword extraction.
- **Language Generation**: Models like GPT can generate human-like text, making them useful for creative writing, chatbots, and even code generation.

---

### 5. **Multimodal Representations (CLIP, ALIGN)**

#### **How It Works:**
**Multimodal models** like **CLIP** (Contrastive Language-Image Pretraining) and **ALIGN** (A Large-scale ImaGe and Noisy-text) jointly learn representations of both **text** and **images**. These models allow for the creation of **joint embeddings** that represent both text and visual information in a unified space.

#### **Why They’re Better than BoW/TF-IDF:**
- **Cross-Modal Understanding**: These models can understand relationships between text and other forms of media (e.g., images), allowing for tasks that involve both types of data, such as image captioning or visual question answering.
- **Improved Performance in Multimodal Search**: CLIP and ALIGN allow you to search for **images based on text queries** and vice versa, a task that BoW/TF-IDF would not be able to handle.

---

### Conclusion:

While **BoW** and **TF-IDF** are foundational techniques in NLP, the methods listed above—**word embeddings**, **contextual embeddings**, **Doc2Vec**, **transformer-based models**, and **multimodal representations**—represent **significant advances** in text representation. These modern techniques often outperform traditional methods in both **accuracy** and **contextual understanding**. The choice of technique depends on the complexity of the task, the available computational resources, and the nature of the dataset. For **general-purpose NLP tasks**, **transformer-based models** like **BERT** and **GPT** are often the best choice, while for **semantic understanding** and **document-level tasks**, **Doc2Vec** or **word embeddings** might be more appropriate.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
### **Stemming vs. Lemmatization**

Both **stemming** and **lemmatization** are techniques used in **Natural Language Processing (NLP)** to reduce words to their root or base form, but they differ in their approach and the quality of results they produce. Let's break down the **pros and cons** of each technique.

---

### **1. Stemming**

#### **What is Stemming?**
Stemming is the process of reducing a word to its **root form** by chopping off suffixes and prefixes, typically using a set of predefined rules. The result may not always be a valid word in the dictionary, but it helps reduce word variations to a common base form for analysis.

#### **Examples**:
- "running" → "run"
- "better" → "better" (stemmer might remove "er" and get "bett")
- "happily" → "happi"

#### **Pros of Stemming**:
1. **Simplicity and Speed**: Stemming is typically **faster** than lemmatization because it uses simple heuristic rules (like removing suffixes) rather than relying on a more complex understanding of words.
2. **Basic Root Form**: For many applications, particularly those involving large datasets, stemming might work sufficiently well to group together words with similar meanings.
3. **Less Resource Intensive**: Stemming doesn't require access to large lexical databases like WordNet (used by lemmatizers), making it less computationally expensive.

#### **Cons of Stemming**:
1. **Over-Stemming**: Since stemmers use rule-based heuristics, they may result in words being reduced to non-dictionary roots that are **too broad** or **inaccurate**. For example, "flying" and "fly" may both stem to "fli", which can lose the semantic distinction between the two.
2. **Ambiguity**: Stemmers are not context-aware. Words that may have different meanings can end up with the same stem. For example, "universal" might stem to "univers", and "universe" might also stem to "univers", even though the contexts could differ.
3. **Not Always Linguistically Correct**: The output of stemming may not always be a valid or meaningful word. For instance, "happily" becomes "happi", which is not a valid word.

#### **Use Cases for Stemming**:
- **Information Retrieval**: Where exact matching is not necessary, and you want to group together words with similar meanings.
- **Search Engines**: For queries that need to find documents containing related word forms, stemming is used to ensure that variations of a word are matched.
- **Quick Prototyping**: In some cases, especially with large datasets, where computational efficiency is prioritized over linguistic accuracy.

---

### **2. Lemmatization**

#### **What is Lemmatization?**
Lemmatization is the process of reducing a word to its **base or dictionary form**, called a "lemma." Unlike stemming, which removes affixes, lemmatization involves understanding the **context** of the word and ensuring that the resulting word is a **valid word** in the dictionary. Lemmatization typically requires knowledge of the part-of-speech (POS) of the word.

#### **Examples**:
- "running" → "run"
- "better" → "good" (because "better" is a comparative form of "good")
- "happily" → "happy"

#### **Pros of Lemmatization**:
1. **Linguistic Accuracy**: Lemmatization ensures that the **output** is always a **valid word** in the language. It preserves the meaning of words and keeps the correct form (e.g., "running" → "run", not "runn").
2. **Context-Aware**: Lemmatization takes into account the **part-of-speech (POS)** of the word. For example, "better" will be lemmatized to "good" (adjective), while "run" will stay as "run" (verb).
3. **Meaning Preservation**: Lemmatization preserves **semantic meaning** better than stemming. The base form of a word represents its actual meaning, and lemmatization ensures that this meaning is retained.

#### **Cons of Lemmatization**:
1. **Slower and More Computationally Intensive**: Lemmatization involves complex processes like looking up words in a **lexical database** (e.g., **WordNet**) and determining the correct POS. This makes lemmatization slower and more resource-intensive than stemming.
2. **Requires POS Tagging**: Lemmatization often requires additional information such as part-of-speech (POS) tagging, which adds complexity. For instance, "better" can be lemmatized to "good" (adjective) or "well" (adverb) depending on the context, and this requires accurate tagging.
3. **Dependency on Lexical Databases**: Since lemmatization relies on resources like **WordNet** or other dictionaries, it might not work well for domain-specific words or neologisms not present in the dictionary.

#### **Use Cases for Lemmatization**:
- **Text Classification**: Where preserving the **precise meaning** of words is important, such as in sentiment analysis, where words like "better" and "good" need to be treated as equivalents.
- **Machine Translation**: Lemmatization is useful for ensuring that different word forms are treated correctly when translating text from one language to another.
- **Information Extraction**: For tasks where the relationship between different forms of a word (such as synonyms or inflections) is important for extracting specific information.

---

### **Comparison: Stemming vs. Lemmatization**

| Feature                  | **Stemming**                               | **Lemmatization**                           |
|--------------------------|--------------------------------------------|--------------------------------------------|
| **Complexity**            | Simple, rule-based                        | More complex, context-aware                |
| **Speed**                 | Faster                                    | Slower                                    |
| **Output**                | May produce non-dictionary words (e.g., "fli") | Produces valid, dictionary words           |
| **Accuracy**              | Less accurate (may over-stem)              | More accurate and context-sensitive        |
| **Use Case**              | Useful for search engines, quick prototyping, and large datasets | Better for NLP tasks where semantic meaning is critical (e.g., sentiment analysis) |
| **Dependency**            | Does not require additional resources     | Often requires lexical databases (e.g., WordNet), POS tagging |

---

### **When to Use Stemming vs. Lemmatization**:

1. **Use Stemming When**:
   - Speed is a critical factor (e.g., in search engines or information retrieval tasks).
   - You’re working with a **large dataset** and need a **simple, fast preprocessing step**.
   - The exact meaning of words is not as important as grouping similar words together for general purposes (e.g., finding documents related to a particular concept).
   - The task does not require **linguistic accuracy** (e.g., clustering text for topic modeling).

2. **Use Lemmatization When**:
   - Semantic meaning and **accuracy** are important (e.g., sentiment analysis, question answering).
   - You need **context-aware analysis** (e.g., when distinguishing between different parts of speech).
   - You're working on more advanced **NLP tasks** like **machine translation**, **information extraction**, or **named entity recognition (NER)**.
   - Your corpus contains specialized vocabulary or non-standard words that are better handled with a dictionary-based approach.

---

### **Summary**:

- **Stemming** is **fast and efficient** but may produce crude and inaccurate results since it doesn't preserve the meaning of words, which can negatively impact the quality of NLP tasks where precision matters.
- **Lemmatization**, on the other hand, produces **linguistically accurate** results that maintain the true meaning of words but is **slower** and more resource-intensive due to its reliance on external resources and the need for part-of-speech tagging.

In general, **lemmatization** is preferred for tasks that require a high level of semantic understanding (e.g., text classification, sentiment analysis), while **stemming** is often sufficient for simpler applications where computational efficiency is prioritized over linguistic accuracy.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
