<a href="https://colab.research.google.com/github/GEETHIKACHINNI2/chinni-27/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 62.30366492146597%




Cross Validation Accuracy: 0.62
[0.60784314 0.58431373 0.66141732]




In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 70.15706806282722%




Cross Validation Accuracy: 0.73
[0.7254902  0.74117647 0.72834646]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
len(df)

5572

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90040377 0.91245791]




In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
The **TF-IDF (Term Frequency-Inverse Document Frequency)** approach generally results in better accuracy than the **Bag-of-Words (BoW)** model, primarily due to how it handles the importance of words in a text corpus. Here's a detailed explanation of why **TF-IDF** typically outperforms **BoW**:

### 1. **Handling Word Importance**:
   - **Bag-of-Words (BoW)** simply counts the frequency of each word in the document, ignoring the context or importance of the word within the overall dataset.
   - **TF-IDF** improves upon this by adjusting the word frequency based on two factors:
     - **Term Frequency (TF)**: The raw frequency of a word in a document.
     - **Inverse Document Frequency (IDF)**: A measure of how common or rare the word is across all documents in the corpus.

   The idea behind TF-IDF is that **common words** (e.g., "the", "is", "and") that appear frequently across many documents should be given **lower weight**, while **rare words** that appear in only a few documents are given **higher weight**, assuming they are more informative.

   **Formula**:
   \[
   \text{TF-IDF}(w) = \text{TF}(w) \times \log\left(\frac{N}{\text{DF}(w)}\right)
   \]
   where:
   - \( \text{TF}(w) \) is the frequency of the word \( w \) in the document.
   - \( \text{DF}(w) \) is the number of documents containing the word \( w \).
   - \( N \) is the total number of documents.

### 2. **Reduces Impact of Common Words (Stopwords)**:
   - In **Bag-of-Words**, common words (often called **stopwords** such as "and", "the", "a", "in") can dominate the feature set, as they appear in almost all documents. This can result in **poor performance** in tasks like text classification because these words don't provide much valuable information about the content.
   - **TF-IDF** addresses this issue by **down-weighting** the importance of these frequent words across documents. The term frequency of such common words may be high, but the **inverse document frequency** will be low (because these words appear in many documents), effectively reducing their influence on the model.

### 3. **Gives Higher Weight to Rare but Important Words**:
   - **TF-IDF** increases the weight of words that are **rare but significant** in the context of a particular document or class. For example, in a set of documents about **sports**, words like **"soccer"** or **"basketball"** might appear in only a few documents but are highly relevant for identifying topics related to sports.
   - **Bag-of-Words**, on the other hand, treats all words equally, meaning that even important domain-specific words could be treated like stopwords if they appear frequently in the corpus.

### 4. **Captures the Relevance of Words in the Context of the Entire Corpus**:
   - **BoW** is essentially a **frequency-based model** that doesn't account for how relevant a word is across the entire corpus. It doesn’t distinguish between words that are meaningful in the specific context of a document and those that appear frequently but aren’t informative.
   - **TF-IDF**, however, considers the **corpus-wide frequency** of terms, allowing the model to focus on the words that are **discriminative** for each document. This makes TF-IDF a more **contextually aware** feature representation compared to BoW, which treats all words with equal importance.

### 5. **Better at Dealing with Synonymy and Polysemy**:
   - **Synonymy** refers to the fact that different words can have the same meaning (e.g., “car” and “automobile”), and **polysemy** refers to words that have multiple meanings (e.g., “bank” could mean a financial institution or the side of a river).
   - While both **BoW** and **TF-IDF** don't explicitly capture semantic relationships between words, **TF-IDF** can still provide better results in distinguishing documents based on the **rarity and specificity** of words, making it more effective in differentiating between documents with similar meanings.

### 6. **Reduces Overfitting in High-dimensional Space**:
   - **BoW** tends to create very sparse, high-dimensional vectors, especially when working with large corpora. It can lead to **overfitting** in machine learning models because it includes many features (words) that are not particularly useful for the task.
   - **TF-IDF** reduces the impact of common words and highlights the more **informative terms**, making the representation more compact and less likely to overfit. By reducing the emphasis on unimportant words, it can improve model generalization.

### 7. **Better at Discriminating between Documents**:
   - In classification tasks (such as spam detection, sentiment analysis, or topic classification), the ability to differentiate between documents based on the unique words they contain is crucial.
   - Since **TF-IDF** emphasizes the **informative, discriminative words**, it helps the machine learning model better distinguish between different categories or topics compared to **Bag-of-Words**, which treats every word equally.

### 8. **Improved Performance in Many NLP Tasks**:
   - For many Natural Language Processing (NLP) tasks, including **text classification**, **information retrieval**, and **document clustering**, **TF-IDF** has been shown to provide better results than **Bag-of-Words** because it accounts for both the **local context** (within a document) and the **global context** (across the entire corpus).
   - **BoW** does not take into account how common or rare the words are across the corpus, which can lead to a model that is less effective at distinguishing between different types of documents.

---

### Summary of Key Differences:
| **Aspect**                    | **Bag-of-Words (BoW)**                                       | **TF-IDF**                                            |
|-------------------------------|------------------------------------------------------------|-------------------------------------------------------|
| **Word Weighting**             | Counts word occurrences in a document (no weighting).       | Adjusts word frequency by the inverse of how common the word is across all documents. |
| **Handling Stopwords**         | Stopwords can dominate and appear in most documents.        | Stopwords are down-weighted due to their high frequency across documents. |
| **Word Importance**            | All words are treated equally, regardless of relevance.      | Rare and discriminative words get higher weight, while common words get lower weight. |
| **Effectiveness in High-Dimensional Data** | Less effective due to sparse vectors and high dimensionality. | More compact representation by emphasizing informative words and reducing noise. |
| **Performance in NLP Tasks**   | Can be less effective in tasks requiring discrimination.    | Better at distinguishing documents based on unique terms, improving classification accuracy. |

---

### Conclusion:
The **TF-IDF** approach improves over **Bag-of-Words** by giving more weight to **important, rare terms** and reducing the influence of **common, non-informative words**. This results in a more meaningful, compact feature representation that is more effective in distinguishing between different documents and topics. As a result, **TF-IDF** generally leads to **better model performance** and **higher accuracy** in tasks such as **text classification**, **information retrieval**, and **document clustering**.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
Yes, there are several **advanced techniques** that often outperform both **Bag-of-Words (BoW)** and **TF-IDF** in many Natural Language Processing (NLP) tasks, especially when dealing with complex language structures or large datasets. These techniques generally capture richer semantic relationships between words, phrases, or even entire documents, improving accuracy, scalability, and generalization. Below are some of the most prominent alternatives:

### 1. **Word Embeddings (e.g., Word2Vec, GloVe, FastText)**

#### **How It Works:**
- **Word Embeddings** represent words as dense, continuous vectors in a **semantic space**, where semantically similar words are mapped to nearby points. This is in contrast to BoW and TF-IDF, where words are treated as discrete entities.
- Techniques like **Word2Vec** (skip-gram and CBOW), **GloVe** (Global Vectors for Word Representation), and **FastText** (which takes into account subword information) learn these embeddings by training on large corpora to capture semantic meanings and relationships (e.g., synonyms, analogies).

#### **Advantages Over BoW and TF-IDF:**
- **Semantic Similarity**: Word embeddings can capture **semantic meaning** and **relationships** (e.g., "king" and "queen" are close in vector space), while BoW and TF-IDF only consider word frequency.
- **Context Sensitivity**: Embeddings are not sparse and are contextually aware, meaning they can generalize better across different contexts.
- **Dimensionality**: Word embeddings are lower-dimensional compared to BoW, which can lead to better performance and lower computational costs.

#### **Limitations:**
- Embeddings are trained on large datasets, so they may not perform as well with smaller datasets or niche vocabularies.
- Pre-trained embeddings (e.g., GloVe or Word2Vec) may not capture domain-specific terms as well as training on your own corpus.

---

### 2. **Doc2Vec (Paragraph Vectors)**

#### **How It Works:**
- **Doc2Vec** extends **Word2Vec** by learning embeddings for entire documents (or paragraphs). It trains a vector representation for each document that captures the context of the document in a similar manner to how Word2Vec captures word meanings.

#### **Advantages Over BoW and TF-IDF:**
- **Document-level Representation**: Unlike BoW or TF-IDF, which work at the word level, **Doc2Vec** produces dense, fixed-size vectors for whole documents, making it easier to classify or cluster documents.
- **Contextual Information**: Like Word2Vec, Doc2Vec takes into account the context of words within a document, rather than treating words as independent features.
- **Compactness**: It produces lower-dimensional embeddings compared to the high-dimensional sparse vectors of BoW and TF-IDF.

#### **Limitations:**
- Requires **large datasets** for effective training. If the dataset is small, performance may degrade.
- Training a good Doc2Vec model can be computationally expensive.

---

### 3. **Transformers (e.g., BERT, GPT, RoBERTa)**

#### **How It Works:**
- **Transformers** are a class of models based on the **self-attention mechanism**, which allows them to process the entire context of a word in a sequence simultaneously. **BERT (Bidirectional Encoder Representations from Transformers)**, **GPT (Generative Pre-trained Transformer)**, and **RoBERTa** are pre-trained models that can generate highly **contextualized embeddings**.
- Unlike BoW and TF-IDF, which represent words in isolation, **transformers** understand words based on the surrounding context (i.e., they are **context-aware**). They generate word embeddings dynamically depending on the sentence in which they appear.

#### **Advantages Over BoW and TF-IDF:**
- **Contextual Understanding**: Transformers like BERT and GPT provide **deep contextualization**, where the representation of each word changes depending on the sentence or surrounding words. This helps in understanding polysemy (words with multiple meanings).
- **Pre-trained Knowledge**: Models like BERT have been pre-trained on vast corpora and can be **fine-tuned** on specific tasks (e.g., sentiment analysis, question answering) with minimal labeled data.
- **State-of-the-art Performance**: For a wide variety of NLP tasks (including text classification, sentiment analysis, named entity recognition, etc.), **transformers** have achieved **state-of-the-art performance**.

#### **Limitations:**
- **Computationally Expensive**: Transformers, especially large models like GPT-3 or BERT, require significant computational resources for both training and inference.
- **Data Requirement**: To achieve good results, transformers generally require **large amounts of data** and may not perform well on small datasets without fine-tuning.

---

### 4. **Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) Networks**

#### **How It Works:**
- **RNNs** and **LSTMs** are deep learning models designed to handle sequential data. They process text word by word and maintain an internal state that captures the sequential nature of the input.
- LSTMs, a more advanced form of RNNs, address issues like the **vanishing gradient problem** and are better suited for modeling long-range dependencies in text.

#### **Advantages Over BoW and TF-IDF:**
- **Sequential Data Handling**: Unlike BoW and TF-IDF, which treat words independently, RNNs and LSTMs can capture the **sequential dependencies** between words in a sentence, making them ideal for tasks like sentiment analysis and language modeling.
- **Contextualization**: RNNs and LSTMs can model the **flow of meaning** over time, which is especially useful for tasks requiring **contextual understanding** of the entire sentence or document.

#### **Limitations:**
- **Training Complexity**: RNNs and LSTMs can be difficult to train, especially on long sequences. They also tend to be slower to train compared to other methods.
- **Memory Requirements**: LSTMs are memory-intensive, and when working with large datasets, they can be computationally expensive.

---

### 5. **Latent Dirichlet Allocation (LDA)**

#### **How It Works:**
- **LDA** is a generative probabilistic model used for **topic modeling**. It assumes that documents are mixtures of topics, and each topic is represented as a distribution over words. The goal is to uncover the hidden thematic structure in a large collection of documents.
- While **BoW** and **TF-IDF** represent words as individual features, LDA groups words into **topics**, allowing a more **abstract and interpretable representation** of documents.

#### **Advantages Over BoW and TF-IDF:**
- **Topic Modeling**: LDA uncovers the latent **topics** in a set of documents, which can be highly informative for tasks like clustering, summarization, and recommendation.
- **Reduced Dimensionality**: By representing documents as a mixture of topics (rather than a sparse vector of words), LDA can reduce the dimensionality and improve performance in certain tasks, especially when there is a clear topic structure.
  
#### **Limitations:**
- **Interpretability**: While LDA can reveal topics, the interpretation of those topics is sometimes subjective, and LDA may struggle when documents don’t clearly belong to a small number of topics.
- **Parameter Tuning**: LDA requires careful selection of the number of topics, which can be challenging without domain knowledge or hyperparameter optimization.

---

### 6. **Universal Sentence Encoder (USE)**

#### **How It Works:**
- **Universal Sentence Encoder (USE)** is a model developed by Google that generates dense vector representations of entire sentences or paragraphs. USE has been trained to capture the semantic meaning of a sentence, which makes it useful for a variety of NLP tasks.

#### **Advantages Over BoW and TF-IDF:**
- **Sentence-Level Embeddings**: Unlike BoW and TF-IDF, which operate on words, USE operates on entire sentences, making it better suited for tasks like **sentence similarity**, **paraphrase detection**, and **semantic textual similarity**.
- **Pre-trained Models**: USE offers pre-trained models, making it easy to use without extensive training on specific datasets.

#### **Limitations:**
- **Lower Precision**: While effective for many tasks, USE may not always be as fine-tuned or specific as models like BERT for certain tasks, as it produces sentence embeddings that might miss subtle differences.

---

### Conclusion:

While **BoW** and **TF-IDF** are useful for many basic text processing tasks, more advanced techniques like **word embeddings** (Word2Vec, GloVe), **Doc2Vec**, **transformers** (BERT, GPT), and **LDA** can capture richer, more complex relationships in text data. These methods typically offer improved accuracy, better generalization, and more nuanced representations of text compared to simple frequency-based models like BoW and TF-IDF, especially when working with large and complex datasets.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
**Stemming** and **Lemmatization** are two common text preprocessing techniques used in **Natural Language Processing (NLP)** to normalize words by reducing them to their base or root form. While both methods aim to simplify text and reduce dimensionality, they differ in how they achieve this goal. Below is an analysis of the **pros** and **cons** of **stemming** and **lemmatization** based on their characteristics.

### **1. Stemming:**

#### **How It Works:**
Stemming is the process of removing affixes (like suffixes or prefixes) from words to reduce them to their root form, known as the **stem**. For example:
- **"running"** → **"run"**
- **"better"** → **"better"** (Note: Some stems are not valid words)
- **"happily"** → **"happi"**

Stemming algorithms, such as the **Porter Stemmer** or **Snowball Stemmer**, use a set of rules to chop off prefixes or suffixes without considering the context or meaning of the word.

#### **Pros of Stemming:**
- **Simple and Fast**: Stemming is generally **faster** than lemmatization since it uses a rule-based approach without the need for dictionary lookups or deeper linguistic analysis.
- **Good for Information Retrieval**: Stemming can be useful in applications like **search engines** or **information retrieval** where the goal is to group similar words and match queries to documents. For instance, words like "run", "running", and "runner" may all be reduced to the same stem, making it easier to find relevant results.
- **Reduces Dimensionality**: By reducing words to a common stem, stemming reduces the **feature space** and can improve model performance, particularly in large datasets.

#### **Cons of Stemming:**
- **Over-Simplification**: Stemming can result in **incorrect roots** that are not real words. For example, "fishing" might become "fish", but "fisherman" could become "fisher", which is not a meaningful root.
- **Loss of Meaning**: Since stems are produced by stripping off affixes without understanding the word’s context or meaning, stemming can produce words that are harder to interpret or irrelevant in some contexts (e.g., "better" → "better").
- **Inaccurate for Complex Words**: Stemming may not handle complex words or irregular forms accurately. Words like "went" (past tense of "go") would not be properly reduced.

---

### **2. Lemmatization:**

#### **How It Works:**
Lemmatization, unlike stemming, considers the **meaning of the word** and uses vocabulary and lexical knowledge (e.g., dictionaries or a lexicon) to reduce words to their **lemma** (the base form). For example:
- **"running"** → **"run"**
- **"better"** → **"good"**
- **"went"** → **"go"**

Lemmatization often requires more complex algorithms than stemming because it involves a deeper understanding of the word's grammatical context (e.g., whether it is a verb, noun, adjective, etc.) and may also involve rules for handling different parts of speech.

#### **Pros of Lemmatization:**
- **More Accurate and Meaningful**: Lemmatization produces **real words** (or valid lemmas) that have actual meanings. For instance, "went" becomes "go", and "better" becomes "good", which preserves their meaning in context.
- **Context-Aware**: Lemmatization is more **context-sensitive** than stemming, as it uses the correct part of speech (POS) tagging to understand how a word functions in a sentence. For example, "better" might be reduced to "good" when it’s used as an adjective, but might remain "better" if it’s used as a comparative adverb.
- **Improved Accuracy in Tasks**: Because lemmatization considers word meaning, it often leads to more accurate models, particularly in tasks like **text classification**, **named entity recognition (NER)**, and **sentiment analysis**.

#### **Cons of Lemmatization:**
- **Slower than Stemming**: Lemmatization is computationally **more expensive** than stemming, as it requires more sophisticated algorithms, including dictionary lookups and POS tagging.
- **Resource Intensive**: Lemmatization requires access to lexical databases like **WordNet** or other pre-built lexicons to look up the correct lemma, making it more resource-heavy than stemming, which relies on simple rule-based transformations.
- **More Complex**: The need to handle word meaning and part-of-speech tagging adds complexity to lemmatization, which may not be necessary for simpler tasks where speed is more important than accuracy.

---

### **Key Differences Between Stemming and Lemmatization:**

| **Aspect**               | **Stemming**                             | **Lemmatization**                           |
|--------------------------|------------------------------------------|---------------------------------------------|
| **Approach**              | Rule-based stripping of affixes (simple) | Context-aware, uses vocabulary and lexicon  |
| **Output**                | Stems (not necessarily valid words)      | Lemmas (valid, meaningful words)            |
| **Speed**                 | Faster                                   | Slower                                      |
| **Accuracy**              | Less accurate, can produce incorrect roots | More accurate, produces correct base forms  |
| **Context Sensitivity**   | No context awareness                     | Considers part of speech and context        |
| **Computational Cost**    | Low (less resource-intensive)            | High (requires dictionaries and POS tagging) |
| **Use Cases**             | Information retrieval, search engines, or when speed is critical | NLP tasks requiring accurate understanding, e.g., classification, sentiment analysis |

---

### **When to Use Stemming vs. Lemmatization:**

- **Use Stemming** when:
  - Speed is critical, and a slight loss in meaning can be tolerated.
  - The application is for **information retrieval**, **search engines**, or **topic modeling**, where the goal is to group words together based on similar roots rather than preserving meaning.
  - The dataset is large and computational efficiency is a priority over accuracy.

- **Use Lemmatization** when:
  - Accuracy and **meaning** are important (e.g., in **text classification**, **named entity recognition**, or **sentiment analysis**).
  - The words in the corpus contain complex structures, or you need to preserve semantic information (such as handling polysemy or homonyms).
  - You have access to sufficient computational resources, as the process can be slower than stemming.

---

### **Conclusion:**
- **Stemming** is **fast and efficient**, but it tends to oversimplify and can generate words that are not meaningful, which can lead to potential loss of information.
- **Lemmatization**, on the other hand, is more **accurate** and preserves the meaning of the words but requires more computational resources and is slower.

Ultimately, the choice between stemming and lemmatization depends on the specific requirements of the task at hand: if speed is more important than precision, stemming might be more suitable, but if you need to preserve **semantic meaning**, lemmatization is the better choice.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
