<a href="https://colab.research.google.com/github/MEDISETTISANJAY196/FMML.LABS/blob/main/moudle%203%20Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [15]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download NLTK resources (only need to do once)
nltk.download('punkt')
nltk.download('wordnet')

# Define cleanText function
def cleanText(text, lemmatize=True, stemmer=True):
    # Tokenize the input text

    # Initialize the stemmer and lemmatizer
    ps = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    cleaned_tokens = []





# Test with "Troubling"
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)

print("Original text:", sample_text)
print("After stemming:", sample_text_result)

sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)

print("After lemmatization:", sample_text_result)


Original text: Troubling
After stemming: None
After lemmatization: None


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [16]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [17]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [18]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [19]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [20]:
df = df.dropna()

In [21]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [22]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [25]:
## KNN accuracy after using BoW
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Example dataset of text and corresponding labels
text_data = [
    "I love programming",
    "Python is great",
    "I enjoy machine learning",
    "I hate bugs",
    "Debugging is fun",
    "I love algorithms"
]
labels = [1, 1, 1, 0, 0, 1]  # 1 = Positive, 0 = Negative (for example)

# Step 1: Create the Bag of Words representation using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text_data)

# Step 2: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

# Step 3: Train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Step 4: Make predictions on the test data
predicted = knn.predict(X_test)

# Step 5: Print out the results
print("Predictions:", predicted)
print("Actual labels:", y_test)
print("Accuracy:", accuracy_score(y_test, predicted))

# Return the predictions and actual test labels


Predictions: [0 0]
Actual labels: [1, 1]
Accuracy: 0.0


In [27]:
## KNN accuracy after using TFIDF
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Example dataset of text and corresponding labels
text_data = [
    "I love programming",
    "Python is great",
    "I enjoy machine learning",
    "I hate bugs",
    "Debugging is fun",
    "I love algorithms"
]
labels = [1, 1, 1, 0, 0, 1]  # 1 = Positive, 0 = Negative (for example)

# Step 1: Create the TF-IDF representation using TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)

# Step 2: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

# Step 3: Train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Step 4: Make predictions on the test data
predicted = knn.predict(X_test)

# Step 5: Print out the results
print("Predictions:", predicted)
print("Actual labels:", y_test)
print("Accuracy:", accuracy_score(y_test, predicted))

# Return the predictions and actual test labels


Predictions: [1 1]
Actual labels: [1, 1]
Accuracy: 1.0


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [28]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [29]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [30]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [31]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [32]:
len(df)

5572

In [33]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [36]:
# This cell may take some time to run
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Function for Bag of Words (BoW) with KNN classifier
def bow_knn():
    # Example dataset of text and corresponding labels
    text_data = [
        "I love programming",
        "Python is great",
        "I enjoy machine learning",
        "I hate bugs",
        "Debugging is fun",
        "I love algorithms"
    ]
    labels = [1, 1, 1, 0, 0, 1]  # 1 = Positive, 0 = Negative (for example)

    # Step 1: Create the Bag of Words representation using CountVectorizer
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(text_data)

    # Step 2: Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

    # Step 3: Train the KNN classifier
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train, y_train)

    # Step 4: Make predictions on the test data
    predicted = knn.predict(X_test)

    # Return the predictions and actual test labels
    return predicted, y_test

# Example usage
predicted, y_test = bow_knn()
print("Predictions:", predicted)
print("Actual labels:", y_test)


Predictions: [0 0]
Actual labels: [1, 1]


In [38]:
# This cell may take some time to run
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Function for TF-IDF with KNN classifier
def tfidf_knn():
    # Example dataset of text and corresponding labels
    text_data = [
        "I love programming",
        "Python is great",
        "I enjoy machine learning",
        "I hate bugs",
        "Debugging is fun",
        "I love algorithms"
    ]
    labels = [1, 1, 1, 0, 0, 1]  # 1 = Positive, 0 = Negative (for example)

    # Step 1: Create the TF-IDF representation using TfidfVectorizer
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(text_data)

    # Step 2: Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

    # Step 3: Train the KNN classifier
    knn = KNeighborsClassifier(n_neighbors=3)
    knn.fit(X_train, y_train)

    # Step 4: Make predictions on the test data
    predicted = knn.predict(X_test)

    # Return the predictions and actual test labels
    return predicted, y_test

# Example usage
predicted, y_test = tfidf_knn()
print("Predictions:", predicted)
print("Actual labels:", y_test)


Predictions: [1 1]
Actual labels: [1, 1]


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

### Why TF-IDF Can Perform Better Than Bag-of-Words (BoW)

**1. Consideration of Term Frequency and Document Frequency:**
   - **BoW (Bag of Words)** simply counts the frequency of words in a document, regardless of whether those words are common or rare across the entire corpus. This can lead to the model placing undue weight on common words (like "the", "and", "is", etc.) that don't carry much meaning.
   
   - **TF-IDF (Term Frequency-Inverse Document Frequency)** adjusts the term frequency by considering both:
     - **Term Frequency (TF)**: How frequently a word appears in a document.
     - **Inverse Document Frequency (IDF)**: How rare the word is across the entire corpus.
     
     The **IDF** component helps down-weight common words that appear across many documents, while giving higher weight to words that are rare in the corpus but appear frequently in a specific document. This allows the model to focus on more informative words, which improves performance on tasks like classification and retrieval.

**2. Addressing Common Words:**
   - Words that appear in almost every document (like "the", "and", etc.) might dominate the model's understanding of a document in **BoW**, which can reduce the model's ability to distinguish between documents based on meaningful content.
   - **TF-IDF** mitigates this by giving these common words a lower weight, making the model more sensitive to the unique words that truly define the document's content.

**3. Sensitivity to Rare Terms:**
   - **BoW** may overemphasize rare words (like specific technical terms) if they appear in only a few documents. **TF-IDF**, however, gives higher weight to terms that appear frequently in a specific document but are rare across the entire corpus, allowing the model to focus on terms that better distinguish documents from one another.

### Techniques Better Than BoW and TF-IDF

While **BoW** and **TF-IDF** are foundational techniques in text processing, more advanced techniques often outperform them in terms of capturing semantic meaning and better handling word order, context, and relationships between words.

**1. Word Embeddings (Word2Vec, GloVe, FastText):**
   - **Word2Vec**: Uses neural networks to learn vector representations of words that capture semantic relationships between them. Words that are contextually similar will have similar vector representations, which helps in capturing meaning beyond mere word frequency.
   - **GloVe (Global Vectors for Word Representation)**: Captures global co-occurrence statistics from the corpus and represents words in a continuous vector space.
   - **FastText**: Similar to Word2Vec, but it takes into account subword information, which helps in handling morphologically rich languages and rare words better.

   **Why Better**:
   - These methods preserve the semantic meaning of words, allowing for better generalization across different contexts and applications.
   - Unlike BoW and TF-IDF, which are sparse representations, these techniques produce dense vectors, making them computationally more efficient and better at capturing nuanced meanings.

**2. Contextualized Word Embeddings (BERT, GPT, ELMo):**
   - **BERT (Bidirectional Encoder Representations from Transformers)**: Unlike traditional embeddings, BERT takes context into account. The meaning of a word changes depending on the surrounding words, and BERT provides contextualized word embeddings that change depending on the sentence.
   - **ELMo (Embeddings from Language Models)**: Uses a bidirectional LSTM to generate word representations based on the entire context of the word in a sentence.
   - **GPT (Generative Pretrained Transformer)**: GPT is a language model that generates word embeddings with context, making it more robust for NLP tasks.

   **Why Better**:
   - These embeddings allow a deeper understanding of word meanings in context, making them more effective for tasks like sentiment analysis, machine translation, and question answering.
   - Contextual embeddings outperform static embeddings like Word2Vec in tasks that require understanding the nuances of meaning based on surrounding words.

### Stemming vs. Lemmatization: Pros and Cons

Both **stemming** and **lemmatization** are techniques for reducing words to their root forms, but they differ in their approach, and each has its advantages and drawbacks.

#### **Stemming**:
- **What it does**: Stemming cuts off prefixes or suffixes from a word to reduce it to its root form (often using simple rules). For example, "running" becomes "run", and "better" becomes "better" (though it might result in odd forms like "happi" for "happiness").
  
- **Pros**:
  - **Faster**: Stemming is generally quicker because it uses simple rules or heuristics to chop off word endings.
  - **Simple**: It doesn’t require a dictionary or deep linguistic analysis, making it easier to implement.
  
- **Cons**:
  - **Over-Stemming**: Sometimes stemming can result in roots that don't make sense, like turning "fishing" into "fish" and "fished" into "fish". This might reduce accuracy in some applications.
  - **Lack of Meaning**: Stemming doesn’t always generate real words (e.g., "happi" from "happiness"), which might make the output less interpretable.
  
#### **Lemmatization**:
- **What it does**: Lemmatization reduces words to their base or dictionary form (lemma) by considering the word's meaning and part of speech. For example, "running" becomes "run", and "better" becomes "good" (since "better" is an adjective and its lemma is "good").
  
- **Pros**:
  - **More Accurate**: Lemmatization considers the word’s meaning, so it tends to provide more semantically accurate results (e.g., "better" becomes "good", not "bet").
  - **Word Forms Are Real**: Unlike stemming, lemmatization results in valid words that are found in dictionaries.
  
- **Cons**:
  - **Slower**: Lemmatization is computationally more expensive than stemming because it involves looking up words in a dictionary and analyzing the context (e.g., whether a word is a noun, verb, etc.).
  - **Complexity**: Lemmatization requires more sophisticated algorithms, making it more complex to implement than stemming.

### Conclusion:
- **TF-IDF** is often superior to **BoW** because it down-weights common words and gives more importance to rare but meaningful terms, which improves model performance in many cases.
- **Word embeddings** (like Word2Vec, GloVe, FastText) and **contextual embeddings** (like BERT, GPT) are more advanced techniques that often outperform both BoW and TF-IDF, as they capture semantic meaning and context.
- **Stemming** is faster and simpler but can result in non-dictionary words or incorrect roots. **Lemmatization** is more accurate but computationally more expensive and complex to implement. The choice between the two depends on the task at hand: lemmatization is often better for tasks requiring accuracy, while stemming might be preferred for speed and simplicity in some applications.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
