<a href="https://colab.research.google.com/github/DBobby56/FMML_LABS-AND-PROJECTS/blob/main/module%203%20Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [20]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [21]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [22]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def cleanText(text, lemmatize=False, stemmer=False):
    # Tokenize text
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha()]  # remove non-alphabetic tokens

    # Initialize the stemmer and lemmatizer
    ps = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    cleaned_tokens = []

    for token in tokens:
        if stemmer:
            cleaned_tokens.append(ps.stem(token))  # Apply stemming
        elif lemmatize:
            cleaned_tokens.append(lemmatizer.lemmatize(token))  # Apply lemmatization
        else:
            cleaned_tokens.append(token)  # Keep the original token

    return cleaned_tokens



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [23]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [24]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [25]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews (1).csv


In [26]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [27]:
df = df.dropna()

In [28]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [29]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

## KNN accuracy after using BoW
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load your dataset
# For this example, assume you have a CSV file with columns 'sentence' and 'sentiment'
# 'sentence' contains the text data, 'sentiment' contains the labels (e.g., 0 or 1)

training_data = pd.read_csv('reviews.csv')  # Replace with your actual file path

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(training_data['sentence'], training_data['sentiment'], test_size=0.2, random_state=5)

# Step 2: Bag of Words (BoW) transformation
vectorizer = CountVectorizer(stop_words='english')  # You can remove stopwords if needed
X_train_bow = vectorizer.fit_transform(X_train)  # Fit and transform the training data
X_test_bow = vectorizer.transform(X_test)  # Only transform the test data

# Step 3: KNN Model
knn = KNeighborsClassifier(n_neighbors=5)  # You can adjust the number of neighbors
knn.fit(X_train_bow, y_train)  # Train the KNN model

# Step 4: Predictions and Accuracy
y_pred = knn.predict(X_test_bow)  # Predict on the test data
accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy

print(f"KNN Accuracy after using BoW: {accuracy:.4f}")

predicted, y_test = bow_knn()

In [30]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [31]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [32]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [33]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [34]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [35]:
len(df)

5572

In [36]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [38]:
# This cell may take some time to run
predicted, y_test = bow_knn()

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [39]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html



1.The **TF-IDF (Term Frequency-Inverse Document Frequency)** approach tends to result in better accuracy than **Bag of Words (BoW)** due to the following reasons:

- **Weighting Importance of Words**:
  - **BoW** treats all words equally. It only counts the frequency of each word in a document without considering whether the word is significant across the entire dataset. This means common words (like "the", "is", "in") that appear in almost every document can dominate the features, even though they might not be important for distinguishing between classes.
  - **TF-IDF** adjusts the weight of each word based on its frequency in the current document (TF) and its importance across the entire corpus (IDF). Words that appear frequently in a document but are rare in the overall corpus are given higher weights. This reduces the influence of common words that do not help differentiate between documents, and emphasizes the words that are more specific to particular documents or classes.
  
- **Reduces Impact of Common Words**:
  - **BoW**: Words that are extremely frequent across all documents (like common stopwords) are not penalized, which may lead to noisy features in the model.
  - **TF-IDF**: Rare words (that are useful for classification) are given higher importance, while frequent words across the corpus are down-weighted.

- **Better Performance in Classification**:
  - By using **TF-IDF**, the model can better differentiate between documents based on the unique, informative words, leading to improved performance in text classification tasks.

2.Yes, there are several techniques that have been shown to perform better than both **BoW** and **TF-IDF** in certain natural language processing (NLP) tasks:

- **Word Embeddings (Word2Vec, GloVe, FastText)**:
  - **Word Embeddings** are a more advanced technique that learns dense, continuous vector representations of words. Unlike BoW and TF-IDF, which represent words as sparse vectors (each word is represented by a unique feature), word embeddings represent words in a continuous vector space where similar words are mapped closer together. This captures semantic relationships between words (e.g., "king" and "queen" will be closer than "king" and "dog").
  - **Advantages**:
    - Captures word meaning and context.
    - Reduces the dimensionality of the feature space, resulting in better performance for larger corpora.
    - Words with similar meanings are grouped together (semantic similarity).
  
- **Contextualized Word Embeddings (BERT, GPT, RoBERTa)**:
  - Models like **BERT (Bidirectional Encoder Representations from Transformers)** and other transformer-based models represent words based on their context in a sentence. Unlike traditional word embeddings that give a fixed vector for each word, these models generate different vectors for the same word depending on its usage in different contexts.
  - **Advantages**:
    - Context-aware embeddings, which allow the model to better understand word meanings in specific contexts (e.g., "bank" as a financial institution vs. "bank" as the side of a river).
    - Fine-tuning the model on specific tasks (e.g., sentiment analysis, question answering) can lead to significant improvements in performance.

- **Doc2Vec (Paragraph Vector)**:
  - **Doc2Vec** extends the Word2Vec model to represent entire documents (or sentences) as fixed-size vectors. It takes into account the context of the entire document, rather than just individual words.
  - **Advantages**:
    - Captures the meaning of an entire document, which is useful for document classification or retrieval tasks.
    - Embedding representations of the entire document can improve tasks that rely on the overall meaning (e.g., sentiment analysis, document similarity).

- **Transformers (e.g., GPT, T5, BERT)**:
  - Transformer-based architectures (like **GPT**, **T5**, and **BERT**) are among the most powerful models in NLP today. These models use attention mechanisms to weigh the importance of different words in a sentence, allowing them to capture complex dependencies and context better than older methods like BoW and TF-IDF.
  - **Advantages**:
    - High accuracy for a variety of NLP tasks (text classification, summarization, question answering, etc.).
    - Pretrained models can be fine-tuned on domain-specific data for further improvements.

3.**Stemming** and **Lemmatization** are both techniques for reducing words to their base or root form, but they differ in how they do this and the outcomes they produce.

#### **Stemming**:
- **What it does**: Stemming is a heuristic process that cuts off prefixes or suffixes from words to arrive at a "root" form, often without regard for the actual meaning of the word. For example, "running" becomes "run" and "happily" becomes "happi".
  
- **Pros**:
  - **Faster**: Stemming algorithms are usually faster than lemmatization because they use simple rules for cutting off word endings.
  - **Simplicity**: Stemming is simple and doesn’t require a dictionary or additional information.

- **Cons**:
  - **Over-stemming**: Sometimes stemming can produce words that are not correct, or overly simplified (e.g., "better" becomes "bet", which changes the word's meaning).
  - **Loss of meaning**: Since stemming doesn’t account for the meaning of a word, it might reduce different words to the same root, even if they are not related (e.g., "run" and "runner" might both be reduced to "run").

#### **Lemmatization**:
- **What it does**: Lemmatization is a more sophisticated process that reduces a word to its **lemma**, which is its base or dictionary form. It uses vocabulary and morphological analysis to ensure the root word is valid. For example, "running" becomes "run" and "better" becomes "good".
  
- **Pros**:
  - **More accurate**: Lemmatization considers the meaning of a word and reduces it to a valid root form, preserving the meaning (e.g., "was" → "be").
  - **Fewer ambiguities**: It helps to distinguish between different meanings of words (e.g., "better" as an adjective vs. "better" as a verb).
  
- **Cons**:
  - **Slower**: Lemmatization can be slower than stemming because it requires additional resources, like word dictionaries, and performs more complex checks.
  - **Requires more resources**: It typically needs a dictionary or a set of rules for correct lemmatization.

### Summary of Pros/Cons:

| **Method**       | **Pros**                                     | **Cons**                                     |
|------------------|----------------------------------------------|----------------------------------------------|
| **Stemming**     | Fast, simple, good for quick tasks.         | May produce non-words, loss of meaning.     |
| **Lemmatization**| More accurate, preserves meaning.           | Slower, requires resources (dictionaries).  |

### Conclusion:
- **TF-IDF** is generally better than BoW because it emphasizes more important and informative words, reducing the impact of frequent but uninformative terms.
- **Word Embeddings** (e.g., Word2Vec, BERT) are more advanced techniques that often outperform both BoW and TF-IDF due to their ability to capture semantic meaning and context.
- **Lemmatization** is typically preferred over stemming when accuracy is important because it preserves the meaning of words, although it is slower and requires more resources. Stemming can be useful for faster, simpler tasks where a rough approximation is acceptable.