<a href="https://colab.research.google.com/github/005sudha/005sudha-fmml-lab/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [3]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the Bag-of-Words (BoW) model because it addresses some of the limitations of BoW, particularly in terms of weighting the importance of words. Here’s why TF-IDF often performs better:

1. Importance of Words:

BoW: The Bag-of-Words model counts the frequency of each word in a document, treating all words equally, regardless of how common or rare they are in the entire corpus.

TF-IDF: In contrast, TF-IDF gives more weight to words that are frequent in a specific document but rare across the entire corpus. This helps highlight words that are more likely to be important for distinguishing between documents.


Why this is better:

Common words (like "the," "and," "is") are usually not meaningful for classification tasks, as they appear in almost every document and do not help in distinguishing between them. TF-IDF reduces their importance, while BoW gives them equal weight to other words.


2. Downplaying Common Words (Stopwords):

BoW: Since it treats all words equally, common words or stopwords (e.g., "in," "on," "to") are given the same importance as more meaningful words.

TF-IDF: It assigns lower weights to words that appear frequently across many documents, reducing the effect of stopwords on the model. This leads to better feature representation and more meaningful word vectors.


3. Handling Rare Terms:

BoW: Rare words, although potentially important, are treated as equal to more common words, often leading to noise in the data.

TF-IDF: Rare words that appear in only a few documents are given higher weights, which can be crucial for distinguishing between documents in certain contexts.


4. Feature Scaling:

BoW: All words are treated as independent features, and the raw counts can cause imbalances if some words appear much more frequently than others.

TF-IDF: By applying the inverse document frequency, the importance of common terms is scaled down, creating a better representation of the text that avoids the dominance of frequent, non-informative terms.


5. Better for Text Classification:

BoW: BoW can capture the occurrence of words, but it does not take into account how significant those words are in the context of the entire corpus.

TF-IDF: TF-IDF helps to identify the distinctive words in each document by considering both local and global word frequency, which is important for text classification tasks like spam detection or sentiment analysis. This makes TF-IDF more effective at capturing patterns in the data that help differentiate between classes.


6. Sparsity:

BoW: The BoW model may lead to very sparse vectors with many zero counts, especially if the vocabulary is large. This can increase computational cost and reduce accuracy.

TF-IDF: By emphasizing important terms and reducing the weight of less informative words, TF-IDF can lead to less sparse and more meaningful vectors, which typically enhances model performance.


Conclusion:

In summary, TF-IDF improves on the BoW model by giving higher weights to more informative and unique words and reducing the influence of common, unimportant words. This more nuanced representation of text typically leads to better performance in text classification tasks, making TF-IDF a more effective approach compared to the simpler Bag-of-Words model.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
Yes, there are several techniques that can outperform both Bag-of-Words (BoW) and TF-IDF in terms of capturing the semantic meaning and context of text data. These more advanced methods address the limitations of BoW and TF-IDF, such as ignoring word order, context, and the relationship between words. Here are some notable techniques:

1. Word Embeddings (Word2Vec, GloVe, FastText):

Word2Vec: This technique creates dense, low-dimensional vector representations of words by training a neural network model to predict context words given a target word (Skip-gram) or vice versa (CBOW). Word2Vec captures semantic relationships, meaning words with similar meanings are close in the vector space.

GloVe (Global Vectors for Word Representation): GloVe generates word embeddings based on the global co-occurrence matrix of words in a corpus. It models the relationships between words in a way that captures semantic similarity and word associations.

FastText: An extension of Word2Vec, FastText represents words as bags of character n-grams. This is useful for capturing the morphology of rare or unseen words, making it more effective for languages with rich morphology or out-of-vocabulary terms.


Why they are better:

Unlike BoW and TF-IDF, word embeddings capture word meanings based on context and can generalize across different uses of words (e.g., "bank" as a financial institution vs. a riverbank).

They provide dense, continuous-valued vectors instead of sparse binary or count vectors, reducing memory usage and computational complexity.


2. Contextualized Embeddings (BERT, GPT, RoBERTa, T5):

BERT (Bidirectional Encoder Representations from Transformers): BERT produces context-aware embeddings by taking into account the entire sentence or document, rather than treating words in isolation. It captures the meaning of a word based on its surrounding words in a bidirectional way (both left and right context).

GPT (Generative Pre-trained Transformer): Like BERT, GPT is a transformer-based model that learns contextualized representations. However, GPT is unidirectional (left to right) and is designed for language generation tasks.

RoBERTa and T5: These are improvements or variants of BERT and GPT, with enhancements in training data and architecture that result in better performance for a wide range of NLP tasks.


Why they are better:

These models learn deep contextual information, meaning that the same word will have different embeddings depending on its context (e.g., "bat" in "bat and ball" vs. "bat" in "flying bat").

They are pretrained on large corpora and fine-tuned on specific tasks, resulting in state-of-the-art performance across many NLP applications (e.g., text classification, question answering, and sentiment analysis).


3. Transformers (BERT, T5, etc.) for Text Representation:

Transformers are a type of model architecture that leverage attention mechanisms to learn relationships between words in a sequence, regardless of their distance. This allows transformers to capture long-range dependencies and context, which is a significant improvement over traditional methods like BoW and TF-IDF that ignore word order and context.


Why they are better:

Transformers handle sequential data efficiently and can understand complex language structures, capturing both syntax and semantics.

Models like BERT can be fine-tuned for specific downstream tasks, improving accuracy on tasks like text classification, named entity recognition, and machine translation.


4. Sentence Embeddings (Sentence-BERT, Universal Sentence Encoder):

Sentence-BERT (SBERT): SBERT is a modification of BERT that generates embeddings for entire sentences, making it suitable for tasks like sentence similarity, clustering, and semantic search.

Universal Sentence Encoder (USE): USE is another model that generates sentence embeddings optimized for downstream NLP tasks such as sentence-level classification or semantic similarity.


Why they are better:

These models take into account the entire sentence's meaning, rather than just individual words, making them useful for tasks where the context of the entire sentence matters.

They produce fixed-length vectors for variable-length text inputs, which can be compared directly for similarity, clustering, or retrieval tasks.


5. Doc2Vec:

Doc2Vec (or Paragraph2Vec) is an extension of Word2Vec that generates fixed-length vector representations for entire documents or paragraphs. It takes into account the context of the entire document, allowing for more accurate document-level representations.


Why it’s better:

While Word2Vec provides vector representations for words, Doc2Vec is useful for capturing the semantic meaning of larger text units (like paragraphs or documents).

It maintains the local context and global structure of the document, unlike BoW and TF-IDF, which treat each word independently.


6. Attention Mechanisms:

Attention mechanisms, often used in combination with transformer models, allow the model to focus on specific parts of the text that are most relevant for the task at hand. This improves performance on tasks like machine translation, summarization, and question answering.


Why it’s better:

Attention helps the model capture the most relevant parts of the input sequence, which is especially useful for longer documents or complex language structures. This provides more fine-grained understanding compared to BoW or TF-IDF, which treat all words uniformly.


Conclusion:

Word embeddings (like Word2Vec and GloVe) and contextualized embeddings (like BERT and GPT) are more advanced techniques that capture semantic meaning, context, and relationships between words.

Transformer-based models and sentence embeddings offer further improvements by capturing long-range dependencies and contextual relationships in text, enabling them to perform well on complex NLP tasks.

Doc2Vec is useful for capturing document-level representations, and attention mechanisms enhance the ability to focus on relevant parts of text.


These advanced methods are generally more effective than BoW and TF-IDF because they are capable of understanding the deeper structure and meaning of text, which improves performance in tasks like text classification, sentiment analysis, and semantic search.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each
Stemming and lemmatization are both techniques used in Natural Language Processing (NLP) to reduce words to their base or root form. However, they differ in their approaches and outcomes. Let's break down the pros and cons of each:

Stemming

Stemming is the process of reducing a word to its base or root form by chopping off prefixes or suffixes, often through heuristic methods.

Example:

"running" → "run"

"better" → "better" (no change because it's already in a stemmed form)

"happiness" → "happi"



Pros of Stemming:

1. Efficiency:

Stemming algorithms are typically faster than lemmatization because they rely on simple rules (e.g., removing affixes) and do not require a dictionary or linguistic analysis.



2. Simplicity:

The stemming process is simple to implement and does not require complex computational resources.



3. Works well for some tasks:

For some tasks, like information retrieval or document clustering, stemming can perform adequately because it reduces words to a common root form, aiding in matching terms.



4. No need for a dictionary:

Stemming doesn't need an extensive dictionary or knowledge of the language's grammar rules. It works based on patterns and suffixes, making it more flexible for certain types of applications.




Cons of Stemming:

1. Over-stemming:

Stemming may reduce words too aggressively, leading to incorrect or nonsensical roots. For example, "better" might be reduced to "bett," which is not a valid word.



2. Lack of Semantic Meaning:

The root word may not always reflect the meaning of the original word. For instance, "running" and "runner" might both be stemmed to "run," but the meanings of these words differ.



3. Loss of Accuracy:

Because stemming is rule-based, it can sometimes produce stems that don't exist in the language, or that don't preserve the intended meaning of the word, affecting tasks that require more precision.





---

Lemmatization

Lemmatization involves reducing a word to its base or dictionary form (the lemma) using a more sophisticated process that considers the word's meaning and part of speech.

Example:

"running" → "run"

"better" → "good"

"happiness" → "happiness" (unchanged, because it's already in its lemma form)



Pros of Lemmatization:

1. Semantic Accuracy:

Lemmatization preserves the meaning of words better than stemming, as it always returns a valid word that has meaning in the language (e.g., "running" becomes "run," not just a stem like "runn").



2. Context Sensitivity:

Lemmatization takes part of speech into account. For example, "better" as an adjective becomes "good," but "better" as a verb is lemmatized to "better" or "improve." This makes lemmatization more precise in distinguishing between word forms.



3. Results in Proper Words:

Unlike stemming, which can generate non-words, lemmatization results in proper dictionary words that are more meaningful and useful for tasks requiring accurate word representations.



4. Improved Task Performance:

For tasks like sentiment analysis, text classification, and machine translation, lemmatization’s accuracy in preserving meaning helps improve overall performance.




Cons of Lemmatization:

1. Slower:

Lemmatization is computationally more expensive than stemming because it involves linguistic rules and the use of dictionaries to identify the correct lemma, which can be slower.



2. Complexity:

The process of lemmatization is more complex and may require additional tools or libraries (e.g., WordNet) for effective implementation. This adds to the complexity of the NLP pipeline.



3. Requires More Resources:

Lemmatization often requires additional resources such as large dictionaries, parts-of-speech tagging, or external tools to properly identify the lemma, making it more resource-intensive than stemming.



4. Dependency on Part of Speech:

Lemmatization can be less effective if the part of speech is ambiguous or not tagged correctly, which can lead to inaccurate lemmatization.





---

Summary of Pros and Cons:


---

When to Use Stemming vs Lemmatization:

Stemming is useful when you need a quick, computationally efficient solution where exact meaning isn't critical (e.g., in search engines or document clustering).

Lemmatization is ideal when you need more accurate, semantically meaningful reductions of words, especially for tasks like sentiment analysis, text classification, and other NLP tasks that require precise understanding of words.


In general, lemmatization tends to perform better in tasks that require higher precision and understanding of the context and meaning of words, while stemming is faster and simpler for less critical tasks.

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
