<a href="https://colab.research.google.com/github/DeeptiDiddi/FMML_M1L1.ipynb/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [4]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [9]:

!pip install nltk
import nltk

# Download the 'punkt_tab' data package
nltk.download('punkt_tab')
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
Troubling


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [10]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [11]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [12]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [13]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [14]:
df = df.dropna()

In [15]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [16]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [17]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

KNN with BOW accuracy = 62.82722513089005%




Cross Validation Accuracy: 0.60
[0.57254902 0.57254902 0.65354331]




In [18]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 71.20418848167539%




Cross Validation Accuracy: 0.75
[0.75686275 0.74509804 0.75590551]


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [21]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [24]:
import pandas as pd

# Try reading the file with 'latin-1' encoding
df = pd.read_csv('spam.csv', encoding='latin-1')
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [23]:
import pandas as pd

# Try reading the file with 'latin-1' encoding
df = pd.read_csv('spam.csv', encoding='latin-1')
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [41]:
import pandas as pd

# Load the dataframe with the correct encoding
df = pd.read_csv('spam.csv', encoding='latin-1')

# Assuming the spam/ham labels are in a column named 'v1', change it to 'Category'
df = df.rename(columns={'v1': 'Category'})

# Now apply the mapping
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

df.head(5) # Display the first 5 rows to confirm the changes

Unnamed: 0,Category,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,0,"Go until jurong point, crazy.. Available only ...",,,
1,0,Ok lar... Joking wif u oni...,,,
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,0,U dun say so early hor... U c already then say...,,,
4,0,"Nah I don't think he goes to usf, he lives aro...",,,


In [38]:
df.head(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [29]:
len(df)

5572

In [30]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [35]:
def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    # Read the CSV file with the 'latin-1' encoding
    training_data = pd.read_csv('spam.csv', encoding='latin-1')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    # Read the CSV file with the 'latin-1' encoding
    training_data = pd.read_csv('spam.csv', encoding='latin-1')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
Ans.The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the Bag-of-Words (BoW) model because it incorporates not just term frequency but also the importance of terms in the context of the entire dataset. Here’s why it often performs better:

1. Weighing Rare but Informative Words

Bag-of-Words: Treats all words equally by simply counting occurrences. Common words like "the" or "is" get as much importance as rare, more informative words.

TF-IDF: Weighs terms based on their frequency across documents. Rare words that are important for classification are given higher weight, while common words (e.g., stopwords) are down-weighted.


2. Reducing the Influence of Non-Discriminative Words

BoW: Common words like "and," "of," "the" (common in all classes) can dominate the feature space, leading to less discriminative power.

TF-IDF: Reduces the importance of these non-discriminative words by applying the IDF component, which penalizes terms that occur frequently across all documents.


3. Better Representation of Document Context

TF-IDF captures the relative importance of terms within a document in relation to the entire corpus, leading to a more meaningful representation of text data.

This makes it easier for machine learning models to distinguish between documents.


4. Sparse and Compact Feature Space

TF-IDF results in a sparser feature space compared to BoW, which can help some models, particularly linear ones, perform better due to reduced dimensionality and noise.


5. Example for Illustration

Consider the word "urgent" in a spam classification dataset:

In a spam message: "Urgent: Win a free car now!", the word "urgent" occurs frequently in spam but less often in regular messages.

BoW: Counts "urgent" as 1 in both contexts without distinguishing importance.

TF-IDF: Highlights "urgent" as significant in spam messages because it’s not common in other messages, helping the model identify it as an indicator of spam.



6. Overall Better Signal-to-Noise Ratio

By emphasizing discriminative words and suppressing redundant words, TF-IDF improves the signal-to-noise ratio, which helps the model focus on features that matter most for classification.


In conclusion, TF-IDF generally outperforms Bag-of-Words because it provides a richer and more discriminative feature representation by incorporating information about how terms are distributed across the document corpus.


2. Can you think of techniques that are better than both BoW and TF-IDF ?
Ans.Yes, several modern techniques outperform Bag-of-Words (BoW) and TF-IDF for many NLP tasks, especially when semantic understanding and context are essential. Below are some commonly used and effective approaches:


---

1. Word Embeddings (Dense Representations)

Examples: Word2Vec, GloVe, FastText

Why They’re Better:

Capture semantic meaning and relationships between words (e.g., similarity between "king" and "queen").

Represent words as dense vectors in a continuous space, unlike sparse vectors in BoW/TF-IDF.

Incorporate context-independent relationships between words.


Limitation: Static embeddings—each word has the same vector regardless of its context.



---

2. Contextualized Word Representations

Examples: ELMo, BERT, GPT, RoBERTa

Why They’re Better:

Contextualized embeddings capture the meaning of a word based on its context. For example, "bank" in "river bank" is treated differently than "bank" in "financial bank."

Models like BERT and GPT handle complex linguistic patterns such as syntactic dependencies, idiomatic phrases, and co-references.

State-of-the-art performance on many NLP tasks including text classification, sentiment analysis, and question answering.




---

3. Recurrent Neural Networks (RNNs)

Variants: Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU)

Why They’re Better:

RNNs process sequences, making them suitable for capturing word dependencies in sentences.

LSTMs/GRUs mitigate the vanishing gradient problem and handle long-range dependencies.

Better for tasks requiring sequential understanding, such as text generation and time-series prediction.




---

4. Transformers

Examples: Attention Is All You Need, Transformer-based models (BERT, GPT, T5)

Why They’re Better:

Use self-attention mechanisms to model relationships between words in a sequence, regardless of their distance.

Extremely efficient for parallel processing and large-scale text understanding.

Fine-tuning a pre-trained Transformer on a downstream task provides excellent results.




---

5. Sentence or Document Embeddings

Examples: Doc2Vec, Universal Sentence Encoder, Sentence-BERT

Why They’re Better:

Capture semantic meaning of entire sentences or documents rather than individual words.

Represent text as a dense vector that considers overall structure and meaning, leading to better performance on classification or clustering tasks.

Suitable for comparing entire documents or doing similarity-based tasks.




---

6. Graph Neural Networks for NLP

Examples: Text Graph Convolutional Networks (TextGCNs), knowledge graphs.

Why They’re Better:

Represent text as a graph (e.g., connecting words to contexts) and capture higher-order relationships between terms.

Integrate external knowledge (e.g., linking entities in the text to a knowledge graph like Wikidata).

Effective in tasks like information retrieval and entity recognition.




---

7. Pretrained Foundation Models + Fine-tuning

Examples: BERT, RoBERTa, GPT-3/4, ChatGPT, Falcon

Why They’re Better:

Models trained on massive amounts of diverse text data generalize well across tasks.

Fine-tuning these models on a specific task often surpasses traditional BoW/TF-IDF approaches.

Easily adapts to transfer learning and low-resource settings.




---

Comparison Overview


---

When to Use BoW/TF-IDF?

They are still useful for quick prototypes or tasks with small datasets where computational resources are limited.


For most modern applications, Transformer-based models and contextual embeddings are far more effective and versatile! Let me know if you want to explore implementation details or specific techniques further.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.
Ans.Here’s a detailed explanation of stemming and lemmatization, their differences, and their respective pros and cons:


---

1. What is Stemming?

Stemming is the process of reducing a word to its base or root form, typically using simple heuristics to cut off prefixes or suffixes. The resulting stem word might not be a valid word in the language.

Example:

Words: "running," "runner," "ran"

Stem: "run"

Tools: Porter Stemmer, Snowball Stemmer, Lancaster Stemmer


Pros:

1. Efficiency:

Computationally lightweight since it applies simple rules.

Suitable for tasks where speed is more critical than accuracy.



2. Language Independence:

Stemming can be applied to multiple languages without extensive customization (e.g., Snowball Stemmer).



3. Improved Recall:

Useful when approximate matches are sufficient, such as search engines or quick prototypes.




Cons:

1. Loss of Meaning:

Often produces invalid words or stems that lack meaningful context (e.g., "connect" and "connected" may both become "connect").



2. Over-stemming:

Reduces words too aggressively, leading to loss of distinct meanings (e.g., "universe" and "university" could both become "univers").



3. Low Precision:

Might lead to noise in results for tasks requiring semantic understanding.





---

2. What is Lemmatization?

Lemmatization reduces a word to its root or dictionary (lemma) form, considering its context (e.g., part of speech). It uses lexical knowledge bases like WordNet to produce meaningful results.

Example:

Words: "running," "runner," "ran"

Lemma: "run"

Tools: WordNet Lemmatizer (NLTK), spaCy, Stanford CoreNLP Lemmatizer


Pros:

1. Accuracy:

Produces contextually meaningful lemmas (e.g., "better" → "good").



2. Preserves Semantics:

Distinguishes between different grammatical contexts (e.g., "saw" as a verb vs. "saw" as a noun).



3. Improved Precision:

Suitable for applications like document classification or sentiment analysis where preserving meaning matters.




Cons:

1. Computationally Expensive:

Requires more resources and time since it involves parsing grammar and referencing lexical databases.



2. Language-Specific:

Tools often need to be tailored for different languages, making it less general-purpose compared to stemming.



3. Dependency on POS Tagging:

Requires accurate part-of-speech tagging to work correctly, which introduces additional computational steps.





---

Stemming vs Lemmatization: Key Differences


---

When to Use Which?

Use Stemming:

In applications where speed and recall are critical.

Example: Search engines (Google, ElasticSearch), where approximate matches suffice.


Use Lemmatization:

When accuracy and semantic understanding are critical.

Example: Text classification, chatbots, or sentiment analysis.



Both methods can also be combined with other preprocessing techniques depending on the use case and resource constraints. Let me know if you’d like to see examples or code for implementation!

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
