<a href="https://colab.research.google.com/github/BudamaLakshmiPragnamanasvi/AIML-Tutorials/blob/main/Module_3_Lab_3_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Student Training Program on AIML**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [11]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perc

True

In [5]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [12]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

Troubling
troubl
trouble


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [13]:
5*12

60

In [14]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [15]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [19]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [20]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [21]:
df = df.dropna()

In [22]:
df

Unnamed: 0,sentence,sentiment
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0
...,...,...
994,I just got bored watching Jessice Lange take h...,0
995,"Unfortunately, any virtue in this film's produ...",0
996,"In a word, it is embarrassing.",0
997,Exceptionally bad!,0


In [23]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [24]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [25]:
def bow_knn():
    """Modified KNN with Bag-of-Words"""
    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=42)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=True)

    knn = neighbors.KNeighborsClassifier(
        n_neighbors=3,                # Reduced neighbors
        weights='distance',           # Use distance weighting
        algorithm='kd_tree',          # kd-tree algorithm
        leaf_size=50,                 # Larger leaf size
        p=1,                          # Manhattan distance
        metric='manhattan',           # Manhattan metric
        n_jobs=-1                     # Use all available cores
    )

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = {:.2f}%'.format(acc * 100))

    scores = cross_val_score(knn, X_train, y_train, cv=5)
    print("Cross Validation Accuracy: {:.2f}".format(scores.mean()))
    return predicted, y_test


def tfidf_knn():
    """Modified KNN with TF-IDF"""
    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=42)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    knn = neighbors.KNeighborsClassifier(
        n_neighbors=7,                # Increased neighbors
        weights='uniform',            # Uniform weighting
        algorithm='brute',            # Brute force for large datasets
        leaf_size=30,                 # Default leaf size
        p=2,                          # Euclidean distance
        metric='euclidean',           # Euclidean metric
        n_jobs=1                      # Single-threaded
    )

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = {:.2f}%'.format(acc * 100))

    scores = cross_val_score(knn, X_train, y_train, cv=5)
    print("Cross Validation Accuracy: {:.2f}".format(scores.mean()))
    return predicted, y_test


Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [26]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 57.59%




Cross Validation Accuracy: 0.62


In [27]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 66.49%
Cross Validation Accuracy: 0.69




# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [28]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [29]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [30]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [31]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [32]:
len(df)

5572

In [33]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [34]:
def bow_knn():
    """Modified KNN with Bag-of-Words"""
    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=42)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    # Experimenting with different hyperparameters
    knn = neighbors.KNeighborsClassifier(
        n_neighbors=7,                # Changed number of neighbors
        weights='distance',           # Distance-based weighting
        algorithm='kd_tree',          # Using kd-tree algorithm for better performance
        leaf_size=40,                 # Changed leaf size
        p=1,                          # Manhattan distance (p=1)
        metric='manhattan',           # Changed to Manhattan distance
        n_jobs=-1                     # Use all available cores
    )

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = {:.2f}%'.format(acc * 100))

    scores = cross_val_score(knn, X_train, y_train, cv=5)  # Increased CV folds
    print("Cross Validation Accuracy: {:.2f}".format(scores.mean()))
    return predicted, y_test


def tfidf_knn():
    """Modified KNN with TF-IDF"""
    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=42)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)

    # Experimenting with different hyperparameters
    knn = neighbors.KNeighborsClassifier(
        n_neighbors=10,               # Increased number of neighbors
        weights='uniform',            # Uniform weighting
        algorithm='auto',             # Auto for choosing the best algorithm
        leaf_size=30,                 # Default leaf size
        p=2,                          # Euclidean distance (p=2)
        metric='euclidean',           # Changed to Euclidean distance
        n_jobs=1                      # Single-threaded
    )

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = {:.2f}%'.format(acc * 100))

    scores = cross_val_score(knn, X_train, y_train, cv=5)  # Increased CV folds
    print("Cross Validation Accuracy: {:.2f}".format(scores.mean()))
    return predicted, y_test


In [35]:
# This cell may take some time to run
predicted, y_test = bow_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with BOW accuracy = 94.35%
Cross Validation Accuracy: 0.93


In [36]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

  soup = BeautifulSoup(text, "lxml")


KNN with TFIDF accuracy = 89.69%
Cross Validation Accuracy: 0.88


### Questions to Think About and Answer
##1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?



 **Importance Weighting**:  
   Bag-of-Words (BoW) counts how often a word appears but treats all terms equally, which means frequent but unimportant words (e.g., "the," "is") can dominate the representation. TF-IDF (Term Frequency-Inverse Document Frequency) assigns weights to terms based on their importance, giving higher scores to unique words and lower scores to common ones. This helps models focus on terms that are more relevant to distinguishing between categories.  

 **Noise Reduction**:  
   In BoW, very frequent words contribute heavily to the feature set, even if they don't provide meaningful information. TF-IDF mitigates this by reducing the weight of such common terms (like stopwords) while highlighting less frequent, more informative words, such as "offer" or "discount" in spam classification.  

 **Better Feature Differentiation**:  
   BoW only captures term frequencies, which doesn’t reflect how informative a term is across the dataset. TF-IDF incorporates the inverse document frequency, highlighting terms that appear often in specific categories but not across the entire dataset, helping the model differentiate between documents effectively.  

 **Document Length Normalization**:  
   BoW can be biased toward longer documents with higher word counts, as they naturally have more occurrences of each term. TF-IDF normalizes term frequencies, ensuring that the representation is not skewed by document length, leading to fairer comparisons between documents.  

 **Context Relevance**:  
   TF-IDF captures both local and global term significance, ensuring terms are evaluated based on their relevance in a specific document and across the corpus. This context-aware representation is typically more robust for text classification than BoW, which focuses solely on local term occurrences.  

##2. **Can you think of techniques that are better than both BoW and TF-IDF ?**


 **Word Embeddings (Word2Vec, GloVe, FastText)**:  
   Word embeddings are dense vector representations that capture the semantic relationships between words. Unlike Bag-of-Words (BoW) and TF-IDF, which rely on word frequencies, embeddings position words in a continuous vector space where similar words are closer together. For instance, "king" and "queen" have a smaller distance compared to "king" and "car." Embeddings also solve the sparsity problem in BoW and TF-IDF by using compact representations. FastText further enhances this by including subword information, which allows the generation of embeddings for out-of-vocabulary words.  

 **Contextualized Word Embeddings (BERT, RoBERTa, GPT)**:  
   Contextualized embeddings dynamically adapt to the meaning of a word based on its context. For example, in "bank of the river" and "financial bank," the word "bank" will have distinct representations. Models like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) use attention mechanisms to understand context and relationships within text, making them highly effective for tasks such as classification and summarization. These models are pretrained on massive datasets, enabling superior generalization to various NLP tasks.  

 **Doc2Vec**:  
   Doc2Vec extends Word2Vec to represent entire documents as dense vectors. This approach captures document-level semantics instead of just word-level features. By encoding the overall meaning of a document into a single vector, Doc2Vec is especially useful for tasks like document similarity, clustering, or information retrieval. It overcomes the limitations of BoW and TF-IDF by considering the relationships between words and phrases in a document.  

**Recurrent Neural Networks (RNNs, LSTMs, GRUs)**:  
   Recurrent Neural Networks (RNNs) process text sequentially, making them suitable for tasks that require understanding dependencies between words. Variants like Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) are designed to handle long-term dependencies, such as in sentences with negations or complex structures. These models maintain context across sequences, which BoW and TF-IDF fail to do, making them better for tasks like sentiment analysis and text generation.  

 **Transformers (BERT, GPT, T5)**:  
   Transformers revolutionized NLP by using self-attention mechanisms to capture relationships between all words in a text simultaneously. Unlike RNNs, transformers do not rely on sequential processing, allowing them to model long-range dependencies more effectively. Models like BERT and GPT leverage this capability to achieve state-of-the-art performance in tasks such as text classification, machine translation, and summarization. Their scalability and ability to handle large datasets make them superior to traditional methods.  

 **Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)**:  
   These techniques focus on identifying hidden patterns and topics within text data. LSA uses matrix factorization to reduce dimensionality while preserving important relationships between terms. LDA is a probabilistic model that assigns words and documents to latent topics. Both methods are useful for topic modeling and clustering, providing insights into the thematic structure of text data.  

 **Hybrid Approaches**:  
   Combining traditional methods like TF-IDF with advanced techniques can leverage the strengths of both. For example, TF-IDF can be used to select important features before feeding the data into a neural network or Word2Vec-based model. This hybrid approach is computationally efficient and often improves performance in tasks like spam detection or sentiment analysis.  


##3. **Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.**

#### Stemming  

Stemming is a process of reducing words to their root form by removing suffixes and prefixes, often using simple rule-based heuristics. For example, the words "running", "runner", and "ran" may all be reduced to "run" by a stemming algorithm. However, stemming doesn't always produce a valid word; for instance, "universities" might be reduced to "univers".  

The primary advantage of stemming is its simplicity and speed. It is computationally efficient and helps reduce vocabulary size by grouping related word forms. This can be particularly useful in tasks like information retrieval or keyword matching, where precise meaning is less critical. However, stemming has significant downsides, including a lack of context awareness and a tendency to "overstem" words, which can lead to a loss of meaning or misrepresentation of the original word.  

#### Lemmatization  

Lemmatization is a more sophisticated technique that reduces words to their dictionary form (lemma) while considering grammatical context, such as part of speech. For example, "running" is lemmatized to "run" if it's a verb, while "better" is lemmatized to "good" as an adjective. This process often relies on dictionaries and morphological analysis to ensure accuracy.  

The key advantage of lemmatization is its ability to produce meaningful, contextually accurate word forms. This makes it suitable for tasks requiring precise text representation, such as machine translation or sentiment analysis. However, lemmatization is computationally more expensive and slower than stemming due to its reliance on linguistic resources and complex algorithms. It also requires well-designed libraries and may not support all languages equally.  


Stemming and lemmatization serve similar purposes but differ in complexity and accuracy. Stemming is faster and simpler, making it ideal for approximate or resource-limited tasks. Lemmatization, on the other hand, is better suited for applications requiring linguistic precision, such as text summarization or chatbot development. The choice between the two depends on the specific requirements of the NLP task at hand.



### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
