<a href="https://colab.research.google.com/github/Nallikirankumar/kiran_fmml_labs_modules/blob/main/module3Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [None]:
sample_text = "Troubling"
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [11]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test


## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [12]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [None]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [None]:
df = df.dropna()

In [None]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [None]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

In [None]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [None]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

In [None]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

In [None]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [None]:
df.head(5)

In [None]:
len(df)

In [None]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [None]:
# This cell may take some time to run
predicted, y_test = bow_knn()

In [None]:
# This cell may take some time to run
predicted, y_test = tfidf_knn()

### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?

answer:
1. Weighting of Important Terms
Bag-of-Words: Treats all words equally and simply counts the frequency of each word, ignoring the importance or relevance of terms. High-frequency words like "the," "is," and "a" dominate the counts, even though they carry little meaning.

TF-IDF: Assigns weights based on two factors:

Term Frequency (TF): How often a term appears in a specific document.
Inverse Document Frequency (IDF): Reduces the influence of words that appear frequently across all documents (common stop words).
As a result, TF-IDF emphasizes rare but informative terms (like "machine," "learning," or "accuracy") that better characterize a document, leading to improved accuracy in tasks like classification or clustering.

2. Reduction of Noise from Common Words
In BoW, very frequent words that appear in almost every document contribute disproportionately to the model but do not help in distinguishing between documents. This introduces noise.
TF-IDF reduces the weight of such common words because of the IDF component, which penalizes terms that appear in most documents. This allows more meaningful and discriminative terms to stand out.
3. Improved Handling of Rare but Significant Words
Rare words that appear in only a few documents are often the most informative for identifying specific topics or content.
TF-IDF gives higher weights to such terms because their IDF value is high, making them more influential in distinguishing between documents.
4. Better Representation of Semantic Importance
By weighting terms according to their importance (rather than just their frequency), TF-IDF provides a better representation of the document's content. For example, in a document about "machine learning," words like "algorithm" or "model" will have higher weights compared to generic words.
This results in more accurate downstream tasks such as document classification, clustering, and information retrieval.
5. Improved Discrimination for Document Similarity
TF-IDF helps identify and prioritize distinctive words when comparing documents or calculating similarity. Unlike BoW, where common words can obscure the real signal, TF-IDF ensures that relevant terms drive similarity computations.
Example:
Consider two documents:

Document A: "The cat sat on the mat."
Document B: "The cat plays with the toy."
In a BoW model, "the" and "cat" would dominate both documents, leading to a less meaningful representation.
In a TF-IDF model, "mat" and "toy" would receive higher weights because they are rare and better distinguish the two documents.
Summary:
TF-IDF generally achieves better accuracy because it:

Reduces the impact of common words (noise).
Highlights rare but informative words.
Provides a better representation of the importance of words in a document.
Improves document discrimination and similarity calculations.

2. Can you think of techniques that are better than both BoW and TF-IDF ?
answer:

1. Word Embeddings (Word2Vec, GloVe, FastText)
Word embeddings are dense vector representations of words where similar words have similar vector representations. They capture semantic relationships and context between words, unlike BoW and TF-IDF.

Word2Vec: Uses neural networks to learn word representations based on context, using either CBOW (Continuous Bag-of-Words) or Skip-gram models. For example, "king" - "man" + "woman" ≈ "queen" in vector space.
GloVe (Global Vectors for Word Representation): Combines local context (like Word2Vec) and global word co-occurrence statistics to create word vectors.
FastText: Improves upon Word2Vec by considering subword information (e.g., character n-grams), which helps represent rare and out-of-vocabulary words better.
Advantages:

Captures semantic relationships between words.
More compact and efficient than sparse BoW/TF-IDF vectors.
Handles synonyms and context better.
2. Contextual Embeddings (ELMo, BERT, GPT)
These models generate word embeddings that vary depending on the context in which a word appears, solving the ambiguity problem of static embeddings.

ELMo (Embeddings from Language Models): Generates embeddings based on the context of words using bidirectional LSTM networks.
BERT (Bidirectional Encoder Representations from Transformers): Pretrained Transformer-based model that uses a masked language model and bidirectional attention to learn rich contextual representations of words. It produces embeddings that vary depending on the sentence.
GPT (Generative Pretrained Transformer): Similar to BERT but uses an autoregressive (left-to-right) approach for language modeling.
Advantages:

Captures both context and word order.
Handles word sense disambiguation (e.g., "bank" as a financial institution vs. riverbank).
Superior performance on tasks like text classification, sentiment analysis, and named entity recognition.
3. Transformer-based Pretrained Language Models
Models like BERT, GPT-3, RoBERTa, XLNet, and T5 are pretrained on massive datasets using Transformer architectures. They can then be fine-tuned for specific tasks.
Why are they better?

They process sequences of words with self-attention mechanisms, which allow the model to understand relationships between words regardless of distance.
They generate contextual representations at a sentence or document level, not just word-level.
Applications: Text summarization, question answering, machine translation, and more.

4. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
LSA: Reduces the dimensionality of the BoW/TF-IDF vectors using singular value decomposition (SVD). This reveals latent relationships between words and documents, capturing hidden semantic structure.
LDA: A probabilistic model that represents documents as mixtures of topics, where each topic is a distribution of words. It’s commonly used for topic modeling.
Advantages:

Reduces dimensionality and noise.
Identifies latent semantics and topic structure.
5. Doc2Vec
Doc2Vec is an extension of Word2Vec that learns fixed-length vector representations for entire documents, not just words.

Distributed Memory (DM): Captures the context of words and documents together.
Distributed Bag-of-Words (DBOW): Learns document representations without word order.
Advantages:

Produces compact and meaningful representations for sentences or documents.
Better for tasks requiring entire document understanding (e.g., classification).
6. Recurrent Neural Networks (RNNs) and LSTMs/GRUs
RNNs process sequences of text while maintaining memory of previous inputs, capturing the sequential nature of text.
LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) solve the vanishing gradient problem in RNNs, enabling the capture of long-range dependencies.
Advantages:

Understands word order and sentence structure.
Useful for sequential tasks like language generation and sentiment analysis.
7. Attention Mechanism and Self-Attention
The Attention Mechanism allows models to focus on specific words in a sentence when generating predictions.
Self-Attention (used in Transformers) allows a model to relate each word to every other word in a sentence, which improves performance on tasks requiring long-range dependencies.
8. Hybrid Approaches
Combining techniques like TF-IDF with embeddings or TF-IDF with LSA can further improve performance. For instance:

Use TF-IDF for word weighting and Word2Vec/BERT for embeddings.
Combine topic modeling (LDA) with word embeddings to get semantic and topical insights.
Summary of Techniques Better than BoW and TF-IDF
Word Embeddings: Word2Vec, GloVe, FastText.
Contextual Embeddings: ELMo, BERT, GPT.
Transformer-based Models: BERT, GPT-3, T5, etc.
LSA and LDA: For latent semantics and topic modeling.
Doc2Vec: Document-level embeddings.
RNNs, LSTMs, GRUs: Sequence-based models.
Attention and Transformers: Contextual and long-range understanding.
Hybrid Methods: Combine classic approaches with modern embeddings.


3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

answer:
1. Word Embeddings (Word2Vec, GloVe, FastText)
Word embeddings are dense vector representations of words where similar words have similar vector representations. They capture semantic relationships and context between words, unlike BoW and TF-IDF.

Word2Vec: Uses neural networks to learn word representations based on context, using either CBOW (Continuous Bag-of-Words) or Skip-gram models. For example, "king" - "man" + "woman" ≈ "queen" in vector space.
GloVe (Global Vectors for Word Representation): Combines local context (like Word2Vec) and global word co-occurrence statistics to create word vectors.
FastText: Improves upon Word2Vec by considering subword information (e.g., character n-grams), which helps represent rare and out-of-vocabulary words better.
Advantages:

Captures semantic relationships between words.
More compact and efficient than sparse BoW/TF-IDF vectors.
Handles synonyms and context better.
2. Contextual Embeddings (ELMo, BERT, GPT)
These models generate word embeddings that vary depending on the context in which a word appears, solving the ambiguity problem of static embeddings.

ELMo (Embeddings from Language Models): Generates embeddings based on the context of words using bidirectional LSTM networks.
BERT (Bidirectional Encoder Representations from Transformers): Pretrained Transformer-based model that uses a masked language model and bidirectional attention to learn rich contextual representations of words. It produces embeddings that vary depending on the sentence.
GPT (Generative Pretrained Transformer): Similar to BERT but uses an autoregressive (left-to-right) approach for language modeling.
Advantages:

Captures both context and word order.
Handles word sense disambiguation (e.g., "bank" as a financial institution vs. riverbank).
Superior performance on tasks like text classification, sentiment analysis, and named entity recognition.
3. Transformer-based Pretrained Language Models
Models like BERT, GPT-3, RoBERTa, XLNet, and T5 are pretrained on massive datasets using Transformer architectures. They can then be fine-tuned for specific tasks.
Why are they better?

They process sequences of words with self-attention mechanisms, which allow the model to understand relationships between words regardless of distance.
They generate contextual representations at a sentence or document level, not just word-level.
Applications: Text summarization, question answering, machine translation, and more.

4. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)
LSA: Reduces the dimensionality of the BoW/TF-IDF vectors using singular value decomposition (SVD). This reveals latent relationships between words and documents, capturing hidden semantic structure.
LDA: A probabilistic model that represents documents as mixtures of topics, where each topic is a distribution of words. It’s commonly used for topic modeling.
Advantages:

Reduces dimensionality and noise.
Identifies latent semantics and topic structure.
5. Doc2Vec
Doc2Vec is an extension of Word2Vec that learns fixed-length vector representations for entire documents, not just words.

Distributed Memory (DM): Captures the context of words and documents together.
Distributed Bag-of-Words (DBOW): Learns document representations without word order.
Advantages:

Produces compact and meaningful representations for sentences or documents.
Better for tasks requiring entire document understanding (e.g., classification).
6. Recurrent Neural Networks (RNNs) and LSTMs/GRUs
RNNs process sequences of text while maintaining memory of previous inputs, capturing the sequential nature of text.
LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) solve the vanishing gradient problem in RNNs, enabling the capture of long-range dependencies.
Advantages:

Understands word order and sentence structure.
Useful for sequential tasks like language generation and sentiment analysis.
7. Attention Mechanism and Self-Attention
The Attention Mechanism allows models to focus on specific words in a sentence when generating predictions.
Self-Attention (used in Transformers) allows a model to relate each word to every other word in a sentence, which improves performance on tasks requiring long-range dependencies.
8. Hybrid Approaches
Combining techniques like TF-IDF with embeddings or TF-IDF with LSA can further improve performance. For instance:

Use TF-IDF for word weighting and Word2Vec/BERT for embeddings.
Combine topic modeling (LDA) with word embeddings to get semantic and topical insights.
Summary of Techniques Better than BoW and TF-IDF
Word Embeddings: Word2Vec, GloVe, FastText.
Contextual Embeddings: ELMo, BERT, GPT.
Transformer-based Models: BERT, GPT-3, T5, etc.
LSA and LDA: For latent semantics and topic modeling.
Doc2Vec: Document-level embeddings.
RNNs, LSTMs, GRUs: Sequence-based models.
Attention and Transformers: Contextual and long-range understanding.
Hybrid Methods: Combine classic approaches with modern embeddings.


### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
