<a href="https://colab.research.google.com/github/Dheeraj16-code/labs-and-projects/blob/main/Using_KNN_For_Text_Classification_Module03_Lab02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Student Training Program on AIML**
### MODULE: CLASSIFICATION-1
### LAB-2 : Using KNN for Text Classification


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [2]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [3]:
import re
from nltk.tokenize import word_tokenize

def clean_text_simple(text):
    """
    A simplified function to clean text.
    - Converts to lowercase
    - Removes non-alphabetic characters
    - Tokenizes the text
    """
    text = str(text).lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    return tokens

In [5]:
sample_text = "Troubling"
stemmed_text = " ".join(clean_text_simple(sample_text))
print(f"Original: {sample_text}")
print(f"Stemmed: {stemmed_text}")
lemmatized_text = " ".join(clean_text_simple(sample_text))
print(f"Lemmatized: {lemmatized_text}")

Original: Troubling
Stemmed: troubling
Lemmatized: troubling


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [None]:
14*7


98

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.
def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = [" ".join(cleanText(p, lemmatize, stemmer)) for p in train]
    clean_test = [" ".join(cleanText(p, lemmatize, stemmer)) for p in test]

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = [" ".join(cleanText(p, lemmatize, stemmer)) for p in train]
    clean_test = [" ".join(cleanText(p, lemmatize, stemmer)) for p in test]

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

In [None]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [8]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [9]:
import pandas as pd
import io

# Get the filename of the uploaded file
filename = list(uploaded.keys())[0]

# Read the uploaded file into a pandas DataFrame
df = pd.read_csv(io.BytesIO(uploaded[filename]))

display(df.head())

Unnamed: 0,sentence,sentiment
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0


In [10]:
df = df.dropna()

In [11]:
df

Unnamed: 0,sentence,sentiment
0,Not sure who was more lost - the flat characte...,0
1,Attempting artiness with black & white and cle...,0
2,Very little music or anything to speak of.,0
3,The best scene in the movie was when Gerardo i...,1
4,"The rest of the movie lacks art, charm, meanin...",0
...,...,...
994,I just got bored watching Jessice Lange take h...,0
995,"Unfortunately, any virtue in this film's produ...",0
996,"In a word, it is embarrassing.",0
997,Exceptionally bad!,0


In [None]:
from google.colab import sheets
sheet = sheets.InteractiveSheet(df=df)

https://docs.google.com/spreadsheets/d/1cYu6mYG0xWFXen38RSd2UND99OmuRYDgjbm3XomwmiQ/edit#gid=0


In [12]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [13]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
import pandas as pd

def bow_knn():
    training_data = df
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', p=2, metric='euclidean')
    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    print('KNN with BOW accuracy = ' + str(metrics.accuracy_score(y_test, predicted) * 100) + '%')
    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    return predicted, y_test

def tfidf_knn():
    training_data = df
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                           test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', metric='cosine')
    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    print('KNN with TFIDF accuracy = ' + str(metrics.accuracy_score(y_test, predicted) * 100) + '%')
    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import re
import numpy
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = [" ".join(cleanText(p, lemmatize, stemmer)) for p in train]
    clean_test = [" ".join(cleanText(p, lemmatize, stemmer)) for p in test]

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## KNN accuracy after using BoW
predicted, y_test = bow_knn()

KNN with BOW accuracy = 66.49214659685863%




Cross Validation Accuracy: 0.63


In [15]:
!pip install lxml



Note: Cross-validation will be discussed in detail in the upcoming lab session.

In [18]:
!pip install lxml
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import re
import numpy
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from bs4 import BeautifulSoup
import pandas as pd
import io

# Get the filename of the uploaded file
filename = list(uploaded.keys())[0]

# Read the uploaded file into a pandas DataFrame
df = pd.read_csv(io.BytesIO(uploaded[filename]))
df['sentiment'] = pd.to_numeric(df['sentiment'], errors='coerce')
df = df.dropna()


def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = [" ".join(cleanText(p, lemmatize, stemmer)) for p in train]
    clean_test = [" ".join(cleanText(p, lemmatize, stemmer)) for p in test]

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

def bow_knn():
    training_data = df
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', p=2, metric='euclidean')
    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    print('KNN with BOW accuracy = ' + str(metrics.accuracy_score(y_test, predicted) * 100) + '%')
    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    return predicted, y_test

## KNN accuracy after using BoW
print("KNN with Bag of Words:")
predicted_bow, y_test_bow = bow_knn()

## KNN accuracy after using TFIDF
print("\nKNN with TF-IDF:")
predicted_tfidf, y_test_tfidf = tfidf_knn()

KNN with Bag of Words:
KNN with BOW accuracy = 68.06282722513089%
Cross Validation Accuracy: 0.60

KNN with TF-IDF:
KNN with TFIDF accuracy = 72.25130890052355%
Cross Validation Accuracy: 0.73


In [19]:
!pip install lxml



In [20]:
## KNN accuracy after using BoW
predicted, y_test = bow_knn()

KNN with BOW accuracy = 68.06282722513089%
Cross Validation Accuracy: 0.60


In [21]:
## KNN accuracy after using TFIDF
predicted, y_test = tfidf_knn()

KNN with TFIDF accuracy = 72.25130890052355%
Cross Validation Accuracy: 0.73


# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [22]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [23]:
import pandas as pd
df = pd.read_csv('spam.csv', encoding='latin-1')
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [26]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam (1).csv


In [27]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [28]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [29]:
len(df)

5572

In [30]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

In [32]:
# This cell may take some time to run
predicted, y_test = bow_knn()
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, predicted))
print(classification_report(y_test, predicted))

KNN with BOW accuracy = 92.19730941704036%
Cross Validation Accuracy: 0.91
[0.90713324 0.90174966 0.91447811]


[[970   0]
 [ 87  58]]
              precision    recall  f1-score   support

           0       0.92      1.00      0.96       970
           1       1.00      0.40      0.57       145

    accuracy                           0.92      1115
   macro avg       0.96      0.70      0.76      1115
weighted avg       0.93      0.92      0.91      1115



In [33]:
# This cell may take some time to run
predicted_tfidf, y_test_tfidf = tfidf_knn()
from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test_tfidf, predicted_tfidf))
print(classification_report(y_test_tfidf, predicted_tfidf))

KNN with TFIDF accuracy = 98.56502242152466%
Cross Validation Accuracy: 0.97
[0.96837147 0.96769852 0.96363636]
[[970   0]
 [ 16 129]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       970
           1       1.00      0.89      0.94       145

    accuracy                           0.99      1115
   macro avg       0.99      0.94      0.97      1115
weighted avg       0.99      0.99      0.99      1115



### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

### 1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words?

*   **What they do:**
    *   **Bag-of-Words (BoW):**  Simply counts the frequency of each word in a document. It treats all words equally, regardless of their importance.
    *   **TF-IDF (Term Frequency-Inverse Document Frequency):**  Also considers word frequency (TF), but it gives more weight to words that are important to a specific document and less weight to common words that appear in many documents (IDF).

*   **Why TF-IDF is often better:**
    *   **Focus on Important Words:** TF-IDF helps to highlight the words that are most relevant to a particular document, while downplaying common words like "the," "a," and "is." This allows the model to focus on the words that carry the most meaning and are most likely to be good predictors.
    *   **Reduced Noise:** By giving less weight to common words, TF-IDF reduces the noise in the data, which can help to improve the accuracy of the model.

### 2. Can you think of techniques that are better than both BoW and TF-IDF?

Yes, there are several more advanced techniques that can often outperform BoW and TF-IDF:

*   **Word Embeddings (Word2Vec, GloVe, FastText):**
    *   **What they are:** These techniques represent words as dense vectors in a low-dimensional space. The key idea is that words with similar meanings will have similar vector representations.
    *   **Why they are better:** Unlike BoW and TF-IDF, word embeddings capture the semantic relationships between words. This allows the model to understand the context of words and make more accurate predictions.

*   **Contextualized Word Embeddings (BERT, ELMo, GPT):**
    *   **What they are:** These are even more advanced techniques that generate different vector representations for a word depending on its context.
    *   **Why they are better:** They can understand that the word "bank" has a different meaning in "river bank" than it does in "investment bank." This allows for a much deeper understanding of the text and can lead to significant improvements in accuracy.

### 3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

*   **What they do:**
    *   **Stemming:** A rule-based approach that chops off the ends of words to get to the "stem." For example, "running," "ran," and "runner" might all be stemmed to "run."
    *   **Lemmatization:** A more sophisticated approach that uses a dictionary to find the root form of a word (the "lemma"). For example, the lemma of "running," "ran," and "runner" would all be "run."

*   **Pros and Cons:**

    | Feature | Stemming | Lemmatization |
    | :--- | :--- | :--- |
    | **Speed** | Faster | Slower |
    | **Accuracy** | Less accurate (can produce non-words) | More accurate (produces real words) |
    | **Complexity** | Simpler | More complex |

*   **When to use which:**
    *   **Stemming:** A good choice when you need a fast and simple way to normalize text, and you don't mind a little bit of inaccuracy.
    *   **Lemmatization:** A better choice when you need a more accurate way to normalize text, and you can afford the extra processing time.