### **TF-IDF: Exercises**

-   Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

-   In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

-   For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

-   We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.


### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp

-   This data consists of two columns. - Comment - Emotion
-   Comment are the statements or messages regarding to a particular event/situation.

-   Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

-   As there are only 3 classes, this problem comes under the **Multi-Class Classification.**


In [6]:
# import pandas library
import pandas as pd

# read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
train = pd.read_csv("emotion-detection/train.txt", sep=";")
val = pd.read_csv("emotion-detection/val.txt", sep=";")
test = pd.read_csv("emotion-detection/test.txt", sep=";")
df = pd.concat([train, val, test], axis=0)

# print the shape of dataframe
print(df.shape)

# print top 5 rows
df.head()

(20000, 2)


Unnamed: 0,comment,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [7]:
# check the distribution of Emotion
df["emotion"].value_counts()

emotion
joy         6761
sadness     5797
anger       2709
fear        2373
love        1641
surprise     719
Name: count, dtype: int64

In [8]:
# Add the new column "Emotion_num" which gives a unique number to each of these Emotions
# joy --> 0, fear --> 1, anger --> 2
df.loc[:, "emotion_num"] = df["emotion"].map(
    {
        "joy": 0,
        "sadness": 1,
        "anger": 2,
        "fear": 3,
        "love": 4,
        "surprise": 5,
    }
)

# checking the results by printing top 5 rows
df.head()

Unnamed: 0,comment,emotion,emotion_num
0,i didnt feel humiliated,sadness,1
1,i can go from feeling so hopeless to so damned...,sadness,1
2,im grabbing a minute to post i feel greedy wrong,anger,2
3,i am ever feeling nostalgic about the fireplac...,love,4
4,i am feeling grouchy,anger,2


### **Modelling without Pre-processing Text data**


In [9]:
# import train-test split
from sklearn.model_selection import train_test_split

# Do the 'train-test' splitting with test size of 20%
# Note: Give Random state 2022 and also do the stratify sampling
X_train, X_test, y_train, y_test = train_test_split(
    df["comment"], df["emotion_num"], random_state=2022, stratify=df["emotion_num"]
)

In [11]:
# print the shapes of X_train and X_test
X_train.shape, X_test.shape

((15000,), (5000,))

**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with only trigrams.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [12]:
# import CountVectorizer, RandomForest, pipeline, classification_report from sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB

# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(3, 3))),
        ("classifier", RandomForestClassifier()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.49      0.63      0.55      1690
           1       0.55      0.35      0.42      1449
           2       0.44      0.28      0.34       678
           3       0.21      0.47      0.29       593
           4       0.59      0.11      0.19       410
           5       0.75      0.10      0.18       180

    accuracy                           0.42      5000
   macro avg       0.50      0.32      0.33      5000
weighted avg       0.48      0.42      0.41      5000



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with both unigram and bigrams.
-   use **Multinomial Naive Bayes** as the classifier.
-   print the classification report.


In [13]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 2))),
        ("classifier", MultinomialNB()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.62      0.97      0.76      1690
           1       0.69      0.89      0.78      1449
           2       0.93      0.33      0.49       678
           3       0.88      0.26      0.40       593
           4       0.94      0.07      0.14       410
           5       1.00      0.03      0.06       180

    accuracy                           0.67      5000
   macro avg       0.84      0.43      0.44      5000
weighted avg       0.75      0.67      0.61      5000



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with both unigram and Bigrams.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [14]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 2))),
        ("classifier", RandomForestClassifier()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.82      0.96      0.88      1690
           1       0.91      0.90      0.90      1449
           2       0.89      0.82      0.85       678
           3       0.87      0.76      0.81       593
           4       0.90      0.64      0.75       410
           5       0.84      0.68      0.75       180

    accuracy                           0.86      5000
   macro avg       0.87      0.79      0.82      5000
weighted avg       0.87      0.86      0.86      5000



**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using **TF-IDF vectorizer** for Pre-processing the text.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [15]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", TfidfVectorizer()),
        ("classifier", RandomForestClassifier()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.83      0.94      0.88      1690
           1       0.91      0.90      0.90      1449
           2       0.89      0.83      0.86       678
           3       0.83      0.81      0.82       593
           4       0.85      0.65      0.74       410
           5       0.83      0.69      0.75       180

    accuracy                           0.86      5000
   macro avg       0.86      0.80      0.83      5000
weighted avg       0.86      0.86      0.86      5000



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>


In [16]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm")


# use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)

    return " ".join(filtered_tokens)

In [17]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient
df.loc[:, "preprocessed_comment"] = df["comment"].apply(preprocess)

**Build a model with pre processed text**


In [19]:
# Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
# Note: Use the preprocessed_Comment
X_train, X_test, y_train, y_test = train_test_split(
    df["preprocessed_comment"],
    df["emotion_num"],
    random_state=2022,
    stratify=df["emotion_num"],
)

**Let's check the scores with our best model till now**

-   Random Forest


**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**

-   using CountVectorizer with both unigrams and bigrams.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [20]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", CountVectorizer(ngram_range=(1, 2))),
        ("classifier", RandomForestClassifier()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.90      0.90      0.90      1690
           1       0.89      0.94      0.91      1449
           2       0.87      0.89      0.88       678
           3       0.89      0.82      0.85       593
           4       0.78      0.73      0.75       410
           5       0.85      0.71      0.78       180

    accuracy                           0.88      5000
   macro avg       0.86      0.83      0.85      5000
weighted avg       0.88      0.88      0.88      5000



**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**

-   using **TF-IDF vectorizer** for pre-processing the text.
-   use **RandomForest** as the classifier.
-   print the classification report.


In [21]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", TfidfVectorizer()),
        ("classifier", RandomForestClassifier()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.84      0.92      0.88      1690
           1       0.91      0.87      0.89      1449
           2       0.85      0.85      0.85       678
           3       0.82      0.82      0.82       593
           4       0.82      0.64      0.72       410
           5       0.82      0.71      0.76       180

    accuracy                           0.85      5000
   macro avg       0.84      0.80      0.82      5000
weighted avg       0.86      0.85      0.85      5000



In [22]:
# 1. create a pipeline object
pipeline = Pipeline(
    [
        ("n_gram", TfidfVectorizer()),
        ("classifier", MultinomialNB()),
    ]
)

# 2. fit with X_train and y_train
pipeline.fit(X_train, y_train)

# 3. get the predictions for X_test and store it in y_pred
y_pred = pipeline.predict(X_test)

# 4. print the classfication report
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.64      0.98      0.77      1690
           1       0.71      0.92      0.80      1449
           2       0.95      0.41      0.57       678
           3       0.89      0.28      0.43       593
           4       0.97      0.09      0.16       410
           5       1.00      0.03      0.05       180

    accuracy                           0.69      5000
   macro avg       0.86      0.45      0.46      5000
weighted avg       0.77      0.69      0.64      5000

