## TF-IDF: Exercises

**Emotional Intelligence from Words!**  

We humans are complex creatures, constantly expressing our emotions through facial expressions, tone of voice, and of course, our words. Social media platforms like Twitter and Instagram have become a treasure trove of these emotional expressions, where people share their thoughts and feelings about various events and scenarios.

**Can we, as data scientists, decipher these emotions from just text?** 

This exercise is an exciting foray into **Emotion Detection using Classical NLP Techniques**. We'll dive into the world of words and emotions, and try to classify a given comment based on the feelings it conveys.

**Our Toolbox:**

* **Text Representation:**
    * Bag of grams and n-grams: Capture the frequency of words and word combinations.
    * TF-IDF: Weigh words based on their importance within the document and across the corpus.
* **Classification Algorithms:** Explore various algorithms to learn from the preprocessed text and assign an emotion label (fear, anger, or joy) to each comment.

**The Data:**

* **Source:** Emotions Dataset for NLP - [https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp](https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp)
* **Columns:**
    * `Comment`: Statement or message expressing an opinion or reaction to an event.
    * `Emotion`: Label indicating the predominant emotion in the comment, one of three categories:
        * `fear`: Feeling of worry or apprehension.
        * `anger`: Strong feeling of annoyance, displeasure, or hostility.
        * `joy`: Feeling of great happiness or pleasure.
* **Problem Type:** Multi-Class Classification (three classes)

**The Journey:**

1. **Load the dataset:** Let's download and explore the dataset containing comments and their associated emotions.
2. **Preprocess the text:** Clean, tokenize, and transform the comments into features using techniques like Bag of grams, n-grams, and TF-IDF.
3. **Train the models:** Train various classification algorithms on the transformed data, optimizing their ability to learn and predict the underlying emotion.
4. **Evaluate and fine-tune:** Compare the performance of different models using metrics like accuracy and F1-score.
5. **Predict emotions:** Unleash the best performing model, and let it tell you the story behind the words - the hidden emotions!

**Get ready to embark on a journey of understanding, one comment at a time!**  

This Markdown format should be perfect for your Jupyter notebook. Remember to replace the generic steps with your actual implementation code and analysis. Good luck with your emotion detection adventure!  



In [8]:
###Load Required dependencies

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

#use this utility function to get the preprocessed text data
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### Load the dataset

In [2]:
#read the dataset with name "Fake_Real_Data.csv" and store it in a variable df
df_input = pd.read_csv("Input_data/Emotion_classify_Data.csv")

#print the shape of dataframe
print(df_input.shape)
print('')

df_input.head()

(5937, 2)



Unnamed: 0,Comment,Emotion
0,i seriously hate one subject to death but now ...,fear
1,im so full of life i feel appalled,anger
2,i sit here to write i start to dig out my feel...,fear
3,ive been really angry with r and i feel like a...,joy
4,i feel suspicious if there is no one outside l...,fear


In [4]:
#Looking at distribution of the target column to check the imbalance

print(df_input['Emotion'].value_counts())

print('')

print(df_input['Emotion'].value_counts(normalize=True)*100)

Emotion
anger    2000
joy      2000
fear     1937
Name: count, dtype: int64

Emotion
anger    33.687047
joy      33.687047
fear     32.625905
Name: proportion, dtype: float64


**From the above distribution the data looks quite symmetric and balanced, hence no imbalance treatement is required**

In [5]:
#Add the new column which gives a unique number to each of these labels 
df_input['label_num'] = df_input['Emotion'].map({'anger': 0, 'joy': 1, 'fear': 2})
#check the results
print('')

print(df_input['label_num'].value_counts())


label_num
0    2000
1    2000
2    1937
Name: count, dtype: int64


### Use text pre-processing to remove stop words, punctuations and apply lemmatization

In [6]:
# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens)

In [7]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient
df_input['preprocessed_comment'] = df_input['Comment'].apply(preprocess) 

### Build a model with pre processed text

In [10]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X_train, X_test, y_train, y_test = train_test_split(
    df_input.preprocessed_comment, 
    df_input.label_num, 
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df_input.label_num
)

In [11]:
#print the shapes

print("Shape of X_train: ", X_train.shape)
print("Shape of X_test: ", X_test.shape)

Shape of X_train:  (4749,)
Shape of X_test:  (1188,)


**Attempt1 :**

using the sklearn pipeline module create a classification pipeline to classify the data.
Note:

- using CountVectorizer with both unigrams and bigrams.
- use RandomForest as the classifier.
- print the classification report.

 

In [12]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer_bi_grams', CountVectorizer(ngram_range = (1, 2))),                       #using the ngram_range parameter 
    ('random_forest', (RandomForestClassifier(random_state= 123)))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.91      0.93      0.92       400
           1       0.93      0.95      0.94       400
           2       0.95      0.90      0.92       388

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188



**Attempt 2 :**

using the sklearn pipeline module create a classification pipeline to classify the data.
Note:

- using TF-IDF vectorizer for pre-processing the text.
- use RandomForest as the classifier.
- print the classification report.

In [14]:
#1. create a pipeline object
clf = Pipeline([
     ('vectorizer_tfidf',TfidfVectorizer()),        #using the ngram_range parameter 
     ('Random Forest', RandomForestClassifier(random_state= 123))         
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred)) 

              precision    recall  f1-score   support

           0       0.93      0.91      0.92       400
           1       0.93      0.95      0.94       400
           2       0.92      0.92      0.92       388

    accuracy                           0.93      1188
   macro avg       0.93      0.93      0.93      1188
weighted avg       0.93      0.93      0.93      1188

