### **TF-IDF: Exercises**
 
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.
 
- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.
 
- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!
 
- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [4]:
#import pandas library
import pandas

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df_test = pandas.read_csv("./test.txt", sep=";", names=["Text", "Emotion"])
df_val = pandas.read_csv("./val.txt", sep=";", names=["Text", "Emotion"])
df_train = pandas.read_csv("./train.txt", sep=";", names=["Text", "Emotion"])

frames = [df_test, df_train, df_val]

df = pandas.concat(frames)

#print the shape of dataframe

df.shape

#print top 5 rows
df.head()

Unnamed: 0,Text,Emotion
0,im feeling rather rotten so im not very ambiti...,sadness
1,im updating my blog because i feel shitty,sadness
2,i never make her separate from me because i do...,sadness
3,i left with my bouquet of red and yellow tulip...,joy
4,i was feeling a little vain when i did this one,sadness


In [5]:
#check the distribution of Emotion
df.Emotion.value_counts()

Emotion
joy         6761
sadness     5797
anger       2709
fear        2373
love        1641
surprise     719
Name: count, dtype: int64

In [6]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2

emotions = ["joy", "sadness", "anger", "fear", "love", "surprise"]

df['Emotion_num'] = pandas.factorize(df.Emotion)[0]


#checking the results by printing top 5 rows
df.head()


Unnamed: 0,Text,Emotion,Emotion_num
0,im feeling rather rotten so im not very ambiti...,sadness,0
1,im updating my blog because i feel shitty,sadness,0
2,i never make her separate from me because i do...,sadness,0
3,i left with my bouquet of red and yellow tulip...,joy,1
4,i was feeling a little vain when i did this one,sadness,0


### **Modelling without Pre-processing Text data**

In [7]:
#import train-test split
from sklearn.model_selection import train_test_split


#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling

X = df.Text
y = df.Emotion_num

X_train, X_test, Y_train, Y_test=train_test_split(X, y, test_size=0.20, random_state=2022, stratify=y)

In [8]:
#print the shapes of X_train and X_test
print(f"X_train: {X_train.shape} and X_test: {X_test.shape}")

X_train: (16000,) and X_test: (4000,)


In [9]:
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report

import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

from sklearn.base import BaseEstimator, TransformerMixin

class PreProcessText(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.apply(preprocess)
        return X

#1. create a pipeline object

model = Pipeline(steps= [
    ('PreprocessText', PreProcessText()),
    ('tf_idf', TfidfVectorizer()),
    ('xg_boost', XGBClassifier()),
])


#2. fit with X_train and y_train

model.fit(X_train, Y_train)

#3. get the predictions for X_test and store it in y_pred

y_pred = model.predict(X_test)

#4. print the classfication report
print(classification_report(Y_test, y_pred))