### **TF-IDF: Exercises**
 
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.
 
- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.
 
- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!
 
- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [85]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
df_train = pd.read_csv("train4.txt", 
                 sep=";",       
                 header=None,   
                 names=["text", "label"])  

df_test = pd.read_csv("test4.txt", 
                 sep=";",       
                 header=None,   
                 names=["text", "label"]) 

df = pd.concat([df_train,df_test],axis = 0)
#print the shape of dataframe
print(df.shape)

#print top 5 rows
df.head()

(18000, 2)


Unnamed: 0,text,label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [86]:
#check the distribution of Emotion
df.label.value_counts()

label
joy         6057
sadness     5247
anger       2434
fear        2161
love        1463
surprise     638
Name: count, dtype: int64

In [87]:
min_samples=2000
df_joy = df[df.label == 'joy'].sample(min_samples,random_state=2022)
df_fear = df[df.label == 'fear'].sample(min_samples,random_state=2022)
df_anger = df[df.label == 'anger'].sample(min_samples,random_state=2022)

df = pd.concat([df_joy,df_fear,df_anger],axis=0)
df.head()

Unnamed: 0,text,label
1971,i was feeling more optimistic with blue skies ...,joy
3538,i feel more reassured now,joy
7931,i want to shout say something dont just smile ...,joy
8849,i feel extremely lucky and blessed to work wit...,joy
5945,i didn t feel terrific,joy


In [88]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
df['Emotion_num'] = df['label'].map(
{
    'joy':0,
    'fear':1,
    'anger':2,
}
)

#checking the results by printing top 5 rows
df.head()

Unnamed: 0,text,label,Emotion_num
1971,i was feeling more optimistic with blue skies ...,joy,0
3538,i feel more reassured now,joy,0
7931,i want to shout say something dont just smile ...,joy,0
8849,i feel extremely lucky and blessed to work wit...,joy,0
5945,i didn t feel terrific,joy,0


In [89]:
df.Emotion_num.value_counts()

Emotion_num
0    2000
1    2000
2    2000
Name: count, dtype: int64

### **Modelling without Pre-processing Text data**

In [90]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
X_train, X_test, y_train, y_test = train_test_split(
    df.text, 
    df.Emotion_num,
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.Emotion_num
)
#Note: Give Random state 2022 and also do the stratify sampling



In [91]:
#print the shapes of X_train and X_test
print(X_train.shape,X_test.shape)


(4800,) (1200,)



**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [104]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

#1. create a pipeline object
clf = Pipeline([
    ('vectorizer',CountVectorizer(ngram_range=(3,3))),
    ('rf',RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.58      0.07      0.12       400
           1       0.86      0.20      0.32       400
           2       0.36      0.95      0.52       400

    accuracy                           0.41      1200
   macro avg       0.60      0.40      0.32      1200
weighted avg       0.60      0.41      0.32      1200




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [93]:
#import MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB


#1. create a pipeline object
clf = Pipeline([
    ('vectorizer',CountVectorizer(ngram_range=(1,2))),
    ('nb',MultinomialNB())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.87      0.84      0.85       400
           1       0.82      0.85      0.84       400
           2       0.84      0.84      0.84       400

    accuracy                           0.84      1200
   macro avg       0.84      0.84      0.84      1200
weighted avg       0.84      0.84      0.84      1200




**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [111]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer',CountVectorizer(ngram_range=(1,2))),
    ('rf',RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.93      0.94      0.94       400
           1       0.93      0.94      0.93       400
           2       0.93      0.92      0.92       400

    accuracy                           0.93      1200
   macro avg       0.93      0.93      0.93      1200
weighted avg       0.93      0.93      0.93      1200




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [109]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer


#1. create a pipeline object
clf = Pipeline([
    ('vectorizer',TfidfVectorizer()),
    ('rf',RandomForestClassifier())
])

#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.94      0.93       400
           1       0.94      0.92      0.93       400
           2       0.91      0.90      0.90       400

    accuracy                           0.92      1200
   macro avg       0.92      0.92      0.92      1200
weighted avg       0.92      0.92      0.92      1200



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [96]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [97]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient

df['preprocessed_txt'] = df['text'].apply(preprocess) 

**Build a model with pre processed text**

In [98]:
#Do the 'train-test' splitting with test size of 20% with random state of 2022 and stratify sampling too
#Note: Use the preprocessed_Comment
X_train, X_test, y_train, y_test = train_test_split(
    df.preprocessed_txt, 
    df.Emotion_num,
    test_size=0.2, # 20% samples will go to test dataset
    random_state=2022,
    stratify=df.Emotion_num
)

**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [106]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer',CountVectorizer(ngram_range=(1,2))),
    ('rf',RandomForestClassifier())
])


#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.95      0.93       400
           1       0.93      0.94      0.93       400
           2       0.93      0.90      0.92       400

    accuracy                           0.93      1200
   macro avg       0.93      0.93      0.93      1200
weighted avg       0.93      0.93      0.93      1200




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [108]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer',TfidfVectorizer()),
    ('rf',RandomForestClassifier())
])

#2. fit with X_train and y_train
clf.fit(X_train,y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.92      0.94      0.93       400
           1       0.94      0.92      0.93       400
           2       0.91      0.91      0.91       400

    accuracy                           0.93      1200
   macro avg       0.93      0.92      0.92      1200
weighted avg       0.93      0.93      0.92      1200



****Dr.Rami , i have a question  :  in the solution it says that with preprocessing there is a very significant change in the accuracy , but ive ran the model multiple times and both before tbe preprocess and after preprocess have almost equal results****

