### **TF-IDF: Exercises**
 
- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.
 
- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.
 
- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!
 
- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [10]:
#import pandas library
import pandas as pd

#read the dataset with name "Emotion_classify_Data.csv" and store it in a variable df
traindf = pd.read_csv('train.txt',header=None, names=['tweet','emotion'],sep=';')
testdf = pd.read_csv('test.txt',header=None, names=['tweet','emotion'],sep=';')
valdf = pd.read_csv('val.txt',header=None, names=['tweet','emotion'],sep=';')

#print the shape of dataframe
print(traindf.shape)

#print top 5 rows
traindf.head()

(16000, 2)


Unnamed: 0,tweet,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [13]:
#check the distribution of Emotion
traindf['emotion'].value_counts().sort_values(ascending=False)

joy         5362
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: emotion, dtype: int64

In [16]:
#Add the new column "Emotion_num" which gives a unique number to each of these Emotions
#joy --> 0, fear --> 1, anger --> 2
class_dic = {
    'joy': 0,
    'fear':1,
    'anger':2,
    'sadness':3,
    'love':4,
    'surprise':5
}

traindf['emotion_class'] = traindf['emotion'].map(class_dic)
#checking the results by printing top 5 rows
traindf.head()

Unnamed: 0,tweet,emotion,emotion_class
0,i didnt feel humiliated,sadness,3
1,i can go from feeling so hopeless to so damned...,sadness,3
2,im grabbing a minute to post i feel greedy wrong,anger,2
3,i am ever feeling nostalgic about the fireplac...,love,4
4,i am feeling grouchy,anger,2


In [17]:
testdf['emotion_class'] = testdf['emotion'].map(class_dic)
valdf['emotion_class'] = valdf['emotion'].map(class_dic)

traindf = traindf.drop('emotion',axis=1)
testdf = testdf.drop('emotion',axis=1)
valdf = valdf.drop('emotion',axis=1)

### **Modelling without Pre-processing Text data** (not necessary because now it comes split from kaggle)

In [25]:
#import train-test split
from sklearn.model_selection import train_test_split

#Do the 'train-test' splitting with test size of 20%
#Note: Give Random state 2022 and also do the stratify sampling

tempX_1, tempX_2, tempy_1, tempy_2 = train_test_split(traindf['tweet'],traindf['emotion_class'],test_size=0.2,stratify=traindf['emotion_class'])

print(tempX_1.shape)

tempy_1.value_counts()


(12800,)


0    4290
3    3733
2    1727
1    1550
4    1043
5     457
Name: emotion_class, dtype: int64

In [26]:
#print the shapes of X_train and X_test

print(tempX_2.shape)

tempy_2.value_counts()

(3200,)


0    1072
3     933
2     432
1     387
4     261
5     115
Name: emotion_class, dtype: int64


**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [31]:
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

#1. create a pipeline object

pipe_rf = Pipeline([
    ('countvec',CountVectorizer(ngram_range=(3,3))),
    ('rf',RandomForestClassifier())
])


#2. fit with X_train and y_train

pipe_rf.fit(traindf['tweet'],traindf['emotion_class'])

#3. get the predictions for X_test and store it in y_pred

y_pred = pipe_rf.predict(testdf['tweet'])



'              precision    recall  f1-score   support\n\n           0       0.43      0.87      0.58       695\n           1       0.50      0.22      0.30       224\n           2       0.50      0.19      0.28       275\n           3       0.62      0.34      0.44       581\n           4       0.53      0.15      0.24       159\n           5       0.45      0.14      0.21        66\n\n    accuracy                           0.47      2000\n   macro avg       0.50      0.32      0.34      2000\nweighted avg       0.51      0.47      0.43      2000\n'

In [32]:
#4. print the classfication report
print(classification_report(testdf['emotion_class'],y_pred))

              precision    recall  f1-score   support

           0       0.43      0.87      0.58       695
           1       0.50      0.22      0.30       224
           2       0.50      0.19      0.28       275
           3       0.62      0.34      0.44       581
           4       0.53      0.15      0.24       159
           5       0.45      0.14      0.21        66

    accuracy                           0.47      2000
   macro avg       0.50      0.32      0.34      2000
weighted avg       0.51      0.47      0.43      2000




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [34]:
#import MultinomialNB from sklearn
from sklearn.naive_bayes import MultinomialNB


#1. create a pipeline object
pipe_naive = Pipeline([
    ('counv',CountVectorizer(ngram_range=(1,2))),
    ('nb',MultinomialNB())
])


#2. fit with X_train and y_train
pipe_naive.fit(traindf['tweet'],traindf['emotion_class'])


#3. get the predictions for X_test and store it in y_pred
y_pred = pipe_naive.predict(testdf['tweet'])

#4. print the classfication report
print(classification_report(testdf['emotion_class'],y_pred))

              precision    recall  f1-score   support

           0       0.64      0.96      0.77       695
           1       0.90      0.35      0.50       224
           2       0.95      0.28      0.44       275
           3       0.67      0.89      0.76       581
           4       0.80      0.05      0.09       159
           5       0.00      0.00      0.00        66

    accuracy                           0.68      2000
   macro avg       0.66      0.42      0.43      2000
weighted avg       0.71      0.68      0.61      2000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [36]:
#1. create a pipeline object
pipe_rf2 = Pipeline([
    ('cVec',CountVectorizer(ngram_range=(1,2))),
    ('rf2',RandomForestClassifier())
])



#2. fit with X_train and y_train
pipe_rf2.fit(traindf['tweet'],traindf['emotion_class'])


#3. get the predictions for X_test and store it in y_pred
y_pred = pipe_rf2.predict(testdf['tweet'])


#4. print the classfication report
print(classification_report(testdf['emotion_class'],y_pred))

              precision    recall  f1-score   support

           0       0.78      0.97      0.87       695
           1       0.91      0.78      0.84       224
           2       0.91      0.79      0.84       275
           3       0.92      0.89      0.90       581
           4       0.86      0.56      0.68       159
           5       0.75      0.55      0.63        66

    accuracy                           0.85      2000
   macro avg       0.85      0.76      0.79      2000
weighted avg       0.86      0.85      0.85      2000



**Attempt 3** :

with validation process, it takes a long time, but you can remove the comment if you want to run it.

In [None]:
'''
from sklearn.model_selection import GridSearchCV

# Create the pipeline object
pipe_rf2 = Pipeline([
    ('cVec', CountVectorizer(ngram_range=(1,2))),
    ('rf2', RandomForestClassifier())
])

# Define parameter grid for GridSearch
param_grid = {
    'rf2__n_estimators': [100, 200],
    'rf2__max_depth': [None, 10, 20],
    'rf2__min_samples_split': [2, 5]
}

# Set up GridSearchCV
grid_search = GridSearchCV(pipe_rf2, param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# Fit GridSearchCV with the training data
grid_search.fit(traindf['tweet'], traindf['emotion_class'])

# Get the best estimator
best_pipe_rf2 = grid_search.best_estimator_

# Get the predictions for the test set and store it in y_pred
y_pred = best_pipe_rf2.predict(testdf['tweet'])

# Print the classification report
print("Test Set Classification Report")
print(classification_report(testdf['emotion_class'], y_pred))
'''


**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [41]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer


#1. create a pipeline object
pipe_tfidf = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('rf3',RandomForestClassifier())
])


#2. fit with X_train and y_train
pipe_tfidf.fit(traindf['tweet'],traindf['emotion_class'])


#3. get the predictions for X_test and store it in y_pred
y_pred = pipe_tfidf.predict(testdf['tweet'])

#4. print the classfication report
print(classification_report(testdf['emotion_class'],y_pred))

              precision    recall  f1-score   support

           0       0.83      0.94      0.88       695
           1       0.85      0.84      0.85       224
           2       0.91      0.83      0.86       275
           3       0.93      0.89      0.91       581
           4       0.80      0.65      0.72       159
           5       0.66      0.56      0.61        66

    accuracy                           0.86      2000
   macro avg       0.83      0.78      0.80      2000
weighted avg       0.86      0.86      0.86      2000



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [42]:
import spacy

# load english language model and create nlp object from it
nlp = spacy.load("en_core_web_sm") 


#use this utility function to get the preprocessed text data
def preprocess(text):
    # remove stop words and lemmatize the text
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    
    return " ".join(filtered_tokens) 

In [43]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient

traindf['preprocessed_comment'] = traindf['tweet'].apply(preprocess)
testdf['preprocessed_comment'] = testdf['tweet'].apply(preprocess)

**Build a model with pre processed text**

**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [44]:
#1. create a pipeline object

final_pipe_bow = Pipeline([
    ('cvec',CountVectorizer()),
    ('rf4',RandomForestClassifier())
])


#2. fit with X_train and y_train
final_pipe_bow.fit(traindf['preprocessed_comment'],traindf['emotion_class'])


#3. get the predictions for X_test and store it in y_pred
y_pred = final_pipe_bow.predict(testdf['preprocessed_comment'])


#4. print the classfication report
print(classification_report(testdf['emotion_class'],y_pred))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89       695
           1       0.86      0.84      0.85       224
           2       0.86      0.87      0.86       275
           3       0.90      0.90      0.90       581
           4       0.70      0.72      0.71       159
           5       0.61      0.62      0.62        66

    accuracy                           0.86      2000
   macro avg       0.80      0.81      0.80      2000
weighted avg       0.86      0.86      0.86      2000




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [47]:
#1. create a pipeline object
final_pipe_tfidf = Pipeline([
    ('tfidf',TfidfVectorizer()),
    ('rf5',RandomForestClassifier())
])


#2. fit with X_train and y_train
final_pipe_tfidf.fit(traindf['preprocessed_comment'],traindf['emotion_class'])


#3. get the predictions for X_test and store it in y_pred
y_pred = final_pipe_tfidf.predict(testdf['preprocessed_comment'])


#4. print the classfication report
print(classification_report(testdf['emotion_class'], y_pred))

              precision    recall  f1-score   support

           0       0.84      0.93      0.88       695
           1       0.84      0.84      0.84       224
           2       0.88      0.83      0.85       275
           3       0.91      0.89      0.90       581
           4       0.79      0.65      0.71       159
           5       0.65      0.59      0.62        66

    accuracy                           0.86      2000
   macro avg       0.82      0.79      0.80      2000
weighted avg       0.86      0.86      0.86      2000

