### **TF-IDF: Exercises**

- Humans 👦 show different emotions/feelings based on the situations and communicate them through facial expressions or in form of words.

- In Social Media like Twitter and Instagram, many people express their views through comments about a particular event/scenario and these comments may address the feelings like sadness, happiness, joy, sarcasm, fear, and many other.

- For a given comment/text, we are going to use classical NLP techniques and classify under which emotion that particular comment belongs!

- We are going to use techniques like Bag of grams, n-grams, TF-IDF, etc. for text representation and apply different classification algorithms.

### **About Data: Emotion Detection**

Credits: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp


- This data consists of two columns.
        - Comment
        - Emotion
- Comment are the statements or messages regarding to a particular event/situation.

- Emotion feature tells whether the given comment is fear 😨, Anger 😡, Joy 😂.

- As there are only 3 classes, this problem comes under the **Multi-Class Classification.**

In [2]:
!kaggle datasets download -d praveengovi/emotions-dataset-for-nlp

import zipfile

# Specify the path to the zip file and the destination directory
zip_file_path = '/content/emotions-dataset-for-nlp.zip'
extract_dir = '/content'

# Open the zip file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Extract all the contents into the specified directory
            zip_ref.extractall(extract_dir)

print(f"Extracted all files to {extract_dir}")

Dataset URL: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp
License(s): CC-BY-SA-4.0
emotions-dataset-for-nlp.zip: Skipping, found more recently modified local copy (use --force to force download)
Extracted all files to /content


In [6]:
import pandas as pd
train_data = pd.read_csv('train.txt', delimiter=';')
train_data.columns = ['text' , 'emotion']
train_data

Unnamed: 0,text,emotion
0,i can go from feeling so hopeless to so damned...,sadness
1,im grabbing a minute to post i feel greedy wrong,anger
2,i am ever feeling nostalgic about the fireplac...,love
3,i am feeling grouchy,anger
4,ive been feeling a little burdened lately wasn...,sadness
...,...,...
15994,i just had a very brief time in the beanbag an...,sadness
15995,i am now turning and i feel pathetic that i am...,sadness
15996,i feel strong and good overall,joy
15997,i feel like this was such a rude comment and i...,anger


In [7]:
#check the distribution of Emotion
train_data['emotion'].value_counts()

Unnamed: 0_level_0,count
emotion,Unnamed: 1_level_1
joy,5362
sadness,4665
anger,2159
fear,1937
love,1304
surprise,572


In [8]:


train_data['emotion_num'] = train_data['emotion'].map({
    'joy' : 0,
    'sadness' : 1,
    'anger' : 2,
    'fear' : 3,
    'love' : 4,
    'surprise' : 5,
})
#checking the results by printing top 5 rows
train_data.head()


Unnamed: 0,text,emotion,emotion_num
0,i can go from feeling so hopeless to so damned...,sadness,1
1,im grabbing a minute to post i feel greedy wrong,anger,2
2,i am ever feeling nostalgic about the fireplac...,love,4
3,i am feeling grouchy,anger,2
4,ive been feeling a little burdened lately wasn...,sadness,1


### **Modelling without Pre-processing Text data**

In [9]:
# #import train-test split
# from sklearn.model_selection import train_test_split
# #Do the 'train-test' splitting with test size of 20%
# #Note: Give Random state 2022 and also do the stratify sampling
# X_train , X_val , y_train , y_val = train_test_split(train_data['text'] , train_data['emotion_num'] , test_size=0.2 , random_state = 2022 , stratify=train_data.fakeness)

val_data = pd.read_csv('val.txt', delimiter=';')
val_data.columns = ['text' , 'emotion']
val_data['emotion_num'] = val_data['emotion'].map({
      'joy' : 0,
          'sadness' : 1,
              'anger' : 2,
                  'fear' : 3,
                      'love' : 4,
                          'surprise' : 5,
                          })
                          #checking the results by printing top 5 rows
val_data.head()

Unnamed: 0,text,emotion,emotion_num
0,i feel like i am still looking at a blank canv...,sadness,1
1,i feel like a faithful servant,love,4
2,i am just feeling cranky and blue,anger,2
3,i can have for a treat or if i am feeling festive,joy,0
4,i start to feel more appreciative of what god ...,joy,0


In [11]:
#print the shapes of X_train and X_test
X_train = train_data['text']
y_train = train_data['emotion_num']
print('x_train shape: ' , X_train.shape)

X_val = val_data['text']
y_val = val_data['emotion_num']
print('x_val shape: ' , X_val.shape)


x_train shape:  (15999,)
x_val shape:  (1999,)



**Attempt 1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with only trigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [12]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

pipe = Pipeline([('v', CountVectorizer(ngram_range=(1, 1))),
                 ('rf_clf', RandomForestClassifier()) ])

                 #2. fit with X_train and y_train

pipe.fit(X_train, y_train)

                 #3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

                 #4. print the classfication report
print(classification_report(y_val , y_pred))


              precision    recall  f1-score   support

           0       0.85      0.95      0.90       704
           1       0.91      0.90      0.90       549
           2       0.89      0.83      0.86       275
           3       0.83      0.81      0.82       212
           4       0.88      0.73      0.80       178
           5       0.83      0.67      0.74        81

    accuracy                           0.87      1999
   macro avg       0.87      0.81      0.84      1999
weighted avg       0.87      0.87      0.87      1999




**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and bigrams.
- use **Multinomial Naive Bayes** as the classifier.
- print the classification report.


In [None]:
#import MultinomialNB from sklearn



#1. create a pipeline object



#2. fit with X_train and y_train



#3. get the predictions for X_test and store it in y_pred


#4. print the classfication report



**Attempt 3** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigram and Bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


In [13]:
pipe = Pipeline([('v', CountVectorizer(ngram_range=(1, 2))),
                 ('rf_clf', RandomForestClassifier()) ])

                                  #2. fit with X_train and y_train

pipe.fit(X_train, y_train)

                                                   #3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

                                                                    #4. print the classfication report
print(classification_report(y_val , y_pred))


              precision    recall  f1-score   support

           0       0.81      0.97      0.88       704
           1       0.91      0.90      0.90       549
           2       0.91      0.80      0.85       275
           3       0.89      0.72      0.80       212
           4       0.89      0.70      0.78       178
           5       0.85      0.64      0.73        81

    accuracy                           0.86      1999
   macro avg       0.88      0.79      0.83      1999
weighted avg       0.87      0.86      0.86      1999




**Attempt 4** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using **TF-IDF vectorizer** for Pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [14]:
#import TfidfVectorizer from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
pipe = Pipeline([('tfidf', TfidfVectorizer()),
                 ('rf_clf', RandomForestClassifier()) ])

                                  #2. fit with X_train and y_train

pipe.fit(X_train, y_train)

                                                   #3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

                                                                    #4. print the classfication report
print(classification_report(y_val , y_pred))




              precision    recall  f1-score   support

           0       0.84      0.95      0.90       704
           1       0.91      0.89      0.90       549
           2       0.90      0.82      0.86       275
           3       0.84      0.80      0.82       212
           4       0.90      0.72      0.80       178
           5       0.84      0.70      0.77        81

    accuracy                           0.87      1999
   macro avg       0.87      0.82      0.84      1999
weighted avg       0.87      0.87      0.87      1999



<h3>Use text pre-processing to remove stop words, punctuations and apply lemmatization </h3>

In [15]:
import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess(text):
  doc = nlp(text)
  filtered_list = []
  for token in doc:
    if token.is_stop or token.is_punct:
      continue
    filtered_list.append(token.lemma_)
  return ' '.join(filtered_list)

In [16]:
# create a new column "preprocessed_comment" and use the utility function above to get the clean data
# this will take some time, please be patient

train_data['preprocessed_text'] = train_data['text'].apply(preprocess)
train_data['preprocessed_text']

Unnamed: 0,preprocessed_text
0,feel hopeless damned hopeful care awake
1,m grab minute post feel greedy wrong
2,feel nostalgic fireplace know property
3,feel grouchy
4,ve feel little burden lately not sure
...,...
15994,brief time beanbag say anna feel like beat
15995,turn feel pathetic wait table sub teaching degree
15996,feel strong good overall
15997,feel like rude comment m glad t


**Build a model with pre processed text**

In [17]:
from tqdm import tqdm
tqdm.pandas()
val_data['preprocessed_text'] = val_data['text'].progress_apply(preprocess)
val_data['preprocessed_text']


X_train = train_data['preprocessed_text']

X_val = val_data['preprocessed_text']


100%|██████████| 1999/1999 [00:20<00:00, 99.13it/s] 


**Let's check the scores with our best model till now**
- Random Forest

**Attempt1** :

1. using the sklearn pipeline module create a classification pipeline to classify the Data.

**Note:**
- using CountVectorizer with both unigrams and bigrams.
- use **RandomForest** as the classifier.
- print the classification report.


KeyboardInterrupt: 


**Attempt 2** :

1. using the sklearn pipeline module create a classification pipeline to classify the data.

**Note:**
- using **TF-IDF vectorizer** for pre-processing the text.
- use **RandomForest** as the classifier.
- print the classification report.


In [19]:
pipe = Pipeline([('tfidf', TfidfVectorizer()),
                 ('rf_clf', RandomForestClassifier()) ])

                                                                    #2. fit with X_train and y_train

pipe.fit(X_train, y_train)

                                                                                                                                                                          #3. get the predictions for X_test and store it in y_pred

y_pred = pipe.predict(X_val)

                                                                                                                                                                                                                                                                                                                                                    #4. print the classfication report
print(classification_report(y_val , y_pred))

              precision    recall  f1-score   support

           0       0.86      0.92      0.89       704
           1       0.89      0.89      0.89       549
           2       0.87      0.85      0.86       275
           3       0.83      0.84      0.84       212
           4       0.85      0.67      0.75       178
           5       0.81      0.75      0.78        81

    accuracy                           0.87      1999
   macro avg       0.85      0.82      0.84      1999
weighted avg       0.87      0.87      0.86      1999



fatal: not a git repository (or any of the parent directories): .git


## **Please write down Final Observations**


## [**Solution**](./tf_idf_exercise_solutions.ipynb)