![pipeline NLP](https://shubhangidabral13.github.io/Bits-and-Bytes-of-NLP/images/copied_from_nb/my_icons/topic_02.a.1.png)

**the preprocessing step**

![](https://miro.medium.com/max/1400/1*pzjECYWP8WOWhwfCjebZVw.png)

# import needed libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score , classification_report 

import spacy



# **EDA**

1. Data Acquisition

In [2]:
#read data
df=pd.read_csv("/kaggle/input/emotion-classification-nlp/emotion-labels-train.csv")

#show the shape
print(df.shape)

#show data
df.head(9)

(3613, 2)


Unnamed: 0,text,label
0,Just got back from seeing @GaryDelaney in Burs...,joy
1,Oh dear an evening of absolute hilarity I don'...,joy
2,Been waiting all week for this game ❤️❤️❤️ #ch...,joy
3,"@gardiner_love : Thank you so much, Gloria! Yo...",joy
4,I feel so blessed to work with the family that...,joy
5,"Today I reached 1000 subscribers on YT!! , #go...",joy
6,"@Singaholic121 Good morning, love! Happy first...",joy
7,#BridgetJonesBaby is the best thing I've seen ...,joy
8,Just got back from seeing @GaryDelaney in Burs...,joy


In [3]:
#check the distribution , balanced or not
df["label"].value_counts()

label
fear       1147
anger       857
joy         823
sadness     786
Name: count, dtype: int64

**the data is not balanced because fear is 1147 and other emotion almost from 786 to 857**

**So we will drop some values from fear**

In [4]:
df.head(900)

Unnamed: 0,text,label
0,Just got back from seeing @GaryDelaney in Burs...,joy
1,Oh dear an evening of absolute hilarity I don'...,joy
2,Been waiting all week for this game ❤️❤️❤️ #ch...,joy
3,"@gardiner_love : Thank you so much, Gloria! Yo...",joy
4,I feel so blessed to work with the family that...,joy
...,...,...
895,#WeirdWednesday OKAY! That jump-scared the #Po...,fear
896,@BBCPolitics @BBCNews I'd rather leave my chil...,fear
897,@OutdoorLoverz is this a bridge if I have to d...,fear
898,My roommate talks and laughs in her sleep. It ...,fear


In [5]:
n_df=df.loc[895:1185 , ['text','label']]

**we will drop about 290 value**

In [6]:
n_df['label'].value_counts()

label
fear    291
Name: count, dtype: int64

In [7]:
df=df.drop(df.loc[895:1185 , ['text','label']].index)
df.shape

(3322, 2)

In [8]:
df['label'].value_counts()

label
anger      857
fear       856
joy        823
sadness    786
Name: count, dtype: int64

2. Text cleaning

In [9]:
#show sample of data
print(f"{df['text'][0]} --> {df['label'][0]}")

# data is already cleaned

Just got back from seeing @GaryDelaney in Burslem. AMAZING!! Face still hurts from laughing so much #hilarious --> joy


# **Preprocessing**

3. Preprocessing
* sentence tokenization . . . already done
* word tokenization
* stemming and lemmatization
* Remove stop words


In [10]:
#import nlp libraries
nlp=spacy.load("en_core_web_sm")

def preprocess(text):
    filltered_tokens=[]
    
    #word tokenization
    doc=nlp(text)
    
    #Remove stope words
    for token in doc :
        if token.is_stop or token.is_punct :
            continue
         
        #stemming and lemmatization
        filltered_tokens.append(token.lemma_)
        
        
    return " ".join(filltered_tokens)  # "join" to convert filltered_tokens from list to text
    

In [11]:
#show a sample

txt=df['text'][0]
print(txt)

process_txt= preprocess(txt)
print(process_txt)


Just got back from seeing @GaryDelaney in Burslem. AMAZING!! Face still hurts from laughing so much #hilarious
get see @GaryDelaney Burslem AMAZING face hurt laugh hilarious


apply preprocess function on DataFrame

In [12]:
df["preprocess_text"]=df["text"].apply(preprocess)

In [13]:
df

Unnamed: 0,text,label,preprocess_text
0,Just got back from seeing @GaryDelaney in Burs...,joy,get see @GaryDelaney Burslem AMAZING face hurt...
1,Oh dear an evening of absolute hilarity I don'...,joy,oh dear evening absolute hilarity think laugh ...
2,Been waiting all week for this game ❤️❤️❤️ #ch...,joy,wait week game ❤ ️ ❤ ️ ❤ ️ cheer friday ❤ ️
3,"@gardiner_love : Thank you so much, Gloria! Yo...",joy,@gardiner_love thank Gloria sweet thoughtful d...
4,I feel so blessed to work with the family that...,joy,feel blessed work family nanny ❤ ️ love amp ap...
...,...,...,...
3608,@VivienLloyd Thank you so much! Just home - st...,sadness,@VivienLloyd thank home stunned happy think si...
3609,Just put the winter duvet on ☃️❄️🌬☔️,sadness,winter duvet ☃ ️ ❄ ️ 🌬 ☔ ️
3610,@SilkInSide @TommyJoeRatliff that's so pretty!...,sadness,@SilkInSide @tommyjoeratliff pretty love sky b...
3611,@BluesfestByron second artist announcement loo...,sadness,@BluesfestByron second artist announcement loo...


In [14]:
df['label_num']=df["label"].map({'fear':0,'anger':1 ,'joy':2 ,'sadness':3})
df.head(9)

Unnamed: 0,text,label,preprocess_text,label_num
0,Just got back from seeing @GaryDelaney in Burs...,joy,get see @GaryDelaney Burslem AMAZING face hurt...,2
1,Oh dear an evening of absolute hilarity I don'...,joy,oh dear evening absolute hilarity think laugh ...,2
2,Been waiting all week for this game ❤️❤️❤️ #ch...,joy,wait week game ❤ ️ ❤ ️ ❤ ️ cheer friday ❤ ️,2
3,"@gardiner_love : Thank you so much, Gloria! Yo...",joy,@gardiner_love thank Gloria sweet thoughtful d...,2
4,I feel so blessed to work with the family that...,joy,feel blessed work family nanny ❤ ️ love amp ap...,2
5,"Today I reached 1000 subscribers on YT!! , #go...",joy,today reach 1000 subscriber YT goodday thankful,2
6,"@Singaholic121 Good morning, love! Happy first...",joy,@singaholic121 good morning love happy day fal...,2
7,#BridgetJonesBaby is the best thing I've seen ...,joy,bridgetjonesbaby good thing see age funny miss...,2
8,Just got back from seeing @GaryDelaney in Burs...,joy,get see @GaryDelaney Burslem AMAZING face hurt...,2


Preparing test dataset

In [15]:
df_test=pd.read_csv("/kaggle/input/emotion-classification-nlp/emotion-labels-test.csv")
df_test

Unnamed: 0,text,label
0,You must be knowing #blithe means (adj.) Happ...,joy
1,Old saying 'A #smile shared is one gained for ...,joy
2,Bridget Jones' Baby was bloody hilarious 😅 #Br...,joy
3,@Elaminova sparkling water makes your life spa...,joy
4,I'm tired of everybody telling me to chill out...,joy
...,...,...
3137,Why does Candice constantly pout #GBBO 💄😒,sadness
3138,"@redBus_in #unhappy with #redbus CC, when I ta...",sadness
3139,"@AceOperative789 no pull him afew weeks ago, s...",sadness
3140,I'm buying art supplies and I'm debating how s...,sadness


In [16]:
df_test['label'].value_counts()

label
fear       995
anger      760
joy        714
sadness    673
Name: count, dtype: int64

In [17]:
df_test.head(900)

Unnamed: 0,text,label
0,You must be knowing #blithe means (adj.) Happ...,joy
1,Old saying 'A #smile shared is one gained for ...,joy
2,Bridget Jones' Baby was bloody hilarious 😅 #Br...,joy
3,@Elaminova sparkling water makes your life spa...,joy
4,I'm tired of everybody telling me to chill out...,joy
...,...,...
895,watching my first Cage of Death and my word th...,fear
896,Pakistan continues to treat #terror as a matte...,fear
897,I think I must scare my coworkers when I'm eat...,fear
898,“Worry is a down payment on a problem you may ...,fear


In [18]:
df_test_edit =df_test.loc[895:1145,['text','label']]

In [19]:
df_test_edit['label'].value_counts()

label
fear    251
Name: count, dtype: int64

In [20]:
df_test= df_test.drop(df_test.loc[895:1185 , ['text','label']].index)
df_test.shape

(2851, 2)

In [21]:
df_test["preprocess_text"]=df_test["text"].apply(preprocess)

In [22]:
df_test

Unnamed: 0,text,label,preprocess_text
0,You must be knowing #blithe means (adj.) Happ...,joy,know blithe mean adj Happy cheerful
1,Old saying 'A #smile shared is one gained for ...,joy,old say smile share gain day @YEGlifer @scott_...
2,Bridget Jones' Baby was bloody hilarious 😅 #Br...,joy,Bridget Jones Baby bloody hilarious 😅 BridgetJ...
3,@Elaminova sparkling water makes your life spa...,joy,@elaminova sparkle water make life sparkly
4,I'm tired of everybody telling me to chill out...,joy,tired everybody tell chill everything ok fuck ...
...,...,...,...
3137,Why does Candice constantly pout #GBBO 💄😒,sadness,Candice constantly pout GBBO 💄 😒
3138,"@redBus_in #unhappy with #redbus CC, when I ta...",sadness,@redBus_in unhappy redbus cc talk week initiat...
3139,"@AceOperative789 no pull him afew weeks ago, s...",sadness,@AceOperative789 pull afew week ago sadly s ga...
3140,I'm buying art supplies and I'm debating how s...,sadness,buy art supply debate buy acrylic paint


In [23]:
df_test['label_num']=df_test["label"].map({'fear':0,'anger':1 ,'joy':2 ,'sadness':3})
df_test.head(9)

Unnamed: 0,text,label,preprocess_text,label_num
0,You must be knowing #blithe means (adj.) Happ...,joy,know blithe mean adj Happy cheerful,2
1,Old saying 'A #smile shared is one gained for ...,joy,old say smile share gain day @YEGlifer @scott_...,2
2,Bridget Jones' Baby was bloody hilarious 😅 #Br...,joy,Bridget Jones Baby bloody hilarious 😅 BridgetJ...,2
3,@Elaminova sparkling water makes your life spa...,joy,@elaminova sparkle water make life sparkly,2
4,I'm tired of everybody telling me to chill out...,joy,tired everybody tell chill everything ok fuck ...,2
5,#GBBO can cheer me up ☺️,joy,GBBO cheer ☺ ️,2
6,"&amp; as much as I hate for a dude to cheat, w...",joy,amp hate dude cheat woman forego please man la...,2
7,@GOT7Official @jrjyp happy birthday jin young!...,joy,@got7official @jrjyp happy birthday jin young ...,2
8,@GOT7Official @jrjyp happy birthday jin young!...,joy,@got7official @jrjyp happy birthday jin young ...,2


In [24]:
#x_train, x_test, y_train, y_test= train_test_split(df['preprocess_text'],df['label_num'] ,
 #                                                  test_size=0.2 , random_state=42 ,
  #                                                 stratify=df["label_num"]
   # )

4. Feature Engineering  

In [25]:
v = TfidfVectorizer()
train= v.fit_transform(df['preprocess_text'])
test=v.fit_transform(df_test['preprocess_text'])

print(v.vocabulary_)



5. Build Model 

In [26]:
NB_model=MultinomialNB()

#Model Training
NB_model.fit(train, df['label_num'])

In [27]:
y_prediction=NB_model.predict(train)

In [28]:
print(accuracy_score(df['label_num'] , y_prediction))

0.973208910295003


In [29]:
print(classification_report(df['label_num'] , y_prediction))

              precision    recall  f1-score   support

           0       0.98      0.96      0.97       856
           1       0.97      0.98      0.97       857
           2       0.99      0.99      0.99       823
           3       0.96      0.95      0.96       786

    accuracy                           0.97      3322
   macro avg       0.97      0.97      0.97      3322
weighted avg       0.97      0.97      0.97      3322

