<a href="https://colab.research.google.com/github/Biffy21/Data-science-algorithms/blob/main/PRO_C115.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load the Dataset

In [2]:
!git clone https://github.com/procodingclass/PRO-C114-Text-Sentiment-Dataset

Cloning into 'PRO-C114-Text-Sentiment-Dataset'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 20 (delta 2), reused 0 (delta 0), pack-reused 11[K
Receiving objects: 100% (20/20), 2.94 MiB | 10.25 MiB/s, done.
Resolving deltas: 100% (3/3), done.


## Pandas:
**Pandas** is an open-source library built on top of Numpy and Matplotlib. It provides high-performance, easy-to-use data structures and data analysis tools.

A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Pandas DataFrame will be created by loading the datasets from existing MS Excel files, CSV files or SQL Database. Pandas DataFrame can also be created from the lists, dictionaries etc.



In [3]:
#Import pandas
import pandas as pd

#read excel file
train_data_raw = pd.read_excel("/content/PRO-C114-Text-Sentiment-Dataset/text-emotion-training-dataset.xlsx")
#display first five entries of training dataset
train_data_raw.head()




Unnamed: 0,Text_Emotion
0,i didnt feel humiliated;sadness
1,i can go from feeling so hopeless to so damned...
2,im grabbing a minute to post i feel greedy wro...
3,i am ever feeling nostalgic about the fireplac...
4,i am feeling grouchy;anger


## Split the rows in two columns as Text and Emotions



In [4]:
#Use split() method to separate the Text and Emotions
train_data = pd.DataFrame(train_data_raw["Text_Emotion"].str.split(";").tolist(),columns = ['Text','Emotion'])
train_data.head(8)

Unnamed: 0,Text,Emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger
5,ive been feeling a little burdened lately wasn...,sadness
6,ive been taking or milligrams or times recomme...,surprise
7,i feel as confused about life as a teenager or...,fear


## Giving labels to Emotions

In [5]:
#Find unique emotions
train_data["Emotion"].unique()

array(['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'],
      dtype=object)

In [6]:
#Create a Dictionary to replace emotions with labels
encode_emotions = {"anger":0,"fear":1,"joy":2,"love":3,"surprise":4,"sadness":5}
train_data.replace(encode_emotions,inplace = True)
train_data.head()

Unnamed: 0,Text,Emotion
0,i didnt feel humiliated,5
1,i can go from feeling so hopeless to so damned...,5
2,im grabbing a minute to post i feel greedy wrong,0
3,i am ever feeling nostalgic about the fireplac...,3
4,i am feeling grouchy,0


## Convert Dataframe to list of dataset

In [7]:
# Define two list for sentences and emotions
training_sentences = []
training_labels = []

# append text and emotions in the list using the 'loc' method
for i in range(len(train_data)):
  sentence = train_data.loc[i,"Text"]
  training_sentences.append(sentence)
  label = train_data.loc[i,"Emotion"]
  training_labels.append(label)
#Check a random text and label of the list
training_sentences[50],training_labels[50]

('i need to feel the dough to make sure its just perfect', 2)

## Tokenization & Padding
The act of converting text into numbers is known as **Tokenization**. The Tokenizer class of Keras is used for encoding text input into integer sequence.

**Padding** is important to make all the sentences contain the same number of words. Zero is used for padding the tokenized sequence to make text contain the same number of tokens.

In [8]:
#import Tokenizer and pad_sequences from tensorflow
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
#Define parameters for Tokenizer
vocab_size = 10000
embedding_dim = 16
oov_tok = "<OOV>"
training_size = 20000
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

#Create a word_index dictionary
word_index = tokenizer.word_index
#Check the tokenized sequence
word_index["the"]

6

In [9]:
#import pad_sequences from tensorflow
training_sequences = tokenizer.texts_to_sequences(training_sentences)
print(training_sequences[0])
print(training_sequences[1])
print(training_sequences[2])



#Check the padded sequence

[2, 139, 3, 679]
[2, 40, 101, 60, 8, 15, 494, 5, 15, 3496, 553, 32, 60, 61, 128, 148, 76, 1480, 4, 22, 1255]
[17, 3060, 7, 1149, 5, 286, 2, 3, 495, 438]


In [10]:
#Define parameters for pad_sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
padding_type = 'post'
max_length = 100
trunc_type = 'post'

training_padded = pad_sequences(training_sequences,maxlen = max_length,padding = padding_type,truncating = trunc_type)
training_padded

array([[   2,  139,    3, ...,    0,    0,    0],
       [   2,   40,  101, ...,    0,    0,    0],
       [  17, 3060,    7, ...,    0,    0,    0],
       ...,
       [   2,    3,  327, ...,    0,    0,    0],
       [   2,    3,   14, ...,    0,    0,    0],
       [   2,   47,    7, ...,    0,    0,    0]], dtype=int32)

In [12]:
#converting padded sequence and labels into numpy array
import numpy as np
training_padded = np.array(training_padded)
training_labels = np.array(training_labels)


Model compilation

In [16]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dense
from tensorflow.keras.layers import Conv1D,Dropout,MaxPooling1D
model = tf.keras.Sequential([
    Embedding(vocab_size,embedding_dim,input_length = max_length),
    Dropout(0.2),
    Conv1D(filters = 256,kernel_size = 3,activation = "relu"),
    MaxPooling1D(pool_size = 3),
    Conv1D(filters = 128,kernel_size = 3, activation = "relu"),
    MaxPooling1D(pool_size = 3),
    LSTM(128),

    Dense(128,activation = "relu"),
    Dropout(0.2),
    Dense(64,activation = "relu"),
    Dense(6, activation = "softmax")
])
model.compile(loss = 'sparse_categorical_crossentropy',optimizer = 'adam',metrics = ['accuracy'])



In [17]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 100, 16)           160000    
                                                                 
 dropout_4 (Dropout)         (None, 100, 16)           0         
                                                                 
 conv1d_4 (Conv1D)           (None, 98, 256)           12544     
                                                                 
 max_pooling1d_4 (MaxPoolin  (None, 32, 256)           0         
 g1D)                                                            
                                                                 
 conv1d_5 (Conv1D)           (None, 30, 128)           98432     
                                                                 
 max_pooling1d_5 (MaxPoolin  (None, 10, 128)           0         
 g1D)                                                 

In [None]:
#train model
num_epochs = 30
history = model.fit(training_padded,training_labels,epochs = num_epochs,verbose = 2)

Epoch 1/30
500/500 - 24s - loss: 1.4445 - accuracy: 0.3557 - 24s/epoch - 47ms/step
Epoch 2/30
500/500 - 22s - loss: 0.8258 - accuracy: 0.6308 - 22s/epoch - 43ms/step
Epoch 3/30
500/500 - 23s - loss: 0.5237 - accuracy: 0.7989 - 23s/epoch - 47ms/step
Epoch 4/30
500/500 - 22s - loss: 0.3734 - accuracy: 0.8761 - 22s/epoch - 44ms/step
Epoch 5/30
500/500 - 20s - loss: 0.2841 - accuracy: 0.9011 - 20s/epoch - 41ms/step
Epoch 6/30
500/500 - 21s - loss: 0.2365 - accuracy: 0.9167 - 21s/epoch - 42ms/step
Epoch 7/30
500/500 - 20s - loss: 0.1950 - accuracy: 0.9311 - 20s/epoch - 39ms/step
Epoch 8/30
500/500 - 23s - loss: 0.1722 - accuracy: 0.9408 - 23s/epoch - 45ms/step
Epoch 9/30
500/500 - 22s - loss: 0.1469 - accuracy: 0.9463 - 22s/epoch - 44ms/step
Epoch 10/30
500/500 - 22s - loss: 0.1354 - accuracy: 0.9509 - 22s/epoch - 43ms/step
Epoch 11/30
500/500 - 21s - loss: 0.1232 - accuracy: 0.9543 - 21s/epoch - 43ms/step
Epoch 12/30
500/500 - 22s - loss: 0.1155 - accuracy: 0.9580 - 22s/epoch - 43ms/step
E

In [None]:
model.save("Text_Emotion.h5")

In [None]:
sentence = ["It is a pretty gloomy day today, isn't it?","I love music."]
sequences = tokenizer.texts_to_sequences(sentence)
padded = pad_sequences(sequences,maxlen = max_length,padding = padding_type,trucating = trunc_type)
result = model.predict(padded)

predict_class = np.argmax(result,axis = 1)
predict_class