# Part I: Emotion classification
The problem type is supervised multiclass classification and the target is the emotion, with the different classes being ('sadness', 'love', 'surprise', 'joy').  
As for the model choice, we're going to be testing multiple models and choosing the model with the best test results.  
1. Support Vector Machines
2. Bayesian Networks/ Naïve Bayes
3. Neural Networks/ Deep Learning
5. BERT

## Prerequisites

In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#Importing require Libraries
import os

import nltk
from tkinter import *
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
import scipy

import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.python import keras
import string
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, LSTM ,Conv2D, Dense,GlobalAveragePooling1D,Flatten, Dropout , GRU, TimeDistributed, Conv1D, MaxPool1D, MaxPool2D, Bidirectional

from sklearn.preprocessing import LabelEncoder
from keras.utils import to_categorical
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

## Read the text-emotion dataset

In [6]:
train_dataset = pd.read_csv("../data/emotions/train.txt", delimiter=';', header=None, names=['Sentence','Label'])
test_dataset = pd.read_csv("../data/emotions/test.txt", delimiter=';', header=None, names=['Sentence','Label'])
vali_dataset = pd.read_csv("../data/emotions/val.txt", delimiter=';', header=None, names=['Sentence','Label'])
full_dataset = [train_dataset, test_dataset, vali_dataset]
full_dataset = pd.concat(full_dataset)

In [7]:
full_dataset.head()

Unnamed: 0,Sentence,Label
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [20]:
full_dataset.Label.value_counts()

joy         6761
sadness     5797
anger       2709
fear        2373
love        1641
surprise     719
Name: Label, dtype: int64

In [8]:
# split target and features
feature = train_dataset['Sentence']
target = train_dataset['Label']

In [26]:
# split for validation set
X_val = vali_dataset['Sentence']
Y_val = vali_dataset['Label']

## Clean the sentences

In [9]:
# turn text to lower case 
feature = feature.apply(lambda sequence: [ltrs.lower() for ltrs in sequence if ltrs not in string.punctuation]) 
feature = feature.apply(lambda wrd: ''.join(wrd))
feature.head()

0                              i didnt feel humiliated
1    i can go from feeling so hopeless to so damned...
2     im grabbing a minute to post i feel greedy wrong
3    i am ever feeling nostalgic about the fireplac...
4                                 i am feeling grouchy
Name: Sentence, dtype: object

## Stop word removal

In [None]:
# using NLTK

## Tokenization and padding

In [10]:
# METHOD 2
# initialize tokenizer
tokenizer = Tokenizer(num_words=5000)
# fit text on tokenizer # this step trains our tokenizer
tokenizer.fit_on_texts(feature)
# create sequence of tokens, and only top most frequent words will be taken into account
X_train = tokenizer.texts_to_sequences(feature)
# pad all tokens to be the same length # because training neural networks require a fixed length input
X_train_pad = pad_sequences(X_train)


In [12]:
# using word_index we can see that the tokenizer gives each word an ID
tokenizer.word_index

{'i': 1,
 'feel': 2,
 'and': 3,
 'to': 4,
 'the': 5,
 'a': 6,
 'feeling': 7,
 'that': 8,
 'of': 9,
 'my': 10,
 'in': 11,
 'it': 12,
 'like': 13,
 'so': 14,
 'im': 15,
 'for': 16,
 'me': 17,
 'was': 18,
 'have': 19,
 'but': 20,
 'is': 21,
 'this': 22,
 'am': 23,
 'with': 24,
 'not': 25,
 'about': 26,
 'be': 27,
 'as': 28,
 'on': 29,
 'you': 30,
 'just': 31,
 'at': 32,
 'when': 33,
 'or': 34,
 'all': 35,
 'because': 36,
 'more': 37,
 'do': 38,
 'can': 39,
 'really': 40,
 'up': 41,
 't': 42,
 'by': 43,
 'are': 44,
 'very': 45,
 'know': 46,
 'been': 47,
 'if': 48,
 'out': 49,
 'myself': 50,
 'time': 51,
 'what': 52,
 'how': 53,
 'little': 54,
 'get': 55,
 'had': 56,
 'will': 57,
 'now': 58,
 'from': 59,
 'they': 60,
 'being': 61,
 'people': 62,
 'he': 63,
 'want': 64,
 'them': 65,
 'would': 66,
 'her': 67,
 'some': 68,
 'still': 69,
 'one': 70,
 'who': 71,
 'think': 72,
 'ive': 73,
 'him': 74,
 'even': 75,
 'an': 76,
 'life': 77,
 'its': 78,
 'there': 79,
 'bit': 80,
 'make': 81,
 'we': 82

In [21]:
len(tokenizer.word_index)

17096

In [14]:
# after tonkanization and padding # zeros are added in front of smaller sentences
feature

array([[   0,    0,    0, ...,  138,    2,  625],
       [   0,    0,    0, ...,    3,   21, 1383],
       [   0,    0,    0, ...,    2,  495,  420],
       ...,
       [   0,    0,    0, ...,    5,  215,  191],
       [   0,    0,    0, ...,   30,   57, 2181],
       [   0,    0,    0, ...,   75,    5,   70]])

## Encoding target

In [16]:
# Encode target labels with value between 0 and n_classes-1 (5)
# initialize label encoder for target variables
labelencoder = LabelEncoder()
# fit encoder
Y_train = labelencoder.fit_transform(target)

In [17]:
# after ecoding
Y_train

array([4, 4, 0, ..., 2, 2, 2])

### Validation set

In [None]:
# fit encoder
Y_val = labelencoder.fit_transform(Y_val)

## One hot encoding

In [None]:
Y_train_h = to_categorical(Y_train)

### Validation set

In [None]:
Y_val_h = to_categorical(Y_val)

## Word embedding
Word embedding is a type of word representation that allows words with similar meaning to have a similar representation. There are two types of word embedding-

- Word2vec
- Doc2Vec

In [19]:
# do i need this?
# this is a context free representation of a word (doesn't take context into account) 
# will be replace by BERT which is context-based or LSTM

## Creating the model
BERT or Bidirectional LSTM?

In [None]:
# initialize the model
model=Sequential()
# set the weights and layers
model.add(Embedding(15212,64,input_length=80))
model.add(Dropout(0.6))
model.add(Bidirectional(LSTM(80,return_sequences=True)))
model.add(Bidirectional(LSTM(160)))
# last layer with activation function
model.add(Dense(6,activation='softmax'))
# summary
print(model.summary())


In [None]:
# compile
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
hist=model.fit(X_train_pad,Y_train_h,epochs=12,validation_data=(X_val_pad,Y_val_f))

## Fitting the model

## Model performance overview

In [None]:
# Loss and accuracy plots

## Check for test data

In [None]:
# test data results and accuracy

## Confusion matrix and correlation report

# Part II: Emoji emotion classification
In this part, we're going to use the previous trained model to help us predict the emotions of the emojis. The feature is the emoji name, e.g., FACE WITH TEARS OF JOY. and the target variable is the emotion.


# Part III: Text emoji recommendation
In this lat part, we train a new model to take the text and recommends an emoji based on that text. The feature is the text and the target variable is the emoji