# Part I: Emotion classification
The problem type is supervised multiclass classification and the target is the emotion, with the different classes being ('sadness', 'anger', 'love', 'surprise', 'fear', 'joy').  
To do this we're going to apply transfer learning by using a model pre-trained specifically on this task  
The model is provided by hugging face
https://huggingface.co/mrm8488/t5-base-finetuned-emotion

## 1.1 Prerequisites

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from transformers import AutoTokenizer, AutoModelWithLMHead

## 1.2 Transfer learning

In [2]:
# hugging face tokenizer
tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-emotion")
# load the model which is already trained on emotion dataset
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-base-finetuned-emotion")
# function that takes input and returns emotion
def get_emotion(text):
  input_ids = tokenizer.encode(text + '</s>', return_tensors='pt')

  output = model.generate(input_ids=input_ids,
               max_length=2)
  
  dec = [tokenizer.decode(ids) for ids in output]
  label = dec[0]
  return label
  
#get_emotion("i feel as if i havent blogged in ages are at least truly blogged i am doing an update cute") # Output: 'joy'
 
get_emotion("i have a feeling i kinda lost my best friend") # Output: 'sadness'



'<pad> sadness'

# Part II: Emoji emotion classification
In this part, we're going to use the previous trained model to help us predict the emotions of the emojis. The feature is the emoji name, e.g., FACE WITH TEARS OF JOY. and the target variable is the emotion.


## 2.1 Read the tweet-emoji dataset

In [3]:
# the problem with this dataset is that it saves the names of the emojis not the emojis themselves # we will solve this by merging it with another dataset
tweets = pd.read_csv("../../../Desktop/tweets_emojis.csv") #problem: some unicode names here don't match unicode names in emoji dataset


In [4]:
# delete uneeded column
tweets.drop('Unnamed: 0', axis=1,inplace=True)

In [5]:
tweets['Unicode name'] = tweets['emoji']
# remove all characters that are not letters or numbers # save it in new column called unicode name to help with merge later
tweets['Unicode name'] = tweets['Unicode name'].str.replace('_', ' ')
# convert to upper case
tweets['Unicode name']= tweets['Unicode name'].apply(lambda names: names.upper())

In [7]:
# drop old column
tweets.drop('emoji', axis=1, inplace=True)

In [6]:
tweets['Unicode name']

0                 FACE WITH TEARS OF JOY
1                 FACE WITH TEARS OF JOY
2                              THUMBS UP
3                 FACE WITH TEARS OF JOY
4                         CLAPPING HANDS
                       ...              
1320030                        MALE SIGN
1320031    BACKHAND INDEX POINTING RIGHT
1320032                     FLUSHED FACE
1320033                 PERSON SHRUGGING
1320034                    RAISING HANDS
Name: Unicode name, Length: 1320035, dtype: object

In [8]:
tweets.head()

Unnamed: 0,text,Unicode name
0,Idk who taught my baby this BS ️ IGmeetthesa...,FACE WITH TEARS OF JOY
1,Thats me in every lesson,FACE WITH TEARS OF JOY
2,There are MANY of you 🇺 🇸 🇺 🇸 🇮 🇱,THUMBS UP
3,Partner strategy LLRC Urban naxal theories ar...,FACE WITH TEARS OF JOY
4,Happy Birthday More blessings Matsatsi 🏽 Hop...,CLAPPING HANDS


In [9]:
tweets['Unicode name'].value_counts()

FACE WITH TEARS OF JOY            210309
RED HEART                         103172
LOUDLY CRYING FACE                 82659
SMILING FACE WITH HEART-EYES       67817
FIRE                               51030
FEMALE SIGN                        50941
MALE SIGN                          34149
FOLDED HANDS                       32273
WEARY FACE                         30489
TWO HEARTS                         30049
PERSON SHRUGGING                   29903
SMILING FACE WITH SMILING EYES     28519
RAISING HANDS                      26444
THINKING FACE                      26362
PERSON FACEPALMING                 24303
HUNDRED POINTS                     23743
SPARKLES                           23532
FACE WITH ROLLING EYES             21917
CLAPPING HANDS                     21462
ROLLING ON THE FLOOR LAUGHING      21335
FACE BLOWING A KISS                20878
EYES                               20824
THUMBS UP                          19456
BACKHAND INDEX POINTING RIGHT      17971
FLEXED BICEPS   

## 2.2 Read the emoji-unicode dataset

In [10]:
# use this dataset to get the emojis
emojis = pd.read_csv("../data/emojis/Emoji_Sentiment_Data_v1.0.csv")
emojis.head()

Unnamed: 0,Emoji,Unicode codepoint,Occurrences,Position,Negative,Neutral,Positive,Unicode name,Unicode block
0,😂,0x1f602,14622,0.805101,3614,4163,6845,FACE WITH TEARS OF JOY,Emoticons
1,❤,0x2764,8050,0.746943,355,1334,6361,HEAVY BLACK HEART,Dingbats
2,♥,0x2665,7144,0.753806,252,1942,4950,BLACK HEART SUIT,Miscellaneous Symbols
3,😍,0x1f60d,6359,0.765292,329,1390,4640,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,😭,0x1f62d,5526,0.803352,2412,1218,1896,LOUDLY CRYING FACE,Emoticons


In [11]:
# since we dont need all features, only save emoji and name
emojis = emojis[['Emoji', 'Unicode name']]

In [12]:
# after
emojis.head()

Unnamed: 0,Emoji,Unicode name
0,😂,FACE WITH TEARS OF JOY
1,❤,HEAVY BLACK HEART
2,♥,BLACK HEART SUIT
3,😍,SMILING FACE WITH HEART-SHAPED EYES
4,😭,LOUDLY CRYING FACE


In [13]:
# merge the two dataset on the unicode name to get a tweet - emoji dataset
tweets_emojis = emojis.merge(tweets)
tweets_emojis.head()

Unnamed: 0,Emoji,Unicode name,text
0,😂,FACE WITH TEARS OF JOY,Idk who taught my baby this BS ️ IGmeetthesa...
1,😂,FACE WITH TEARS OF JOY,Thats me in every lesson
2,😂,FACE WITH TEARS OF JOY,Partner strategy LLRC Urban naxal theories ar...
3,😂,FACE WITH TEARS OF JOY,Dont play with me 🏾 ‍ ️
4,😂,FACE WITH TEARS OF JOY,To the goofiest boy ever who apparently looks ...


In [14]:
tweets_emojis.shape

(741777, 3)

In [15]:
# explore the distribution of emojis
tweets_emojis['Unicode name'].value_counts()

FACE WITH TEARS OF JOY            210309
LOUDLY CRYING FACE                 82659
FIRE                               51030
FEMALE SIGN                        50941
MALE SIGN                          34149
WEARY FACE                         30489
TWO HEARTS                         30049
SMILING FACE WITH SMILING EYES     28519
SPARKLES                           23532
EYES                               20824
FLEXED BICEPS                      16634
PURPLE HEART                       16457
PARTY POPPER                       16259
WINKING FACE                       16075
BLUE HEART                         15941
SMILING FACE WITH SUNGLASSES       15322
SPARKLING HEART                    14614
SKULL                              11976
CRYING FACE                        11265
YELLOW HEART                       10100
FLUSHED FACE                        8449
WHITE HEAVY CHECK MARK              7192
TROPHY                              6892
GLOWING STAR                        6189
HEAVY CHECK MARK

In [16]:
tweets_emojis.duplicated().sum()

0

In [17]:
# since this dataframe is too large, we need a way to query info more efficiently so we create a dictionary
emoji_dict = {k: v for k, v in tweets_emojis.groupby('Unicode name')}

In [18]:
# a dictionary made up of dataframes
emoji_dict

{'BLUE HEART':        Emoji Unicode name                                               text
 462064     💙   BLUE HEART                                  OBF Lets go AMVT 
 462065     💙   BLUE HEART  BEAUTIFUL  ️  ️  ️ RaIna  ️  Unforgettable Mem...
 462066     💙   BLUE HEART                        So much love for youman  ️ 
 462067     💙   BLUE HEART  Idc what my man have or dont have Ima ride w h...
 462068     💙   BLUE HEART  This is JUST the beginning Number 9 on iTunes ...
 ...      ...          ...                                                ...
 478000     💙   BLUE HEART  Thanasi with a convincing win over Alexander B...
 478001     💙   BLUE HEART                      I love you so much hermanita 
 478002     💙   BLUE HEART                    Congrats to the whole family  🏼
 478003     💙   BLUE HEART                          All My Heart also played 
 478004     💙   BLUE HEART    ITS OFFICIAL Arnold Hall Room 153 with my girl 
 
 [15941 rows x 3 columns],
 'CRYING FACE':      

In [19]:
# querying
emoji_dict['HEAVY CHECK MARK']['text']

630503     Mini Giveaway  SATU WINNER GET EXO natrep den...
630504    Q If the items below fall into water which one...
630505    CALLING ALL CoD WWII PLAYERS  is recruiting fo...
630506     ️ The Azerbaijans Best Awarding Ceremony On T...
630507    Interest within reaps in absolute Bliss  ️ Gre...
                                ...                        
636409    I agree As they want 2 disarm us for a reasons...
636410    Gain 150 followers tonight extra FAST  Retweet...
636411    iKONICs  ️ Which team are you on ️  ️ Team Mel...
636412    Fresh squeezed flatforms  ️ Spring forward in the
636413     Check out ALL my gifs of the stunning Kylie Page
Name: text, Length: 5911, dtype: object

## 2.3 Send tweets as input to the pre-trained model


In [None]:
# classify emoji emotions and save in new emotions column
for key in emoji_dict:
    emoji_dict[key]['Emotions'] = emoji_dict[key]['text'].apply(lambda tweet: get_emotion(tweet.lower()))
    emoji_dict[key].drop(['text', 'Unicode name'], axis=1, inplace=True)

# This takes a while, so instead of having my machine occupied for the number of hours this needs to finish, i broke it down into smaller pieces

In [20]:
# classify emoji emotions and save in new emotions column
emoji_dict['HEAVY CHECK MARK']['Emotions'] = emoji_dict['HEAVY CHECK MARK']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
# drop tweet column
emoji_dict['HEAVY CHECK MARK'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [22]:
# GLOWING STAR 
emoji_dict['GLOWING STAR']['Emotions'] = emoji_dict['GLOWING STAR']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['GLOWING STAR'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [23]:
# TROPHY
emoji_dict['TROPHY']['Emotions'] = emoji_dict['TROPHY']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['TROPHY'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [24]:
# WHITE HEAVY CHECK MARK
emoji_dict['WHITE HEAVY CHECK MARK']['Emotions'] = emoji_dict['WHITE HEAVY CHECK MARK']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['WHITE HEAVY CHECK MARK'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [25]:
# FLUSHED FACE
emoji_dict['FLUSHED FACE']['Emotions'] = emoji_dict['FLUSHED FACE']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['FLUSHED FACE'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [26]:
# YELLOW HEART
emoji_dict['YELLOW HEART']['Emotions'] = emoji_dict['YELLOW HEART']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['YELLOW HEART'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [29]:
# CRYING FACE
emoji_dict['CRYING FACE']['Emotions'] = emoji_dict['CRYING FACE']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['CRYING FACE'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [30]:
# SKULL
emoji_dict['SKULL']['Emotions'] = emoji_dict['SKULL']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['SKULL'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [31]:
# SPARKLING HEART
emoji_dict['SPARKLING HEART']['Emotions'] = emoji_dict['SPARKLING HEART']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['SPARKLING HEART'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [32]:
# SMILING FACE WITH SUNGLASSES
emoji_dict['SMILING FACE WITH SUNGLASSES']['Emotions'] = emoji_dict['SMILING FACE WITH SUNGLASSES']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['SMILING FACE WITH SUNGLASSES'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [33]:
# BLUE HEART
emoji_dict['BLUE HEART']['Emotions'] = emoji_dict['BLUE HEART']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['BLUE HEART'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [34]:
# WINKING FACE
emoji_dict['WINKING FACE']['Emotions'] = emoji_dict['WINKING FACE']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['WINKING FACE'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [35]:
# PARTY POPPER
emoji_dict['PARTY POPPER']['Emotions'] = emoji_dict['PARTY POPPER']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['PARTY POPPER'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [36]:
# PURPLE HEART
emoji_dict['PURPLE HEART']['Emotions'] = emoji_dict['PURPLE HEART']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['PURPLE HEART'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [37]:
# FLEXED BICEPS
emoji_dict['FLEXED BICEPS']['Emotions'] = emoji_dict['FLEXED BICEPS']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['FLEXED BICEPS'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [38]:
# EYES
emoji_dict['EYES']['Emotions'] = emoji_dict['EYES']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['EYES'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [44]:
# SPARKLES
emoji_dict['SPARKLES']['Emotions'] = emoji_dict['SPARKLES']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['SPARKLES'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [45]:
# SMILING FACE WITH SMILING EYES
emoji_dict['SMILING FACE WITH SMILING EYES']['Emotions'] = emoji_dict['SMILING FACE WITH SMILING EYES']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['SMILING FACE WITH SMILING EYES'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [46]:
# TWO HEARTS
emoji_dict['TWO HEARTS']['Emotions'] = emoji_dict['TWO HEARTS']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['TWO HEARTS'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [47]:
# WEARY FACE
emoji_dict['WEARY FACE']['Emotions'] = emoji_dict['WEARY FACE']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['WEARY FACE'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [48]:
# MALE SIGN
emoji_dict['MALE SIGN']['Emotions'] = emoji_dict['MALE SIGN']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['MALE SIGN'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [87]:
# FEMALE SIGN 
emoji_dict['FEMALE SIGN']['Emotions'] = emoji_dict['FEMALE SIGN']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['FEMALE SIGN'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [88]:
# FIRE
emoji_dict['FIRE']['Emotions'] = emoji_dict['FIRE']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['FIRE'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [89]:
# LOUDLY CRYING FACE
emoji_dict['LOUDLY CRYING FACE']['Emotions'] = emoji_dict['LOUDLY CRYING FACE']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['LOUDLY CRYING FACE'].drop(['text', 'Unicode name'], axis=1, inplace=True)

In [None]:
# FACE WITH TEARS OF JOY
emoji_dict['FACE WITH TEARS OF JOY']['Emotions'] = emoji_dict['FACE WITH TEARS OF JOY']['text'].apply(lambda tweet: get_emotion(tweet.lower()))
emoji_dict['FACE WITH TEARS OF JOY'].drop(['text', 'Unicode name'], axis=1, inplace=True)

----

In [None]:
# delete extra word present at the start of the string
for key in emoji_dict:    
    emoji_dict[key]['Emotions'] = emoji_dict[key]['Emotions'].apply(lambda tweet: tweet.split(' ', 1)[1])

In [None]:
# FEMALE SIGN
emoji_dict['FEMALE SIGN']['Emotions'] = emoji_dict['FEMALE SIGN']['Emotions'].apply(lambda tweet: tweet.split(' ', 1)[1])
# FIRE
emoji_dict['FIRE']['Emotions'] = emoji_dict['FIRE']['Emotions'].apply(lambda tweet: tweet.split(' ', 1)[1])
# LOUDLY CRYING FACE
emoji_dict['LOUDLY CRYING FACE']['Emotions'] = emoji_dict['LOUDLY CRYING FACE']['Emotions'].apply(lambda tweet: tweet.split(' ', 1)[1])

In [None]:
# FACE WITH TEARS OF JOY
emoji_dict['FACE WITH TEARS OF JOY']['Emotions'] = emoji_dict['FACE WITH TEARS OF JOY']['Emotions'].apply(lambda tweet: tweet.split(' ', 1)[1])

In [53]:
# save the new dictionary
import csv
with open('../../../Desktop/capstone-data/emoji_emotion_dict.csv', 'w', newline='', encoding="utf-8") as f:
    writer = csv.writer(f)
    for row in emoji_dict.items():
        writer.writerow(row)

------

In [66]:
# merge dictionary dfs into one df
df_list = []
for key in emoji_dict:
    df_list.append(emoji_dict[key])

df_emoji_emotion = pd.concat(df_list)

In [None]:
df_emoji_emotion.Emotions.value_counts()

In [None]:
df_emoji_emotion[['Emotion'] == 'surprise']

In [85]:
# show random 30 rows to check data quality
df_emoji_emotion[['Emotions', 'Emoji']].sample(n=30)

Unnamed: 0,Emotions,Emoji
538954,,🔥
363022,joy,😩
248599,,😭
552904,,🔥
128696,,😂
439653,joy,🎉
408399,fear,💪
684824,fear,♂
296115,anger,😊
473219,love,💙


In [None]:
# save the dataset as csv
df_emoji_emotion.to_csv('../../../Desktop/capstone-data/emoji_emotion_df.csv', index=False)

## 2.4 Merge emoji and emotion dataset

In [40]:
emotion_df = pd.read_csv("../data/emotions/train.txt", delimiter=';', header=None, names=['Sentence','Emotions'])
emoji_df = pd.read_csv('../../../Desktop/capstone-data/emoji_emotion_df.csv')

In [41]:
# show first 5 rows
emotion_df.head()

Unnamed: 0,Sentence,Emotions
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [None]:
emotion_df.shape

In [None]:
emotion_df.Emotions.value_counts()

In [42]:
emoji_df.head()

Unnamed: 0,Emoji,Emotions
630503,✔,joy
630504,✔,joy
630505,✔,joy
630506,✔,joy
630507,✔,joy


In [43]:
# merge the two tables on 'Emotion' column
emotion_emoji_merged = emoji_df.merge(emotion_df)

In [44]:
emotion_emoji_merged.head(20)

Unnamed: 0,Emoji,Emotions,Sentence
0,✔,joy,i have been with petronas for years i feel tha...
1,✔,joy,i do feel that running is a divine experience ...
2,✔,joy,i have immense sympathy with the general point...
3,✔,joy,i do not feel reassured anxiety is on each side
4,✔,joy,i have the feeling she was amused and delighted
5,✔,joy,i was able to help chai lifeline with your sup...
6,✔,joy,i feel more superior dead chicken or grieving ...
7,✔,joy,i get giddy over feeling elegant in a perfectl...
8,✔,joy,i can t imagine a real life scenario where i w...
9,✔,joy,i am not sure what would make me feel content ...


In [45]:
emotion_emoji_merged.duplicated().sum()

26327103

In [46]:
emotion_emoji_merged = emotion_emoji_merged.drop_duplicates()

In [47]:
emotion_emoji_merged.shape

(15999, 3)

In [48]:
emotion_emoji_merged.head(20)

Unnamed: 0,Emoji,Emotions,Sentence
0,✔,joy,i have been with petronas for years i feel tha...
1,✔,joy,i do feel that running is a divine experience ...
2,✔,joy,i have immense sympathy with the general point...
3,✔,joy,i do not feel reassured anxiety is on each side
4,✔,joy,i have the feeling she was amused and delighted
5,✔,joy,i was able to help chai lifeline with your sup...
6,✔,joy,i feel more superior dead chicken or grieving ...
7,✔,joy,i get giddy over feeling elegant in a perfectl...
8,✔,joy,i can t imagine a real life scenario where i w...
9,✔,joy,i am not sure what would make me feel content ...


In [49]:
emotion_emoji_merged['Emotions'].value_counts()

joy         5361
sadness     4666
anger       2159
fear        1937
love        1304
surprise     572
Name: Emotions, dtype: int64

In [50]:
# save the dataset as csv
emotion_emoji_merged.to_csv('../../../Desktop/capstone-data/emoji_emotion_merged.csv', index=False)

# Part III: Text emoji recommendation
In this lat part, we train a new model to take the text and recommends an emoji based on that text. The feature is the text and the target variable is the emoji. First we're going to apply multinomial naive bayes classifier which is one of the most popular machine learning algorithm in natural language processing., then try building a deep learning neural network and compare results.

## 3.1 Prerequisites

In [None]:
import numpy as np
import nltk
from tkinter import *
from matplotlib import pyplot as plt
import scipy
import re
from nltk.corpus import stopwords
from tensorflow.python import keras
import string
from keras.layers import Dense, Activation, Input, Dropout, SimpleRNN, LSTM
from keras.models import Model, Sequential
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import seaborn as sns



## 3.2 Text preprocessing

In [None]:
# read dataset
emotion_emoji_merged = pd.read_csv('../../../Desktop/capstone-data/emoji_emotion_merged.csv')

In [None]:
emotion_emoji_merged.head()

In [60]:
# split the data
X = emotion_emoji_merged['Sentence']
y = emotion_emoji_merged['Emoji']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

print(X_train.shape, y_train.shape)

(15999,) (15999,)


In [None]:
sns.displot(emotion_emoji_merged, x="Emotions")

### 3.2.1 Embeddings

In [95]:
# create an embedding matrix using golve vectors from pre-trained models
file = open("../../../Desktop/glove.6B/glove.6B.50d.txt", encoding = 'utf8')


In [96]:
# function that creates a dictionary where the key is the word and the values are all the embeddings
# the dictionary that will hold the mappings between words, and the embedding vectors of those words
def intialize_emb_matrix(file):
    embedding_matrix = {}
    for line in file:
        values = line.split()
        word = values[0]
        embedding = np.array(values[1:], dtype='float64')
        embedding_matrix[word] = embedding

    return embedding_matrix 
embedding_matrix = intialize_emb_matrix(file)

In [102]:
# This is the vector of the word’s position
embedding_matrix['ok']

array([-0.53646 , -0.072432,  0.24182 ,  0.099021,  0.18426 , -0.86764 ,
        0.081939,  0.40473 , -0.40506 ,  0.47446 , -0.16865 ,  0.38936 ,
       -0.16916 ,  0.1661  ,  0.73543 ,  0.83612 ,  0.026771,  0.56956 ,
        0.41988 , -0.23297 , -0.58841 ,  0.5495  ,  0.71645 ,  0.22451 ,
        1.0043  , -1.5036  , -0.78521 ,  0.73364 ,  0.4161  , -1.6782  ,
        1.9156  ,  0.26593 , -0.41546 ,  0.97965 , -0.06039 , -0.74422 ,
        0.6166  , -0.023109,  0.77383 , -0.65267 , -0.20022 , -0.2479  ,
        0.04704 ,  0.31407 ,  0.32598 , -0.24481 ,  0.16835 ,  0.097793,
        0.12392 ,  1.1584  ])

In [None]:
# embedding our text dataset
# note: we shouldn't remove stop words before embedding 
# because that would remove semantic structure 
# that we need for our model to work well
def get_emb_data(data, max_len):
    embedding_data = np.zeros((len(data), max_len, 50))  # from glove6B50d
    
    for idx in range(data.shape[0]):
        words_in_sentence = data[idx].split()
        
        for i in range(len(words_in_sentence)):
            if embedding_matrix.get(words_in_sentence[i].lower()) is not None:
                embedding_data[idx][i] = embedding_matrix[words_in_sentence[i].lower()]
                
    return embedding_data

In [None]:
# get embedding for train daata
X_train_emb = get_emb_data(X_train, 168)
# training data after embedding
X_train_emb
# get embedding for test data
X_test_emb = get_emb_data(X_test, 168)

In [103]:
from scipy.spatial import distance
def find_closest_embeddings(embedding):
    return sorted(embedding_matrix.keys(), key=lambda word: distance.euclidean(embedding_matrix[word], embedding))


In [107]:
print(find_closest_embeddings(embedding_matrix["paper"])[1:6])

['paper',
 'print',
 'sheet',
 'printed',
 'printing',
 'ink',
 'papers',
 'copy',
 'cover',
 'contents',
 'contained',
 'sheets',
 'piece',
 'notes',
 'covered',
 'made',
 'material',
 'packaging',
 'laid',
 'reference',
 'instead',
 'stamped',
 'instance',
 'covering',
 'explaining',
 'recycled',
 'distributed',
 'press',
 'discarded',
 'read',
 'publish',
 'collected',
 '.',
 'materials',
 'box',
 'publication',
 'placing',
 'paint',
 'pointed',
 'pointing',
 'page',
 'putting',
 'publishing',
 'supplied',
 'making',
 'aside',
 'showing',
 'selling',
 'soft',
 'own',
 'suggested',
 'note',
 'using',
 'suggesting',
 'adding',
 'rolls',
 'provided',
 'post',
 'collection',
 'put',
 'newspapers',
 'advertisement',
 'bearing',
 'product',
 'collecting',
 'pieces',
 'articles',
 'mirror',
 'merely',
 'fake',
 'reading',
 'hand',
 'labeled',
 'example',
 'simply',
 'carried',
 'roll',
 'picture',
 'partly',
 'recently',
 'used',
 'book',
 'changed',
 'essentially',
 'delivered',
 'pressed

### 3.2.2 Encoding the target variable

In [None]:
# converting y_train to one hot vectors so that cross-entropy loss can be used
y_train = to_categorical(y_train)
y_train

In [None]:
y_test = to_categorical(y_test)
y_test

## 3.3 Naive Bayes

### 3.3.1 Model building

In [None]:
# create the model
NBmodel = MultinomialNB()
# fit the model
NBmodel.fit(X_train_emb, y_train)

### 3.3.2 Model testing

In [None]:
NB_y_pred = NBmodel.predict(X_test_emb)

In [None]:
plt.figure(figsize=(6, 6))

# use displot for newer versions
ax1 = sns.distplot(y_test, hist=False, color="r", label="Actual Values")
sns.distplot(NB_y_pred, hist=False, color="b", label="Predicted Values" , ax=ax1)


plt.title('Actual vs Predicted Values')
plt.xlabel('')
plt.ylabel('')

plt.show()
plt.close()

## 3.4 LSTM

### 3.4.1 Model building

In [None]:
model = Sequential()

In [None]:
# try different dropouts
model.add(LSTM(units = 256, return_sequences=True, input_shape = (168,50)))
model.add(Dropout(0.3))
model.add(LSTM(units=128))
model.add(Dropout(0.3))
model.add(Dense(units=128, activation='relu'))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(units=32, activation='relu'))
model.add(Dense(units=20, activation='relu'))
model.add(Dense(units=20, activation='softmax'))

In [None]:
model.summary()

In [None]:
# try different optimizers
model.compile(optimizer='adam', loss=keras.losses.categorical_crossentropy, metrics=['acc'])

### 3.4.2 Model training

In [None]:
# try different validation split
# try different epochs
res = model.fit(X_train_emb, y_train, validation_split=0.2, batch_size=32, epochs=100, verbose=2)

### 3.4.3 Model performance overview

In [None]:
# Loss and accuracy plots

### 3.4.4 Confusion matrix and correlation report

## 3.5 Conclusion

compare my results to other models?
- https://www.kaggle.com/code/satwiksrivastava/emoji-prediction/notebook
- https://huggingface.co/spaces/ml6team/emoji_predictor
- 

### Bayes rule
1. vector len 6 {freq of each emotion var t}
2. vector len 25 {freq of each emoji var j}
3. matrix size 25x6 {keys: j, values: t } 
    and j,t entry has freq of emotion t given emoji j
### Use bayes to create another table
4. matrix 6x25 {keys: t, values: j} 
    and t,j entry is prob of emoji j given emotion t