Special thanks to Alireza Akhavan

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras.preprocessing import sequence

## Dataset

Dataset EMOJISET

You have a tiny dataset (X, Y) where:
- X contains 127 sentences (strings)
- Y contains a integer label between 0 and 4 corresponding to an emoji for each sentence

<img src="https://github.com/Alireza-Akhavan/rnn-notebooks/blob/master/images/data_set.png?raw=1" style="width:700px;height:300px;">


In [None]:
def read_csv(filename):
    data_frame = pd.read_csv(filename)
    X = np.asarray(data_frame["sentence"])
    Y = np.asarray(data_frame["label"], dtype=int)
    return X, Y

X_train, Y_train = read_csv('datasets/Emoji_Text_Classification/train.csv')
X_test, Y_test = read_csv('datasets/Emoji_Text_Classification/test.csv')

In [None]:
max_len = len(max(X_train, key=len).split())  # length of the largest sentence
max_len

In [None]:
def label_to_emoji(label):
    emojies = ["‚ù§Ô∏è", "üèê", "üòÑ", "üòû", "üç¥"]
    return emojies[label]

In [None]:
index = 7
print(X_train[index], label_to_emoji(Y_train[index]))

## Emojifier-V1

<center>
<img src="https://github.com/Alireza-Akhavan/rnn-notebooks/blob/master/images/image_1.png?raw=1" style="width:900px;height:300px;">
</center>


In [None]:
# convert labels to one hot
Y_train_oh = tf.keras.utils.to_categorical(Y_train, 5)
Y_test_oh = tf.keras.utils.to_categorical(Y_test, 5)

In [None]:
index = 7
print(Y_train[index], "is converted into one hot", Y_train_oh[index])

In [None]:
!wget http://nlp.stanford.edu/data/glove.6B.zip --no-verbose
!unzip -q glove.6B.zip

In [None]:
def read_glove_vecs(glove_file):
    with open(glove_file, encoding="utf8") as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        words_to_index = {}
        for i, w in enumerate(sorted(words)):
            words_to_index[w] = i + 1

    return words_to_index, word_to_vec_map

In [None]:
word_to_index, word_to_vec_map = read_glove_vecs('glove.6B/glove.6B.50d.txt')

In [None]:
word_to_index["ali"]

In [None]:
word_to_vec_map["ali"]

In [None]:
# convert sentence to average of the word vectors
def sentence_to_avg(sentence):
    words = sentence.lower().split()
    sum_vectors = np.zeros((50,))
    for w in words:
        sum_vectors += word_to_vec_map[w] 
    
    avg_vectors = sum_vectors / len(words)
    return avg_vectors

In [None]:
sentence_to_avg("Pasta is my favorite food")

In [None]:
X_train_avg = []
for i in range(X_train.shape[0]):
    X_train_avg.append(sentence_to_avg(X_train[i]))

X_train_avg = np.array(X_train_avg)

In [None]:
print(X_train_avg.shape)
print(Y_train_oh.shape)

In [None]:
class EmojiNet_V1(Model):
    def __init__(self):
        super().__init__()
        self.dense = Dense(5, input_shape=(50,), activation='softmax')

    def call(self, x):
        x = self.dense(x)
        return x

In [None]:
model = EmojiNet_V1()

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [None]:
model.fit(X_train_avg, Y_train_oh, epochs=400)

In [None]:
X_test_avg = []
for i in range(X_test.shape[0]):
    X_test_avg.append(sentence_to_avg(X_test[i]))

X_test_avg = np.array(X_test_avg)

In [None]:
model.evaluate(X_test_avg, Y_test_oh)

Random guessing would have had 20% accuracy given that there are 5 classes. This is pretty good performance after training on only 132 examples. 

In the training set, the algorithm saw the sentence "*I love you*" with the label ‚ù§Ô∏è. You can check however that the word "adore" does not appear in the training set. Nonetheless, lets see what happens if you write "*I adore you*."



In [None]:
X_me = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_me = np.array([[0], [0], [2], [1], [4], [3]])

X_me_avg = []
for x in X_me:
    X_me_avg.append(sentence_to_avg(x))
X_me_avg = np.array(X_me_avg)

pred = model.predict(X_me_avg)

for i in range(X_me.shape[0]):
    print(X_me[i], label_to_emoji(np.argmax(pred[i])))

Amazing! Because *adore* has a similar embedding as *love*, the algorithm has generalized correctly even to a word it has never seen before. Words such as *heart*, *dear*, *beloved* or *adore* have embedding vectors similar to *love*, and so might work too.

feel free to modify the inputs above and try out a variety of input sentences. How well does it work?



## Emojifier-V2: Using RNNs: 

Let's build an LSTM model that takes as input word sequences. This model will be able to take word ordering into account. Emojifier-V2 will continue to use pre-trained word embeddings to represent words, but will feed them into an LSTM, whose job it is to predict the most appropriate emoji. 

Run the following cell to load the Keras packages.

<img src="https://github.com/Alireza-Akhavan/rnn-notebooks/blob/master/images/emojifier-v2.png?raw=1" style="width:700px;height:400px;"> <br>
<caption><center> **Figure 3**: Emojifier-V2. A 2-layer LSTM sequence classifier. </center></caption>



<img src="https://github.com/Alireza-Akhavan/rnn-notebooks/blob/master/images/embedding1.png?raw=1" style="width:700px;height:250px;">
<caption><center> **Figure 4**: Embedding layer. This example shows the propagation of two examples through the embedding layer. Both have been zero-padded to a length of `max_len=5`. The final dimension of the representation is  `(2,max_len,50)` because the word embeddings we are using are 50 dimensional. </center></caption>

In [None]:
def convert_sentences_to_embeddings(X):
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)

    emb_matrix = np.zeros((X.shape[0], max_len, emb_dim))

    for i in range(X.shape[0]):
        words = X[i].lower().split()
        for j in range(len(words)):
            emb_matrix[i, j, :] = word_to_vec_map[words[j]]
    
    return emb_matrix

In [None]:
X_me = np.array(["funny lol", "lets play baseball", "food is ready for you"])
print(X_me)
print(convert_sentences_to_embeddings(X_me))

In [None]:
X_train_embs = convert_sentences_to_embeddings(X_train)
X_train_embs.shape

In [None]:
Y_train_oh = tf.keras.utils.to_categorical(Y_train, 5)
Y_train_oh.shape

In [None]:
class EmojiNet_V2(Model):
    def __init__(self, input_shape):
        super().__init__()
        
        self.lstm_1 = LSTM(128, return_sequences=True)
        self.dropout_1 = Dropout(0.5)
        self.lstm_2 = LSTM(128)
        self.dropout_2 = Dropout(0.5)
        self.dense = Dense(5, activation='softmax')

    def call(self, x):
        x = self.lstm_1(x)
        x = self.dropout_1(x)
        x = self.lstm_2(x)
        x = self.dropout_2(x)
        x = self.dense(x)
        return x

In [None]:
model = EmojiNet_V2((max_len,))

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='categorical_crossentropy', 
              metrics=['accuracy'])

In [None]:
model.summary()

It's time to train your model. Your Emojifier-V2 `model` takes as input an array of shape (`m`, `max_len`) and outputs probability vectors of shape (`m`, `number of classes`). We thus have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices), and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).

In [None]:
model.fit(emb_matrix, Y_train_oh, epochs=50, batch_size=32, shuffle=True)

Your model should perform close to **100% accuracy** on the training set. The exact accuracy you get may be a little different. Run the following cell to evaluate your model on the test set. 

In [None]:
X_test_embs = convert_sentences_to_embeddings(X_test)
Y_test_oh = tf.keras.utils.to_categorical(Y_test, 5)
model.evaluate(X_test_embs, Y_test_oh)

You should get a test accuracy between 80% and 95%. Run the cell below to see the mislabelled examples. 

In [None]:
# This code allows you to see the mislabelled examples

pred = model.predict(X_test_embs)
for i in range(len(X_test)):
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print('Expected emoji:', label_to_emoji(Y_test[i]), 'prediction:', X_test[i], label_to_emoji(num))

Now you can try it on your own example. Write your own sentence below. 

In [None]:
# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.  
x_me = np.array(['not feeling happy'])
X_me_embs = convert_sentences_to_embeddings(x_me)
pred = model.predict(X_me_embs)
print(x_me[0], label_to_emoji(np.argmax(pred)))

Previously, Emojify-V1 model did not correctly label "not feeling happy," but our implementation of Emojiy-V2 got it right. (Keras' outputs are slightly random each time, so you may not have obtained the same result.) The current model still isn't very robust at understanding negation (like "not happy") because the training set is small and so doesn't have a lot of examples of negation. But if the training set were larger, the LSTM model would be much better than the Emojify-V1 model at understanding such complex sentences. 
