In [1]:
import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt

%matplotlib inline

### DataSet EmojiSet
We have a tiny dataset (X, Y) where:
- X contains 127 sentences (strings).
- Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.


In [2]:
X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')


In [3]:
# To find out the maximum sentecne length in the X_train
maxLen = len(max(X_train, key=len).split())
print(maxLen)


10


In [4]:
for idx in range(10):
    print(X_train[idx], label_to_emoji(Y_train[idx]))

never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long  😞
I want to go play ⚾


In [21]:
# Reading the Glove Vectors
def read_glove_vecs(glove_file):
    with open(glove_file, 'r',encoding="utf-8") as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [22]:
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

We have loaded
- `word_to_index`: dictionary mapping from words to their indices in the vocabulary 
    - (400,001 words, with the valid indices ranging from 0 to 400,000)
- `index_to_word`: dictionary mapping from indices to their corresponding words in the vocabulary
- `word_to_vec_map`: dictionary mapping words to their GloVe vector representation.

In [23]:
word = "cucumber"
idx = 289846
print("the index of", word, "in the vocabulary is", word_to_index[word])
print("the", str(idx) + "th word in the vocabulary is", index_to_word[idx])

the index of cucumber in the vocabulary is 113317
the 289846th word in the vocabulary is potatos


In [24]:
word_to_vec_map['orange']

array([-0.42783 ,  0.43089 , -0.50351 ,  0.5776  ,  0.097786,  0.2608  ,
       -0.68767 , -0.31936 , -0.25337 , -0.37255 , -0.045907, -0.53688 ,
        0.97511 , -0.44595 , -0.50414 , -0.086751, -1.0645  ,  0.36625 ,
       -0.52428 , -1.3413  , -0.2391  , -0.58808 ,  0.56378 , -0.062501,
       -1.7429  , -0.88077 , -0.27933 ,  1.4705  ,  0.50436 , -0.69174 ,
        2.0018  ,  0.26663 , -0.85679 , -0.18893 , -0.021125, -0.055118,
       -0.50337 , -0.67157 ,  0.55502 , -0.8009  ,  0.10695 ,  0.1459  ,
       -0.55588 , -0.64971 ,  0.22046 ,  0.67415 , -0.45119 , -1.1462  ,
        0.16348 , -0.62946 ])

### Model 

* We will build an LSTM model that will take sequences as input
* To represent words we will use the pretrained word embeddings
* For a given sentence (sequence of words the model will learn to predict the most important emoji

In [25]:
np.random.seed(0)
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
np.random.seed(1)

Using TensorFlow backend.


### Padding

To handle sequences of different lengths we will do Padding for this
* We will set a max sequence length
* padd all the sequences to have this same length

We are doing padding because most deep learning frameworks require that all sequences in the same mini-batch have the same length and suppose if we had a 3-word sentence and a 4-word sentence, then the computations needed for them are different (one takes 3 steps of an LSTM, one takes 4 steps) so it's just not possible to do them both at the same time. Therefore padding is required

### Embedding Layer 

* In Keras Embedding Matrix is represented as a layer
* Embedding matrix maps word indices to word vectors
* for the embedding vectors we are using pre-trained glove word vectors
* for this we will use [Embedding()](https://keras.io/layers/embeddings/) in keras

#### Input is :-
  Input of the embedding layer is (Batch_size , max_len of the sequence) this contains word indices in of the words that are in the sentence . and the embedding layer outputs an array of shape (batch_size , max_len ,embedding dimension). The embedding dimension in our case is 50.
  
We will inititalise the embedding layer with Glove-50 dimensional vectors. 

In [26]:
for idx, val in enumerate(["I", "like", "learning"]):
    print(idx,val)

0 I
1 like
2 learning


In [31]:
# First we need the function which converts array of sentences (strings) into an array of indices . Yhr output of this function
# will be given to Embedding layer
def sentences_to_indices(X, word_to_index, max_len):
    """
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    
    """
    m = X.shape[0] 
    X_indices = np.zeros((m,max_len))
    for i in range(m):
        words = X[i].lower().split()
        j  = 0
        for w in words:
            X_indices[i,j] = word_to_index[w]
            j+=1
    return X_indices

In [33]:
X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])
X1_indices = sentences_to_indices(X1,word_to_index, max_len = 5)
print("X1 =", X1)
print("X1_indices =\n", X1_indices)


X1 = ['funny lol' 'lets play baseball' 'food is ready for you']
X1_indices =
 [[155345. 225122.      0.      0.      0.]
 [220930. 286375.  69714.      0.      0.]
 [151204. 192973. 302254. 151349. 394475.]]


In [36]:
# creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    vocab_len = len(word_to_index) +1  # Adding one to handle unknown words <UNK>
    emb_dim = word_to_vec_map['cucumber'].shape[0]
    
    # Initializing Embedding Matrix
    emb_mat = np.zeros((vocab_len,emb_dim))
    
    # Now we need to set each row of the matrix as the corresponding word vector
    for word,index in word_to_index.items():
        emb_mat[index,:] = word_to_vec_map[word]
    
    # We will define a keras embedding layer and set it to non trainable as we do not nedd our algo to modify these     
    # pretrained embeddings
    emb_layer = Embedding(vocab_len,emb_dim,trainable = False)
    
    # Before Setting the weights of the embedding layer it is required to build the embedding layer
    emb_layer.build((None,))
    emb_layer.set_weights([emb_mat])
    return emb_layer


In [40]:
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
print("weights[0][1][3] =", embedding_layer.get_weights()[0][1][2])

weights[0][1][3] = 0.65282


In [41]:
# Building the model
# The model takes as input an array of sentences of shape (m, max_len, ) defined by input_shape.
# The model outputs a softmax probability vector of shape (m, C = 5).

# This is a function to create emoji model's graph

def emoji_model(input_shape, word_to_vec_map, word_to_index):
    """
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    
    """
    sentence_indices = Input(input_shape,dtype = 'int32') # Define sentence_indices as the input of the graph.
    emb_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propogating sentence_indices through the embedding layer
    embeddings = emb_layer(sentence_indices)
    
    # Now we will Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # The returned output should be a batch of sequences.
    X = LSTM(128,return_sequences = True)(embeddings)
    X = Dropout(0.5)(X) # TO prevent overfitting we add dropout with probablity 0.5
    X = LSTM(128, return_sequences=False)(X) # return output should be single activation not a batch of sequences
    X = Dropout(0.5)(X)
    # Propogating X through a Dense layer with 5 units (because there are 5 emojis i.e 5 labels)
    X = Dense(5)(X)
    X = Activation('softmax')(X)
    
    # Creating the model
    model = Model(inputs = sentence_indices,outputs = X)
    return model

In [42]:
model = emoji_model((maxLen,), word_to_vec_map, word_to_index)
model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 10)                0         
_________________________________________________________________
embedding_6 (Embedding)      (None, 10, 50)            20000050  
_________________________________________________________________
lstm_1 (LSTM)                (None, 10, 128)           91648     
_________________________________________________________________
dropout_1 (Dropout)          (None, 10, 128)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645 

In [43]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

### Training The Model

Our model takes as input an array of shape (m, max_len) and outputs probability vectors of shape (m, number of classes). We thus have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices), and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).

In [44]:
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

In [45]:
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.callbacks.History at 0x1c884aac208>

### Evaluating the model on the Test Set


In [48]:
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)


Test accuracy =  0.7142857313156128


In [49]:
# Predictions 
def predict(X, Y, W, b, word_to_vec_map):
    """
    Given X (sentences) and Y (emoji indices), predict emojis and compute the accuracy of your model over the given set.
    
    Arguments:
    X -- input data containing sentences, numpy array of shape (m, None)
    Y -- labels, containing index of the label emoji, numpy array of shape (m, 1)
    
    Returns:
    pred -- numpy array of shape (m, 1) with your predictions
    """
    m = X.shape[0]
    pred = np.zeros((m, 1))
    
    for j in range(m):                       # Loop over training examples
        
        # Split jth test example (sentence) into list of lower case words
        words = X[j].lower().split()
        
        # Average words' vectors
        avg = np.zeros((50,))
        for w in words:
            avg += word_to_vec_map[w]
        avg = avg/len(words)

        # Forward propagation
        Z = np.dot(W, avg) + b
        A = softmax(Z)
        pred[j] = np.argmax(A)
        
    print("Accuracy: "  + str(np.mean((pred[:] == Y.reshape(Y.shape[0],1)[:]))))
    
    return pred


In [50]:
def print_predictions(X, pred):
    print()
    for i in range(X.shape[0]):
        print(X[i], label_to_emoji(int(pred[i])))

### Predicting the Emoji for any given sentence

In [55]:
sample_sentence = np.array(['not feeling happy'])
sample_sentence_indices = sentences_to_indices(sample_sentence, word_to_index, maxLen)
print(sample_sentence[0] +' '+  label_to_emoji(np.argmax(model.predict(sample_sentence_indices))))

not feeling happy 😞
