# Emojify- AI


In [1]:
import numpy as np
import tensorflow
np.random.seed(0)
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Dropout, LSTM, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.initializers import glorot_uniform
from utils import *

### Load the word vectors 

In this notebook, we'll be using GloVe vectors for our word embeddings. The GloVe vectors were introduced in the following paper:

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf).

You can get the GloVe text file with the embeddings from the link above

In [2]:
word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

### Load the training and testing sets

In [3]:
X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/test_emoji.csv')

We need to find the maximum sequence length in our training and test sets

In [4]:
maxLen = len(max(X_train, key=lambda x: len(x.split())).split())

Convert the X values into a list of indices and convert the outputs into one hot vectors

In [5]:
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

## Embedding Layer

The embedding matrix maps word indices to embedding vectors. In TensorFlow Keras, the embedding matrix is represented as a "layer."

The Embedding() layer's input is an integer matrix of size **(batch size, max input length)**

The embedding layer outputs an array of shape **(batch size, max input length, dimension of word vectors)**

<img src="images/embedding1.png" style="width:700px;height:250px;">

We will implement this layer by doing the following:

1. Initialize the embedding matrix as a numpy array of zeros.
    * The embedding matrix has a row for each unique word in the vocabulary.
    * There is one additional row to handle "unknown" words.
    * So vocab_size is the number of unique words plus one.
    * Each row will store the vector representation of one word. 
        * For example, one row may be 50 positions long if using GloVe word vectors.
    * In the code below, `emb_dim` represents the length of a word embedding.
2. Fill in each row of the embedding matrix with the vector representation of a word
    * Each word in `word_to_index` is a string.
    * word_to_vec_map is a dictionary where the keys are strings and the values are the word vectors.
3. Define the Keras embedding layer. 
    * Use [Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding). 
    * The input dimension is equal to the vocabulary length (number of unique words plus one).
    * The output dimension is equal to the number of positions in a word embedding.

In [6]:
def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained Embedding layer Keras instance
    """
    
    vocab_size = len(word_to_index) + 1  # Adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map[next(iter(word_to_vec_map))].shape[0]  # Dimensionality of GloVe word vectors (50)
      
    # Step 1: Initialize the embedding matrix with zeros
    emb_matrix = np.zeros((vocab_size, emb_dim))
    
    # Step 2: Fill the embedding matrix with GloVe word vectors
    for word, idx in word_to_index.items():
        emb_matrix[idx, :] = word_to_vec_map[word]

    # Step 3: Create Keras Embedding layer
    embedding_layer = Embedding(input_dim=vocab_size, output_dim=emb_dim, trainable=False)

    # Build the embedding layer before setting the weights
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

## Building the Emojify Model

<img src="images/emojifier-v2.png" style="width:700px;height:400px;"> <br>

* The model takes as input an array of sentences of shape (`m`, `max_len`, ) defined by `input_shape`. 
* The model outputs a softmax probability vector of shape (`m`, `C = 5`). 

* We will need to use the following Keras layers:
    * [Input()](https://www.tensorflow.org/api_docs/python/tf/keras/Input)
        * Set the `shape` and `dtype` parameters.
        * The inputs are integers, so we can specify the data type as a string, `int32`.
    * [LSTM()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
        * Set the `units` and `return_sequences` parameters.
    * [Dropout()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout)
        * Set the `rate` parameter.
    * [Dense()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)
        * Set the `units`, 
        * Set the `Activation` to `softmax`.
    * [Model()](https://www.tensorflow.org/api_docs/python/tf/keras/Model)
        * Set `inputs` and `outputs`

In [7]:
def Emojify(input_shape, word_to_vec_map, word_to_index):
    """
    Create Emojify-v2 model's graph.
    
    Args:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping words to their 50-dimensional vector representations
    word_to_index -- dictionary mapping words to their indices in the vocabulary

    Returns:
    model -- Keras model instance
    """
    
    # Input layer for sentence indices
    sentence_indices = Input(shape=input_shape, dtype='int32')
    
    # Embedding layer
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    embeddings = embedding_layer(sentence_indices)
    
    # LSTM layer 1
    X = LSTM(units=128, return_sequences=True)(embeddings)
    X = Dropout(rate=0.5)(X)
    
    # LSTM layer 2
    X = LSTM(units=128, return_sequences=False)(X)
    X = Dropout(rate=0.5)(X)
    
    # Dense layer
    X = Dense(units=5, activation='softmax')(X)
    
    # Create the model
    model = Model(inputs=sentence_indices, outputs=X)
    
    return model

In [8]:
model = Emojify((maxLen,), word_to_vec_map, word_to_index)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding (Embedding)       (None, 10, 50)            20000050  
                                                                 
 lstm (LSTM)                 (None, 10, 128)           91648     
                                                                 
 dropout (Dropout)           (None, 10, 128)           0         
                                                                 
 lstm_1 (LSTM)               (None, 128)               131584    
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense (Dense)               (None, 5)                 645   

In [9]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [10]:
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x1fe3eb89d10>

Let us see the performance of the model on the testing set

In [11]:
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print("Test accuracy = ", acc)


Test accuracy =  0.75


Now, let us test the model and see how it acts

In [15]:
def try_model(string):
    """
    Runs the string inside the model
    """
    x_test = np.array([string])
    X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
    print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))

In [22]:
try_model('I am very hungry')

I am very hungry 🍴


In [23]:
try_model("I want to play right now")

I want to play right now ⚾


In [25]:
try_model("I love you")

I love you ❤️
