<a href="https://colab.research.google.com/github/HasibAlMuzdadid/Machine-Learning-and-Deep-Learning-Projects/blob/main/emojify%20sentences%20using%20lstm/emojify_sentences_using_lstm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Emojify Sentences using LSTM**
 

For building this model we will use a pre-trained word embeddings to represent words. We'll feed word embeddings into an LSTM and the LSTM will learn to predict the most appropriate emoji. This model will be able to account for word ordering. 


In [None]:
pip install emoji==1.6.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import numpy as np
import tensorflow
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, Dropout, LSTM, Activation
from tensorflow.keras.layers import Embedding
import csv
import emoji

In [None]:
# helper function

def read_csv(filename = "data/emojify_data.csv"):
    phrase = []
    emoji = []

    with open (filename) as csvDataFile:
        csvReader = csv.reader(csvDataFile)

        for row in csvReader:
            phrase.append(row[0])
            emoji.append(row[1])

    X = np.asarray(phrase)
    Y = np.asarray(emoji, dtype=int)

    return X, Y

In [None]:
emoji_dictionary = {#"0": ":red_heart:",    # :heart: prints a black instead of red heart depending on the font
                    "0": "\u2764\ufe0f",
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):

    # Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
    return emoji.emojize(emoji_dictionary[str(label)], use_aliases=True)

In [None]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map

In [None]:
def convert_to_one_hot(Y, C):
    Y = np.eye(C)[Y.reshape(-1)]
    return Y

**Dataset**

We have a tiny dataset (X, Y) where:
- X contains 127 sentences (strings)
- Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence


The dataset is splitted between training (127 examples) and testing (56 examples).


In [None]:
# loading dataset

X_train, Y_train = read_csv("/content/train_emoji.csv")
X_test, Y_test = read_csv("/content/tesss.csv")

In [None]:
maxLen = len(max(X_train, key=len).split())
print(maxLen)

10


Printing sentences from X_train and corresponding labels from Y_train

In [None]:
for idx in range(10):
    print(X_train[idx], label_to_emoji(Y_train[idx]))

never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long  😞
I want to go play ⚾


In [None]:
# loading the pre-trained model which contains all the vector representations

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs("/content/drive/MyDrive/colab files/glove.6B.50d.txt")

**Keras and Mini-batching** 

We will train Keras using mini-batches. However, most deep learning frameworks require that all sequences in the same mini-batch have the **same length**. 

This is what allows vectorization to work: If we had a 3-word sentence and a 4-word sentence, then the computations needed for them are different (one takes 3 steps of an LSTM, one takes 4 steps) so it's just not possible to do them both at the same time.
    
#### Padding Handles Sequences of Varying Length
The common solution to handling sequences of **different length** is to use padding. Specifically:
   * Set a maximum sequence length. One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set.
   * Pad all sequences to have the same length. 
    
**The Embedding Layer**

In Keras, the embedding matrix is represented as a "layer". The embedding matrix maps word indices to embedding vectors.
* The word indices are positive integers
* The embedding vectors are dense vectors of fixed size
* A "dense" vector is the opposite of a sparse vector. It means that most of its values are non-zero.  As a counter-example, a one-hot encoded vector is not "dense".

The embedding matrix can be derived in two ways:
* Training a model to derive the embeddings from scratch. 
* Using a pretrained embedding.
    
**Using and Updating Pre-trained Embeddings**

We'll create an [Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer in Keras

* We will initialize the Embedding layer with GloVe 50-dimensional vectors. 


Inputs and Outputs to the Embedding Layer :

* The `Embedding()` layer's input is an integer matrix of size **(batch size, max input length)**. This input corresponds to sentences converted into lists of indices (integers). The largest integer (the highest word index) in the input should not be larger than the vocabulary size.
* The embedding layer outputs an array of shape (batch size, max input length, dimension of word vectors).


**Prepare Input Sentences**


Implementing `sentences_to_indices` function which processes an array of sentences X and returns inputs to the embedding layer:

* Convert each training sentences into a list of indices (the indices correspond to each word in the sentence)
* Zero-pad all these lists so that their length is the length of the longest sentence


In [None]:
for idx, val in enumerate(["I", "like", "learning"]):
    print(idx, val)

0 I
1 like
2 learning


In [None]:
# sentences to Indices

def sentences_to_indices(X, word_to_index, max_len):
     
    # Arguments:
    # X -> array of sentences (strings) of shape (m, 1)
    # word_to_index -> a dictionary containing the each word mapped to its index
    # max_len -> maximum number of words in a sentence that we can assume every sentence in X is no longer than this 
    
    # Returns:
    # X_indices -> array of indices corresponding to words in the sentences from X of shape (m, max_len)

    
    m = X.shape[0]     # number of training examples

    # Initialize X_indices as a numpy matrix of zeros and the correct shape 
    X_indices = np.zeros((m, max_len))
    
    for i in range(m):              # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. We should get a list of words.
        sentence_words = X[i].lower().split()
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # if w exists in the word_to_index dictionary
            if w in word_to_index:
                # Set the (i,j)th entry of X_indices to the index of the correct word.
                X_indices[i, j] = word_to_index[w]
                # Increment j to j + 1
                j =  j + 1
    
    return X_indices

In [None]:
X1 = np.array(["funny lol", "lets play baseball", "food is ready for you"])
X1_indices = sentences_to_indices(X1, word_to_index, max_len=5)
print(f"X1 = {X1}")
print(f"X1_indices =\n {X1_indices}")

X1 = ['funny lol' 'lets play baseball' 'food is ready for you']
X1_indices =
 [[155345. 225122.      0.      0.      0.]
 [220930. 286375.  69714.      0.      0.]
 [151204. 192973. 302254. 151349. 394475.]]


**Building Embedding Layer**

We will build the `Embedding()` layer in Keras using pre-trained word vectors. 

* The embedding layer takes as input a list of word indices
    * `sentences_to_indices()` creates these word indices
* The embedding layer will return the word embeddings for a sentence 


We will implement `pretrained_embedding_layer()` by following these steps:

1. Initializing the embedding matrix as a numpy array of zeros
    * The embedding matrix has a row for each unique word in the vocabulary
        * There is one additional row to handle "unknown" words
        * So vocab_size is the number of unique words plus one
    * Each row will store the vector representation of one word 
        *  One row may be 50 positions long if using GloVe word vectors
    * `emb_dim` represents the length of a word embedding
2. Filling in each row of the embedding matrix with the vector representation of a word
    * Each word in `word_to_index` is a string
    * word_to_vec_map is a dictionary where the keys are strings and the values are the word vectors
3. Defining the Keras embedding layer 
    * Using [Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) 
    * The input dimension is equal to the vocabulary length (number of unique words plus one)
    * The output dimension is equal to the number of positions in a word embedding
    * Making this layer's embeddings fixed
        * If we are to set `trainable = True`, then it will allow the optimization algorithm to modify the values of the word embeddings
        * we don't want the model to modify the word embeddings
4. Setting the embedding weights to be equal to the embedding matrix
 

In [None]:
# pretrained embedding layer

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
      
    # Arguments:
    # word_to_vec_map -> dictionary mapping words to their GloVe vector representation
    # word_to_index -> dictionary mapping from words to their indices in the vocabulary (400,001 words)

    # Returns:
    # embedding_layer -> pretrained layer Keras instance
    
    
    vocab_size = len(word_to_index) + 1              # adding 1 to fit Keras embedding (requirement)
    any_word = list(word_to_vec_map.keys())[0]
    emb_dim = word_to_vec_map[any_word].shape[0]     # define dimensionality of your GloVe word vectors (= 50)
      
    # Step 1: Initialize the embedding matrix as a numpy array of zeros
    emb_matrix = np.zeros((vocab_size, emb_dim))
    
    # Step 2: Set each row "idx" of the embedding matrix to be the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        emb_matrix[idx, :] = word_to_vec_map[word]

    # Step 3: Define Keras embedding layer with the correct input and output sizes. Make it non-trainable
    embedding_layer = Embedding(vocab_size , emb_dim , trainable = False)

    # Step 4: Build the embedding layer, it is required before setting the weights of the embedding layer 
    embedding_layer.build((None,)) # Do not modify the "None".  This line of code is complete as-is.
    
    # Set the weights of the embedding layer to the embedding matrix. Our layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

In [None]:
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
print(f"weights[0][1][1] = {embedding_layer.get_weights()[0][1][1]}")
print(f"Input_dim  {embedding_layer.input_dim}")
print(f"Output_dim  {embedding_layer.output_dim}")

weights[0][1][1] = 0.39030998945236206
Input_dim  400001
Output_dim  50


**Building The Emojifier**

We will implement `Emojify_V2()` function that builds a Keras graph of the architecture. 

* The model takes as input an array of sentences of shape (`m`, `max_len`, ) defined by `input_shape` 
* The model outputs a softmax probability vector of shape (`m`, `C = 5`) 

* We will use the following Keras layers:
    * [Input()](https://www.tensorflow.org/api_docs/python/tf/keras/Input)
        * Set the `shape` and `dtype` parameters
        * The inputs are integers so we can specify the data type as a string, 'int32'.
    * [LSTM()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM)
        * Set the `units` and `return_sequences` parameters
    * [Dropout()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout)
        * Set the `rate` parameter
    * [Dense()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)
        * Set the `units` 
       
    * [Activation()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Activation)
        * We can pass in the activation of our choice as a lowercase string
    * [Model()](https://www.tensorflow.org/api_docs/python/tf/keras/Model)
        * Set `inputs` and `outputs`



These Keras layers return an object and we will feed in the outputs of the previous layer as the input arguments to that object. The returned object can be created and called in the same line.

The `embedding_layer` that is returned by `pretrained_embedding_layer` is a layer object that can be called as a function, passing in a single argument (sentence indices).

In [None]:
# Emojify V2

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    
    # Arguments:
    # input_shape -> shape of the input usually (max_len,)
    # word_to_vec_map -> dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    # word_to_index -> dictionary mapping from words to their indices in the vocabulary (400,001 words)

    # Returns:
    # model -> a model instance in Keras
  
    
    # Define sentence_indices as the input of the graph
    # It should be of shape input_shape and dtype 'int32' as it contains indices, which are integers
    sentence_indices = Input(shape = input_shape, dtype = np.int32)
    
    # Create the embedding layer pretrained with GloVe Vectors 
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through embedding layer
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # The returned output should be a batch of sequences
    X = LSTM(128, return_sequences=True)(embeddings)
    
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # The returned output should be a single hidden state not a batch of sequences
    X = LSTM(128)(X)
    
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
   
    # Propagate X through a Dense layer with 5 units
    X = Dense(5)(X)
   
    # Add a softmax activation
    X = Activation("softmax")(X)
    
    # Create Model instance which converts sentence_indices into X
    model = Model(inputs=sentence_indices, outputs=X)
    
    
    return model

**Creating the model**  

* Because all sentences in the dataset are less than 10 words so `max_len = 10` was chosen.  
* Our architecture uses 20,223,927 parameters of which 20,000,050 (the word embeddings) are non-trainable with the remaining 223,877 being trainable 
* Because our vocabulary size has 400,001 words (with valid indices from 0 to 400,000) there are 400,001\*50 = 20,000,050 non-trainable parameters 

In [None]:
model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 10)]              0         
                                                                 
 embedding_4 (Embedding)     (None, 10, 50)            20000050  
                                                                 
 lstm_2 (LSTM)               (None, 10, 128)           91648     
                                                                 
 dropout_2 (Dropout)         (None, 10, 128)           0         
                                                                 
 lstm_3 (LSTM)               (None, 128)               131584    
                                                                 
 dropout_3 (Dropout)         (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 5)                 645 

**Compile The Model** 

After creating model in Keras, we need to compile it and define what loss, optimizer and metrics we want to use. Compiling our model using `categorical_crossentropy` loss, `adam` optimizer and `['accuracy']` metrics:

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

**Train The Model** 

It's time to train the model! The Emojifier-V2 `model` takes as input an array of shape (`m`, `max_len`) and outputs probability vectors of shape (`m`, `number of classes`). Thus, we have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices) and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).

In [None]:
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)

In [None]:
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f062ec0c750>

In [None]:
X_test_indices = sentences_to_indices(X_test, word_to_index, max_len = maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C = 5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print()
print(f"Test accuracy = {acc}")


Test accuracy = 0.875


In [None]:
# see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
    x = X_test_indices
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print(f"Expected emoji: {label_to_emoji(Y_test[i])}  prediction:  {X_test[i]} {label_to_emoji(num).strip()}")

Expected emoji: 😞  prediction:  work is hard	 😄
Expected emoji: 😞  prediction:  This girl is messing with me	 ❤️
Expected emoji: 😞  prediction:  work is horrible	 😄
Expected emoji: ❤️  prediction:  I love taking breaks	 😞
Expected emoji: 😄  prediction:  you brighten my day	 ❤️
Expected emoji: 😞  prediction:  she is a bully	 ❤️
Expected emoji: 😄  prediction:  will you be my valentine	 ❤️


In [None]:
# Change the sentence below to see prediction. Make sure all the words are in the Glove embeddings.  
x_test = np.array(["I am not happy"])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(f"{x_test[0]} {label_to_emoji(np.argmax(model.predict(X_test_indices)))}")

I am not happy 😞


This model considers the order of words. If the training set were larger then LSTM model would be much better at understanding more complex sentences.