# Word embeddings

we can represent words from a dictionary as vectors using one-hot encoding where all components are zero except for one (one hot encoding). 

<img src="images/words-one-hot.png">

The advantage of such an encoding is that the calculation of a word vector and looking up a word given its vector is easy. On the other hand this form of encoding does not contain any information about the relationships of words between each other. An alternative sort of word vectors are **word embeddings**. In such vectors, each component of a vector reflects a different feature of a word meaning (e.g. age, sex, food/non-food, word type, etc…). Therefore the components can all have non-null values. Words that are semantically similar have similar values in the individual components. For visualization we could also reduce dimensionality to two (or three) dimensions, e.g. by applying the t-SNE algorithm. By doing so it turns out that words with similar meanings are in similar positions in vector space.

<img src="images/vector-space.png" width="300px">

## Properties of word embeddings

Word embeddings have become hugely popular in NLP and can for example be used for NER. Oftentimes an existing model can be adjusted for a specific task by performing additional training on suitable training data (transfer learning). This training set and also the dimensionality of the word vectors can be much smaller. The relevance of a word embedding $e$ is simliar to the vector of a face in face recognition in computer vision: It is a vectorized representation of the underlying data. An important distinction however is that in order to get word embeddings a model needs to learn a fixed-size vocabulary, while face embeddings can work with any new image. Vectors for words outside this vocabulary can not be calculated. In contrast a CNN could calculate a vector for a face it has never seen before.

Word embeddings are useful to model analogies and relationships between words:

$$e_{man} - e_{woman} \approx e_{king} - e_{queen}$$

The distance between the vectors for “man” and “woman” is similar to the distance between the vectors for “king” and “queen”, because those two pairs of words are related in the same way. We can also observe that a trained model has learned the relationship between these two pairs of words because the vector representations of their distances is approximately parallel. This also applies to other kinds of word pairings, like verbs in different tenses or the relationship between a country and its capital:

<img src="images/word-embeddings.png">

Therefore we could get an embedding by rearranging formula:
$$e_{king} - e_{man} + e_{woman} \approx e_{queen}$$

This way the word embedding for “queen” can be calculated using the embeddings of the other words. To get the word for its embedding we can use a similarity function $sim$, which measures the similarity between two embeddings $u$ and $v$. Often the cosine similarity is used for this function:

$$sim(u,v) = \frac{u^T v}{\lVert u\rVert_2 \lVert v \rVert_2}$$

With the help of the similarity function we can find the word for “queen” by comparing the embedding $e_queen$ against the embeddings of all other word from the vocabulary:

$$w = \underset{w}{\operatorname{argmax}} sim(e_{queen}, e_{king} - e_{man} + e_{woman})$$

## Word2Vec

**Word2Vec (W2V)** is the probably most popular implementation for word embeddings. W2V contains two approaches:
- Skip-Gram
- CBOW (Continuous Bag Of Words)

<img src="images/skip-gram-medium.webp" width="500px">

### Skip-Grams

One of the most famous approaches to dense representation is word2vec, proposed by Google in 2013.

We come up with few context-target pairs to create our supervised learning problem. We pick a random word from the sentence, then we take random number of words from a given windown ($\pm k$). Usually the context words are not chosen with uniform random distribution but according to their frequency in the corpus. Frequent words have a lower probability for being selected as context words. 

<img src="images/skip grams.png" width="500px">

The high-level idea of dense vector representation is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”.

<img src="images/skip gram high level model.webp">

This model started to take into account the meaning of the words since it’s trained on the context of the words. However, it ignores morphology (information we can get from the word parts, for example, that “-less” means the lack of something). This drawback was addressed later by looking at subword skip-grams in **GloVe**.

## Packages

In [1]:
import os

import numpy as np
from tensorflow import data as tf_data
from tensorflow import keras
import pandas as pd
from emo_utils import *
import tensorflow as tf


## Load Dataset: EMOJISET 

You have a tiny dataset (X, Y) where:
- X contains 127 sentences (strings).
- Y contains an integer label between 0 and 4 corresponding to an emoji for each sentence.

<img src="images/data_set.png" style="width:700px;height:300px;">

Load the dataset using the code below. The dataset is split between training (127 examples) and testing (56 examples).

In [2]:
data = pd.read_csv('data/train_emoji.csv', names=["sentence", "emoji", "remove1", "remove_2"])
data.drop(columns=["remove1", "remove_2"], inplace=True)
data_test = pd.read_csv('data/tesss.csv', names=["sentence", "emoji"])

In [3]:
X_train, y_train = data["sentence"], data["emoji"]
X_test, y_test = data_test["sentence"], data_test["emoji"]

In [4]:
for idx in range(10):
    print(X_train[idx], label_to_emoji(y_train[idx]))

never talk to me again 😞
I am proud of your achievements 😄
It is the worst day in my life 😞
Miss you so much ❤️
food is life 🍴
I love you mum ❤️
Stop saying bullshit 😞
congratulations on your acceptance 😄
The assignment is too long  😞
I want to go play ⚾️


In the below code cell, you will find out the sentence with the maximum number of words, and will store it's length in `maxLen` (*i.e., the number of words in the longest sentence, which will be used further*). Let's break down this code for a better understanding.

- The first point to note here is that `split()` breaks a string into a list of it's words. So, if `x` is a string, then `len(x.split())` returns the number of words in that string. You can read more about `split` [here](https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split).

- The second point to note here is the way in which `max` function has been used. As can be read [here](https://docs.python.org/3/library/functions.html#max), apart from an iterable (*which in your case is `X_train`, a list of strings*), this function also has a `key` argument, that can be used to modify the basis on which the largest element in the iterable is chosen.

In this case, `key` has been chosen as the number of words in a string. So the `max` function will return the string with the largest number of words.

In [5]:
maxLen = len(max(X_train, key=lambda x: len(x.split())).split())
maxLen

10

## Create a Vocabulary index

In [6]:
vectorizer = keras.layers.TextVectorization(max_tokens=40000, output_sequence_length=10)
text_ds = tf_data.Dataset.from_tensor_slices(X_train)
vectorizer.adapt(text_ds)

In [7]:
vectorizer.get_vocabulary()[:5]

['', '[UNK]', 'i', 'you', 'is']

In [8]:
len(vectorizer.get_vocabulary())


262

In [9]:
output = vectorizer([["the cat sat on the mat"]])
output.numpy()[0, :6]

array([ 5,  1,  1, 39,  5,  1], dtype=int64)

In [10]:
voc = vectorizer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))

In [11]:
test = ["i", "love", "to",  "eat", "chinese", "food"]
[word_index[w] for w in test]

[2, 21, 10, 209, 227, 27]

#### One-hot Encoding
* To get your labels into a format suitable for training a softmax classifier, convert $Y$ from its current shape  $(m, 1)$ into a "one-hot representation" $(m, 5)$, 

In [12]:
train_labels = keras.utils.to_categorical(y_train)
test_labels = keras.utils.to_categorical(y_test)

In [13]:
idx = 50
print(f"Sentence '{X_train[idx]}' has label index {y_train[idx]}, which is emoji {label_to_emoji(y_train[idx])}", )
print(f"Label index {y_train[idx]} in one-hot encoding format is {train_labels[idx]}")

Sentence 'I missed you' has label index 0, which is emoji ❤️
Label index 0 in one-hot encoding format is [1. 0. 0. 0. 0.]


## Load pre-trained word embeddings

In [14]:
path_to_glove_file = "data/glove.6B.50d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print(embeddings_index["cat"])
print("Found %s word vectors." % len(embeddings_index))

[ 0.45281  -0.50108  -0.53714  -0.015697  0.22191   0.54602  -0.67301
 -0.6891    0.63493  -0.19726   0.33685   0.7735    0.90094   0.38488
  0.38367   0.2657   -0.08057   0.61089  -1.2894   -0.22313  -0.61578
  0.21697   0.35614   0.44499   0.60885  -1.1633   -1.1579    0.36118
  0.10466  -0.78325   1.4352    0.18629  -0.26112   0.83275  -0.23123
  0.32481   0.14485  -0.44552   0.33497  -0.95946  -0.097479  0.48138
 -0.43352   0.69455   0.91043  -0.28173   0.41637  -1.2609    0.71278
  0.23782 ]
Found 400000 word vectors.


 However, most deep learning frameworks require that all sequences in the same mini-batch have the **same length**. 

This is what allows vectorization to work: If you had a 3-word sentence and a 4-word sentence, then the computations needed for them are different (one takes 3 steps of an LSTM, one takes 4 steps) so it's just not possible to do them both at the same time.
    
#### Padding Handles Sequences of Varying Length
* The common solution to handling sequences of **different length** is to use padding.  Specifically:
    * Set a maximum sequence length
    * Pad all sequences to have the same length. 
    
#### Example of Padding:
* Given a maximum sequence length of 20, you could pad every sentence with "0"s so that each input sentence is of length 20. 
* Thus, the sentence "I love you" would be represented as $(e_{I}, e_{love}, e_{you}, \vec{0}, \vec{0}, \ldots, \vec{0})$. 
* In this example, any sentences longer than 20 words would have to be truncated. 
* One way to choose the maximum sequence length is to just pick the length of the longest sentence in the training set. 
  
### The Embedding Layer

In Keras, the embedding matrix is represented as a "layer."

* The embedding matrix maps word indices to embedding vectors.
    * The word indices are positive integers.
    * The embedding vectors are dense vectors of fixed size.
    * A "dense" vector is the opposite of a sparse vector. It means that most of its values are non-zero.  As a counter-example, a one-hot encoded vector is not "dense."
* The embedding matrix can be derived in two ways:
    * Training a model to derive the embeddings from scratch. 
    * Using a pretrained embedding.
    
#### Using and Updating Pre-trained Embeddings
In this section, you'll create an [Embedding()](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer in Keras

* You will initialize the Embedding layer with GloVe 50-dimensional vectors. 
* In the code below, you'll observe how Keras allows you to either train or leave this layer fixed.  
    * Because your training set is quite small, you'll leave the GloVe embeddings fixed instead of updating them.

#### Inputs and Outputs to the Embedding Layer

* The `Embedding()` layer's input is an integer matrix of size **(batch size, max input length)**. 
    * This input corresponds to sentences converted into lists of indices (integers).
    * The largest integer (the highest word index) in the input should be no larger than the vocabulary size.
* The embedding layer outputs an array of shape (batch size, max input length, dimension of word vectors).

* The figure shows the propagation of two example sentences through the embedding layer. 
    * Both examples have been zero-padded to a length of `max_len=5`.
    * The word embeddings are 50 units in length.
    * The final dimension of the representation is  `(2,max_len,50)`. 

<img src="images/embedding1.png" style="width:700px;height:250px;">

In [15]:
num_tokens = len(embeddings_index) + 2
embedding_dim = 50
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 260 words (2 misses)


In [76]:
embedding_layer = keras.layers.Embedding(
    num_tokens,
    embedding_dim,
    trainable=False,
)
embedding_layer.build((None,))
embedding_layer.set_weights([embedding_matrix])

In [None]:
# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)
# GRADED FUNCTION: pretrained_embedding_layer

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_size = len(word_to_index) + 1              # adding 1 to fit Keras embedding (requirement)
    any_word = list(word_to_vec_map.keys())[0]
    emb_dim = word_to_vec_map[any_word].shape[0]    # define dimensionality of your GloVe word vectors (= 50)
      
    ### START CODE HERE ###
    # Step 1
    # Initialize the embedding matrix as a numpy array of zeros.
    # See instructions above to choose the correct shape.
    emb_matrix = np.zeros((vocab_size, emb_dim))
    
    # Step 2
    # Set each row "idx" of the embedding matrix to be 
    # the word vector representation of the idx'th word of the vocabulary
    for word, idx in word_to_index.items():
        emb_matrix[idx, :] = word_to_vec_map[word]

    # Step 3
    # Define Keras embedding layer with the correct input and output sizes
    # Make it non-trainable.
    embedding_layer = Embedding(vocab_size, emb_dim, trainable = False)
    ### END CODE HERE ###

    # Step 4 (already done for you; please do not modify)
    # Build the embedding layer, it is required before setting the weights of the embedding layer. 
    embedding_layer.build((None,)) # Do not modify the "None".  This line of code is complete as-is.
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

## Build the model

### 2.1 - Model Overview

Here is the Emojifier-v2 you will implement:

<img src="images/emojifier-v2.png" style="width:700px;height:400px;">

In [87]:
int_sequences_input = keras.Input(shape=(None,), dtype="int32")
embedded_sequences = embedding_layer(int_sequences_input)

# Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
# The returned output should be a batch of sequences.
X = keras.layers.LSTM(128, return_sequences=True)(embedded_sequences)

# Add dropout with a probability of 0.5
X =  keras.layers.Dropout(0.5)(X)

# Propagate X trough another LSTM layer with 128-dimensional hidden state
# The returned output should be a single hidden state, not a batch of sequences.
X =  keras.layers.LSTM(128)(X)

# Add dropout with a probability of 0.5
X =  keras.layers.Dropout(0.5)(X)

# Propagate X through a Dense layer with 5 units
outputs =  keras.layers.Dense(5)(X)

model = keras.Model(int_sequences_input, outputs)
model.summary()

Model: "model_20"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_21 (InputLayer)       [(None, None)]            0         
                                                                 
 embedding_1 (Embedding)     (None, None, 50)          20000100  
                                                                 
 lstm_22 (LSTM)              (None, None, 128)         91648     
                                                                 
 dropout_24 (Dropout)        (None, None, 128)         0         
                                                                 
 lstm_23 (LSTM)              (None, 128)               131584    
                                                                 
 dropout_25 (Dropout)        (None, 128)               0         
                                                                 
 dense_12 (Dense)            (None, 5)                 645

In [88]:
x_train = vectorizer(np.array([[s] for s in X_train])).numpy()
x_val = vectorizer(np.array([[s] for s in X_test])).numpy()

y_train = np.array(train_labels)
y_val = np.array(test_labels)

In [89]:
vectorizer(["funny lol", "lets play baseball", "food is ready for you"])

<tf.Tensor: shape=(3, 10), dtype=int64, numpy=
array([[ 33,  64,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 43, 140,  34,   0,   0,   0,   0,   0,   0,   0],
       [ 27,   4, 131,  12,   3,   0,   0,   0,   0,   0]], dtype=int64)>

In [90]:
model.compile(loss=keras.losses.CategoricalCrossentropy(from_logits=True), optimizer='adam', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=40, batch_size = 32, shuffle=True, validation_data=(x_val, y_val))

Epoch 1/40


Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x23e8f8df190>

In [91]:
x_test = vectorizer(np.array([[s] for s in X_test])).numpy()
loss, acc = model.evaluate(x_test, test_labels)
print("Test accuracy = ", acc)



Test accuracy =  0.7321428656578064


In [75]:
# This code allows you to see the mislabelled examples
string_input = keras.Input(shape=(1,), dtype="string")
x = vectorizer(string_input)
x = model(x)
# Add a softmax activation
outputs =  keras.layers.Activation('softmax')(x)
end_to_end_model = keras.Model(string_input, outputs)

probabilities = end_to_end_model(
        tf.convert_to_tensor(X_test)
)
for i in range(len(X_test)):
    num = np.argmax(probabilities[i])
    if(num != y_test[i]):
        print('Expected emoji:'+ label_to_emoji(y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())

Expected emoji:😄 prediction: he got a very nice raise❤️
Expected emoji:😄 prediction: she got me a nice present❤️
Expected emoji:😄 prediction: We had such a lovely dinner tonight🍴
Expected emoji:😞 prediction: work is hard😄
Expected emoji:😞 prediction: This girl is messing with me❤️
Expected emoji:😞 prediction: work is horrible😄
Expected emoji:😄 prediction: Congratulation for having a baby❤️
Expected emoji:😄 prediction: you brighten my day❤️
Expected emoji:🍴 prediction: I boiled rice😞
Expected emoji:😞 prediction: she is a bully❤️
Expected emoji:❤️ prediction: My grandmother is the love of my life😄
Expected emoji:😄 prediction: will you be my valentine❤️
Expected emoji:⚾️ prediction: he can pitch really well😞
Expected emoji:😄 prediction: I like to laugh❤️
Expected emoji:😄 prediction: What you did was awesome😞
Expected emoji:😞 prediction: go away⚾️
Expected emoji:❤️ prediction: family is all I have😞
Expected emoji:😄 prediction: You deserve this nice prize😞
Expected emoji:🍴 prediction: I did