# <font color=  #C70039 > Adding t-SNE to visualize  </font> <font color= #13c113  >Word Embeddings</font> with Keras

![Deep Learning with Python](https://images-na.ssl-images-amazon.com/images/I/41DWjHboiyL._SX258_BO1,204,203,200_.jpg)


## Adapted from:

### [Section 6.1-using-word-embeddings](https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/6.1-using-word-embeddings.ipynb)

## By François Chollet

### And [classifying-yelp-review-comments-using-lstm-and-word-embeddings](https://medium.com/@sabber/classifying-yelp-review-comments-using-lstm-and-word-embeddings-part-1-eb2275e4066b)

## By Sabber Ahamed
<br>


# * [MSTC](http://mstc.ssr.upm.es/big-data-track) and MUIT: <font size=5 color='green'>Deep Learning with Tensorflow & Keras</font>



---

## A popular and powerful way to associate a vector with a word is the use of dense "word vectors", also called "word embeddings".

![Word embeddings](https://s3.amazonaws.com/book.keras.io/img/ch6/word_embeddings.png)

from François Chollet book.keras.io

- #### One-hot encoding are binary, sparse (mostly made of zeros) and very high-dimensional (same dimensionality as the number of words in the vocabulary), "word embeddings" are low-dimensional floating point vectors (i.e. "dense" vectors, as opposed to sparse vectors).
- #### Unlike one-hot encoding, word embeddings are learned from data.

It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with very large vocabularies. On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or higher (capturing a vocabulary of 20,000 token in this case). So, word embeddings pack more information into far fewer dimensions.



---

### There are two ways to obtain word embeddings:

-    Learn word embeddings jointly with the main task you care about (e.g. document classification or sentiment prediction). In this setup, you would start with random word vectors, then **learn your word vectors in the same way that you learn the weights of a neural network**.

<br>
-    Load into your model word embeddings that were pre-computed using a different machine learning task than the one you are trying to solve. These are called **"pre-trained word embeddings"**.



---

## Learning word embeddings with the $Embedding$ layer

- It is reasonable to learn a new embedding space with every new task.
- Thankfully, backpropagation makes this really easy,
- and Keras makes it even easier.

      It's just about learning the weights of a layer: the Embedding layer.

In [0]:
import tensorflow
from tensorflow import keras

from keras.layers import Embedding

# The Embedding layer takes at least two arguments:
# the number of possible tokens, here 10000 (1 + maximum word index (we will see it is 9999)),
# and the dimensionality of the embeddings, here 8.
# embedding_layer = Embedding(10000, 8)


---

### The Embedding layer is best understood as a dictionary:
- It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors.

- The Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers.

- The Embedding layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality).





---
### Training the Embedding layer:

- When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just like with any other layer.

- During training, these word vectors will be gradually adjusted via backpropagation, structuring the space into something that the downstream model can exploit. 

- Once fully trained, your embedding space will show a lot of structure -- a kind of structure specialized for the specific problem you were training your model for.


---


## See example using the IMDB movie review sentiment prediction.

    "IMDB dataset", a set of 50,000 highly-polarized reviews from the Internet Movie Database. They are split into 25,000 reviews for training and 25,000 reviews for testing, each set consisting in 50% negative and 50% positive reviews.

    Just like the MNIST dataset, the IMDB dataset comes packaged with Keras. It has already been preprocessed: the reviews (sequences of words) have been turned into sequences of integers, where each integer stands for a specific word in a dictionary.


Let's quickly prepare the data. 

- We will restrict the movie reviews to the top 10,000 most common words
- and cut the reviews after only 20 words. 

**Our network will simply learn 8-dimensional embeddings for each of the 10,000 words, turn the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flatten the tensor to 2D, and train a single Dense layer on top for classification.**

---
# <font color= #C70039 > In this example we will only use top <font color=black>300</font> words from IMBD to be able to visualize them quickly</font>

In [0]:
from keras.datasets import imdb
from keras import preprocessing

# Number of words to consider as features
max_features = 300
INDEX_FROM = 0

# Cut texts after this number of words 
# (among top max_features most common words)
maxlen = 20

# Load the data as lists of integers.
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=max_features,index_from=INDEX_FROM)

# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

In [0]:
print('Train data shape:', x_train.shape)
print('Test  data shape:', x_test.shape)


In [0]:
x_train[0]

---
# <font color= #C70039 > Obtain <font color=black>$word\_to\_id$</font> and <font color=black>$id\_to\_word$</font> dictionaries from IMBD</font>

In [0]:
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}


In [0]:
print(id_to_word[62])

In [0]:
print(' '.join(id_to_word[id] for id in x_train[0] ))

## Check that the maximum word index is ??? 300?

In [0]:
import numpy as np
np.max(x_train)

In [0]:
np.unique(x_train)

---
## Model with a first layer Embeddings on the input sequence + Flatten + Dense

- Define the model
- Compile it
- Fit train data & evaluate with test data

# <font color= #C70039 > Let's try with a vector size of dimension <font color=black>16</font> </font>

In [0]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

batch_size=32


model = Sequential()

#### NOTE that Embedding requires input_length #######################################
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
# This argument is required if you are going to connect Flatten then Dense layers upstream

model.add(Embedding(max_features, 8, input_length=maxlen))
# After the Embedding layer, 
# our activations have shape `(samples, maxlen, 8)`.

# We flatten the 3D tensor of embeddings 
# into a 2D tensor of shape `(samples, maxlen * 8)`
model.add(Flatten())

# We add the classifier on top
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=15,
                    batch_size=batch_size,
                    validation_split=0.2)

In [0]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score without RNN:', score)
print('Test accuracy without RNN:', acc)

---
# <font color= #C70039 > Get <font color=black>$embedding\ weights$</font> from the trained <font color=black>Embeddings layer</font> in the Keras model</font>

In [0]:
word_embds = model.layers[0].get_weights()[0]

In [0]:
type(word_embds)

In [0]:
word_embds.shape

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(100, 10))
plt.imshow(word_embds.T,cmap='viridis')

---
# <font color= #C70039 > TO DO: apply t-SNE to the Embeddings layer</font>

In [0]:
import time

from sklearn.manifold import TSNE

time_start = time.time()
tsne = ???
tsne_results = ???

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

---
# <font color= #C70039 > Create a list of words for using them as labels in plots</font>

In [0]:
word_list = []
for i in range(0,max_features):
    word_list.append(str(id_to_word[i]))

---
# <font color= #C70039 > To DO: plot a t-SNE point and label in a figure</font>

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 10))  # in inches

i_to_plot=99

x, y = ???
label = ???

plt.scatter(x, y)

plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt


def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(22, 16))  # in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
                 xy=(x, y),
                 xytext=(5, 2),
                 textcoords='offset points',
                 ha='right',
                 va='bottom')
    plt.savefig(filename)


In [0]:
# Finally plotting and saving the fig 
plot_with_labels(tsne_results, word_list)

---
# <font color= #C70039 > ... now you could try the same visualization for the LSTM + Dense model </font>

---
## Now test a Model with a first layer Embeddings on every input + LSTM + Dense

- Define the model
- Compile it
- Fit train data & evaluate with test data

In [0]:
from keras.models import Sequential
from keras.layers import Flatten, Dense, LSTM

batch_size=32

model = Sequential()

model.add(Embedding(10000, 8))
model.add(LSTM(8, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=batch_size,
                    validation_split=0.2)

In [0]:
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score with LSTM:', score)
print('Test accuracy with LSTM:', acc)

## See Pre-trained models

https://nlp.stanford.edu/projects/glove/

## See Embeddings projector

http://projector.tensorflow.org/