# Understand embeddings with Word2Vec

### Exercise objectives:
- Convert words to vector representations thanks to embeddings
- Discover the powerful Word2Vec algorithm

<hr>
<hr>

_Embeddings_ are representation of words thanks to vectors. These embeddings can be learnt within a Neural Network. But it can take time to converge. Another option is to learn them as a first step. Then, use them directly to feed the word representation into an RNN. 



# The data

Keras provides many datasets, among which the ÌMDB dataset: it corresponds to sentences that are movie reviews. Each of them is related to a score given by the review writer.

❓ **Question** ❓ Let's first load the data. You don't have to understand what is going on in the function, it does not matter here.

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, **you should start with 10% of the sentences** and see if your computer handles it. Otherwise, rerun with a lower number. 

⚠️ **DISCLAIMER** ⚠️ **No need to play _who has the biggest_ (RAM) !** The idea is to get to run your models quickly to prototype. Even in real life, it is recommended that you start with a subset of your data to loop and debug quickly. So increase the number only if you are into getting the best accuracy. 

In [1]:
###########################################
### Just run this cell to load the data ###
###########################################

import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import text_to_word_sequence

def load_data(percentage_of_sentences=None):
    train_data, test_data = tfds.load(name="imdb_reviews", split=["train", "test"], batch_size=-1, as_supervised=True)

    train_sentences, y_train = tfds.as_numpy(train_data)
    test_sentences, y_test = tfds.as_numpy(test_data)
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(train_sentences))
        train_sentences, y_train = train_sentences[:len_train], y_train[:len_train]
  
        len_test = int(percentage_of_sentences/100*len(test_sentences))
        test_sentences, y_test = test_sentences[:len_test], y_test[:len_test]
    
    X_train = [text_to_word_sequence(_.decode("utf-8")) for _ in train_sentences]
    X_test = [text_to_word_sequence(_.decode("utf-8")) for _ in test_sentences]
    
    return X_train, y_train, X_test, y_test

X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=10)

2022-02-18 11:30:31.539283: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-02-18 11:30:31.570033: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)


In the previous exercise, we jointly learnt a representation for the words, and feed this representation to a RNN, as shown here : 

<img src="layers_embedding.png" width="400px" />

However, this increases the number of parameters to learn, which can slow the convergence, and make it harder!

For that reason, we will separate the steps of learning the word representation and feeding it to a RNN. As shown here : 

<img src="word2vec_representation.png" width="400px" />

We will learn the embedding with the word2vec.

The drawback is indeed that the learnt embedding is not _specifically_ designed for our task. However, learning it independently of the task at hand (sentiment analysis) has some advantages : 
- it is very fast to do in general (with word2vec)
- the representation learnt by word2vec is still meaningful 
- the convergence of the RNN alone will be easier and faster

So let's learn an embedding with word2vec and see how meaningful it is!


# Embedding with Word2Vec

Let's use Word2Vec to embed the words of our sentences. Word2Vec will be able to convert each word to a fixed-size vectorial representation.

For instance, we will have:
- 'dog' --> [0.1, -0.3, 0.8]
- 'cat' --> [-1.1, 2.3, 0.7]
- 'apple' --> [3.1, 0.9, -4.7]

Here, your embedding space is of size 3.

What you expect is to have representation such as words with close meanings are close in this embedding space. As on this example:

![Embedding](word_embedding.png)

❓ **Question** ❓ Let's run Word2Vec! The following code imports word2vec from [GENSIM](https://radimrehurek.com/gensim/), a great python package that makes the use of the word2vec algorithm easy, fast and accurate - which is not an easy task. The second line learns the embedding representation of the words thanks to the sentences in `X_train`.

In [3]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.1.2-cp38-cp38-macosx_10_9_x86_64.whl (24.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.0/24.0 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting smart-open>=1.8.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.6/58.6 KB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: smart-open, gensim
Successfully installed gensim-4.1.2 smart-open-5.2.1


In [4]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=X_train)

Let's look at the embedded representation of some words.

You can use `word2vec.wv` as a dictionary.
For instance, `word2vec.wv['dog']` will return a representation of `dog` in the embedding space.

❓ **Question** ❓ Try different words - especially, try non-existing words to see that they don't have any representation (which is perfectly normal as their representation were not learn). 

In [6]:
len(X_train)

2500

In [7]:
word2vec_other = Word2Vec(sentences=X_train[:1000])

In [16]:
word2vec.wv['cat']

array([-0.15643266,  0.33435693, -0.23366775,  0.2813018 , -0.06015401,
       -0.3247563 ,  0.26593184,  0.7909876 , -0.1884312 , -0.34094706,
       -0.02625695, -0.48829427,  0.1413697 ,  0.1832338 , -0.09167619,
       -0.20158379,  0.24933021, -0.19483404,  0.02383106, -0.3536339 ,
        0.1257421 , -0.01322551,  0.1716626 , -0.29368004,  0.02738187,
        0.03229668, -0.29172075, -0.19693182, -0.32873565, -0.05513848,
        0.17672272, -0.03335209,  0.12025582, -0.56569934, -0.20316178,
        0.27316704,  0.1364251 , -0.20875664, -0.39918387, -0.25136122,
       -0.261277  , -0.36314073, -0.3872287 ,  0.21884948,  0.5352526 ,
       -0.12732477, -0.1233341 , -0.15305407,  0.5883032 ,  0.27347234,
        0.1514674 , -0.22549902, -0.29780993,  0.07541078, -0.34306324,
        0.2549272 ,  0.2706402 , -0.0867511 , -0.06323722,  0.13940597,
        0.31730366, -0.10724132, -0.04988686,  0.27300525, -0.08706812,
        0.20461924, -0.1615899 ,  0.23136964, -0.36312258,  0.32

In [14]:
((word2vec.wv['dog'] - word2vec_other.wv['dog'])/word2vec.wv['dog'])

array([ 8.0501831e-01,  7.3407519e-01,  1.2081716e+00,  1.2036262e+00,
        3.9037783e+00,  6.3459623e-01,  3.3966073e-01,  6.5524292e-01,
        4.2478058e-01,  6.8266410e-01,  6.4848608e-01,  6.5254378e-01,
       -2.7798958e+00,  2.0444266e-01,  1.4286186e+00,  5.2821344e-01,
        6.9730735e-01,  5.4862732e-01,  2.3428869e+00,  4.0071693e-01,
        6.7607351e-02,  1.1319041e+00,  6.0196811e-01,  4.7773981e-01,
       -6.3815123e-01, -2.1211631e+00,  6.1959749e-01,  6.1057067e-01,
        4.2678079e-01, -2.0190909e+00,  6.0705286e-01,  1.2013780e+00,
        6.8422717e-01,  5.5766332e-01,  1.1044139e+00,  8.3218586e-01,
        6.5301913e-01,  5.8150798e-01,  6.1348367e-01,  6.1959779e-01,
        9.8669279e-01,  7.8622591e-01,  8.3216226e-01,  8.5222393e-01,
        6.9723022e-01, -2.6848620e-01,  6.0028267e-01,  1.2149249e+00,
        7.0817971e-01,  3.2506171e-01,  8.7681198e-01,  5.0484598e-01,
        1.0674266e+00, -4.3424149e+00,  6.5786159e-01,  9.6915627e-01,
      

❓ **Question** ❓ What is the size of each word representation, and therefore, what is the size of the embedding space?

In [17]:
len(word2vec.wv['dog'])

100

How to know if this embedding make any sense? To do that, we will check that words with a close meaning have close representations. 

Let's use the `word2vec.most_similar(...)` method that, given an input word, display the "closest" words in the embedding space. If the embedding is well done, then words that have close meanings will have close representation in the embedding space.

❓ **Question** ❓ Test the `most_similar` method on different words. 

❗ **Remark** ❗ Indeed, the quality of the closeness will depend on the quality of your embedding, and thus, depend on the number of sentences that you have loaded and from which you create your embedding.

In [20]:
word2vec.wv.most_similar('dog', topn = 10)

[('using', 0.9886091351509094),
 ('fallon', 0.988295316696167),
 ('passing', 0.9874628782272339),
 ('enormous', 0.9864641427993774),
 ('bright', 0.9864473938941956),
 ('wayne', 0.9861370325088501),
 ('loggia', 0.9857607483863831),
 ('green', 0.9855518341064453),
 ('tommy', 0.9854166507720947),
 ("girl's", 0.9850045442581177)]

Similarly to `most_similar` used on words directly, we can use `similar_by_vector` on vectors to do the same thing :

In [21]:
word2vec.wv.similar_by_vector('dog', topn = 10)

[('using', 0.9886091351509094),
 ('fallon', 0.988295316696167),
 ('passing', 0.9874628782272339),
 ('enormous', 0.9864641427993774),
 ('bright', 0.9864473938941956),
 ('wayne', 0.9861370325088501),
 ('loggia', 0.9857607483863831),
 ('green', 0.9855518341064453),
 ('tommy', 0.9854166507720947),
 ("girl's", 0.9850045442581177)]

# Arithmetic on words

Now, let's do mathematical operations on words - meaning on their vector representations!

As any word is represented as a vector, you can do basic arithmetic as:

$$W2V(good) - W2V(bad)$$

❓ **Question** ❓ Do this mathematical operation and print the result

In [22]:
vec = word2vec.wv['good'] - word2vec.wv['bad']

Now, image for a second that, somehow, the following equality holds true - just for a second

$$W2V(good) - W2V(bad) = W2V(nice) - W2V(stupid)$$

which is equivalent to 

$$W2V(good) - W2V(bad) + W2V(stupid) = W2V(nice)$$

❓ **Question** ❓ Let's, just for fun (as it would be foolish of us to think that this equality holds true ...), do the operation $W2V(good) - W2V(bad) + W2V(stupid)$ and store it in a `res` variable (which will be a vector of size 100 that you can print).

In [23]:
res = word2vec.wv['good'] - word2vec.wv['bad'] + word2vec.wv['stupid']

We earlier said that, for any vector, it is possible to see the closest vectors in the embedding space.

❓ **Question** ❓ Look at the closest vector (thanks to the `word2vec.wv.similar_by_vector` function) of `res`

In [24]:
word2vec.wv.similar_by_vector(res, topn=10)

[('nice', 0.794916033744812),
 ('potential', 0.7929612398147583),
 ('always', 0.7898537516593933),
 ('done', 0.783433198928833),
 ('decent', 0.7773710489273071),
 ('reworking', 0.7765313982963562),
 ('rarely', 0.7746520042419434),
 ('written', 0.7716220021247864),
 ('known', 0.7715383768081665),
 ('given', 0.7694361209869385)]

Incredible right! You can do arithmetic operations on words!

❓ **Question** ❓ You can try on arithmetic such as 

$$W2V(Boy) - W2V(Girl) = W2V(Man) - W2V(Woman)$$

or 

$$W2V(Queen) - W2V(King) = W2V(actress) - W2V(actor)$$

❗ **Remark** ❗ You will probably see that the results are not perfect. But don't forget that you trained your model on a very small corpus.

In [25]:
word2vec.wv.similar_by_vector(word2vec.wv['queen'] - word2vec.wv['king'] + word2vec.wv['actor'], topn=10)

[('actor', 0.9710970520973206),
 ('role', 0.873417317867279),
 ('actress', 0.8567383885383606),
 ('performance', 0.8553652763366699),
 ('job', 0.8025171756744385),
 ('guy', 0.7906687259674072),
 ('character', 0.7826564908027649),
 ('man', 0.7694869041442871),
 ('emmy', 0.7589439749717712),
 ('villain', 0.7445894479751587)]

You might wonder where does this magic comes from (at quite a low price, you just run a line of code on a very small corpus and it was trained within few minutes). The magic comes from the way Word2Vec is trained. The details are quite complex, but you can remember that Word2vec, in `word2vec = Word2Vec(sentences=X_train)` , actually trains a internal neural network (that you don't see).  

In a nutshell, this internal NN predicts a word from the surroundings words in a sentences. So it chooses many splits in the different sentences, choose some words as inputs $X$  and a word as output $y$ which it tries to predict, in the embedding space.

And as any neural network, Word2Vec has some hyperparameters. Let's check some. 


# Word2Vec hyperparameters


❓ **Question** ❓ The first important hyperparameter is the `vector_size` argument. It corresponds to the size of the embedding space. Learn a new `word2vec_2` model, still trained on the `X_train`, but with a smaller or higher `vector_size`.

Verify on some words that the corresponding embedding is of your selected size.

In [26]:
word2vec_2 = Word2Vec(X_train, vector_size = 30)

❓ **Question** ❓ Use the `word2vec.wv.key_to_index` attribute to display the size of the learnt vocabulary. On the other hand, compare it to the number of different words in `X_train`.

In [29]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()

tokenizer.fit_on_texts(X_train)

X_train_token = tokenizer.texts_to_sequences(X_train)
X_test_token = tokenizer.texts_to_sequences(X_test)

In [32]:
tokenizer.index_word

{1: 'the',
 2: 'a',
 3: 'and',
 4: 'of',
 5: 'to',
 6: 'is',
 7: 'br',
 8: 'in',
 9: 'i',
 10: 'it',
 11: 'this',
 12: 'that',
 13: 'was',
 14: 'as',
 15: 'for',
 16: 'with',
 17: 'but',
 18: 'movie',
 19: 'film',
 20: 'on',
 21: 'not',
 22: 'you',
 23: 'his',
 24: 'are',
 25: 'have',
 26: 'one',
 27: 'be',
 28: 'he',
 29: 'all',
 30: 'at',
 31: 'by',
 32: 'they',
 33: 'an',
 34: 'so',
 35: 'like',
 36: 'who',
 37: 'from',
 38: 'her',
 39: 'or',
 40: 'just',
 41: 'if',
 42: 'out',
 43: 'about',
 44: "it's",
 45: 'has',
 46: 'what',
 47: 'some',
 48: 'there',
 49: 'good',
 50: 'more',
 51: 'when',
 52: 'very',
 53: 'no',
 54: 'up',
 55: 'she',
 56: 'my',
 57: 'time',
 58: 'even',
 59: 'which',
 60: 'would',
 61: 'really',
 62: 'only',
 63: 'had',
 64: 'story',
 65: 'me',
 66: 'see',
 67: 'can',
 68: 'their',
 69: 'were',
 70: 'well',
 71: 'than',
 72: 'much',
 73: 'get',
 74: 'do',
 75: 'great',
 76: 'been',
 77: 'we',
 78: 'first',
 79: 'bad',
 80: 'because',
 81: 'into',
 82: 'other',

In [35]:
len(word2vec_2.wv.key_to_index) , len(tokenizer.index_word)

(8006, 30419)

There is an important difference between the number of words in the train sentences and in the Word2Vec vocabulary, even though it has been train on the train sentence set. The reasons comes from the second important hyperparameter of Word2Vec :  `min_count`. 

`min_count` is a integer that tells you how many occurences a given word should have to be learn in the embedding space. For instance, let's say that the word "movie" appears 1000 times in the corpus and "simba" only 2 times. If `min_count=3`, the word "simba" will be skipped during the training.

The intention is to have only words that are sufficiently present in the corpus to have a robust embedded representation

❓ **Question** ❓ Learn a new `word2vec_3` model with a `min_count` higher than 5 (which is the default value) and a `word2vec_4` with a `min_count` smaller than 5, and then, compare the size of the vocabulary for all the different word2vec that you have trained (you can choose any `vector_size` you want).

In [34]:
word2vec_3 = Word2Vec(X_train, vector_size=30, min_count=4)

In [36]:
len(word2vec_2.wv.key_to_index),len(word2vec_3.wv.key_to_index)

(8006, 9584)

Remember that we say that word2vec has an internal neural network that it optimizes based on some predictions? These predictions actually correspond to predicting a word based on surrounding words. The surroundings words are in a `window` which corresponds to the number of words taken into account. And you can train the word2vec with different `window` sizes.

❓ **Question** ❓ Learn a new `word2vec_5` model with a `window` different than previously (default is 5).

In [37]:
word2vec_5 = Word2Vec(X_train, vector_size=30, min_count=4, window=8)

The arguments you have seen (`vector_size`, `min_count` and `window`) are usually the one that you should start changing to get a better performance for your model.

But you can also look at other arguments in the [documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Text8Corpus)



# Convert our train and test set to RNN ready data

Remember that word2vec is the first step to the overall process of feeding such a representation into a RNN, as shown here :

<img src="word2vec_representation.png" width="400px" />



Now, let's work on Step 2 by converting the training and test data into their vector representation to be ready to be feed in RNNs.

❓ **Question** ❓ Now, write a function that, given a sentence, returns a matrix that corresponds to the embedding of the full sentence, which means that you have to embed each word one after the other and concatenate the result to output a 2D matrix (be sure that your output is a NumPy array)

❗ **Remark** ❗ You will probably notice that some words you are trying to convert throw errors as they are said not to belong to the dictionary:

- for the test set, this is understandable: some words were not in the train set and thus their embedded representation is unknown
- for the train set, due to `min_count` hyperparameter, not all the words have a vector representation

In any case, just skip the missing words here.

In [64]:
"laaaaame" in list(word2vec.wv.key_to_index.keys())[0]

False

In [65]:
import numpy as np
from tensorflow.keras.preprocessing.text import text_to_word_sequence

example = ['this', 'movie', 'is', 'the', 'worst', 'action', 'movie', 'ever']
example_missing_words = ['this', 'movie', 'is', 'laaaaaaaaaame']

def embed_sentence(word2vec, sentence):
    X = []
    #list_of_word = text_to_word_sequence(sentence)
    for word in sentence:
        if word in list(word2vec.wv.key_to_index.keys()):
            X.append(word2vec.wv[word])
    return np.array(X)


### Checks
embedded_sentence = embed_sentence(word2vec, example)
assert(type(embedded_sentence) == np.ndarray)
assert(embedded_sentence.shape == (8, 100))

embedded_sentence_missing_words = embed_sentence(word2vec, example_missing_words)  
assert(type(embedded_sentence_missing_words) == np.ndarray)
assert(embedded_sentence_missing_words.shape == (3, 100))

In [66]:
embed_sentence(word2vec, example_missing_words).shape

(3, 100)

❓ **Question** ❓ Write a function that, given a list of sentence (each sentence being a list of words/strings), returns a list of embedded sentences (each sentence is a matrix). Apply this function to the train and test sentences

Hint: Use the previous function `embed_sentence`

In [67]:
def embedding(word2vec, sentences):
    return [embed_sentence(word2vec, sentence) for sentence in sentences]
    
X_train = embedding(word2vec, X_train)
X_test = embedding(word2vec, X_test)

In [72]:
len(X_train[0]), len(X_train[1])

(108, 110)

❓ **Question** ❓ In order to have ready-to-use data, do not forget to pad them in order to have tensors that can be divided in batch sizes during the optimization. Store the padedd values in `X_train_pad` and `X_test_pad`. Do not forget the important arguments of the padding ;)

In [73]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [75]:
X_train_pad = pad_sequences(X_train, maxlen = 200, dtype='float32', padding = 'post')
X_test_pad = pad_sequences(X_test, maxlen = 200, dtype='float32', padding = 'post')

assert(len(X_train_pad.shape) == 3)
assert(len(X_test_pad.shape) == 3)
assert(X_train_pad.shape[2] == 100)
assert(X_test_pad.shape[2] == 100)