[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/demos/nlp/w2v_from_scratch.ipynb)

# Word-to-Vec (W2V) in Keras
This notebook is the third demo on natural language processing and word embeddings. Having discovered the [fundamentals of NLP](https://github.com/Humboldt-WI/adams/blob/master/demos/nlp/nlp_foundations.ipynb) and how we can train [custom word embeddings using the Gensim library](https://github.com/Humboldt-WI/adams/blob/master/demos/nlp/word-2-vec.ipynb), the purpose of this demo is to illustrate a from scratch implementation of the W2V algorithm. To that end, we will draw heavily on the very good Word-to-Vec tutorial on the [Tensorflow homepage](https://www.tensorflow.org/tutorials/text/word2vec). As shown there, we provide an implementation based on Keras. 

If you are interested, you can find many tutorials that walk you through a from scratch implementation of W2V using nothing but plain Python and Numpy. Here are some examples:
- [Word2vec from Scratch with NumPy](https://towardsdatascience.com/word2vec-from-scratch-with-numpy-8786ddd49e72)
- [Word2vec from Scratch](https://jaketae.github.io/study/word2vec/)
- [Word2vec from Scratch with Python and NumPy](https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/)

Last, it goes without saying that excellent resources including various Python codes can be obtained from [Dive into Deep Learning](https://www.d2l.ai/), [Chapter 14](https://www.d2l.ai/chapter_natural-language-processing-applications/index.html).

Now, without further delay, let's move on with our ADAMS demo.

## Recap W2V
Remember that W2V proposes two models for learning word vectors, continuous-bag-of-words (CBOW) and Skip-Gram. In a nutshell, CBOW predicts a central target word from surrounding context words, while Skip-Gram takes the opposite approach. Given a <font color='red'>target word</font>, predict <font color='green'>context words</font> with high chance to appear next to the target word in a corpus. Considering one of the above example sentences and a widow size of 2, we can highlight target and context words as follows:<br><br>
[doctors <font color='green'>claim the</font><font color='red'> air </font><font color='green'>you breath</font> defines]. 
<br><br>Using a question mark to indicate the target variable of the model, we obtain:

[doctors *? ?* **air** *? ?* breath] in Skip-Gram versus [doctors *claim the* **?** *you breath* defines] in CBOW.

In this section, we focus on Skip-Gram, which seems to be the preferred approach in practice. The code is based on a tutorial that is part of the [Tensorflow documentation](https://www.tensorflow.org/tutorials/text/word2vec) and hence optimized for keras 2.

Before moving on, let's recall the architecture of the skip-gram W2V model.

![sg](https://upload.wikimedia.org/wikipedia/commons/9/95/Skip-gram.png)
<br>
Source: [Wikipedia](https://upload.wikimedia.org/wikipedia/commons/9/95/Skip-gram.png)

Given a sentence – better to say sequence of text – we take a target word and predict a set of context words, that is, words, which appear in a certain <font color="green">**context window**</font>  $[w_{-i},\ldots, w, \ldots, w_{+i}]$, where $i$ is the *window size* and the number of context words to consider is window size $\times 2$. 

An important caveat with the above picture is that a corresponding model would not scale. Remember that the output layer involves a high-dimensional softmax which is too costly to compute for any reasonably sized corpus. Among the two options around this problem, *hierarchical softmax* and *negative sampling*, we will make use of the latter. So given a target word, our prediction task will be to classify whether another word is an actual context word for that target word, or a random word sampled from the corpus according to some probability distribution. This is a binary classification task. Thus, the output of our neural network is much cheaper to compute. Instead of a high-dimensional softmax we only need a simple logistic classifier. 

In [1]:
# Import standard libraries
import re
import numpy as np
import pandas as pd

## Importing the IMDB movie review data

In [12]:
import sys

# Configure variables pointing to directories and stored files 
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')  # mount Google-Drive
    directory = '/content/drive/My Drive/ADAMS/'  # adjust to Google drive folder with the data if applicable
else:
    directory = "../NLP/" # adjust to the directory where data is stored on your machine (if running the notebook locally)

sys.path.append(directory)

When running this notebook in Google Colab, ensure that you run it with a GPU as hardware accelerator. To enable this:
- Navigate to Edit → Notebook Settings
- select GPU from the Hardware Accelerator drop-down

In [13]:
import pickle

In [14]:
# We use our cleaned IMDB data set for the demo
with open(directory + 'imdb_clean_full_v2.pkl','rb') as path_name:
    df_imdb = pickle.load(path_name)

In [15]:
df_imdb.head()

Unnamed: 0,review,sentiment,review_clean
0,One of the other reviewers has mentioned that ...,positive,one reviewer mention watch oz episode hooked r...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production film technique una...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter love time money visually stun film watc...


## Building the vocabulary

Let's start with building our vocabulary. It is common practice to not train on every word but words that occur reasonably frequent. For rare words, training a good embedding is difficult. Remember how this issue motivated subword embeddings like Fasttext. In our example, we simply use the most frequent words from the review corpus and try to compute embeddings for these words. This is the point where our word_counter comes in handy.

In [16]:
from collections import Counter

In [None]:
# This code is copied from the NLP foundations notebook. 
word_counter = Counter()
for r in df_imdb['review_clean']:
    for w in r.split():  # this is like tokenizing using the white space
        word_counter.update({w: 1})

# Extract the n most common words from the corpus
#vocab_size = 4000
vocab_size = 1000  # run the code with a vocab size of only 1000 to speed up the process
vocab = word_counter.most_common(vocab_size)
vocab = [x[0] for x in vocab]
vocab[:10]

['movie', 'film', 'one', 'make', 'like', 'see', 'get', 'well', 'time', 'good']

Next task is to build a dictionary. For Keras, we need to encode words as integers, which Keras will then interpret as indices into a one-hot vector of the size of the vocabulary. We build two dictionaries. One to map words to their code (i.e. unique integer) and one to revert the mapping and decode words.

In [22]:
idx = range(1, vocab_size)
word2id = dict(zip(vocab, idx))
id2word = dict(zip(idx, vocab))

In [23]:
print('Vocabulary size: {}'.format(vocab_size))
print('Vocabulary Sample:', list(word2id.items())[:10])
print(list(word2id.items())[-10:])

Vocabulary size: 1000
Vocabulary Sample: [('movie', 1), ('film', 2), ('one', 3), ('make', 4), ('like', 5), ('see', 6), ('get', 7), ('well', 8), ('time', 9), ('good', 10)]
[('mood', 990), ('regard', 991), ('jane', 992), ('garbage', 993), ('reference', 994), ('barely', 995), ('haunt', 996), ('super', 997), ('humour', 998), ('impressive', 999)]


You may have noted that we have so far left out the index 0. This index is commonly reserved for unknown words, which we map to a special token. Remember that our vocabulary is not very large when compared to the number of words that exists in a language (e.g. ~ 300k in English). So when processing texts, we will run into a lot of unknown words. We deal with these words by mapping them to the token `UNK`. This way, we learn one embedding for all unknown words.

In [24]:
word2id['UNK'] = 0
id2word[0] = 'UNK'

Now we are ready to turn our reviews into integer numbers, which is the format that Keras expects, while accounting for unknown words. 

In [25]:
# Build the corpus for W2V by encoding the reviews
def encode_review(review, dictionary):
    output = []
    for word in review:
        if word not in dictionary.keys():
            output.append(dictionary['UNK'])
        else:
            output.append(dictionary[word])
    return output

In [26]:
coded_reviews = []
for r in df_imdb['review_clean']:
    coded_reviews.append(encode_review(r.split(), word2id))

In [27]:
# Some testing
id_demo_review = 8  # one random review
demo_review = df_imdb['review_clean'][id_demo_review]
print(demo_review)  # plain text after cleaning

# One-hot-coding representation in which integer numbers represent the
# index of the single non-zero element in a one-hot-vector of dimensionality 
# vocab_size
print(coded_reviews[id_demo_review])

encourage positive comment film look forward watch film bad mistake see film truly one bad awful almost every way edit pace storyline soundtrack song lame country tune played less four time film look cheap nasty boring extreme rarely happy see end credit film thing prevents give score harvey keitel far best performance least seem make bit effort one keitel obsessive
[0, 928, 306, 2, 22, 687, 11, 2, 14, 840, 6, 2, 276, 3, 14, 280, 123, 85, 35, 432, 418, 614, 583, 257, 677, 442, 0, 162, 251, 519, 9, 2, 22, 556, 0, 267, 0, 0, 413, 6, 25, 379, 2, 37, 0, 32, 385, 0, 0, 134, 49, 65, 129, 40, 4, 107, 453, 3, 0, 0]


In [29]:
# One more example: compare the id's of words in this demo text with those from the output of encoding review no. 8 above
demo_txt = ["movie", "positive", "comment", "silly"]
encode_review(demo_txt, word2id)

[1, 928, 306, 531]

In [30]:
if len(coded_reviews[id_demo_review]) == len(demo_review.split()):
    print('Looks good')
else:
    raise ValueError('This can\'t be right')

Looks good


## Generating training data

The training data for our skip-gram model consists of tuples (target, context) with corresponding label (0/1), indicating whether the second word really appeared in the context of the target word or not. Fortunately, Keras has ready-made functions that we can use to generate that training data. Below you can find an overview of the process from the tensorflow documentation.

<div>
<img src="https://tensorflow.org/tutorials/text/images/word2vec_negative_sampling.png" width="650"/>
</div>

Source: [Tensorflow Documentation](https://tensorflow.org/tutorials/text/images/word2vec_negative_sampling.png)

Let's first illustrate the function `skipgrams()` for a single short sentence.

In [31]:
from tensorflow.keras.preprocessing.sequence import skipgrams

In [32]:
# Let's pick a random review (you can use any text)
text = df_imdb['review_clean'][27521]
print(text)

# Encoded version
encoded_text = coded_reviews[27521]
print(encoded_text)

read book forget movie
[234, 144, 626, 1]


Remember that a window size of `i` translates to $[w_{-i},\ldots, w, \ldots, w_{+i}]$, so the number of context words to consider is window size $\times 2$. 

In [33]:
positive_skip_grams, _ = skipgrams(encoded_text,
                         vocabulary_size=vocab_size,
                         window_size=2,
                         negative_samples=0)

for i in range(len(positive_skip_grams)):
    print('({:s} ({:d}), {:s} ({:d}))'.format(id2word[positive_skip_grams[i][0]], 
                                              positive_skip_grams[i][0], 
                                              id2word[positive_skip_grams[i][1]],
                                              positive_skip_grams[i][1]))

(forget (626), movie (1))
(read (234), book (144))
(read (234), forget (626))
(book (144), movie (1))
(book (144), read (234))
(forget (626), book (144))
(book (144), forget (626))
(movie (1), forget (626))
(movie (1), book (144))
(forget (626), read (234))


What we get is a list of positive skipgrams. Next we will have to sample negative skipgrams, i.e. those that don't appear in the context window. According to the empirical evidence, the probability of a word to be sampled as a negative example should be relative to its frequency. Otherwise, we might end up focussing too much on the most frequent words. Keras provides the utility function `make_sampling_table` to calculate sampling weights for each word in the corpus. Details are available in the [Tensorflow documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/make_sampling_table). The sampling table is a list of sampling probabilities, one for each word. 

In [34]:
from tensorflow.keras.preprocessing.sequence import make_sampling_table
sampling_table = make_sampling_table(vocab_size)
sampling_table[:10]

array([0.00315225, 0.00315225, 0.00547597, 0.00741556, 0.00912817,
       0.01068435, 0.01212381, 0.01347162, 0.01474487, 0.0159558 ])

Note the increasing magnitude of the sampling weights. Sampling words from the corpus using this sampling distribution requires that the words in the corpus are ordered by frequency. The idea is that when sampling negative examples we do not want to focus too much on the frequent words – in our case words like 'movie', 'film', and 'like'. We therefore raise the chance of less frequent words to be sampled as negative examples in accordance to their rank.

Let's now put everything together and create a helper function that will take care of generating the training data for our model.

In [35]:
import tensorflow as tf
from tqdm import tqdm
from tensorflow.keras import layers

In [36]:
# Set the window size
window_size = 2
# Set the number of negative samples per positive context
num_ns = 4

In [37]:
def generate_training_data(sequences, window_size, num_ns, vocab_size, seed):
    ''' This function generates skip-gram pairs with negative sampling for a list
    of sequences (int-encoded sentences) based on window size, number of negative
    samples and vocabulary size. '''

    # Elements of each training example are appended to these lists.
    targets, contexts, labels = [], [], []

    # Build the sampling table for `vocab_size` tokens.
    sampling_table = make_sampling_table(vocab_size)

    # Iterate over all sequences (sentences) in the dataset.
    for sequence in tqdm(sequences):

        # Generate positive skip-gram pairs for a sequence (sentence).
        positive_skip_grams, _ = skipgrams(
            sequence,
            vocabulary_size=vocab_size,
            sampling_table=sampling_table,
            window_size=window_size,
            negative_samples=0)

        # Iterate over each positive skip-gram pair to produce training examples
        # with a positive context word and negative samples.
        for target_word, context_word in positive_skip_grams:
            context_class = tf.expand_dims(
                tf.constant([context_word], dtype='int64'), 1)
            negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
                true_classes=context_class,
                num_true=1,
                num_sampled=num_ns,
                unique=True,
                range_max=vocab_size,
                seed=seed,
                name='negative_sampling')

            # Build context and label vectors (for one target word)
            negative_sampling_candidates = tf.expand_dims(
                negative_sampling_candidates, 1)

            context = tf.concat(
                [context_class, negative_sampling_candidates], 0)
            label = tf.constant([1]+[0]*num_ns, dtype='int64')

            # Append each element from the training example to global lists.
            targets.append(target_word)
            contexts.append(context)
            labels.append(label)

    # Transform the lists into numpy arrays
    targets = np.array(targets)
    contexts = np.array(contexts)[:, :, 0]
    labels = np.array(labels)

    return targets, contexts, labels

In [38]:
targets, contexts, labels = generate_training_data(
    sequences=coded_reviews,
    window_size=window_size,
    num_ns=num_ns,
    vocab_size=vocab_size,
    seed=111)

100%|███████████████████████████████████████████████████████████████████████████| 50000/50000 [02:00<00:00, 414.21it/s]


## Building the neural network

We are ready to design our NN architecture using Keras. We feed the network with pairs of target words and actual/fake context words. Each word is put through an embedding layer. Remember that W2V trains two embeddings per word, one when the word is the target word and one when the word appears in the context of some other target word. So using two embedding layers is important.

Having obtained word embeddings for the target and context word, we pass these embeddings to a merge layer in which we compute the dot product of these two vectors. We can think of the dot products as an unnormalized cosine similarity between the two embedding vectors. Put differently, we obtain a similarity score. We want that score to be large when the inputted 'context' word actually appeared in the context of the target word, and small otherwise. Hence, we forward the similarity score to a dense sigmoid layer, which computes a probability of the 'context' word being an actual context word. We then compare this probability, the output of our neural network, to the actual label, which we obtained above from `generate_training_data` function. Enter back-propagation. 

So far so good, but there is one issue. Our network is a little more advanced than those we have built so far. There were also some changes when moving to Keras 2, which hit us in this example. Long story short, we cannot use the nice and simple sequential API anymore and will have to use the functional API instead. For this reason, the code will look a little different from what you are used to. 

In [39]:
# Create tensorflow dataset from the generated training data (this includes
# randomizing the data, creating batches and speeding up the performance by
# enabling caching and prefetching)
dataset = (tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
           .shuffle(10000)
           .batch(1024, drop_remainder=True)
           .cache()
           .prefetch(buffer_size=tf.data.AUTOTUNE))

In [40]:
# Define the structure of the keras model
class Word2Vec(tf.keras.Model):
  
    # Define layers in __init__ function
    def __init__(self, vocab_size, embedding_dim, num_ns):
        super().__init__()
        self.target_embedding = layers.Embedding(vocab_size,
                                                 embedding_dim,
                                                 input_length=1,
                                                 name='target_embedding')
        self.context_embedding = layers.Embedding(vocab_size,
                                                  embedding_dim,
                                                  input_length=num_ns+1,
                                                  name='context_embedding')
        # 5 dimensions in the dense layer, because we have one positive and 4
        # negative context words
        self.dense = layers.Dense(5, activation='sigmoid')

    # Implement forward pass in call function
    def call(self, pair):
        target, context = pair
        word_emb = self.target_embedding(target)
        context_emb = self.context_embedding(context)
        dot_product = tf.einsum('be,bce->bc', word_emb, context_emb)
        output = self.dense(dot_product)
        return output

In [41]:
# Initiate the model and configure it with an optimizer, loss and metrics
embedding_dim = 128
word2vec = Word2Vec(vocab_size, embedding_dim, num_ns)
word2vec.compile(optimizer='adam',
                 loss=tf.keras.losses.CategoricalCrossentropy(),
                 metrics=['accuracy'])

In [42]:
# Let's now fit the model while iterating over the dataset 20 times
word2vec.fit(dataset, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x1eb7b3083a0>

See how the model is not much of a neural network? The overwhelming majority of the trainable parameters are the embeddings, which are then dot-multiplied. We thus have two hidden layers side-by-side rather than one after the other and no non-linear activation of the hidden layers! This is very similar to matrix factorization and you can use the same architecture to build a collaborative filter on users (one embedding matrix) and items (one embedding matrix); just in case you are into recommender engines.

In [43]:
word2vec.summary()

Model: "word2_vec"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 target_embedding (Embedding  multiple                 128000    
 )                                                               
                                                                 
 context_embedding (Embeddin  multiple                 128000    
 g)                                                              
                                                                 
 dense (Dense)               multiple                  30        
                                                                 
Total params: 256,030
Trainable params: 256,030
Non-trainable params: 0
_________________________________________________________________


Here is a perhaps more intuitive visualization of the model:

<img src="https://miro.medium.com/max/1123/1*4Uil1zWWF5-jlt-FnRJgAQ.png">

Source: [Medium](https://miro.medium.com/max/1123/1*4Uil1zWWF5-jlt-FnRJgAQ.png)

## Extracting the weights

We can extract the word embeddings from the target embedding layer of our model  using `Model.get_layer` and `Layer.get_weights`. Converting the embeddings to a dataframe facilitates a quick look.

In [44]:
word_embeddings = word2vec.get_layer('target_embedding').get_weights()[0]
print(word_embeddings.shape)
w2v_df = pd.DataFrame(word_embeddings, index=id2word.values())
w2v_df.head()

(1000, 128)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
movie,0.007403,-0.013269,0.039849,-0.005933,-0.043833,0.015909,0.019333,-0.041979,-0.009944,-0.001832,...,-0.019483,0.044691,-0.015024,0.047083,0.040798,0.006431,0.024771,-0.007073,-0.014015,-0.007731
film,0.127619,0.146523,-0.105528,-0.156217,0.192956,0.189649,-0.102191,0.128531,-0.142733,0.173058,...,0.010634,-0.139883,0.158247,-0.110454,0.180473,0.097302,0.106072,0.114954,0.145535,0.128355
one,0.145414,0.086958,-0.155354,-0.154791,0.184328,0.169703,-0.148181,0.158294,-0.169188,0.149171,...,0.047883,-0.125324,0.092685,-0.156749,0.166181,0.155183,0.147127,0.195911,0.144122,0.113477
make,0.1704,0.067413,-0.146721,-0.101469,0.144049,0.180389,-0.18578,0.169884,-0.153277,0.10354,...,0.000444,-0.19223,0.084499,-0.189832,0.218721,0.132661,0.108176,0.140849,0.199588,0.081444
like,0.114674,0.118794,-0.163889,-0.181996,0.181742,0.164045,-0.160769,0.137906,-0.106992,0.185597,...,0.072031,-0.136441,0.148543,-0.19412,0.169525,0.155139,0.144008,0.160898,0.149344,0.183718


Now that we have the word embeddings we can calculate how similar words are to each other. We use some scikit-learn functionality to create a matrix of pairwise distances between words. We can then query the most similar words to some seed-words.

In [45]:
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(w2v_df)
print(f'distance_matrix.shape: {distance_matrix.shape}')

distance_matrix.shape: (1000, 1000)


In [47]:
# note that the results will not (yet) make much sense if you trained on a small corpus
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['movie', 'bad', 'good']}

similar_words

{'movie': ['us', 'anti', 'hole', 'disappoint', 'convincing'],
 'bad': ['three', 'head', 'wish', 'attack', 'always'],
 'good': ['might', 'right', 'say', 'future', 'home']}