[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/Ex10_word2vec/Ex10_word2vec.ipynb) 

# ADAMS Tutorial #10 Word2Vec

The tutorial revisits the famous Word-to-Vec (W2V) model for learning word embeddings. We introduce the Gensim library, which offers a nice interface to train embeddings. In addition, we implement a W2V model on our own using Keras. In case you would like to take it one step further and code everything yourself using just `numpy`, I recommend [Nathan Rooy's post](https://nathanrooy.github.io/posts/2018-03-22/word2vec-from-scratch-with-python-and-numpy/) the codes of which are available from his [GitHub repo](https://github.com/nathanrooy/word2vec-from-scratch-with-python/blob/master/word2vec.py). A re-implementation of his example with a nice Excel demo is available on [Towards Data Science](https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281).

#### Here is the outline of the tutorial

1. The IMDB Movie Review data set
2. Training W2V embeddings using Gensim
3. Manual W2V using Keras


Let's get started.

## 1. The IMDB Movie Review data set

We use a popular NLP data set consisting of movie reviews posted at [IMDB](https://www.imdb.com/). The data is available in different sizes and shapes (cleaned, raw, ...) on the web. We use a version from Kaggle, which includes 50K reviews and binary labels whether a review is positive or negative. The labels are useful for sentiment analysis, which we will do in our next tutorial. Today, we will not use them and focus exclusively on the text of reviews. You can download the data from Kaggle: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews/data

### Data integration and cleaning
We need a couple of libraries to pre-process the data. Most steps follow the example of our first NLP tutorial. For example, we will again use the `NLTK toolkit` for standard NLP operation. Although not the focus of this tutorial, we also use a library called `Beautiful Soup` which gained a lot of popularity in web-scraping. We use it to deal with html tags that might occur in some of the reviews. So you might need to install the bs4 library before running the folling code.      

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Library re provides regular expressions functionality
import re

# To keep an eye on runtimes
import time

# Saving and loaded objects
import pickle

# Library beatifulsoup4 handles html
from bs4 import BeautifulSoup

# Standard NLP workflow
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#### Load the data

In [None]:
# Remeber to adjust the path so that it matches your environment
df = pd.read_csv("../../data/IMDB-50K-Movie-Review.zip", sep=",", encoding="ISO-8859-1")
df.info()

In [None]:
df.head()

Apparently, some of the reviews include HTML. So we probably have to do some data cleaning

In [None]:
df.loc[1, 'review']

The data has a nice balanced distribution of positive and negative reviews. 

In [None]:
df['sentiment'].value_counts()

In [None]:
# Map label
df['sentiment'] = df['sentiment'].map({'positive' : 1, 'negative': 0})

#### Sampling
Working with the full data set of 50K reviews is time consuming. For the tutorial, you might want to use a random sample instead. For a modern computer, a sample size of 5000 should be feasible, without increasing the time too much. We will also make available results from using the full data sets, for example a cleaned version of the full data set or word embeddings trained on the full version of the data set. 

In [None]:
# Draw a radnom sample to save time
sample_size = 500
idx = np.random.randint(0, high=df.shape[0], size=sample_size)
df = df.loc[idx,:]

df.reset_index(inplace=True, drop=True)  # dropping the index prohibits a reidentification of the cases in the original data frame
df.sentiment.value_counts()

### NLP pipeline
Our NLP workflow is almost the same as in the previous tutorial. We just clean-up the code a little by putting everything into one function *clean_reviews()*. In that function, we will use lemmatization. As you might remember from Tutorial #9, lemmatization supports different forms of a word, e.g., whether it is used as a noun, verb, etc. We implement a little helper function that uses the POS tagger of the NLTK toolkit. This will allow us to select a suitable grammatical form for the lemmatizer. Here is the helper function. 

In [None]:
# Lemmatize with POS Tag
def get_wordnet_pos(word):
    """Map POS tag to first character for lemmatization"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
# Test the helper function
[get_wordnet_pos(x) for x in ["house", "car", "go", "nice","nicely"]]

And here is our function to clean reviews. We implement it in such a way that we receive as input a set of reviews (in our case a Pandas series object), iterate over that set, and pre-process every review. Alternatively, we could have written the function such that it processes a single review. The latter approach would then facilitate calling it *DataFrame.apply()*. Which approach is better is probably a matter of choice. 

In [None]:
def clean_reviews(df):
    """ Standard NLP pre-processing chain including removal of html tags, non-alphanumeric characters, and stopwords.
        Words are subject to lemmatization using their POS tags, which are determind using WordNet. 
    """
    reviews = []

    lemmatizer = WordNetLemmatizer()
    
    print('*' * 40)
    print('Cleaning {} movie reviews.'.format(df.shape[0]))
    counter = 0
    for review in df:
        
        # remove html content
        review_text = BeautifulSoup(review).get_text()
        
        # remove non-alphabetic characters
        review_text = re.sub("[^a-zA-Z]"," ", review_text)
    
        # tokenize the sentences
        words = word_tokenize(review_text.lower())
  
        # filter stopwords
        words = [w for w in words if w not in stopwords.words("english")]
        
        # lemmatize each word to its lemma
        lemma_words =[lemmatizer.lemmatize(i, get_wordnet_pos(i)) for i in words]
    
        reviews.append(lemma_words)
              
        if (counter > 0 and counter % 500 == 0):
            print('Processed {} reviews'.format(counter))
            
        counter += 1
        
    print('DONE')
    print('*' * 40)

    return(reviews) 

In [None]:
#* Do the cleaning
# CAUTION: depending on your data set size, the processing might take a while 
reviews = clean_reviews(df.review)

In [None]:
# Check all is well
print(df.review[0])
print(reviews[0])

### Saving the data
Should you have used the full data set in the above cleaning, you will want to store your results. The following codes exemplifies the use of a library called `Pickle`, which offers an easy way to store Python objects on your hard disk. The code is pretty self-explanatory. You can also skip over it, e.g. when using only a small sample of reviews. 

In [None]:
# Save cleaned reviews using Pickle
# 'wb' specifies 'write (open in binary mode)'
# binary mode is important on Win for non-text files
with open('imdb_clean.pkl','wb') as path_name:
    pickle.dump(reviews, path_name)

# 'rb' specifies 'read (open in binary mode)'
with open('imdb_clean.pkl','rb') as path_name:
    reviews = pickle.load(path_name)

#### Full cleaned IMDB data set 
If you do not invest the time to clean the full IMDB data set with its 50K reviews, you can find a clean version on moodle. The following code loads that data set from disk (assuming it is located in your working directory).   

In [None]:
with open('imdb_clean_full.pkl','rb') as path_name:
    reviews = pickle.load(path_name)
len(reviews)

#### Bird's eye view
Let's have a look what folks talk about in this data set. Using the class *Counter* from the collections package, we can easily count word occurrences and query the most common words. We can also check the number of occurrences for specific words. We do not really need the *word_counter* here and only use it to get a feeling for the data set. However, note that we will use it later on when building a vocabulary for our manual W2V model.


In [None]:
# Loop through the words and update a counter keeping track of word counts
import collections

word_counter = collections.Counter()
for r in reviews:
    for w in r:
        word_counter.update({w: 1})
        
word_counter.most_common(20)

In [None]:
#* Check frequency of some target word
word_counter["tarantino"]

## 2. Training W2V Embeddings using Gensim
When it comes to embeddings, the most typical use case is to **download pre-trained embeddings** and employ these for some downstream tasks (with or without fine-tuning). The Keras *embedding layer* supports that use case very well. We will make use of it in a next tutorial. Another use case is that you want to **train your own embeddings**. Since this tutorial aims at deepening our understanding of W2V, we focus on this use case.

*Gensim* is a popular library for text processing. Although maybe even more geared toward topic modeling, it offers, amongst others, implementations of several algorithms to learn word embeddings including *W2V*, *GloVe*, and *Fasttext*. The following demonstrates training W2V embeddings using the IMDB data using Gensim. By the way, you might need to install it ;)   

### Recap W2V
Let's quickly revisit the principles of W2V. Please consult the paper of Mikolov et al. (2013) for a detailed description.

W2V establishes a word's meaning by the words that frequently appear close-by (distributional semantics). More specifically, the context of a word consists of the words that appear next to it within a pre-defined window (let's say 5 words).

 - the quality of *air* in mainland China has been decreasing since..
 - doctors claim the *air* you breath defines the overall wellbeing...
 - the currents of hot *air* have been bursting from underground
 - the mountain *air* was crystal clean and filled with ..
 - in case of *air* supply shortages, the submarine will..

Taking the word *air* as our **target word**, the words around *air*, called context words, define the **meaning** of the word *air* in W2V.

![w2vprocess](w2v.jpg)
<br>
inspired by https://www.youtube.com/watch?v=BD8wPsr_DAI

### The Gensim W2V model

Training word embeddings using Gensim is very easy. But note that, depending on your data, the code may take quite a while to run. Again, word embeddings trained on the full 50K data set for 500 epochs are available on Moodle.

In [None]:
# CAUTION: Running the code might take a long time
from gensim.models import Word2Vec    

emb_dim = 100  # embedding dimension
# Train a Word2Vec model
model = Word2Vec(reviews, 
                 min_count=1,  #min_count means the frequency benchmark, if =2 and word is used only once - it's not included
                 window=5,     #the size of context
                 iter=100,     #how many times the training code will run through the data set, same as epochs (first pass is to create dict)
                 size=emb_dim, #size of embedding
                 workers=2)    #for parallel computing
# summarize the loaded model
print(model)
words=list(model.wv.vocab)

### Input / output handling
Gensim supports saving and loading of trained embeddings in two versions. Option 1 allows you do load embeddings and continue training. This is very useful, for example to save temporary results when training real embeddings for many epochs. The disadvantage is that saving/loading takes longer and that the file consumes more disk space. Therefore, when you are done with the training and only want to store the embeddings, you should go for option 2. We showcase both options below. More information are available on the [Gensim homepage](https://radimrehurek.com/gensim/models/word2vec.html)

In [None]:
# Option a) save in such a way that you can continue training later
embs="gensim_movie_embeddings.model"
model.save(embs)

# Overwrite variable to show that loading works
model = 0
# Load model from disk
model =  Word2Vec.load(embs)
model.wv['nice']  # get one embedding to show that loading worked

In [None]:
# Option b) save only the trained word vectors; continuation of training is not possible but IO speed increases
embs="w2v_movie_embeddings.model"
save_as_bin = False
model.wv.save_word2vec_format(embs, binary=save_as_bin)  # set binary to True to save disk space; false facilitates inspecting the embeddings in a text editor

model = 0

# Load model from disk
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format(embs, binary=save_as_bin)
model['nice']  # get one embedding to show that loading worked

### Working with the trained embeddings

We use a pre-trained version of the embeddings, which were trained on the full IMDB data set for 500 epochs. The examples are inspired by [this Kaggle kernel](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial). If you want to visualize the trained word vectors have a look at [this post](https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne). It is fairly easy to create a TSNE visualization but to get meaningful results you would need to prepare the data more carefully; for example removing too frequent words and too infrequent words. Ultimately, working with the embeddings would involve building a proper NLP model and using it to solve some downstream tasks. So maybe the section should better be called playing with embeddings.

In [None]:
# This data set is available on moodle
model = KeyedVectors.load_word2vec_format("w2v_imdb_full_500_epocs.model", binary=True)

#### Which word is most similar to another word?

In [None]:
model.most_similar(positive=['bad'])

#### How similar are two words?

In [None]:
model.similarity('good', 'great')

In [None]:
print('How similar is Tarantino to Spielberg: {}'.format(model.similarity('tarantino', 'spielberg')))
print('How similar is Emmerich to Spielberg: {}'.format(model.similarity('emmerich', 'spielberg')))

print('How similar is Paltrow to Bullock: {}'.format(model.similarity('paltrow', 'bullock')))
print('How similar is Paltrow to Alba: {}'.format(model.similarity('paltrow', 'alba')))

print('How similar is Cruise to Depp: {}'.format(model.similarity('cruise', 'depp')))
print('How similar is Cruise to Willis: {}'.format(model.similarity('cruise', 'willis')))


#### Which word does not fit in?

In [None]:
model.doesnt_match(['cool', 'great', 'lovely', 'weak'])
model.doesnt_match(['cruise', 'willis', 'pacino', 'reeves'])

#### A is to B as C is to ? 

In [None]:
model.most_similar(positive=['woman', 'spielberg'], negative=['man'], topn=5)

### Phrase detection
W2V trains one embedding per word. The model is agnostic of common phrases such as 'New York'. It would train one embedding for new and another for york, provided both words are part of the vocabulary. You can get better embeddings by adding common phrases to the vocabulary. W2V will then train individual embeddings for these phrases. Gensims also comes with a phrase detection models, which allows you to handle bigrams, trigrams and the like. We will not retrain our W2V model but sketch how you can use Gensim to get these common phrases. You could then consider to add (some of) them to your vocab and enhance the model.  

In [None]:
from gensim.models.phrases import Phrases

# Train a bigram model
bigram_model = Phrases(reviews, min_count=10) 

In [None]:
# Compare the original review and the version after phrase detection
reviews[0]

In [None]:
bigram_model[reviews[0]]

We can again make use of your counter class to examine the most common bigrams in the corpus, as follows:

In [None]:
bigram_counter = collections.Counter()
for key in bigram_model.vocab.keys():
    if key.decode().find('_')>-1: # the decode is needed because Gensims stores keys as bytes
        bigram_counter[key] += bigram_model.vocab[key]

In [None]:
bigram_counter.most_common(25)

The above bigrams might be frequent. However, you would not consider training individual embeddings for phrases such as *look_like* or *waste_time*. This shows how proper phrase detection in the scope of W2V is nontrivial and would require more work before we can hope to get descend results.     

## Manual Word2Vec using Keras

In the following, we will re-implement W2V in Keras. Remember that W2V proposes two models for learning word vectors, continuous-bag-of-words (CBOW) and Skip-Gram. IN a nutshell, CBOW predicts a central target word from surrounding context words, while Skip-Gram takes the opposite approach. Given a <font color='red'>target word</font>, predict <font color='green'>context words</font> with high chance to appear next to the target word in a corpus. Considering one of the above example sentences and a widow size of 2, we can highlight target and context words as follows:<br><br>
[doctors <font color='green'>claim the</font><font color='red'> air </font><font color='green'>you breath</font> defines]. 
<br><br>Using a question mark to indicate the target variable of the model, we obtain:

[doctors *? ?* **air** *? ?* breath] in Skip-Gram versus  [doctors *claim the* **?** *you breath* defines] in CBOW.


In this tutorial, we focus on Skip-Gram, which seems to be the preferred approach in practice. The code is based on a great tutorial by [Dipanjan Sarkar](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa), in which you can also find a Keras implementation of CBOW; if interested. However, as nice as the post is, the code is not compatible with the recent version of Keras, which is the one you probably use (i.e., Keras 2). So we will take care of that issue in our implementation.  


Another potentially useful demonstration how to implement the W2V model in Keras is available at [adventuresinmachinelearning.com](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa). The post provides a lot of nice explanations. However, it uses a shared embedding layer for target and context words, which seems to be wrong. There is some more debate concerning code quality and correctness on Reddit. In summary, have a look at the [adventuresinmachinelearning.com] for some additional explanations, but bear in mind that the Keras code seems to be flawed.

Before moving on, let's remember the architecture of the skip-gram W2V model.

![sg](https://upload.wikimedia.org/wikipedia/commons/9/95/Skip-gram.png)
<br>
Source: https://upload.wikimedia.org/wikipedia/commons/9/95/Skip-gram.png

Given a sentence - better to say sequence of text - we take a target word and predict a set of context words, that is, words, which appear in a certain <font color="green">**context window**</font>  $[w_{-i},\ldots, w, \ldots, w_{+i}]$, where $i$ is the *window size* and the number of context words to consider is window size $\times 2$. 

An important caveat with the above picture is that a corresponding model would not scale. Remember that the output layer involves a high-dimensional softmax which is too costly to compute for any reasonably sized corpus. Among the two options around this problem, *hierarchical softmax* and *negative sampling*, we will make use of the latter. So given a target word, our prediction task will be to classify whether another word is an actual context word for that target word, or a random word sampled from the corpus according to some probability distribution. This is a binary classification tasks. Thus, the output of our neural network is must cheaper to compute. Instead of a high-dimensional softmax we only need a simple logistic classifier. 

### Building the vocabulary
Let's start with building our vocabulary. It is common practice to not train to train every word but words that occur reasonably frequent. For rare words, training a good embedding is difficult. Remember how this issue motivated subword embeddings like Fasttext. In our example, we simply use the most frequent words from the review corpus and try to compute embeddings for these words. This is the point where our word_counter (see above) comes in handy.

In [None]:
# It is best to start from a clear defined version of the data. The names of files may be different on your machine.
# So make sure to adjust the code if you need to. Then, uncomment the line that you need, the small sample or the full one.

# This should be the same version of the cleaned data set
with open('imdb_clean.pkl','rb') as path_name:
    cleaned_reviews = pickle.load(path_name)
    
# And this should be the full version of 50K cleaned reviews    
#with open('imdb_clean_full.pkl','rb') as path_name:
#    cleaned_reviews = pickle.load(path_name)

In [None]:
# This code is copied from above. We run it again to make sure that the 
# word_counter is adjusted to the right set of cleaned reviews (e.g., small or full)
word_counter = collections.Counter()
for r in cleaned_reviews:
    for w in r:
        word_counter.update({w: 1})

# Extract the n most common words from the corpus
vocab_size = 1000
vocab = word_counter.most_common(vocab_size)
vocab = [x[0] for x in vocab]
vocab[:10]

Next task is to build dictionary. For Keras, we need to encode words as integers, which Keras will then interpret as indices into a one-hot vector of the size of the vocabulary. We build two dictionaries. One to map words to their code (i.e., unique integer) and one to revert the mapping and decode words. 

In the below code, we implicitly exploit the fact that our vocabulary is ordered by frequency. The most frequent word receives the index 1, the second-most frequent word the index two, and so forth. That will prove useful later when calculating sampling weights for the negative sampling. 

In [None]:
idx = range(1, vocab_size)
word2id = dict(zip(vocab, idx))
id2word = dict(zip(idx, vocab))

In [None]:
print('Vocabulary size: {}'.format(vocab_size))
print('Vocabulary Sample:', list(word2id.items())[:10])
print(list(word2id.items())[-10:])

You may have noted that we have so far left out the index 0. This index is commonly reserved for unknown words, which we map to a special token. Rmember that our vocabulary is not very large when compared to the number of words that exists in a language (e.g., ~300K in English). So there will be a lot of unknown words in a text and we deal with them but mapping every unknown word to the token `UNK`.

In [None]:
word2id["UNK"] = 0
id2word[0] = "UNK"

In [None]:
# Helper function to map unknown words to index 'unknown'
def encode_review(review, dictionary):
    output = []
    for word in review:
        if word not in dictionary.keys():
            output.append(dictionary["UNK"])
        else:
            output.append(dictionary[word])
    return output

Now we are ready to turn our reviews into integer numbers, which is the format that Keras expects, while accounting for unknown words. 

In [None]:
#* Build the corpus for W2V by encoding the reviews
coded_review = []
for r in cleaned_reviews:
    coded_review.append(encode_review(r, word2id))

In [None]:
# Some testing
#print(reviews[0])
#print(coded_review[0])
if len(coded_review[0]) == len(reviews[0]):
    print("Looks good")
else:
    print("that can't be right")

### Generate training data
Th training data for our skip-gram model consists of tuples (target, context) with corresponding label (0/1), indicating whether the second word really appeared in the context of the target word our not. Fortunately, Keras has a ready-made function that we can use for that purpose. Specifically, the function `skipgrams` takes a sentence as input and outputs:

1. target words in combination with a context word
2. a label if the context word is from the actual context or randomly sampled.

In [None]:
from keras.preprocessing.sequence import skipgrams

Let's first illustrate the function *skipgrams()* for a single short sentence.

In [None]:
# Produce a list of review lengths
r_lengths  = [len(r) for r in coded_review]
# Indices of reviews ordered by their length in ascending order
ix = np.argsort(r_lengths)

pic = ix[15]  # the shortest review is maybe too short, this is just an arbitrary selection of some short review
[(id2word[t], t) for t in coded_review[pic]]

In [None]:
example_sentence = coded_review[pic]

Remember that a window size of `i` translates to $[w_{-i},\ldots, w, \ldots, w_{+i}]$, so the number of context words to consider is window size $\times 2$. 

In [None]:
window_size = 2

In [None]:
pairs, labels = sequence.skipgrams(example_sentence, vocabulary_size=vocab_size, window_size=window_size)
for i in range(10):
    print("({:s} ({:d}), {:s} ({:d}))\t -> {:d}".format( id2word[pairs[i][0]], pairs[i][0], id2word[pairs[i][1]], pairs[i][1], labels[i]))

In the above demo, negative samples not appearing in the context window of the target words were picked at random. According to empirical evidence, the probability of a word to be sampled as negative example should be related to its frequency. Otherwise, we might end up with focussing too much on frequent words. Keras provides a utility function, *make_sampling_table*, to calculate sampling weights for each word in the corpus. Details are available in the [Keras documentation](https://keras.io/preprocessing/sequence/). The `sampling_table` is a list of sampling probabilities, one for each word. 

In [None]:
samp_tab = sequence.make_sampling_table(vocab_size)
samp_tab

Note the increasing magnitude of the sampling weights. Sampling words from the corpus using this sampling distribution requires an that the words in the corpus are ordered by frequency. Remember the idea is that when sampling negative examples we do not want to focus too much on the frequent words; in our case words like 'movie', and 'file', and 'like'. Therefore, we raise the chance of less frequent words to be sampled as negative examples. 

When building our corpus above, we used the *most_common()* function. Therefore, the words in our corpus are ordered in decreasing order by their frequency. We will make use of our sampling table to govern the sampling of negative examples make generating the training set for our W2V model.  

In [None]:
# CAUTION: yet another operation that is not cheap when using all data
start = time.time()
skip_grams = [sequence.skipgrams(coded_review, vocabulary_size=vocab_size, window_size=window_size, sampling_table=samp_tab) for coded_review in corpus]
end = time.time()
print('Generated {} skip-grams in {} sec.'.format(len(skip_grams), end-start))

### Building the neural network
We are ready to design our NN architecture using Keras. We feed the network with pairs of target word and actual/fake context word. Each word is put through on embedding layer. Remember that W2V trains two embeddings per word, one when the word is the target word and one when it appears in the context of some other word. So using two embedding layers is important.

Having obtained word embeddings for the target and context word, we pass these embeddings to a merge layer in which we compute the dot product of these two vectors. We can think of the dot products as an unnormalized cosine similarity between the two embedding vectors. Put differently, we obtain a similarity score. We want that score to be large when the inputted 'context' word actually appeared in the context of the target word, and small otherwise. Hence, we forward the similarity score to a dense sigmoid layer, which computes a probability of the 'context' word being an actual context word. We then compare this probability, the output of our neural network, to the actual label, which we obtained above from *skipgram()*. Enter back-propagation. 

So far so good, but there is on issue. Our network is a little more advanced than those be have built so far. There were also some changes when moving to Keras 2., which hit us in this example. Long story short, we cannot use the nice and simple sequential API anymore and will have to use the functional API instead. For this reason, the code will look a little different from what you are used. 

In [None]:
import keras
from keras.models import Model
from keras.preprocessing import sequence
from keras.layers import Embedding, Input, Reshape, Dot, Activation

In [None]:
# Embedding dimension
emd_dim = 25  # relatively small but we do not use much data

In [None]:
# Set up embedding layers for the target and the context word:
embedding_target = Embedding(vocab_size, emd_dim, input_length=1, name='embedding_target')
embedding_context = Embedding(vocab_size, emd_dim, input_length=1, name='embedding_context')

In [None]:
# Build the model architecture using the functional API

# Take a single target word
input_target = Input((1,))
target = embedding_target(input_target)
target = Reshape((emd_dim, 1))(target)

# Take another word either from the context or a random word from vocabulary
input_context = Input((1,))
context = embedding_context(input_context)
context = Reshape((emd_dim, 1))(context)

# Calculate the dot product as an unnormalized cosine distance
dot_product = Dot(axes=1, normalize=False)([target, context])
dot_product = Reshape((1,))(dot_product)

# Predict if the words are in the same context -> Binary yes/no
output = Activation(activation='sigmoid')(dot_product)

# Compile the model
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam')

See how the model is not much of a neural network? The only trainable parameters are the embeddings, which are then dot-multiplied. We thus have two hidden layers side-by-side rather than one after the other and no non-linear activation of the hidden layers! This is very similar to matrix factorization and you can use the same architecture to build a collaborative filter on users (one embedding matrix) and items (one embedding matrix). 

In [None]:
model.summary()

Here is a maybe more intuitive visualization of the model thanks to [Dipanjan Sarkar.](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)

<img src="https://miro.medium.com/max/1123/1*4Uil1zWWF5-jlt-FnRJgAQ.png">
<br>
Image source: 
https://miro.medium.com/max/1123/1*4Uil1zWWF5-jlt-FnRJgAQ.png2

### Training loop

We train our model review by review, updating the model after every review (i.e., batch). Implementing this approach is not possible when using the standard Keras training loop. Therefore, we use the function *train_on_batch*, which gives us more control over the training.

In [None]:
# Number of epochs
nb_epoch = 5

for e in range(nb_epoch):
        print('-'*40)
        if e>0:
            print('Epoch {} elapsed {:.2f} min.'.format(e, (end-start)/60))
        else:
            print('Epoch {}'.format(e))
        print('-'*40)
        start = time.time()

        samples_seen = 0
        losses = []
        
        for couples, labels in skip_grams:
            if couples:
                X = np.array(couples, dtype="int32")
                loss = model.train_on_batch([X[:,0],X[:,1]], labels)
                losses.append(loss)
        print(f'Average loss over last 1000 batches: {np.mean(losses[-1000:])}')
        end = time.time()

### Extracting the weights
We can extract the word embeddings from the corresponding layer of our model. Converting the embeddings to a data frame facilitates a quick look.

In [None]:
word_embeddings = model.get_layer(name="embedding_target").get_weights()[0]
print(word_embeddings.shape)
w2v_df = pd.DataFrame(word_embeddings, index=id2word.values())
w2v_df.head()

We can try to reproduce some of the functionality demonstrated above for the Gensim implementation. Of course, we don't go all the way. Still, doing a little similarity calculation is not too difficult. We use some scikit-learn functionality to create a matrix of pairwise distances between words. We can then query the most similar words to some seed-words.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(word_embeddings)
print(distance_matrix.shape)

# Note that this code will not work if you trained on a small corpus
# To make it work, you have to ensure that your search words are part of the vocabulary.
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['tarantino', 'cruise', 'willis', 'lawrence', 'bullock']}

similar_words

Ok, we might want to continue training our embeddings.   