[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/demos/nlp/word-2-vec.ipynb)

# Word Embeddings and Word-to-Vec (W2V)
This demo notebook revisits the lecture on word embeddings and Google's word-to-vec algorithm. W2V, like backpropagation, is a very popular algorithm that enjoys much coverage in various blogs, youtube channels, etc. In case you appreciate some additional material to read-up on W2V, here here are some useful resources including,  
- [the original W2V paper](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)
- the beautiful ["Illustrated Word2vec" by Jay Alammar](https://jalammar.github.io/illustrated-word2vec/)
- the[W2V Tensorflow tutorial](https://www.tensorflow.org/tutorials/text/word2vec)

Last but not least, our main textbook features excellent chapters on word embeddings, W2V, and related algorithms inlcuding GloVe and Fasttext. You can find those parts in [Section 14 of Dive into Deep Learning](http://d2l.ai/chapter_natural-language-processing-pretraining/index.html)

Let's get started with our ADAMS demo.

## Training word-to-vec embeddings
When it comes to embeddings, the most common use case is to **download pre-trained embeddings** and employ these for some downstream tasks (with or without fine-tuning). The Keras *embedding layer* supports that use case very well, as we will see in a future demo on sentiment analysis. Since this demo aims at deepening our understanding of W2V, we focus on a different use case and demonstrate the training of **customer word embeddings** using our IMDB data. 

You could argue that the IMDB forum exhibits a specific type of speech or jargon, and that this justifies training word embeddings for this specific corpus. In practice, using pre-trained embeddings will almost surely give better results than training embeddings from zero. However, without going into too much detail of the pros and cons of pre-training your own embeddings versus employing pre-trained embeddings, perhaps with some finetuning, the point of this section is simply to showcase how you could train from scratch if you want to. To that end, we will use a library called `Gensim`. 

`Gensim` is a popular library for text processing. Although maybe even more geared toward topic modeling, it offers, among others, implementations of several algorithms to learn word embeddings including *W2V*, *GloVe*, and *Fasttext*. We demonstrate training W2V embeddings using our cleaned IMDB movie review data set. Before moving on, make sure to have installed `Gensim`. 

**Credits and disclaimers**: many of the examples you are going to see in this section have been inspired by this very nice [Kaggle post](https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook).

In [98]:
# Create a global variable to idicate whether the notebook is run in Colab
import sys
import numpy as np
import pandas as pd

IN_COLAB = 'google.colab' in sys.modules

# Configure variables pointing to directories and stored files 
if IN_COLAB:
    # Mount Google-Drive
    from google.colab import drive
    drive.mount('/content/drive')
    DATA_DIR = '/content/drive/My Drive/'  # adjust to Google drive folder with the data if applicable
else:
    DATA_DIR = './' # adjust to the directory where data is stored on your machine (if running the notebook locally)

sys.path.append(DATA_DIR)

CLEAN_REVIEW = DATA_DIR + 'imdb_clean_full_v2.pkl'   # List with tokenized reviews after standard NLP preparation
IMBD_EMBEDDINGS = DATA_DIR + 'w2v_imdb_full_d100_e500.model'

### Recap W2V
Let's quickly revisit the principles of W2V. Please consult the paper of [Mikolov et al. (2013)](https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf) for a detailed description.

W2V establishes a word's meaning by the words that frequently appear close-by (distributional semantics). More specifically, the context of a word consists of the words that appear next to it within a pre-defined window (let's say 5 words).

 - the quality of *air* in mainland China has been decreasing since..
 - doctors claim the *air* you breath defines the overall wellbeing...
 - the currents of hot *air* have been bursting from underground
 - the mountain *air* was crystal clean and filled with ..
 - in case of *air* supply shortages, the submarine will..

Taking the word *air* as our **target word**, the words around *air*, called context words, define the **meaning** of the word *air* in W2V.

![w2vprocess](w2v.jpg)
<br>
inspired by https://www.youtube.com/watch?v=BD8wPsr_DAI

### Loading the data
We load the data frame with the original and cleaned reviews. The original version does not matter for this session. We will delete them to save memory. 

In [47]:
import pickle
with open(CLEAN_REVIEW,'rb') as path_name:
    df = pickle.load(path_name)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review        50000 non-null  object
 1   sentiment     50000 non-null  object
 2   review_clean  50000 non-null  object
dtypes: object(3)
memory usage: 1.1+ MB


In [48]:
df.drop(labels="review", axis=1, inplace=True)
df.head()

Unnamed: 0,sentiment,review_clean
0,positive,one reviewer mention watch oz episode hooked r...
1,positive,wonderful little production film technique una...
2,positive,thought wonderful way spend time hot summer we...
3,negative,basically family little boy jake think zombie ...
4,positive,petter love time money visually stun film watc...


### The Gensim W2V model
Training word embeddings using `Gensim` is very easy and just a matter of calling a function. Well, the reason it takes so little code is that we have already cleaned our data and have it available as an array of texts; that is a format that `Gensim`supports. However, note that, depending on your data, the code may take quite a while to run. Again, word embeddings trained on the full 50K data set for 500 epochs are available in our course folder.

In [49]:
# we need a bit of infrstructure to run the algorithm
from gensim import utils

class CleanReviews:
    """An iterator that yields sentences (lists of str)."""
    
    def __init__(self, reviews):
        self.reviews = reviews
        
    def __iter__(self):
        for line in self.reviews:
            yield utils.simple_preprocess(line)

In [50]:
# CAUTION: Running the code might take a while
from gensim.models import Word2Vec    

emb_dim = 10  # embedding dimension, we use 10 for a quick demo of the code
reviews = CleanReviews(df.review_clean)
# Train a Word2Vec model
model = Word2Vec(sentences=reviews, 
                 min_count=5,  # min_count means the word frequency threshold, if =2 and word is used only once - it's not included
                 window=5,     # the size of context window
                 epochs=5,     # epochs is set to 5 to decrease runtim, would be much larger in practice
                 vector_size=emb_dim,  # size of embedding
                 workers=2)    # for parallel computing

Make sure to check out the docstring of the `Word2Vec` function to discover how word vectors are trained by default. Importantly, the argument `sg` let's you chose between *skip-gram* and *cbow*. Other concepts we discussed in the lecture include accelerating computations using *hierarchical softmax* and *negative sampling*. Gensim features these through its arguments `hs` and `negative`, respectively. Obviously, tons of other functionality is available, so make sure to study the [documentation](https://radimrehurek.com/gensim/models/word2vec.html?highlight=word2vec) if you plan to use the Gensim library for serious projects. Also, just to remind you, the [Kaggle post](https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook), which inspired this notebook, has a slightly more elaborate demo of how to set up training and, specifically, how you can break down the individual steps of W2V training into smaller pieces.

The trained word vectors are accessible through the field `wv` of the model class.

In [56]:
# what is the word vector of the words good and bad?
print(model.wv['good'])

[ 1.9756973  1.1006564  2.1407259 -2.7641182  4.1969585 -3.8379216
 -1.5564842  1.7839793 -1.1708127  1.1585494]


In [55]:
print(model.wv['bad'])

[ 0.03799623  1.7121277   3.5084138  -3.0625455   2.6999998  -5.3605113
 -1.6722724   3.313875   -0.6442533  -1.1794131 ]


In [51]:
len(model.wv.key_to_index)  # how many word vectors have been trained

30201

We continue with playing with word vectors shortly but let us first discuss input and output handling with Gensim.

### Input / output handling
Gensim supports saving and loading of trained embeddings in different versions. This makes a lot of sense since training can take a long time. For example, you could train for a couple of epochs, then store your results on disk, and then continue training. Here is how we can store our trained word vectors.

In [57]:
# Save trained word vectors to disk
file="w2v_tmp.model"
save_as_bin = False
model.wv.save_word2vec_format(file, binary=save_as_bin)  # set binary to True to save disk space; false facilitates inspecting the embeddings in a text editor

For Adams, you can obtain word vectors trained on the IMDB corpus for 500 epochs from our [GitHub repository](https://github.com/Humboldt-WI/adams/tree/master/demos/nlp). These vectors are far from comparable to real pre-trained W2V embeddings. On the other hand, their training took a couple of hours so the vectors should carry a bit more information compared to just running the above training code with a small embedding dimension of ten and training for only five epochs. Let's showcase how we can save and load word vectors.

In [58]:
# Load model from disk
from gensim.models import KeyedVectors
w2v = KeyedVectors.load_word2vec_format(IMBD_EMBEDDINGS, binary=False)

Remember that you can also access the `KeyedVectors`, which we load with the previous statement, directly from a trained model object via the field `wv`. Thus, if you would like to run the following demos with the word vectors you trained yourself, simply run the following command. One would expect that the demos give nicer results with the pre-trained embeddings from your repo but you are welcoem to try this our yourself. 

In [65]:
# w2v = model.wv  # continue with the W2V embeddings trained above 

### Playing with embeddings
Again, the embeddings loaded above are far from solid but should give us some somewhat meaningful results in algebraic comparisons. Let's see whether this works out. 

#### Which word is most similar to another word?

In [60]:
w2v.most_similar(positive=['movie'])

[('least', 0.9374186396598816),
 ('probably', 0.9304987192153931),
 ('still', 0.9181867837905884),
 ('ever', 0.9113300442695618),
 ('definitely', 0.9088698625564575),
 ('even', 0.9075060486793518),
 ('honestly', 0.9061371088027954),
 ('lately', 0.8953308463096619),
 ('actually', 0.8867319226264954),
 ('watch', 0.8728312253952026)]

#### How similar are two words?

In [61]:
w2v.similarity('good', 'great')

0.86766267

In [62]:
print('How similar is Tarantino to Spielberg: {}'.format(w2v.similarity('tarantino', 'spielberg')))
print('How similar is Lucas to Spielberg: {}'.format(w2v.similarity('lucas', 'spielberg')))

print('How similar is Paltrow to Bullock: {}'.format(w2v.similarity('paltrow', 'bullock')))
print('How similar is Paltrow to Alba: {}'.format(w2v.similarity('paltrow', 'alba')))

print('How similar is Cruise to Depp: {}'.format(w2v.similarity('cruise', 'depp')))
print('How similar is Cruise to Willis: {}'.format(w2v.similarity('cruise', 'willis')))


How similar is Tarantino to Spielberg: 0.8217860460281372
How similar is Lucas to Spielberg: 0.7440744042396545
How similar is Paltrow to Bullock: 0.7634718418121338
How similar is Paltrow to Alba: 0.8613905310630798
How similar is Cruise to Depp: 0.5666610598564148
How similar is Cruise to Willis: 0.6183159947395325


#### Which word does not fit in?

In [63]:
print(w2v.doesnt_match(['cool', 'great', 'lovely', 'weak']))
print(w2v.doesnt_match(['movie', 'film', 'good']))

weak
good


#### A is to B as C is to ? 

In [64]:
w2v.most_similar(positive=['spielberg', 'woman'], negative=['man'], topn=5)

[('deserves', 0.9519431591033936),
 ('deserve', 0.8851733803749084),
 ('thanks', 0.8781426548957825),
 ('surpass', 0.867881178855896),
 ('qualify', 0.8677225708961487)]

### Phrase detection
W2V trains one embedding per word. The model is agnostic of common phrases such as 'New York'. It would train one embedding for new and another for york, provided both words are part of the vocabulary. You can get better embeddings by adding common phrases to the vocabulary. W2V will then train individual embeddings for these phrases. Gensims also comes with a phrase detection models, which allows you to handle bigrams, trigrams and the like. We will not retrain our W2V model but sketch how you can use Gensim to get these common phrases. You could then consider to add (some of) them to your vocab and enhance the model.  

In [30]:
from gensim.models.phrases import Phrases, ENGLISH_CONNECTOR_WORDS
# Train a bigram model
bigram_model = Phrases(sentences=reviews,min_count=10 , threshold=1, connector_words=ENGLISH_CONNECTOR_WORDS) 

After training, we can take text and put it through the bigram model. The model will then alter the text so as to introduce bigrams. Here is an example,

In [38]:
# to process text and replace phrases, we use our phrase detector as follows
bigram_model['I', 'like', 'this', 'movie']  # no phrases to be detected here

['I', 'like', 'this', 'movie']

In [45]:
bigram_model['sex', 'and', 'the', 'city', 'is', 'all', 'about', 'new', 'york']  # but we would expect city names to be detected

['sex', 'and', 'the', 'city', 'is', 'all', 'about', 'new_york']

We can also make use of our counter class to examine the most common bigrams in the corpus, as follows:

In [42]:
import collections
bigram_counter = collections.Counter()
for key in bigram_model.vocab.keys():
    if key.find('_')>-1: # the decode is needed because Gensims stores keys as bytes
        bigram_counter[key] += bigram_model.vocab[key]

In [43]:
bigram_counter.most_common(25)

[('look_like', 3715),
 ('watch_movie', 3121),
 ('ever_see', 2973),
 ('see_movie', 2752),
 ('bad_movie', 2727),
 ('make_movie', 2392),
 ('year_old', 2389),
 ('film_make', 2369),
 ('special_effect', 2308),
 ('movie_make', 2134),
 ('one_best', 2030),
 ('even_though', 1999),
 ('movie_ever', 1987),
 ('movie_like', 1921),
 ('low_budget', 1892),
 ('make_film', 1882),
 ('see_film', 1859),
 ('main_character', 1838),
 ('waste_time', 1793),
 ('watch_film', 1664),
 ('good_movie', 1634),
 ('horror_movie', 1611),
 ('much_well', 1532),
 ('want_see', 1494),
 ('seem_like', 1473)]

The above bigrams might be frequent. However, you would not consider training individual embeddings for phrases such as *look_like* or *waste_time*. This shows how proper phrase detection in the scope of W2V is nontrivial and would require more work before we can hope to get good results.     

### Plotting word vectors
It is fairly easy to create a visualization of the trained word vectors. You can find an example of how to do this in the [Kaggle kernel](https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial) mentioned above. Needless to say, many alternative demos are available online; here is just [one example](https://www.kaggle.com/jeffd23/visualizing-word-vectors-with-t-sne). However, to get meaningful results we would need to prepare the data more carefully by, for example, removing too frequent words and too infrequent words. We would also finetune the training, and, overall, invest a lot more work to craft our word embeddings. In practice, we would typically not train our own embeddings from scratch. Instead, we would download pre-trained embeddings, which are available in many flavors (multiple languages, trained on different corpora with different jargon, etc.), and use these in our NLP application. We could also finetune the pre-trained embeddings using our own text data. We will showcase a corresponding approach in a later notebook on sentiment analysis. 

## Manual Word2Vec using Keras

In the following, we will re-implement W2V in Keras. Remember that W2V proposes two models for learning word vectors, continuous-bag-of-words (CBOW) and Skip-Gram. In a nutshell, CBOW predicts a central target word from surrounding context words, while Skip-Gram takes the opposite approach. Given a <font color='red'>target word</font>, predict <font color='green'>context words</font> with high chance to appear next to the target word in a corpus. Considering one of the above example sentences and a widow size of 2, we can highlight target and context words as follows:<br><br>
[doctors <font color='green'>claim the</font><font color='red'> air </font><font color='green'>you breath</font> defines]. 
<br><br>Using a question mark to indicate the target variable of the model, we obtain:

[doctors *? ?* **air** *? ?* breath] in Skip-Gram versus [doctors *claim the* **?** *you breath* defines] in CBOW.


In this section, we focus on Skip-Gram, which seems to be the preferred approach in practice. The code is based on a nice tutorial by [Dipanjan Sarkar](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa), in which you can also find a Keras implementation of CBOW; if interested. However, as nice as the post is, the code is not compatible with the recent version of Keras, which is the one you probably use (i.e., Keras 2). So we will take care of that issue in our implementation.  

Before moving on, let's remember the architecture of the skip-gram W2V model.

![sg](https://upload.wikimedia.org/wikipedia/commons/9/95/Skip-gram.png)
<br>
Source: https://upload.wikimedia.org/wikipedia/commons/9/95/Skip-gram.png

Given a sentence - better to say sequence of text - we take a target word and predict a set of context words, that is, words, which appear in a certain <font color="green">**context window**</font>  $[w_{-i},\ldots, w, \ldots, w_{+i}]$, where $i$ is the *window size* and the number of context words to consider is window size $\times 2$. 

An important caveat with the above picture is that a corresponding model would not scale. Remember that the output layer involves a high-dimensional softmax which is too costly to compute for any reasonably sized corpus. Among the two options around this problem, *hierarchical softmax* and *negative sampling*, we will make use of the latter. So given a target word, our prediction task will be to classify whether another word is an actual context word for that target word, or a random word sampled from the corpus according to some probability distribution. This is a binary classification tasks. Thus, the output of our neural network is must cheaper to compute. Instead of a high-dimensional softmax we only need a simple logistic classifier. 

### Building the vocabulary
Let's start with building our vocabulary. It is common practice to not train to train every word but words that occur reasonably frequent. For rare words, training a good embedding is difficult. Remember how this issue motivated subword embeddings like Fasttext. In our example, we simply use the most frequent words from the review corpus and try to compute embeddings for these words. This is the point where our word_counter (see above) comes in handy.

In [75]:
# We use our cleaned IMDB data set for the demo
with open(CLEAN_REVIEW,'rb') as path_name:
    df_imdb = pickle.load(path_name)
    

In [76]:
# This code is copied from the NLP foundations notebook. 
word_counter = collections.Counter()
for r in df_imdb["review_clean"]:
    for w in r.split():  # this is like tokenizing using the white space
        word_counter.update({w: 1})

# Extract the n most common words from the corpus
vocab_size = 1000
vocab = word_counter.most_common(vocab_size)
vocab = [x[0] for x in vocab]
vocab[:10]

['movie', 'film', 'one', 'make', 'like', 'see', 'get', 'well', 'time', 'good']

Next task is to build dictionary. For Keras, we need to encode words as integers, which Keras will then interpret as indices into a one-hot vector of the size of the vocabulary. We build two dictionaries. One to map words to their code (i.e., unique integer) and one to revert the mapping and decode words. 

In the below code, we implicitly exploit the fact that our vocabulary is ordered by frequency. The most frequent word receives the index 1, the second-most frequent word the index two, and so forth. That will prove useful later when calculating sampling weights for the negative sampling. 

In [77]:
idx = range(1, vocab_size)
word2id = dict(zip(vocab, idx))
id2word = dict(zip(idx, vocab))

In [78]:
print('Vocabulary size: {}'.format(vocab_size))
print('Vocabulary Sample:', list(word2id.items())[:10])
print(list(word2id.items())[-10:])

Vocabulary size: 1000
Vocabulary Sample: [('movie', 1), ('film', 2), ('one', 3), ('make', 4), ('like', 5), ('see', 6), ('get', 7), ('well', 8), ('time', 9), ('good', 10)]
[('mood', 990), ('regard', 991), ('jane', 992), ('garbage', 993), ('reference', 994), ('barely', 995), ('haunt', 996), ('super', 997), ('humour', 998), ('impressive', 999)]


You may have noted that we have so far left out the index 0. This index is commonly reserved for unknown words, which we map to a special token. Rmember that our vocabulary is not very large when compared to the number of words that exists in a language (e.g., ~300K in English). So when processing texts, we will run into a lot of unknown words. We deal with these words by mapping them to the token `UNK`. This way, we learn one embedding for all unknown words.

In [79]:
word2id["UNK"] = 0
id2word[0] = "UNK"

In [80]:
# Helper function to map unknown words to index 'unknown'
def encode_review(review, dictionary):
    output = []
    for word in review:
        if word not in dictionary.keys():
            output.append(dictionary["UNK"])
        else:
            output.append(dictionary[word])
    return output

Now we are ready to turn our reviews into integer numbers, which is the format that Keras expects, while accounting for unknown words. 

In [120]:
#* Build the corpus for W2V by encoding the reviews
coded_review = []
for r in df_imdb["review_clean"]:
    coded_review.append(encode_review(r.split(), word2id))

In [126]:
# Some testing
id_demo_review = 8  # one random review
demo_review = df_imdb["review_clean"][id_demo_review]
print(demo_review)  # plain text after cleaning

# One-hot-coding representation in which integer numbers represent the
# index of the single non-zero element in a one-hot-vector of dimensionality 
# vocab_size
print(coded_review[id_demo_review])

encourage positive comment film look forward watch film bad mistake see film truly one bad awful almost every way edit pace storyline soundtrack song lame country tune played less four time film look cheap nasty boring extreme rarely happy see end credit film thing prevents give score harvey keitel far best performance least seem make bit effort one keitel obsessive
[0, 928, 306, 2, 22, 687, 11, 2, 14, 840, 6, 2, 276, 3, 14, 280, 123, 85, 35, 432, 418, 614, 583, 257, 677, 442, 0, 162, 251, 519, 9, 2, 22, 556, 0, 267, 0, 0, 413, 6, 25, 379, 2, 37, 0, 32, 385, 0, 0, 134, 49, 65, 129, 40, 4, 107, 453, 3, 0, 0]


In [125]:
if len(coded_review[id_demo_review]) == len(demo_review.split()):
    print("Looks good")
else:
    print("that can't be right")

Looks good


### Generate training data
Th training data for our skip-gram model consists of tuples (target, context) with corresponding label (0/1), indicating whether the second word really appeared in the context of the target word our not. Fortunately, Keras has a ready-made function that we can use for that purpose. Specifically, the function `skipgrams` takes a sentence as input and outputs:

1. target words in combination with a context word
2. a label if the context word is from the actual context or randomly sampled.

In [96]:
from keras.preprocessing.sequence import skipgrams

Let's first illustrate the function *skipgrams()* for a single short sentence.

In [133]:
# Produce a list of review lengths
r_lengths  = [len(r) for r in coded_review]
# Indices of reviews ordered by their length in ascending order
ix = np.argsort(r_lengths)

# The second shortest review in the data set is a good example (you can use any review)
print(df_imdb["review_clean"][ix[1]])

# Encoded version
print(coded_review[ix[1]])

# Just for fun, we can of course also decode the one-hot-encoding    
[(id2word[c],c) for c in coded_review[ix[1]]]    

read book forget movie
[234, 144, 626, 1]


[('read', 234), ('book', 144), ('forget', 626), ('movie', 1)]

In [136]:
example_sentence = coded_review[ix[1]]

Remember that a window size of `i` translates to $[w_{-i},\ldots, w, \ldots, w_{+i}]$, so the number of context words to consider is window size $\times 2$. 

In [139]:
window_size = 2

In [140]:
pairs, labels = skipgrams(example_sentence, vocabulary_size=vocab_size, window_size=window_size)
for i in range(10):
    print("({:s} ({:d}), {:s} ({:d}))\t -> {:d}".format( id2word[pairs[i][0]], 
                                                        pairs[i][0], 
                                                        id2word[pairs[i][1]],
                                                        pairs[i][1], labels[i]))

(read (234), book (144))	 -> 1
(read (234), laugh (120))	 -> 0
(forget (626), read (234))	 -> 1
(read (234), manages (855))	 -> 0
(book (144), attention (535))	 -> 0
(movie (1), disappointed (540))	 -> 0
(book (144), lose (227))	 -> 0
(forget (626), movie (1))	 -> 1
(book (144), forget (626))	 -> 1
(read (234), forget (626))	 -> 1


In the above demo, negative samples not appearing in the context window of the target words were picked at random. According to empirical evidence, the probability of a word to be sampled as negative example should be related to its frequency. Otherwise, we might end up with focussing too much on frequent words. Keras provides a utility function, *make_sampling_table*, to calculate sampling weights for each word in the corpus. Details are available in the [Keras documentation](https://keras.io/preprocessing/sequence/). The `sampling_table` is a list of sampling probabilities, one for each word. 

In [145]:
from keras.preprocessing.sequence import make_sampling_table
samp_tab = make_sampling_table(vocab_size)
samp_tab[:10]

array([0.00315225, 0.00315225, 0.00547597, 0.00741556, 0.00912817,
       0.01068435, 0.01212381, 0.01347162, 0.01474487, 0.0159558 ])

Note the increasing magnitude of the sampling weights. Sampling words from the corpus using this sampling distribution requires an that the words in the corpus are ordered by frequency. Remember the idea is that when sampling negative examples we do not want to focus too much on the frequent words; in our case words like 'movie', and 'film', and 'like'. Therefore, we raise the chance of less frequent words to be sampled as negative examples. 

When building our corpus above, we used the *most_common()* function. Therefore, the words in our corpus are ordered in decreasing order by their frequency. We will make use of our sampling table to govern the sampling of negative examples when generating the training set for our W2V model.  

In [150]:
# CAUTION: yet another operation that is not cheap when using all data
import time
start = time.time()
skip_grams = [skipgrams(r, vocabulary_size=vocab_size, 
                        window_size=window_size, sampling_table=samp_tab) 
             for r in coded_review]
end = time.time()
print('Generated {} skip-grams in {} sec.'.format(len(skip_grams), end-start))

Generated 50000 skip-grams in 7.139723777770996 sec.


### Building the neural network
We are ready to design our NN architecture using Keras. We feed the network with pairs of target word and actual/fake context word. Each word is put through an embedding layer. Remember that W2V trains two embeddings per word, one when the word is the target word and one when the word appears in the context of some other target word. So using two embedding layers is important.

Having obtained word embeddings for the target and context word, we pass these embeddings to a merge layer in which we compute the dot product of these two vectors. We can think of the dot products as an unnormalized cosine similarity between the two embedding vectors. Put differently, we obtain a similarity score. We want that score to be large when the inputted 'context' word actually appeared in the context of the target word, and small otherwise. Hence, we forward the similarity score to a dense sigmoid layer, which computes a probability of the 'context' word being an actual context word. We then compare this probability, the output of our neural network, to the actual label, which we obtained above from *skipgram()*. Enter back-propagation. 

So far so good, but there is one issue. Our network is a little more advanced than those we have built so far. There were also some changes when moving to Keras 2., which hit us in this example. Long story short, we cannot use the nice and simple sequential API anymore and will have to use the functional API instead. For this reason, the code will look a little different from what you are used. 

In [151]:
import keras
from keras.models import Model
from keras.preprocessing import sequence
from keras.layers import Embedding, Input, Reshape, Dot, Activation

In [152]:
# Embedding dimension
emd_dim = 25  # relatively small but we do not use much data

In [153]:
# Set up embedding layers for the target and the context word:
embedding_target = Embedding(vocab_size, emd_dim, input_length=1, name='embedding_target')
embedding_context = Embedding(vocab_size, emd_dim, input_length=1, name='embedding_context')

In [154]:
# Build the model architecture using the functional API

# Take a single target word
input_target = Input((1,))
target = embedding_target(input_target)
target = Reshape((emd_dim, 1))(target)

# Take another word either from the context or a random word from vocabulary
input_context = Input((1,))
context = embedding_context(input_context)
context = Reshape((emd_dim, 1))(context)

# Calculate the dot product as an unnormalized cosine distance
dot_product = Dot(axes=1, normalize=False)([target, context])
dot_product = Reshape((1,))(dot_product)

# Predict if the words are in the same context -> Binary yes/no
output = Activation(activation='sigmoid')(dot_product)

# Compile the model
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='adam')

See how the model is not much of a neural network? The only trainable parameters are the embeddings, which are then dot-multiplied. We thus have two hidden layers side-by-side rather than one after the other and no non-linear activation of the hidden layers! This is very similar to matrix factorization and you can use the same architecture to build a collaborative filter on users (one embedding matrix) and items (one embedding matrix); just in case you are into recommender engines.

In [155]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding_target (Embedding)    (None, 1, 25)        25000       input_1[0][0]                    
__________________________________________________________________________________________________
embedding_context (Embedding)   (None, 1, 25)        25000       input_2[0][0]                    
______________________________________________________________________________________________

Here is a maybe more intuitive visualization of the model thanks to [Dipanjan Sarkar.](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)

<img src="https://miro.medium.com/max/1123/1*4Uil1zWWF5-jlt-FnRJgAQ.png">
<br>
Image source: 
https://miro.medium.com/max/1123/1*4Uil1zWWF5-jlt-FnRJgAQ.png2

### Training loop
We train our model review by review, updating the model after every review (i.e., batch). Implementing this approach is not possible when using the standard Keras training loop. Therefore, we use the function *train_on_batch*, which gives us more control over the training.

In [156]:
# Number of epochs
nb_epoch = 5

for e in range(nb_epoch):
        print('-'*40)
        if e>0:
            print('Epoch {} elapsed {:.2f} min.'.format(e, (end-start)/60))
        else:
            print('Epoch {}'.format(e))
        print('-'*40)
        start = time.time()

        samples_seen = 0
        losses = []
        
        for couples, labels in skip_grams:
            if couples:
                X = np.array(couples, dtype="int32")
                loss = model.train_on_batch([X[:,0],X[:,1]], labels)
                losses.append(loss)
        print(f'Average loss over last 1000 batches: {np.mean(losses[-1000:])}')
        end = time.time()

----------------------------------------
Epoch 0
----------------------------------------


AttributeError: 'int' object has no attribute 'shape'

### Extracting the weights
We can extract the word embeddings from the corresponding layer of our model. Converting the embeddings to a data frame facilitates a quick look.

In [None]:
word_embeddings = model.get_layer(name="embedding_target").get_weights()[0]
print(word_embeddings.shape)
w2v_df = pd.DataFrame(word_embeddings, index=id2word.values())
w2v_df.head()

We can try to reproduce some of the functionality demonstrated above for the Gensim implementation. Of course, we don't go all the way. Still, doing a little similarity calculation is not too difficult. We use some scikit-learn functionality to create a matrix of pairwise distances between words. We can then query the most similar words to some seed-words.

In [None]:
from sklearn.metrics.pairwise import euclidean_distances

distance_matrix = euclidean_distances(word_embeddings)
print(distance_matrix.shape)

# Note that this code will not work if you trained on a small corpus
# To make it work, you have to ensure that your search words are part of the vocabulary.
similar_words = {search_term: [id2word[idx] for idx in distance_matrix[word2id[search_term]-1].argsort()[1:6]+1] 
                   for search_term in ['tarantino', 'cruise', 'willis', 'lawrence', 'bullock']}

similar_words

Ok, we might want to continue training our embeddings.   