<a href="https://colab.research.google.com/github/ShaunakSen/AI-for-Web-Accessibility/blob/master/Understanding_Word2Vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word2Vec word embedding tutorial

[tutorial link](https://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/)

### Why do we need Word2Vec?

If we want to feed words into machine learning models, unless we are using tree based methods, we need to convert the words into some set of numeric vectors.  A straight-forward way of doing this would be to use a “one-hot” method of converting the word into a sparse representation with only one element of the vector set to 1, the rest being zero. 

So, for the sentence “the cat sat on the mat” we would have the following vector representation:

\begin{equation} 
\begin{pmatrix} 
the \\ 
cat \\ 
sat \\ 
on \\ 
the \\ 
mat \\ 
\end{pmatrix} 
= 
\begin{pmatrix} 
1 & 0 & 0 & 0 & 0 \\ 
0 & 1 & 0 & 0 & 0 \\ 
0 & 0 & 1 & 0 & 0 \\ 
0 & 0 & 0 & 1 & 0 \\ 
1 & 0 & 0 & 0 & 0 \\ 
0 & 0 & 0 & 0 & 1 
\end{pmatrix} 
\end{equation}


Here we have transformed a six word sentence into a 6×5 matrix, with the 5 being the size of the vocabulary (“the” is repeated).  In practical applications, however, we will want machine and deep learning models to learn from gigantic vocabularies i.e. 10,000 words plus.  You can begin to see the efficiency issue of using “one hot” representations of the words – the input layer into any neural network attempting to model such a vocabulary would have to be at least 10,000 nodes.  Not only that, this method strips away any local context of the words – in other words, it strips away information about words which commonly appear close together in sentences (or between sentences).

For instance, we might expect to see “United” and “States” to appear close together, or “Soviet” and “Union”.  Or “food” and “eat”, and so on.  This method loses all such information, which, if we are trying to model natural language, is a large omission.  Therefore, we need an efficient representation of the text data which also conserves information about local word context.  This is where the Word2Vec methodology comes in.

### The Word2Vec methodology

As mentioned previously, there is two components to the Word2Vec methodology.  The first is the mapping of a high dimensional one-hot style representation of words to a lower dimensional vector. **This might involve transforming a 10,000 columned matrix into a 300 columned matrix, for instance. This process is called word embedding.**  The second goal is to do this while still maintaining word context and therefore, to some extent, meaning. **One approach to achieving these two goals in the Word2Vec methodology is by taking an input word and then attempting to estimate the probability of other words appearing close to that word.  This is called the skip-gram approach.**  The alternative method, called Continuous Bag Of Words (CBOW), does the opposite – it takes some context words as input and tries to find the single word that has the highest probability of fitting that context.  In this tutorial, we will concentrate on the skip-gram method.

What’s a gram?  A gram is a group of n words, where n is the gram window size.  So for the sentence “The cat sat on the mat”, a 3-gram representation of this sentence would be “The cat sat”, “cat sat on”, “sat on the”, “on the mat”.  The “skip” part refers to the number of times an input word is repeated in the data-set with different context words (more on this later).  These grams are fed into the Word2Vec context prediction system. For instance, assume the input word is “cat” – the Word2Vec tries to predict the context (“the”, “sat”) from this supplied input word.  **The Word2Vec system will move through all the supplied grams and input words and attempt to learn appropriate mapping vectors (embeddings) which produce high probabilities for the right context given the input words.**

> What is this Word2Vec prediction system?  Nothing other than a neural network.







### The softmax Word2Vec method

Consider the diagram below – in this case we’ll assume the sentence “The cat sat on the mat” is part of a much larger text database, with a very large vocabulary – say 10,000 words in length.  We want to reduce this to a 300 length embedding.

![](https://i2.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2017/07/Word2Vec-softmax.jpg?w=676&ssl=1)

With respect to the diagram above, if we take the word “cat” it will be one of the words in the 10,000 word vocabulary.  Therefore we can represent it as a 10,000 length one-hot vector.  We then interface this input vector to a 300 node hidden layer. The weights connecting this layer will be our new word vectors – more on this soon.  The activations of the nodes in this hidden layer are simply linear summations of the weighted inputs (i.e. no non-linear activation, like a sigmoid or tanh, is applied).  These nodes are then fed into a softmax output layer. Note that the op layer has dimension equal to the vocab size.  During training, we want to change the weights of this neural network so that words surrounding “cat” have a higher probability in the softmax output layer.  So, for instance, if our text data set has a lot of Dr Seuss books, we would want our network to assign large probabilities to words like “the”, “sat” and “on” (given lots of sentences like “the cat sat on the mat”).

By training this network, we would be creating a 10,000 x 300 weight matrix connecting the 10,000 length input with the 300 node hidden layer.  Each row in this matrix corresponds to a word in our 10,000 word vocabulary – so we have effectively reduced 10,000 length one-hot vector representations of our words to 300 length vectors.  The weight matrix essentially becomes a look-up or encoding table of our words.  Not only that, but these weight values contain context information due to the way we’ve trained our network.  Once we’ve trained the network, we abandon the softmax layer and just use the `10,000 x 300` weight matrix as our word embedding lookup table.


\begin{pmatrix} 
word_1 & wt_{1} & wt_{2} & ... & wt_{300} \\ 
word1 & wt_{1} & wt_{2} & ... & wt_{300} \\ 
...  \\ 
word_{10,000} & wt_{1} & wt_{2} & ... & wt_{300}
\end{pmatrix} 

As with any machine learning problem, there are two components – the first is getting all the data into a usable format, and the next is actually performing the training, validation and testing.  First I’ll go through how the data can be gathered into a usable format, then we’ll talk about the TensorFlow graph of the model.



### Preparing the text data

The previously mentioned TensorFlow tutorial has a few functions that take a text database and transform it so that we can extract input words and their associated grams in mini-batches for training the Word2Vec system / embeddings (if you’re not sure what “mini-batch” means, check out this tutorial).  I’ll briefly talk about each of these functions in turn:



In [0]:
!pip install keras==2.2.4



In [0]:
import urllib.request
import collections
import os
import zipfile

import numpy as np
import tensorflow as tf


In [0]:
import keras
keras.__version__

'2.2.4'

In [0]:
def maybe_download(filename, url, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""

  # check if file exists
  if not os.path.exists(path=filename):
    # download: Returns a tuple containing the path to the newly created data file as well as the resulting HTTPMessage object.
    filename, _ = urllib.request.urlretrieve(url=url+filename, filename=filename)
  statinfo = os.stat(path=filename)
  # check file size
  if statinfo.st_size == expected_bytes:
    print('Found and verified', filename)
  else:
    print(statinfo.st_size)
    raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

This function checks to see if the filename already has been downloaded from the supplied url.  If not, it uses the urllib.request Python module which retrieves a file from the given url argument, and downloads the file into the local code directory.  If the file already exists (i.e. os.path.exists(filename) returns true), then the function does not try to download the file again.  Next, the function checks the size of the file and makes sure it lines up with the expected file size, expected_bytes.  If all is well, it returns the filename object which can be used to extract the data from.  To call the function with the data-set we are using in this example, we execute the following code:



In [0]:
url = 'http://mattmahoney.net/dc/'
filename = maybe_download('text8.zip', url, 31344016)

print (filename)

Found and verified text8.zip
text8.zip


In [0]:
# Read the data into a list of strings.
def read_data(filename):
  """Extract the first file enclosed in a zip file as a list of words."""
  with zipfile.ZipFile(filename) as f:
    data = tf.compat.as_str(f.read(f.namelist()[0])).split()
  return data

Using zipfile.ZipFile() to extract the zipped file, we can then use the reader functionality found in this zipfile module.  First, the namelist() function retrieves all the members of the archive – in this case there is only one member, so we access this using the zero index.  Then we use the read() function which reads all the text in the file and pass this through the TensorFlow function as_str which ensures that the text is created as a string data-type.  Finally, we use split() function to create a list with all the words in the text file, separated by white-space characters.  We can see some of the output here:



In [0]:
vocabulary = read_data(filename)

print (vocabulary[:10])

print ("Length of vocab:", len(vocabulary))

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
Length of vocab: 17005207


As you can observe, the returned vocabulary data contains a list of plain English words, ordered as they are in the sentences of the original extracted text file.  Now that we have all the words extracted in a list, we have to do some further processing to enable us to create our skip-gram batch data.  These further steps are:

1. Extract the top 10,000 most common words to include in our embedding vector
2. Gather together all the unique words and index them with a unique integer value – this is what is required to create an equivalent one-hot type input for the word.  We’ll use a dictionary to do this
3. Loop through every word in the dataset (vocabulary variable) and assign it to the unique integer word identified, created in Step 2 above.  This will allow easy lookup / processing of the word data stream

In [0]:
count = [['UNK', -1]]

count.extend(collections.Counter([1,1,12,3,2,2,1]).most_common(2))

print (len(count))

3


In [0]:
count

[['UNK', -1], (1, 3), (2, 2)]

In [0]:
vocabulary_size = 10000

count = [('UNK', -1)]

count.extend(collections.Counter(vocabulary).most_common(vocabulary_size-1))

print (count[:10], len(count))

[('UNK', -1), ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201), ('a', 325873), ('to', 316376), ('zero', 264975), ('nine', 250430)] 10000


In [0]:
dictionary = dict()

for word,_ in count:
  dictionary[word] = len(dictionary)

print (dictionary)



In [0]:
data = list()

unk_count=0

for word in vocabulary:
  if word in dictionary:
    index = dictionary[word]
  else:
    index = 0 # unknown word so the index corr to 'UNK'
    unk_count += 1
  data.append(index)

print (len(data) == len(vocabulary))

print (data[:10])

True
[5234, 3081, 12, 6, 195, 2, 3134, 46, 59, 156]


In [0]:
def build_dataset(words, n_words):
  """Process raw inputs into a dataset.
  words: vocab corpus
  n_words: top most freq words we want to consider
  """
  count = [['UNK', -1]]
  # create list of most freq occurring words and the freq (in DESC order of freq)
  count.extend(collections.Counter(words).most_common(n_words-1))
  # sample: [('UNK', -1), ('the', 1061396), ('of', 593677), ('and', 416629), ...]
  dictionary = dict()

  # create mapping st most freq word -> 1 next most freq -> 2 and so on..
  # UNK will get mapped to 0
  for word, freq in count:
    dictionary[word] = len(dictionary)

  # sample: {'UNK': 0, 'the': 1, 'of': 2, 'and': 3, 'one': 4, ...}
  # now we create a list of the corr int mapping of the words in the original corpus
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      # get corr index
      index = dictionary[word]
    else:
      # unknown
      index = dictionary['UNK'] # map to 0
      # increment unk count
      unk_count+=1

    data.append(index)
  
  # data is a list of indices in order of the original words in the vocab

  # now that we have no of unk, update it in the count list
  # for this assignment we had made the 1st elem of count a list and not a tuple
  # as tuples are immutable
  count[0][1] = unk_count

  # now we have word->int mapping. We want to create the reverse
  reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))

  # check if len of data == len of words in the corpus
  if len(data) == len(words):
    return data, count, dictionary, reversed_dictionary


data, count, dictionary, reversed_dictionary = build_dataset(vocabulary, 10000)

In [0]:
print (data[:5], count[:5])

[5234, 3081, 12, 6, 195] [['UNK', 1737307], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]


### The softmax issue and negative sampling

The problem with using a full softmax output layer is that it is very computationally expensive.  Consider the definition of the softmax function:

$$
P(y = j \mid x) = \frac{e^{x^T w_j}}{\sum_{k=1}^K e^{x^T w_k}}$$

Here the probability of the output being class j is calculated by multiplying the output of the hidden layer and the weights connecting to the class j output on the numerator and dividing it by the same product but over all the remaining weights.  When the output is a 10,000-word one-hot vector, we are talking millions of weights that need to be updated in any gradient based training of the output layer.  This gets seriously time-consuming and inefficient


To train the embedding layer using negative samples in Keras, we can re-imagine the way we train our network.  Instead of constructing our network so that the output layer is a multi-class softmax layer, we can change it into a simple binary classifier.  For words that are in the context of the target word, we want our network to output a 1, and for our negative samples, we want our network to output a 0. Therefore, the output layer of our Word2Vec Keras network is simply a single node with a sigmoid activation function.


We also need a way of ensuring that, as the network trains, words which are similar end up having similar embedding vectors.  Therefore, we want to ensure that the trained network will always output a 1 when it is supplied words which are in the same context, but 0 when it is supplied words which are never in the same context. Therefore, we need a vector similarity score supplied to the output sigmoid layer – with similar vectors outputting a high score and un-similar vectors outputting a low score.  The most typical similarity measure used between two vectors is the cosine similarity score:

$$
similarity = cos(\theta) = \frac{\textbf{A}\cdot\textbf{B}}{\parallel\textbf{A}\parallel_2 \parallel \textbf{B} \parallel_2}
$$

The denominator of this measure acts to normalize the result – the real similarity operation is on the numerator: the dot product between vectors A and B.  In other words, to get a simple, non-normalized measure of similarity between two vectors, you simply apply a dot product operation between them.

So with all that in mind, our new negative sampling network for the planned Word2Vec Keras implementation features:

- An (integer) input of a target word and a real or negative context word
- An embedding layer lookup (i.e. looking up the integer index of the word in the embedding matrix to get the word vector)
- The application of a dot product operation
- The output sigmoid layer

This architecture of this implementation looks like:





### The final architecture

![](https://i0.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2017/08/Negative-sampling-architecture-1.jpg?w=931&ssl=1)

Let’s go through this architecture more carefully.  First, each of the words in our vocabulary is assigned an integer index between 0 and the size of our vocabulary (in this case, 10,000).  We pass two words into the network, one the target word and the other either a word from the surrounding context or a negative sample.  We “look up” these indexes as the rows of our embedding layer (10,000 x 300 weight tensor) to retrieve our 300 length word vectors.  We then perform a dot product operation between these vectors to get the similarity.  Finally, we output the similarity to a sigmoid layer to give us a 1 or 0 indicator which we can match with the label given to the Context word (1 for a true context word, 0 for a negative sample).

The back-propagation of our errors will work to update the embedding layer to ensure that words which are truly similar to each other (i.e. share contexts) have vectors such that they return high similarity scores. Let’s now implement this architecture in Keras and we can test whether this turns out to be the case.

In [0]:
print ("data:",data[:5])

print ("count", count[:5])

data: [5234, 3081, 12, 6, 195]
count [['UNK', 1737307], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]


Next, we need to define some constants for the training and also create a validation set of words so we can check the learning progress of our word vectors.

### Constants and the validation set

The first constant, window_size, is the window of words around the target word that will be used to draw the context words from. 

The second constant, vector_dim, is the size of each of our word embedding vectors – in this case, our embedding layer will be of size 10,000 x 300.

Finally, we have a large epochs variable – this designates the number of training iterations we are going to run.  Word embedding, even with negative sampling, can be a time-consuming process.

The next set of commands relate to the words we are going to check to see what other words grow in similarity to this validation set. During training, we will check which words begin to be deemed similar by the word embedding vectors and make sure these line up with our understanding of the meaning of these words.

In this case, we will select 16 words to check, and pick these words randomly from the top 100 most common words in the data-set (collect_data has assigned the most common words in the data set integers in ascending order i.e. the most common word is assigned 1, the next most common 2, etc.).

In [0]:
np.random.choice(100, 16).shape

(16,)

In [0]:
window_size = 3
vector_dim = 300
epochs = 10,00,000

valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the distribution.
# pick 16 elems randomly from a set of 100 elems without replacement
valid_examples = np.random.choice(a=valid_window, size=valid_size, replace=False)

print (valid_examples)

[17 22 82 41 56 29 68 19 46  7 81  5  1 78 60 44]


Next, we are going to look at a handy function in Keras which does all the skip-gram / context processing for us.

### The skip-gram function in Keras

According to the official documentation:

```
Generates skipgram word pairs.

This function transforms a sequence of word indexes (list of integers) into tuples of words of the form:

- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).
```



In [0]:
from keras.preprocessing import sequence

In [0]:
sampling_table = sequence.make_sampling_table(size=vocabulary_size)

print (sampling_table[:20])

couples, labels = sequence.skipgrams(sequence=data, vocabulary_size=vocabulary_size, window_size=window_size, sampling_table=sampling_table)


[0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
 0.01212381 0.01347162 0.01474487 0.0159558  0.0171136  0.01822533
 0.01929662 0.02033198 0.02133515 0.02230924 0.02325687 0.02418031
 0.02508148 0.02596208]


Ignoring the first line for the moment (make_sampling_table), the Keras skipgrams function does exactly what we want of it – it returns the word couples in the form of (target, context) and also gives a matching label of 1 or 0 depending on whether context is a true context word or a negative sample. By default, it returns randomly shuffled couples and labels.  In the code above, we then split the couples tuple into separate word_target and word_context variables and make sure they are the right type.  The print function produces the following instructive output:



In [0]:
print (couples[:10]) 
print (labels[:10])

print (len(couples), len(labels))

print (sum(labels))

[[470, 4], [5298, 2], [1004, 37], [3818, 8796], [5595, 7923], [185, 9592], [9335, 909], [1306, 1236], [5291, 6194], [1984, 5651]]
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
30042736 30042736
15021368


The make_sampling_table() operation creates a table that skipgrams uses to ensure it produces negative samples in a balanced manner and not just the most common words.  The skipgrams operation by default selects the same amount of negative samples as it does true context words.



In [0]:
# precentage of context words
print (sum(labels)/len(labels))

0.5


We’ll feed the produced arrays (word_target, word_context) into our Keras model later – now onto the Word2Vec Keras model itself.

### The Keras functional API and the embedding layers

In this Word2Vec Keras implementation, we’ll be using the Keras functional API.  In my previous Keras tutorial, I used the Keras sequential layer framework. This sequential layer framework allows the developer to easily bolt together layers, with the tensor outputs from each layer flowing easily and implicitly into the next layer.  In this case, we are going to do some things which are a little tricky – the sharing of a single embedding layer between two tensors, and an auxiliary output to measure similarity – and therefore we can’t use a straightforward sequential implementation.

Thankfully, the functional API is also pretty easy to use.  I’ll introduce it as we move through the code. The first thing we need to do is specify the structure of our model, as per the architecture diagram which I have shown above. As an initial step, we’ll create our input variables and embedding layer:

![](https://i0.wp.com/adventuresinmachinelearning.com/wp-content/uploads/2017/08/Negative-sampling-architecture-1.jpg?w=931&ssl=1)

In [0]:
from keras.models import Model
from keras.layers import Input, Dense, Reshape, merge, Dot
from keras.layers.embeddings import Embedding

In [0]:
# create some input variables

input_target = Input(shape=(1,))
input_context = Input(shape=(1,))

print (input_target.shape, input_context.shape)

embedding = Embedding(input_dim=vocabulary_size, output_dim=vector_dim, input_length=1, name='embedding')


W0902 12:07:12.048299 139953276630912 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.



(?, 1) (?, 1)


First off, we need to specify what tensors are going to be input to our model, along with their size. In this case, we are just going to supply individual target and context words, so the input size for each input variable is simply (1,).  Next, we create an embedding layer, which Keras already has specified as a layer for us – Embedding().  The first argument to this layer definition is the number of rows of our embedding layer – which is the size of our vocabulary (10,000).  The second is the size of each word’s embedding vector (the columns) – in this case, 300. We also specify the input length to the layer – in this case, it matches our input variables i.e. 1.  

**Finally, we give it a name, as we will want to access the weights of this layer after we’ve trained it, and we can easily access the layer weights using the name.**

The weights for this layer are initialized automatically, but you can also specify an optional embeddings_initializer argument whereby you supply a Keras initializer object.  Next, as per our architecture, we need to look up an embedding vector (length = 300) for our target and context words, by supplying the embedding layer with the word’s unique integer value:




In [0]:
# embed the target to dense vector of size 300
target = embedding(input_target)
# reshape the target to 300,1
target = Reshape(target_shape=(vector_dim, 1))(target)

# embed the context to dense vector of size 300
context = embedding(input_context)
# reshape the context to 300,1
context = Reshape(target_shape=(vector_dim,1))(context)

print (target.shape, context.shape)

W0902 12:07:14.360013 139953276630912 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



(?, 300, 1) (?, 300, 1)


As can be observed in the code above, the embedding vector is easily retrieved by supplying the word integer (i.e. input_target and input_context) in brackets to the previously created embedding operation/layer. For each word vector, we then use a Keras Reshape layer to reshape it ready for our upcoming dot product and similarity operation, as per our architecture.

The next layer involves calculating our cosine similarity between the supplied word vectors:

`merge()` as described in the tutorial is depriciated

So we use `Dot`

```
Layer that computes a dot product between samples in two tensors.
```

In [0]:
test1 = tf.constant([
                     [[1.], [2.], [3.]], 
                     [[8.], [6.], [0.]]
                     ])
test2 = tf.constant([
                     [[0.], [2.], [-3.]], 
                     [[3.], [2.], [-5.]]
                     ])

print (test1.shape)

exp_dot1 = Dot(axes=0, normalize=False)
exp_dot2 = Dot(axes=1, normalize=False)
exp_dot3 = Dot(axes=2, normalize=False)

dot_prod1 = exp_dot1([test1, test2])
dot_prod2 = exp_dot2([test1, test2])
dot_prod3 = exp_dot3([test1, test2])

# Construct a `Session` to execute the graph.
sess = tf.compat.v1.Session()

# Execute the graph and store the value that `e` represents in `result`.
result1 = sess.run(dot_prod1)
result2 = sess.run(dot_prod2)
result3 = sess.run(dot_prod3)

print (result1)
print (result2)
print (result3)

print (result1.shape, result2.shape, result3.shape)

(2, 3, 1)
[[[-5.]]

 [[36.]]]
[[[-5.]]

 [[36.]]]
[[[  0.   2.  -3.]
  [  0.   4.  -6.]
  [  0.   6.  -9.]]

 [[ 24.  16. -40.]
  [ 18.  12. -30.]
  [  0.   0.  -0.]]]
(2, 1, 1) (2, 1, 1) (2, 3, 3)


In [0]:
# create a Dot prduct layer. normalize is set to True to compute cosine similarity
# setup a cosine similarity operation which will be output in a secondary model
similarity_layer = Dot(axes=0, normalize=True, name='similarity_layer')

similarity = similarity_layer([target, context])

print (similarity.shape)

(?, 1, 1)


In [0]:
# now perform the dot product operation to get a similarity measure
dot = Dot(axes=1, normalize=False, name='dot')
dot_product = dot([target, context])

dot_product = Reshape((1,))(dot_product)

print (dot_product.shape)

# add the sigmoid output layer
output = Dense(units=1, activation='sigmoid')(dot_product)

print (output.shape)

(?, 1)
(?, 1)


We then do another Reshape layer, and take the reshaped dot product value (a single data point/scalar) and apply it to a Keras Dense layer, with the activation function of the layer set to ‘sigmoid’.  This is the output of our Word2Vec Keras architecture.

Next, we need to gather everything into a Keras model and compile it, ready for training:

In [0]:
# create the primary training model
model = Model(input = [input_target, input_context], output=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop')



  """Entry point for launching an IPython kernel.
W0902 12:26:26.896552 139953276630912 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0902 12:26:26.924488 139953276630912 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0902 12:26:26.930299 139953276630912 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Here, we create the functional API based model for our Word2Vec Keras architecture.  What the model definition requires is a specification of the input arrays to the model (these need to be numpy arrays) and an output tensor – these are supplied as per the previously explained architecture.  We then compile the model, by supplying a loss function that we are going to use (in this case, binary cross entropy i.e. cross entropy when the labels are either 0 or 1) and an optimizer (in this case, rmsprop).  The loss function is applied to the output variable.

The question now is, if we want to use the similarity operation which we defined in the architecture to allow us to check on how things are progressing during training, how do we access it? We could output it via the model definition (i.e. output=[similarity, output]) but then Keras would be trying to apply the loss function and the optimizer to this value during training and this isn’t what we created the operation for.

There is another way, which is quite handy – we create another model:

In [0]:
# create a secondary validation model to run our similarity checks during training

validation_model = Model(input=[input_target, input_context], output=similarity)

  


We can now use this validation_model to access the similarity operation, and this model will actually share the embedding layer with the primary model.  Note, because this model won’t be involved in training, we don’t have to run a Keras compile operation on it.

Now we are ready to train the model – but first, let’s setup a function to print out the words with the closest similarity to our validation examples (valid_examples).

### The similarity callback

We want to create a “callback” which we can use to figure out which words are closest in similarity to our validation examples, so we can monitor the training progress of our embedding layer.



In [0]:
print (valid_size, valid_examples)

16 [17 22 82 41 56 29 68 19 46  7 81  5  1 78 60 44]


In [0]:
test = np.array([1,2,3,4,5])

(-test).argsort()

array([4, 3, 2, 1, 0])

In [0]:
class SimilarityCallback:
  def run_sim(self):
    # for each of the valid_examples compute similarity with every word and print out the top k
    for i in range(valid_size):
      # get the word
      valid_word = reversed_dictionary[valid_examples[i]]

      top_k = 8  # number of nearest neighbors

      # get all similarity scores
      sim = self._get_sim(valid_examples[i])

      # sort desc and get the top k except the first one
      # first one is the same as the original so ignore that
      nearest = (-sim).argsort()[1:top_k+1]
      log_str = 'Nearest to %s:' % valid_word
      # loop through and print out the top k words
      for k in range(top_k):
        close_word = reversed_dictionary[nearest[k]]
        log_str = '%s %s,' % (log_str, close_word)
      print(log_str)



  @staticmethod
  def _get_sim(valid_word_idx):
    # array of 0s of size vocab_size
    sim = np.zeros((vocabulary_size,))
    # target and context
    in_arr1 = np.zeros((1,))
    in_arr2 = np.zeros((1,))
    # for each word
    for i in range(vocabulary_size):
      # set target word as valid word
      # set context word as current word in the loop
      in_arr1[0,] = valid_word_idx
      in_arr2[0,] = i
      # predict_on_batch: Returns predictions for a single batch of samples.
      # note here we are accessing our secondary model
      out = validation_model.predict_on_batch([in_arr1, in_arr2])
      # store output
      sim[i] = out
    return sim

# init the obj
sim_cb = SimilarityCallback()



This class runs through all the valid_examples and gets the similarity score between the given validation word and all the other words in the vocabulary.  It gets the similarity score by running _get_sim(), which features a loop which runs through each word in the vocabulary, and runs a predict_on_batch() operation on the validation model – this basically looks up the embedding vectors for the two supplied words (the valid_example and the looped vocabulary example) and returns the similarity operation result.  The main loop then sorts the similarity in descending order and creates a string to print out the top 8 words with the closest similarity to the validation example.

The output of this callback will be seen during our training loop, which is presented below.

### The training loop

The main training loop of the model is:



In [0]:
print (couples[:5], labels[:5])

[[470, 4], [5298, 2], [1004, 37], [3818, 8796], [5595, 7923]] [1, 1, 1, 0, 0]


We need to input the target and context words separately


In [0]:
word_target, word_context = zip(*couples)

print (word_target[:5])

(470, 5298, 1004, 3818, 5595)


In [0]:
print (word_context[:5])

(1705, 4584, 433, 2572, 2)


In [0]:
# convert the lists to np arrays
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print (word_target.shape, word_context.shape, len(labels))

(30042736,) (30042736,) 30042736


In [0]:
epochs = 1000000


arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))

for cnt in range(epochs):
  # pick a random int from the entire range of training data
  idx = np.random.randint(0, len(labels)-1)
  # set the target, context and label
  arr_1[0,] = word_target[idx]
  arr_2[0,] = word_context[idx]
  arr_3[0,] = labels[idx]
  # note here we are accessing our primary model
  loss = model.train_on_batch(x=[arr_1, arr_2], y=arr_3)
  if cnt % 100 == 0:
    print("Iteration {}, loss={}".format(cnt, loss))
  if cnt % 10000 == 0:
    sim_cb.run_sim()

Iteration 0, loss=0.6197246313095093
Nearest to three: greatest, experienced, ways, camouflage, shares, lifelong, telecommunications, guam,
Nearest to six: contexts, madrid, producer, early, accepts, wto, handle, farmer,



In this loop, we run through the total number of epochs.  First, we select a random index from our word_target, word_context and labels arrays and place the values in dummy numpy arrays.  Then we supply the input ([word_target, word_context]) and outputs (labels) to the primary model and run a train_on_batch() operation.  This returns the current loss evaluation, loss, of the model and prints it. Every 10,000 iterations we also run functions in the SimilarityCallback.

Here are some of the word similarity outputs for the validation example word “eight” as we progress through the training iterations:

```
Nearest to into: in, a, is, this, and, his, an, at,
Nearest to up: also, with, a, of, his, it, on, to,
Nearest to used: are, to, as, who, an, be, is, in,
Nearest to people: the, with, that, is, his, are, a, only,
Nearest to known: the, with, and, as, be, by, only, to,
Nearest to often: the, as, in, a, s, was, his, with,
Nearest to see: who, was, in, the, an, a, and, as,
Nearest to a: the, are, as, in, his, for, of, and,
Nearest to they: as, that, the, for, s, is, are, was,
Nearest to four: nine, zero, eight, one, two, six, five, three,
Nearest to over: can, the, region, from, are, which, we, s,
Nearest to from: the, a, in, as, for, his, is, was,
Nearest to so: an, as, s, from, that, was, can, some,
Nearest to world: ii, the, s, of, that, are, was, for,
Nearest to be: the, as, in, that, is, are, they, a,
Nearest to but: is, it, are, only, that, the, of, to,
```

We can simply look up the int encoding of a word and pass it through the trained embedding layer in order to determine its dense vector representation