## Introduction to Word2Vec
#### The model Word2Vec is a simple word embedding neural network with a single hidden layer, based on the study of Le & Mikolov (2014).

In [1]:
import seaborn as sb
import numpy as np
import pandas as pd
%matplotlib inline

Word embeddings from plain text.

The model assumes the *Distributional Hypothesis* that words are characterized by words they hang out with. this idea is used to estimate the probability of two words occurring near each other.

#### Softmax Regression

Word2Vec is a very simple neural network with a single hidden layer.

In [2]:
sentences = ['the king loves the queen', 'the queen loves the king',
             'the dwarf hates the king', 'the queen hates the dwarf',
             'the dwarf poisons the king', 'the dwarf poisons the queen']

Bag-of-Words

In [8]:
from collections import defaultdict

def Vocabulary():
    dictionary = defaultdict()
    dictionary.default_factory = lambda: len(dictionary) # len of dictionary gives a unique integer to each new word
    return dictionary

def docs2bow(docs, dictionary):
    """Transforms a list of strings into a list of lists where 
    each unique item is converted into a unique integer."""
    for doc in docs:
        yield [dictionary[word] for word in doc.split()] # returns a generator

In [9]:
vocabulary = Vocabulary()
sentences_bow = list(docs2bow(sentences, vocabulary))
sentences_bow

[[0, 1, 2, 0, 3],
 [0, 3, 2, 0, 1],
 [0, 4, 5, 0, 1],
 [0, 3, 5, 0, 4],
 [0, 4, 6, 0, 1],
 [0, 4, 6, 0, 3]]

In [12]:
vocabulary

defaultdict(<function __main__.<lambda>>,
            {'dwarf': 4,
             'hates': 5,
             'king': 1,
             'loves': 2,
             'poisons': 6,
             'queen': 3,
             'the': 0})

In [19]:
V = len(vocabulary)
N = 3 # number of nodes in Hidden Layer

# random initialization of weights between [-0.5 , 0.5] normalized by number of nodes mapping to.
WI =(np.random.random((V, N)) - 0.5) / N # input weights
WO =(np.random.random((N, V)) - 0.5) / V # output weights

In [20]:
print WI

[[ 0.06259958 -0.04045636 -0.00231579]
 [-0.14813526  0.03140125  0.07526604]
 [-0.09669949  0.02225064  0.07541938]
 [-0.0804188  -0.06866706  0.12916966]
 [-0.06829591 -0.07356343 -0.16079142]
 [ 0.07614897  0.05643023 -0.10641049]
 [ 0.10718903  0.1422684   0.08797442]]


In [26]:
print WO

[[-0.04535321  0.05662719  0.00660368  0.03789646  0.0420766  -0.03693758
  -0.02067386]
 [-0.06374536  0.00364524  0.01388948  0.06994418 -0.02437946  0.02687771
  -0.03953408]
 [ 0.06046383  0.06506942  0.03423023  0.06397297 -0.0050463  -0.00836647
  -0.03639222]]


In [22]:
# input weights associated with dwarf
WI[vocabulary['dwarf']]

array([-0.06829591, -0.07356343, -0.16079142])

In [25]:
# output weights associated with hates
WO.T[vocabulary['hates']]

array([-0.03693758,  0.02687771, -0.00836647])

Using the dot product $W_I \cdot W'^T_O$ we compute the distance between the input word *dwarf* and the output word *hates*:

In [24]:
WI[vocabulary['dwarf']].dot(WO.T[vocabulary['hates']])

0.0018907259741998627

Now using softmax regression, we can compute the posterior probability $P(w_O|w_I)$:

$$ P(w_O|w_I) = y_i = \frac{exp(W_I \cdot W'^T_O)}{\sum^V_{j=1} exp(W_I \cdot W'^T_j)} $$

In [28]:
numerator = np.exp(WI[vocabulary['dwarf']].dot(WO.T[vocabulary['hates']]))
denominator = sum(np.exp(WI[vocabulary['dwarf']].dot(WO.T[vocabulary[word]])) for word in vocabulary)

P_hates_dwarf = numerator / denominator
P_hates_dwarf

0.1437309461247088

### Updating the hidden-to-output layer weights

loss function to minimize is: $E = -\log P(w_O|w_I)$

The error is computed with $t_j - P(w_O|w_I) = e_j$, where $t_j$ is 1 if $w_j$ is the actual output word, otherwise $t_j$ is 0.

To obtain the gradient on the hidden-to-output weights, we compute $e_j \cdot h_i$, where $h_i$ is a copy of the vector corresponding to the input word (only holds with a context of a single word). Finally, using stochastic gradient descent, with a learning rate $\nu$ we obtain the weight update equation for the hidden to output layer weights:

$$W'^{T (t)}_j = W'^{T (t-1)}_j - \nu \cdot e_j \cdot h_j$$.


In [43]:
target_word = 'king' 
input_word = 'queen' # context word
learning_rate = 1.0

denominator = sum(np.exp(WI[vocabulary[input_word]].dot(WO.T[vocabulary[word]])) for word in vocabulary)

for word in vocabulary:
    
    numerator = np.exp(WI[vocabulary[input_word]].dot(WO.T[vocabulary[word]]))
    P_word_queen = numerator / denominator # posterior probability P(word | queen)
    
    if word == target_word:
        t = 1
    else:
        t = 0
    
    err = t - P_word_queen # error
    
    # weight update using stochastic gradient descent
    WO.T[vocabulary[word]] -= learning_rate * err * WI[vocabulary[input_word]]
    # update brings word vector closer in the feature space if word = target, and push them apart otherwise.
    
print WO

[[-0.09218151  0.33392974 -0.03962089 -0.0082109  -0.00390225 -0.08302784
  -0.06674708]
 [-0.10373055  0.24042508 -0.02558022  0.03057457 -0.06363934 -0.0124773
  -0.07887454]
 [ 0.13568001 -0.38033732  0.10847671  0.13803118  0.06880549  0.06566426
   0.03761114]]


### Updating the input-to-hidden layer weights

backpropagate the prediction errors to the input-to-hidden weights