# Skip Gram for Word Embeddings (Word Vectors) Tutorial

This tutorial walks through the Skip-Gram model proposed by [Mikolov et al](https://arxiv.org/pdf/1301.3781.pdf). The purpose of this tutorial is more about the architecture of the model rather than the estimation. 

<b>IMPORTANT</b>: Here, I only discuss training the model with gradient descent, however as practitioners will know, this is inefficient with large databases. As a result, [Mikolov et al](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) released another paper with a technique known as negative sampling, which trains only a subset of weights each epoch. This would be more effective in obtaining word vectors with an actual text database. This will be the topic of a subsequent tutorial.

## Example ##
Textual analysis is a booming field with practical applications in social science. The first thing to consider in most textual analyses is the representation of words in a computer program. 

Consider the following sentences:

Example 1: <i>we think uncertainty about unemployment</i><br>
Example 2: <i>uncertainty and fears about inflation</i><br>
Example 3: <i>we think fears about unemployment</i><br>
Example 4: <i>we think fears and uncertainty about inflation and unemployment</i><br>
Example 5: <i>constant negative press covfefe</i><br>

In total, there are 12 unique words: and, about, constant, unemployment, uncertainty, fears, we, negative, inflation, press, think, covfefe

## One-Hot Representation ##
One common way of representing words is with a <b>one-hot encoded</b> vector. This would be a vector of length 12 where each index represents one of the 12 unique words. We can represent each word with such a vector, with a 1 at the index of the word and zero everywhere else. 

For example, we can represent the word <i>fears</i> as follows:
$$
\small{\begin{array}{|cccccccccccc|}
 \text{and}  & \text{uncertainty} & \text{fears}& \text{we}& \text{about}& \text{constant}& \text{unemployment}& \text{negative}& \text{inflation}& \text{press}& \text{think}& \text{covfefe} \\
\hline
0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\end{array}}$$

### Why use word vectors/embeddings? ###
One issue with this representation is how to relate two words. How is the word <i>fears</i> related to the word <i>uncertainty</i>? They both contain length 12, have a 1 only in one index and 0 everywhere else. Essentially, all individual words are the same distance from each other.

$$\small{\begin{array}{l|cccccccccccc|}
& \text{and}  & \text{uncertainty} & \text{fears}& \text{we}& \text{about}& \text{constant}& \text{unemployment}& \text{negative}& \text{inflation}& \text{press}& \text{think}& \text{covfefe} \\
\hline
\text{fears} & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\hline
\text{uncertainty} & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\end{array}}$$

Word vectors, or [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), solve this issue. They map words to some arbitrary H-dimensional space, $\mathbb{R}^H$, where semantic and syntactic similarities are preserved. Below provides an example of these 12 words mapped into a 3-Dimensional space. The 3-Dimensional vectors were obtained using the Skip-Gram model proposed by Mikolov et al. Notice, that words <i>fear</i> and <i>uncertainty</i> appear alongisde each, as do the words <i>inflation</i> and <i>unemployment</i>, suggesting similar meanings (with respect to their usage in sentences) through syntax and semantics. 

<img src="markdown/scatter.png">

## Overview of Skip-Gram Model to Obtain Word Embeddings ##

The Skip-Gram Model aims to understand the link between words in sentences, and their context. For example, in Example 1, <i>we think uncertainty about unemployment</i>, consider the word <i>uncertainty</i>. The words within it's immediate proximity are <i>think</i> and <i>about</i>. Word embeddings leverage the information that similar words typically show up with similar contexts. 

The Skip-Gram Model aims to use as input the one-hot encoded vector for the word <i>uncertainty</i> and output the word <i>think</i> and the word <i>about</i>, each represented as their own one-hot encoded vector. Let's take the input <i>uncertainty</i> and output <i>think</i>. Below is an image of the neural network at work (it might look daunting at first, so I'll try to explain each step below!).

### Example of 1 Training Example: Input <i>uncertainty</i>, Output <i>think</i>###
<img src="markdown/neural_net.png">

Each step in the neural network is known as a layer. 

###Input Layer### 
Here we define the input, represented by a 12 length vector as above, with a one at the target word, <i>uncertainty</i>.

###Hidden Layer###
This is the target word represented in the H-dimensional space, $\mathbb{R}^H$. In our example, we will use a 3-Dimensional space.

How do we get the Hidden Layer from the Input Layer of just a one-hot encoded vector. We multiply the Input Vector by a 3-by-12 matrix <b>we will call the $\mathbf{V}$ matrix</b>.

$x*\mathbf{V}^T=\begin{array}{cccccc}[0&0&1&...&0&0]\end{array}*\left(  \begin{array}{cccc}
v_{1,1} & v_{1,2} & ... & v_{1,12} \\
v_{2,1} & v_{2,2} & ... & v_{2,12} \\
v_{3,1} & v_{3,2} & ... & v_{3,12} \end{array}\right)^T = \begin{array}{ccc}[v_{1,3}&v_{2,3}&v_{3,3}]\end{array} $

Thus, the hidden layer $\begin{array}{ccc}[x_1^H&x_2^H&x_1^H]\end{array} $= $\begin{array}{ccc}[v_{1,3}&v_{2,3}&v_{3,3}]\end{array} $

<b>NOTE</b> that since $\mathbf{V}$ is H-by-V in size, each column represents one of the words in our vocabulary. Since <i>uncertainty</i> is the third index in our input matrix, the third column of $\mathbf{V}$ represents <i>uncertainty</i> in our H-Dimensional space. <b>This H-Dimensional vector is the word embedding we are looking after</b>.

###Output Layer###

We multiply the hidden layer, $\begin{array}{ccc}[v_{1,3}&v_{2,3}&v_{3,3}]\end{array}$ by a 12-by-3 matrix <b>we will call the $\mathbf{W}$ matrix</b>.

$x^h\mathbf{W}^T=\begin{array}{ccc}[v_{1,3}&v_{2,3}&v_{3,3}]\end{array}*\left(  \begin{array}{ccc}
w_{1,1} & w_{1,2} &w_{1,3} \\
w_{2,1} & w_{2,2} &w_{2,3} \\
... \\
w_{12,1} & w_{12,2} &w_{12,3} \end{array}\right)^T = \begin{array}{cccc}[\sum_{n=1}^H v_{n,1}*w_{1,n} ;&\sum_{n=1}^H v_{n,2}*w_{2,n};&...&\sum_{n=1}^H v_{n,12}*w_{12,n}]\end{array} $

Thus the output layer $\begin{array}{cccc}[x_1^o&x_2^o&...&x_{12}^0]\end{array}=\begin{array}{cccc}[\sum_{i=1}^H v_{i,1}*w_{1,i} ;&\sum_{i=1}^H v_{i,2}*w_{2,i};&...&\sum_{i=1}^H v_{i,12}*w_{12,i}]\end{array}$

###Output Probability###

Because we want to be able to predict the words in the context of <i>uncertainty</i> in the sentence, i.e. <i>think</i> and <i>about</i>, we should handle this appropriately. One way is to normalize the values to probabilities by applying the [sigmoid function](https://en.wikipedia.org/wiki/Sigmoid_function). 

$\begin{array}{cccc}[\frac{exp(x_1^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_2^o)}{\sum_{n=1}^{12} exp(x_n^o)}&...&\frac{exp(x_{12}^o)}{\sum_{n=1}^{12} exp(x_n^o)}]\end{array}$

We know we wanted to predict the word <i>think</i> in this example, therefore we have the prediction and the target value.

$$\small{\begin{array}{l|cccccccccccc|}
& \text{and}  & \text{uncertainty} & \text{fears}& \text{we}& \text{about}& \text{constant}& \text{unemployment}& \text{negative}& \text{inflation}& \text{press}& \text{think}& \text{covfefe} \\
\hline
\text{Input} & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\hline
\text{Output} & \frac{exp(x_1^o)}{\sum_{n=1}^{12} exp(x_n^o)} & \frac{exp(x_2^o)}{\sum_{n=1}^{12} exp(x_n^o)}& \frac{exp(x_3^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_4^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_5^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_6^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_7^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_8^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_9^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_{10}^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_{11}^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\frac{exp(x_{12}^o)}{\sum_{n=1}^{12} exp(x_n^o)}&\\
\hline
\text{Target} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
\end{array}}$$

## Estimating V and W ##

Now I will discuss exactly how we can use these training examples to train the neural network and obtain good estimates fo the word embeddings, matrix $\mathbf{V}$. 

To make the estimation derivation clear, I will clarify some of the notation.

$f()$ = sigmoid function <br>
$x^i$ = Input Vector <br>
$x^h$ = Hidden Layer Vector <br>
$x^o$ = Output Vector <br>
$a^o$ = Output Vector normalized with $f()$ sigmoid function <br>
$t^o$ = Target Vector <br>
$V$ = total number of unique words <br>
$H$ = sizeof hidden dimension <br>

Thus the neural network can be summarized as:

Input Layer (size 1-by-V): $x^i$  <br>
Hidden Layer (size 1-by-H): $x^h=x^i\mathbf{V^T}$ <br>
Output Layer (size 1-by-V): $x^o = x^h \mathbf{W^T}$ <br>
Output Layer Normalized(size 1-by-V): $a^o = f(x^o)$ <br>

The loss function we will use to evaluate the model is:

$$
E_i = \frac{1}{2} \sum_{m=-M}^{M} (a^o - t_{m}^o)^2
$$

The loss function in the Skip-Gram considers one word, $i$, and then forward propogates a value of $a^o$ for that specific word. There will be $M$ words to the left and $M$ words to the right of the word in the sentence. We want the errors to reflect the discrepancy of the model for the word $i$ and the $2*M$ words to the left and right of the word. 

First, let us understand how the error function $E$ changes with respect to the $\mathbf{W}$ matrix.

### Updating W Matrix ###

$
\begin{align*}
\frac{\partial E}{\partial \mathbf{W}} &= \sum_{m=-M}^{M}(a^o-t_m^o)\frac{\partial a^o}{\partial \mathbf{W}} = \sum_{m=-M}^{M}(a^o-t_m^o)\frac{\partial f(x^h \mathbf{W^T})}{\partial \mathbf{W}}= \sum_{m=-M}^{M}[(a^o-t_m^o)\circ f\prime(x^h \mathbf{W^T})]\frac{\partial (x^h \mathbf{W^T})}{\partial \mathbf{W}} \\
&=\sum_{m=-M}^{M}[(a^o-t_m^o)\circ f\prime(x^h \mathbf{W^T})]x^{hT} = sum_{m=-M}^{M}\delta_m^2 x^{hT} 
\end{align*}
$

where $\circ$ is the [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) and $\delta_m^2 = [(a^o-t_m^o)\circ f\prime(x^h \mathbf{W^T})]$.

Thus, we will update W as follows:

$
\begin{align*}
\mathbf{W}^{new} = \mathbf{W}^{old} - \alpha [\sum_{m=-M}^{M}\delta_m^2 x^{hT}] 
\end{align*}
$

### Updating V Matrix ###

$
\begin{align*}
\frac{\partial E}{\partial \mathbf{V}} &= \sum_{m=-M}^{M}(a^o-t_m^o)\frac{\partial a^o}{\partial \mathbf{V}} = \sum_{m=-M}^{M}(a^o-t_m^o)\frac{\partial f(x^h \mathbf{W^T})}{\partial \mathbf{V}}= \sum_{m=-M}^{M}[(a^o-t_m^o)\circ f\prime(x^h \mathbf{W^T})]\frac{\partial (x^h \mathbf{W^T})}{\partial \mathbf{V}} \\
&=\sum_{m=-M}^{M}\delta_m^2 \frac{\partial (x^h \mathbf{W^T})}{\partial \mathbf{V}}=\sum_{m=-M}^{M} \mathbf{W^T} \delta_m^2 \frac{\partial (x^h)}{\partial \mathbf{V}}=\sum_{m=-M}^{M} \mathbf{W^T} \delta_m^2 \frac{\partial(x^i\mathbf{V^T})}{\partial \mathbf{V}} = \sum_{m=-M}^{M} \mathbf{W^T} \delta_m^2 x^{iT}
\end{align*}
$

Thus, we will update V as follows:

$
\begin{align*}
\mathbf{V}^{new} = \mathbf{V}^{old} - \alpha [\sum_{m=-M}^{M} \mathbf{W^T} \delta_m^2 x^{iT}] 
\end{align*}
$

## Implementation in Python ##

First, let's load the relevant libraries and create our documents.

In [14]:
import numpy as np 
import re
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import itertools

docs = ["we think uncertainty about unemployment",
		"uncertainty and fears about inflation",
		"we think fears about unemployment",
		"we think fears and uncertainty about inflation and unemployment",
		"constant negative press covfefe"]

Next, we would like to find the unique words in our documents and then generate a one-hot encoded vector of length 12 for each word.

In [16]:
docs_split = map(lambda x: x.split(),docs)
docs_words = list(itertools.chain(*docs_split))
words = np.unique(docs_words)

vectors = np.eye(words.shape[0])

Now we will define the window size of the context, $M$. That is, how many words to look left and right for each word in order to traing the matrices. We will also define the matrices $\mathbf{V}$ and $\mathbf{W}$ as well as the number of dimensions in the hidden layer. 

In [18]:
M = 2
H = 3

V = np.random.randn(H,words.shape[0])
W = np.random.randn(words.shape[0],H)

For each word in each sentence, we need to know the context. We will create a function which takes in a sentence, in the form of a list of words, and then return the $M$ words to the left and right of each word in the sentence. 

In [19]:
def get_context(word_list):
	samples = []
	for word_ind in range(len(word_list)):
		context_inds = range(max(0,word_ind-M),
							min(word_ind+M+1,len(word_list)))
		context_inds.remove(word_ind)
		context = [word_list[el] for el in context_inds]
		samples.append((word_list[word_ind],context))
	return samples

Now let's generate all the training examples, apply <b>get_context</b> to each sentence in <i>docs_split</i>, and then converting it to one list using itertools.

In [20]:
training = list(itertools.chain(*map(get_context,docs_split)))

Next, we will get to the meat of the code. We will run 10,000 <b>epochs</b>, which means looping through each word in each sentence 10,000 times. For each word, we will <b>forward propogate</b> the one-hot encoded vector and obtain a probability distribution over all 12 words in our vocabular. Then, for each word in the context, compute the error. Lastly, we will sum up the errors and then update the $\mathbf{V}$ and $\mathbf{W}$ matrices. 

Along the way, I will update the log likehood function, to ensure we are maximizing the probabilities. 

Also, I use a linear learning rate, so that each iteration it gets linearly closer to 0. I chose the default value of 2.5% but this is a <b>hyper parameter</b> (along with the H-Dimension in the hidden layer) which should be messed around with to get best results. 

In [None]:
def sigma(vector, deriv=False):
	if deriv:
		return sigma(vector)*(1-sigma(vector))
	else:
		return np.exp(vector)/np.exp(vector).sum()

log_likelihood = np.array([])
epochs = 10000
learning_rate = 0.025
discount = float(learning_rate)/epochs

for epoch in range(epochs):
	likelihood = 0
	for example in training:
		input_index = np.where(words==example[0])[0][0]
		l_input = vectors[input_index]
		l_hidden = np.dot(V,l_input)
		l_output = np.dot(W,l_hidden)
		l_output_a = sigma(l_output)
		errors = np.zeros(words.shape[0])
		for context in example[1]:
			output_index = np.where(words==context)[0][0]
			l_target= vectors[output_index]
			errors += (l_output_a-l_target)
		delta2 = errors*sigma(l_output,True)
		W -= learning_rate*np.outer(delta2,l_hidden)
		V -= learning_rate*np.outer(np.dot(W.T,delta2),l_input)
		likelihood+=sum(map(np.log,l_output_a))
	log_likelihood=np.append(log_likelihood,likelihood)
	learning_rate -= discount

Let's plot out the word embeddings from the matrix $\mathbf{V}$ along with the log likelihood function. Since the initialized $\mathbf{V}$ and $\mathbf{W}$ matrices were initialized to be random numbers from a normal distribution, you will get different results. Though the log likelihood and the proximity of words should be more or less similar. 

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1,projection="3d")
ax.scatter(V[0],V[1],V[2], alpha=0.3)
for i,txt in enumerate(words):
	ax.text(V[0][i],V[1][i],V[2][i],txt, size=10)
ax = fig.add_subplot(1,2,2)
ax.plot(log_likelihood)
plt.show()

<img src="markdown/scatter.png">
<img src="markdown/ll.png">

In [None]:
#################################
### Author: Paul Soto 		  ###
### 		paul.soto@upf.edu ###
#								#
# This file is a script to run ##
# a Skip Gram Model using a toy #
# sample of documents. While ####
# negative sampling should be ###
# used to train the neural ######
# network, I use gradient #######
# descent to focus on the #######
# architecture rather than the ##
# optimal estimation ############
#################################

import numpy as np 
from collections import Counter
import re
import itertools
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# M is the number of words to look (on one side) of each word for the context
M = 2
# H is the dimension of the hidden layer
H = 3

def sigma(vector, deriv=False):
	"""
	This function returns a vector evaluated using the sigmoid function

	vector: numpy array of real values
	deriv: if True, evaluate first derivate of sigmoid 
	"""
	if deriv:
		return sigma(vector)*(1-sigma(vector))
	else:
		return np.exp(vector)/np.exp(vector).sum()

def get_context(word_list):
	"""
	This function returns the 2*M words in the context of each word in 
	the list

	word_list: List of words
	M: global variable of the window size
	"""
	samples = []
	for word_ind in range(len(word_list)):
		context_inds = range(max(0,word_ind-M),
							min(word_ind+M+1,len(word_list)))
		context_inds.remove(word_ind)
		context = [word_list[el] for el in context_inds]
		samples.append((word_list[word_ind],context))
	return samples

docs = ["we think uncertainty about unemployment",
		"uncertainty and fears about inflation",
		"we think fears about unemployment",
		"we think fears and uncertainty about inflation and unemployment",
		"constant negative press covfefe"]


# Split each document into a list of words
docs_split = map(lambda x: x.split(),docs)
docs_words = list(itertools.chain(*docs_split))

# Find unique words across all documents
words = np.unique(docs_words)

# Generate a one hot encoded vector for each unique word
vectors = np.eye(words.shape[0])

# Initiate randomly V and W matrices
V = np.random.randn(H,words.shape[0])
W = np.random.randn(words.shape[0],H)

# Create list of all training examples
training = list(itertools.chain(*map(get_context,docs_split)))

log_likelihood = np.array([])
epochs = 10000
learning_rate = 0.025
discount = float(learning_rate)/epochs

for epoch in range(epochs):
	likelihood = 0
	for example in training:
		# Forward propogate word
		input_index = np.where(words==example[0])[0][0]
		l_input = vectors[input_index]
		l_hidden = np.dot(V,l_input)
		l_output = np.dot(W,l_hidden)
		l_output_a = sigma(l_output)
		errors = np.zeros(words.shape[0])
		# Compute the error for each word in context window
		for context in example[1]:
			output_index = np.where(words==context)[0][0]
			l_target= vectors[output_index]
			errors += (l_output_a-l_target)
		# Update the weights of V and W matrices
		delta2 = errors*sigma(l_output,True)
		W -= learning_rate*np.outer(delta2,l_hidden)
		V -= learning_rate*np.outer(np.dot(W.T,delta2),l_input)
		likelihood+=sum(map(np.log,l_output_a))
	log_likelihood=np.append(log_likelihood,likelihood)
	learning_rate -= discount

# Plot out word embeddings and log-likelihood function
fig = plt.figure()
ax = fig.add_subplot(1,1,1,projection="3d")
ax.scatter(V[0],V[1],V[2], alpha=0.3)
for i,txt in enumerate(words):
	ax.text(V[0][i],V[1][i],V[2][i],txt, size=10)
ax = fig.add_subplot(1,2,2)
ax.plot(log_likelihood)
plt.show()

I hope this was useful for you. Any questions or feedback please forward to my email paul.soto@upf.edu or message me on GitHub.