# The Skip Gram model
Dataset: movie review 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz' <br>
From this data set we will compute/fit the skipgram model of the Word2Vec Algorithm
(https://arxiv.org/abs/1301.3781)
<br>The Word2Vec Algorithm considers the word order.

Skip-gram allows creation of word embeddings with word order informations.

In [23]:
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import random
import os
import string
import requests
import collections
import io
import tarfile
import gzip
from nltk.corpus import stopwords
from tensorflow.python.framework import ops
import sys
sys.path.insert(0, './utils')
import text_helpers
ops.reset_default_graph()

In [24]:
sess = tf.Session()

In [25]:
batch_size = 50         # How many sets of words to train on at once.
embedding_size = 200    # The embedding size of each word to train.
vocabulary_size = 10000 # How many words considered for training.
generations = 100000    # How many iterations performed the training on.
print_loss_every = 500  # Print the loss every so many iterations

num_sampled = int(batch_size/2) # Number of negative examples to sample.
window_size = 2         # How many words in skip-gram to consider left and right.

In [26]:
# Declare stop words
stops = stopwords.words('english')

# Pick five test words. We are expecting synonyms to appear
print_valid_every = 2000

# set of words for testing
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']
# Later these will be transformed into indices

### Loading data

In [27]:
texts, target = text_helpers.load_movie_data()

### Normalizing the dataset

In [28]:
texts = text_helpers.normalize_text(texts, stops)

# Texts must contain at least 3 words
target = [target[ix] for ix, x in enumerate(texts) if len(x.split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

### Build the vocabulary
Function that creates a word count dictionary, and assigns to 'RARE' the words not in the vocabulary size

In [29]:
# Build our data set and dictionaries
word_dictionary = text_helpers.build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_dictionary.keys()))
text_data = text_helpers.text_to_numbers(texts, word_dictionary)

# Get validation word keys
valid_examples = [word_dictionary[x] for x in valid_words]
valid_examples

[1490, 28, 940, 205, 359]

### Generation of batch data
The training dataset will consist in pair of words, where one word is the one at the center of the window, and the other one is one selected from the same window.

Example: sentence "the cat in the hat"

context word: ["hat"]
target words: ["the", "cat", "in", "the"]
context-target pairs: ("hat", "the"), ("hat", "cat"), ("hat", "in"), ("hat", "the")

### Model Creation

In [30]:
# Define Embeddings:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# NCE loss parameters (explanation below)
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                                               stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Lookup the word embedding:
# It will map the indices of the 
# words in the sentence to the one-hot-encoded vectors of our identity matrix.
embed = tf.nn.embedding_lookup(embeddings, x_inputs)

### Loss function
A softmax could be used, but since the target has 10000 categories, it is too sparse, causing difficulties in the convergence of a model. <br>

To tackle this, the loss function called noise-contrastive error (NCE) can be used.<br> 
This 
NCE loss function turns our problem into a binary prediction, by predicting the word 
class versus random noise predictions. The num_sampled parameter is how much of 
the batch to turn into random noise:

In [31]:
# Get loss from prediction
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                     biases=nce_biases,
                                     labels=y_target,
                                     inputs=embed,
                                     num_sampled=num_sampled,
                                     num_classes=vocabulary_size))

# Create optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(loss)

# Cosine similarity between words (to find nearby ones)
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)


#Add variable initializer.
init = tf.global_variables_initializer()
sess.run(init)

In [32]:
sim_init = sess.run(similarity)

### Training and testing
Note to line *nearest = (-sim[j, :]).argsort()[1:top_k+1]* below: <br>
The negative of the similarity matrix is used because argsort() sorts the values from least to greatest. In order to have the greatest numbers, sort in the opposite direction by taking the negative of the similarity matrix, then calling the argsort() method.

In [33]:
# Run the skip gram model.
loss_vec = []
loss_x_vec = []
for i in range(generations):
    batch_inputs, batch_labels = text_helpers.generate_batch_data(text_data, batch_size, window_size)
    feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}

    # Run the train step
    sess.run(optimizer, feed_dict=feed_dict)

    # Return the loss
    if (i+1) % print_loss_every == 0:
        loss_val = sess.run(loss, feed_dict=feed_dict)
        loss_vec.append(loss_val)
        loss_x_vec.append(i+1)
        print("Loss at step {} : {}".format(i+1, loss_val))
      
    # Validation: Print some random words and top 5 related words
    if (i+1) % print_valid_every == 0:
        sim = sess.run(similarity)
        for j in range(len(valid_words)):
            valid_word = word_dictionary_rev[valid_examples[j]]
            top_k = 5 # number of nearest neighbors
            nearest = (-sim[j, :]).argsort()[1:top_k+1]
            log_str = "Nearest to {}:".format(valid_word)
            for k in range(top_k):
                close_word = word_dictionary_rev[nearest[k]]
                score = sim[j,nearest[k]]
                log_str = "%s %s," % (log_str, close_word)
            print(log_str)

Loss at step 500 : 77.35371398925781
Loss at step 1000 : 49.351463317871094
Loss at step 1500 : 49.78864288330078
Loss at step 2000 : 31.838579177856445
Nearest to cliche: confident, style, oleander, drown, fuller,
Nearest to love: civic, rocky, ivan, miyazaki, burns,
Nearest to hate: whiff, gravity, scientific, smiles, resultado,
Nearest to silly: mug, koreas, kissing, lots, toback,
Nearest to sad: smooth, suggestive, selfimportance, steven, disjointed,
Loss at step 2500 : 31.272642135620117
Loss at step 3000 : 41.85002136230469
Loss at step 3500 : 24.471689224243164
Loss at step 4000 : 15.03326416015625
Nearest to cliche: confident, style, oleander, drown, fuller,
Nearest to love: civic, rocky, ivan, miyazaki, burns,
Nearest to hate: whiff, gravity, scientific, smiles, resultado,
Nearest to silly: mug, koreas, kissing, lots, toback,
Nearest to sad: smooth, suggestive, selfimportance, steven, disjointed,
Loss at step 4500 : 12.656588554382324
Loss at step 5000 : 22.972822189331055
Los

Loss at step 36500 : 1.0897375345230103
Loss at step 37000 : 2.9010508060455322
Loss at step 37500 : 1.5280669927597046
Loss at step 38000 : 0.8578202724456787
Nearest to cliche: confident, style, oleander, drown, fuller,
Nearest to love: civic, rocky, ivan, miyazaki, burns,
Nearest to hate: whiff, gravity, scientific, smiles, resultado,
Nearest to silly: mug, koreas, kissing, lots, toback,
Nearest to sad: smooth, suggestive, selfimportance, steven, disjointed,
Loss at step 38500 : 3.251991033554077
Loss at step 39000 : 1.2583551406860352
Loss at step 39500 : 1.2675714492797852
Loss at step 40000 : 1.3330646753311157
Nearest to cliche: confident, style, oleander, drown, fuller,
Nearest to love: civic, rocky, ivan, miyazaki, burns,
Nearest to hate: whiff, gravity, scientific, smiles, resultado,
Nearest to silly: mug, koreas, kissing, lots, toback,
Nearest to sad: smooth, suggestive, selfimportance, steven, disjointed,
Loss at step 40500 : 1.1049021482467651
Loss at step 41000 : 0.823699

Loss at step 72500 : 1.1675937175750732
Loss at step 73000 : 0.8924819231033325
Loss at step 73500 : 1.0922127962112427
Loss at step 74000 : 0.9601470828056335
Nearest to cliche: confident, style, oleander, drown, fuller,
Nearest to love: civic, rocky, ivan, miyazaki, burns,
Nearest to hate: whiff, gravity, scientific, smiles, resultado,
Nearest to silly: mug, koreas, kissing, lots, toback,
Nearest to sad: smooth, suggestive, selfimportance, steven, disjointed,
Loss at step 74500 : 3.1290388107299805
Loss at step 75000 : 1.1743953227996826
Loss at step 75500 : 0.9645028710365295
Loss at step 76000 : 0.24405761063098907
Nearest to cliche: confident, style, oleander, drown, fuller,
Nearest to love: civic, rocky, ivan, miyazaki, burns,
Nearest to hate: whiff, gravity, scientific, smiles, resultado,
Nearest to silly: mug, koreas, kissing, lots, toback,
Nearest to sad: smooth, suggestive, selfimportance, steven, disjointed,
Loss at step 76500 : 1.4617234468460083
Loss at step 77000 : 1.3120

Note: the closest words to the validation ones are not synonyms. <br>
This is because synonyms very rarely do actually appear next to each other 
in sentences. The model predicts which words are in proximity to each 
other in our data set. We hope that using an embedding like this would make prediction easier.