Motivation: Image and audio processing data are rich high dimensional data encodede as vectors of the individual raw    pixels. Similary much of the information within natural language texts or speech are encoded as vectors of high dimensionality.But typical problem with traditional approach to extract useful information from natural language has been treatingeach word as discrete symbols and assigning ID to them. Then problem arises that the knowledge gained from the understanding of the word 'CAT' doesn't help to understand word 'DOG' although they belong to same class as 'ANIMAL' and shares many characteristic.Also traditonal method like Bag of Words and count based methods such as N-Gram is to deal with high dimensioanlity which is the total vocabulary size of the text and it could be in size of many millions. One other problem is to deal with any new vocab additions is complex. These methods also fails to capture all the semantic similarity of every words or phrases (e.g. wonderful & awesome) and how they exists in certain texts. Moreover, they are very hard to be robust to many different problems like machine translation, speech recognition and language understanding tasks.
Representing words as vectors can overcome many of these problems.

Vector Space Models (VSM) embeds words as vectors and has a long reach history in NLP tasks. It shars distibutional hypothesis .
In this hypothesis it is identified that words that appear together shares same semantic meaning or context. The different
approach that applies this hypothesis are Latent Semantic Analysis (LSA - Count based method) and Neural Probabilistic Method
(Precitive).
One of the popular method currently is word2vec which follows the principal of distributional similarity based representation.
There is a great saying in this regard and I quote, "You shall know a word by the company it keeps". Word is represented 
by the left and right words of a window with a size. We will explore this in greater details later.

word2vec is particularly efficient in terms of computation because it works with a window for co-occurance from raw text.
It comes in two flavor - Continuous Bag of Words and Skip Gram Model (Mikolov et. al.).Algorithimically these are same except that CBOW predicts the next word based on a context and Skip-Gram predicts the context based on a word. CBOW treats each context as one observation and smoothes over a lot of distributional information but skip gram treats every context-target as new observation and hence performs better for large datasets.

Traditionally probabilistic model uses MLE method to optimize but normalize over entrire vocab is expensive and word2vec doesn't use full probabilstic model rather uses a technique called negative sampling where it tries to maximize logprobability of correct word appearing in the context and minimize noise from random sampled words from the rest of the corpus not appearning in the context.

Mathametically speaking:
    


In [45]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import math
import os
import random
import zipfile

import numpy as np
from six.moves import urllib
from six.moves import xrange
import tensorflow as tf

In [46]:
#dowload the data
url = "http://mathmahoney.net/dc/"

def download_data(filename, expected_bytes):
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified:', filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify;' + filename + '. Can you get it with a browser?')
    return filename

filename = download_data('text8.zip', 31344016)

Found and verified: text8.zip


In [47]:
#read the data into list of strings

def read_data(filename):
    #Extract first file in the zip as a list of words"
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data

words = read_data(filename)
print('Data size', len(words))

Data size 17005207


In [48]:
#Build the dictionary and replace the rare words with UNK token

vocab_size = 50000

def build_dataset(words, vocab_size):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(vocab_size -1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0
            unk_count +=1 
        data.append(index)
    count[0][1] = unk_count
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reverse_dictionary

In [49]:
data, count, dictionary, reverse_dictionary = build_dataset(words, vocab_size)
#del words # to reduce memory
print("Most common words (+unk)", count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

data_index = 0

#Step -3: Function to generate a training batch for the skip-gram model
def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1 # [skip_window target skip_window]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window # target level at the center of the buffer
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span -1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
        
        #Backtrack a little bit to avoid skipping words in the end of a batch
        data_index = (data_index + len(data) - span) % len(data)
        return batch, labels

Most common words (+unk) [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


In [50]:
batch, labels = generate_batch(batch_size = 8, num_skips = 2, skip_window=1)
for i in range(8):
    print(batch[i], reverse_dictionary[batch[i]], 
          '->', labels[i, 0], reverse_dictionary[labels[i, 0]])

3084 originated -> 5239 anarchism
3084 originated -> 12 as


KeyError: 268787658

In [None]:
batch_size = 128 
embedding_size = 128 # Dimension of embedding vectors
skip_window = 1 #How many words to consider left and right
num_skips = 2 # How many times to reuse an input to generate a label

#We pick a random validation set to sample nearest neighbos, here we limit the 
#validation samples to the words to the words that have low numeric ID which by construction are also
#most frequent words

valid_size = 16 #random set of words to evaluate simimilarity
va,id_window = 100 # Pick up samples 
valid_exeamples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64

graph = tf.Graph()


with graph as default():
    
    #Input data
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    valid_dataset = tf.constant(valid_examples, dtypetf.int32)
    
    #Ops and variables pinned to the CPU because of missing GPU implementation
    with tf.device('/cpu:0'):
        #Look up embeddings for inputs
        embeddings = tf.Variable(
             tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0))
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
        
        #Construct the variable for the NCE loss
        nce_weight = tf.Variable(
            tf.truncated_normal([vocabulary_size, embedding_size], -1.0, 1.0))
        nce_bias = tf.Variable(tf.zeros([vocab_size]))
        
        
        #Compute the average NCE loss for the batch
        #tf.nce_loss automatically drew as new sample of the negative labels
        #each time we evaluate the loss
        
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weights,
                                             biases = nce_bias,
                                             labels=train_labels,
                                             inputs = embed,
                                             num_smapled=num_sampled,
                                             num_classes = vocab_size))
        
        #Construct the sgd optimizer using a learning rate of 1.0
        
        optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
        
        #Compute the cosien similarity between minibatch examples and all other embeddings
        
        norm = tf.sqrt(tf.reduce_mean(tf.squre(embeddings), 1, keep_dims=True))
        
        normalized_embeddings = embeddings / norm
        valid_embeddings = tf.nn.embedding_lookup(
                 normalized_embeddings, valid_dataset)
        similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)
        
        #Add variable initializer
        init = tf.global_variables_initializer()
        
        #Begin training
        
        num_steps = 100001
        
        with tf.Session(graph=graph) as session:
            #We must initialize all variables before we use them
            init.run()
            print('Initialized')
            
            
            average_loss = 0
            for step in range(num_steps):
                batch_size, batch_labels = generate_batch(batch_size, num_skips, skip_window)
                feed_dict={train_inputs: batch_inputs, train_labels: batch_labels}