Word2Vec
=============

The goal of this tutorial is to train a Word2Vec skip-gram model over http://mattmahoney.net/dc/text8.zip data. This tutorial is based on http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow 

First download the file (text8.zip) from the given link and put it in the same directory as this notebook. 

In [1]:
# Modules we'll be using later. 
import math
import numpy as np
import random
import tensorflow as tf
import fileinput
import os, sys
import shutil
import nltk
nltk.download('punkt')
from collections import Counter

[nltk_data] Downloading package punkt to /Users/fanfan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Read the data into a string.

In [2]:
def extract_word_list():
    list_of_words = []
    directoris = ["L_the_patient", "U_Patient", "U_The_patient"]
    genders = ["female", "male", "universal"]

    for dir_index in directoris:
        for gender_index in genders:
            dir = "letters/"+dir_index+"/"+gender_index+"/"
            path = os.fsencode(dir)

            for file in os.listdir(path):
                filename = os.fsdecode(file)
                if filename != ".DS_Store":
                    f = open(dir+filename,'r')
                    filedata = f.read()
                    f.close()
                    
                    from nltk.tokenize import word_tokenize
                    tokens = word_tokenize(filedata)
                    words = [word for word in tokens if word.isalpha()] # remove punctuation
                    lowercased = [word.lower() for word in words]
                    list_of_words.append(lowercased)

    flattened_list = [y for x in list_of_words for y in x]
    print('list_of_words size is ', len(flattened_list))
    return flattened_list

list_of_words = extract_word_list()

print('list_of_words size is ', len(list_of_words))
print('Some words are ', list_of_words[:7])

list_of_words size is  989006
list_of_words size is  989006
Some words are  ['chelsey', 'holley', 'consult', 'screening', 'colonoscopy', 'history', 'a']


Using *zipfile.ZipFile()* to extract the zipped file, we can then use the reader functionality found in this zipfile module.  First, the *namelist()* function retrieves all the members of the archive – in this case there is only one member, so we access this using the zero index.  Then we use the *read()* function which reads all the text in the file.  Finally, we use *split()* function to create a list with all the words in the text file, separated by white-space characters.  We can see some of the output above.

As you can observe, the returned vocabulary data contains a list of plain English words, ordered as they are in the sentences of the original extracted text file.  Now that we have all the words extracted in a list, we have to do some further processing to enable us to create our skip-gram batch data.  These further steps are:

1. Extract the top 10,000 most common words to include in our embedding vector
2. Gather together all the unique words and index them with a unique integer value – this is what is required to create an equivalent one-hot type input for the word.  We’ll use a Python dictionary to do this
3. Loop through every word in the dataset (vocabulary variable) and assign it to the unique integer word identified, created in Step 2 above.  This will allow easy lookup / processing of the word data stream

In [3]:
vocabulary_size = 1000

words_counts = Counter(list_of_words)

top_words_counts = words_counts.most_common(vocabulary_size - 1)

dictionary = dict(); data = list(); index = 0;

for word, _ in top_words_counts:
    dictionary[word] = index
    data.append(index)
    index += 1

In [4]:
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 

#print('Most common words', top_words_counts[:5])
#print('Sample data', data[:10])
#print('Sample data', [reverse_dictionary[data[i]] for i in range(10)])
print(list(reverse_dictionary.items())[:20])

[(0, 'the'), (1, 'and'), (2, 'was'), (3, 'of'), (4, 'to'), (5, 'a'), (6, 'with'), (7, 'in'), (8, 'is'), (9, 'patient'), (10, 'she'), (11, 'for'), (12, 'were'), (13, 'no'), (14, 'he'), (15, 'on'), (16, 'this'), (17, 'at'), (18, 'then'), (19, 'right')]


Function to generate a training batch for the skip-gram model.

In [5]:
word_index = 0

def generate_batch(batch_size=16, skip_window=2):
    global word_index   # now we can change this variable inside this function
    
    batch  = []
    labels = []
    
    for _ in range(batch_size): 
        if list_of_words[word_index] in dictionary:
            left  = max(0, word_index-skip_window)
            right = min(word_index+skip_window+1, len(list_of_words))
            labels_indexes = list(range(left,word_index)) + list(range(word_index+1,right))

            for i in labels_indexes:
                if list_of_words[i] in dictionary:
                    batch.append(dictionary[list_of_words[word_index]])
                    labels.append(dictionary[list_of_words[i]])

        word_index += 1
    
    return batch, labels

batch, labels = generate_batch()

print([reverse_dictionary[x] for x in batch])
print([reverse_dictionary[x] for x in labels])
print(batch)
print(labels)

['consult', 'colonoscopy', 'colonoscopy', 'colonoscopy', 'history', 'history', 'history', 'a', 'a', 'a', 'a', 'is', 'is', 'is', 'is', 'a', 'a', 'a', 'a', 'who', 'who', 'who', 'i', 'i', 'i', 'well', 'well', 'well', 'because', 'because', 'because', 'i', 'i', 'i', 'i', 'have', 'have', 'have', 'have']
['colonoscopy', 'consult', 'history', 'a', 'colonoscopy', 'a', 'is', 'colonoscopy', 'history', 'is', 'a', 'history', 'a', 'a', 'who', 'a', 'is', 'who', 'i', 'is', 'a', 'i', 'a', 'who', 'well', 'i', 'because', 'i', 'well', 'i', 'have', 'well', 'because', 'have', 'been', 'because', 'i', 'been', 'taking']
[410, 808, 808, 808, 26, 26, 26, 5, 5, 5, 5, 8, 8, 8, 8, 5, 5, 5, 5, 82, 82, 82, 39, 39, 39, 37, 37, 37, 293, 293, 293, 39, 39, 39, 39, 45, 45, 45, 45]
[808, 410, 26, 5, 808, 5, 8, 808, 26, 8, 5, 26, 5, 5, 82, 5, 8, 82, 39, 8, 5, 39, 5, 82, 37, 39, 293, 39, 37, 39, 45, 37, 293, 45, 57, 293, 39, 57, 526]


Train a skip-gram model.

In [6]:
batch_size = 100
embedding_size = 128 # Dimension of the embedding vector.

train_inputs = tf.placeholder(tf.int32)
train_labels = tf.placeholder(tf.int32)

# Look up embeddings for inputs.
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs) # train_inputs will be a batch at a time
# tf.nn.embedding_lookup() is a useful helper function in TensorFlow. 
# Here’s how it works – it takes an input vector of integer indexes – in this case our train_input batch 
# of training input words, and "looks up" these indexes in the supplied embeddings tensor.  
# Therefore, this command will return the current embedding vector for each of the supplied input words 
# in the training batch.  The full embedding tensor will be optimized during the training process.

# Construct the variables for the softmax
weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size)))
biases = tf.Variable(tf.zeros([vocabulary_size]))
hidden_out = tf.matmul(embed, tf.transpose(weights)) + biases

# convert train_context to a one-hot format
train_one_hot = tf.one_hot(train_labels, vocabulary_size)
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=hidden_out, labels=train_one_hot))
# Construct the SGD optimizer using a learning rate of 1.0.
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(cross_entropy)

In [7]:
session = tf.InteractiveSession()
tf.global_variables_initializer().run()
print('Initialized')

average_loss = 0
num_steps = 101


for step in range(num_steps):
    batch_inputs, batch_labels = generate_batch()
    # We need to convert these lists to numpy arrays. 
    # batch_inputs needs to be a row array.
    # batch_labels needs to be a column array.  
    length = len(batch_inputs)
    batch_inputs = np.array(batch_inputs)
    batch_labels = np.array(batch_labels).reshape(length, 1)
    feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    _, loss_val = session.run([optimizer, cross_entropy], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 20 == 0:
        if step > 0:
            average_loss /= 20
        # The average loss is an estimate of the loss over the last 2000 batches.
        print('Average loss at step ', step, ': ', average_loss)
        average_loss = 0

Initialized
Average loss at step  0 :  7.04640007019
Average loss at step  20 :  6.61802151203
Average loss at step  40 :  6.23343007565
Average loss at step  60 :  6.24429838657
Average loss at step  80 :  5.98988995552
Average loss at step  100 :  6.00317900181


In [8]:
num_sampled = 64    # Number of examples to sample.

nce_loss = tf.reduce_mean( 
        tf.nn.nce_loss(weights=weights,
                       biases=biases,
                       labels=train_labels,
                       inputs=embed,
                       num_sampled=num_sampled,
                       num_classes=vocabulary_size))

optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(nce_loss)

In [9]:
word_index = 0

session = tf.InteractiveSession()
tf.global_variables_initializer().run()
print('Initialized')

average_loss = 0
num_steps = 10001


for step in range(num_steps):
    batch_inputs, batch_labels = generate_batch()
    length = len(batch_inputs)
    batch_inputs = np.array(batch_inputs)
    batch_labels = np.array(batch_labels).reshape(length, 1)
    feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels}

    _, loss_val = session.run([optimizer, nce_loss], feed_dict=feed_dict)
    average_loss += loss_val

    if step % 2000 == 0:
        if step > 0:
            average_loss /= 2000
        # The average loss is an estimate of the loss over the last 2000 batches.
        print('Average loss at step ', step, ': ', average_loss)
        average_loss = 0

Initialized
Average loss at step  0 :  143.895339966
Average loss at step  2000 :  nan
Average loss at step  4000 :  nan
Average loss at step  6000 :  nan
Average loss at step  8000 :  4.7423764149
Average loss at step  10000 :  nan


This is what an embedding looks like:

In [10]:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm

final_embeddings = normalized_embeddings.eval()
print("final_embeddings.shape = ", final_embeddings.shape)
print("The first embedding vector is: \n", final_embeddings[0])

final_embeddings.shape =  (1000, 128)
The first embedding vector is: 
 [-0.06040489 -0.01330441 -0.09700806  0.09462865 -0.06287403  0.06696502
 -0.17774592 -0.0952784   0.12120966 -0.01762032  0.02647315 -0.08618247
  0.16042912 -0.01304143 -0.12926026  0.07320725 -0.05332165  0.00240781
  0.08730064  0.00759852 -0.03314163  0.0516535  -0.07823957  0.1113538
 -0.05894373  0.10860934  0.01588152  0.00113327  0.14620855 -0.17204115
 -0.05744638 -0.12161048 -0.03422917 -0.06331154  0.03345115 -0.00453212
 -0.00281025  0.07150039  0.0225572  -0.14017226 -0.01739253 -0.0360419
  0.01315403  0.07402837 -0.06505622 -0.0322975   0.06845841 -0.03555855
 -0.10300234  0.01720936 -0.05083563 -0.07368746 -0.00757827 -0.00188826
 -0.15294185  0.03227917  0.15326868 -0.05712933 -0.26599848 -0.01704332
 -0.09562656 -0.05190606 -0.04794063  0.03382377  0.04194435  0.03574634
  0.10957918 -0.03770536  0.03981107 -0.00529713  0.07970909  0.01328546
  0.0946495   0.26669094 -0.0146098   0.00049715  0.028

In [11]:
print(words_counts.most_common(200))

[('the', 65872), ('and', 36616), ('was', 33155), ('of', 25590), ('to', 22676), ('a', 18994), ('with', 16072), ('in', 14735), ('is', 10592), ('patient', 9822), ('she', 7581), ('for', 7548), ('were', 7177), ('no', 6987), ('he', 6477), ('on', 6339), ('this', 6211), ('at', 5818), ('then', 5775), ('right', 4904), ('as', 4881), ('that', 4720), ('left', 4716), ('has', 4713), ('her', 4495), ('or', 4138), ('history', 3950), ('there', 3716), ('had', 3559), ('procedure', 3549), ('be', 3530), ('his', 3438), ('placed', 3287), ('an', 3038), ('not', 3016), ('from', 2863), ('normal', 2753), ('well', 2719), ('pain', 2531), ('i', 2478), ('we', 2470), ('it', 2383), ('which', 2378), ('are', 2372), ('by', 2119), ('have', 2080), ('after', 2067), ('any', 2024), ('also', 1945), ('will', 1939), ('using', 1901), ('time', 1847), ('s', 1830), ('noted', 1814), ('into', 1798), ('mg', 1781), ('anesthesia', 1772), ('been', 1730), ('but', 1728), ('blood', 1719), ('incision', 1678), ('removed', 1673), ('performed', 165

In [18]:
test_word = 'patient'

print(dictionary[test_word])

test_word_index = dictionary[test_word]
test_word_embedding = final_embeddings[test_word_index:test_word_index+1, : ]

#print("The embedding vector of test_word is: \n", test_word_embedding)

top_k = 4
test_word_top_k_similar = list((-test_word_embedding @ final_embeddings.T).argsort()[:, 1:top_k+1].flat)
print( test_word_top_k_similar )
print([reverse_dictionary[x] for x in test_word_top_k_similar])

9
[10, 14, 807, 853]
['she', 'he', 'guide', 'release']
