<img src='img/anaconda-logo.png' align='left' style="padding:10px">
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Accelerate Natural Language Processing: Word2Vec

* Word2Vec is a unsupervised learning model to reconstruct linguistic context of words.
* The model is trained with sentences.  
* The model produces a vector for each word.
* Elementary arithmetic operations (e.g. addition, subtraction) can be used on these vectors to compute analogies (i.e. "france" + "berlin" - "germany" = "paris").

## Table of Contents
* [Accelerate Natural Language Processing: Word2Vec](#Accelerate-Natural-Language-Processing:-Word2Vec)
	* [Dataset](#Dataset)
	* [Create a Word2Vec model](#Create-a-Word2Vec-model)
	* [Warmup](#Warmup)
		* [Quick tests](#Quick-tests)
		* [Accuracy test](#Accuracy-test)
	* [Continue learning as a CUDA-enabled Word2Vec model](#Continue-learning-as-a-CUDA-enabled-Word2Vec-model)
		* [Train on the GPU](#Train-on-the-GPU)
		* [Quick test](#Quick-test)
	* [Downpour Stochastic Gradient Descent](#Downpour-Stochastic-Gradient-Descent)
		* [Quick test](#Quick-test)
		* [Accuracy test](#Accuracy-test)
	* [Hardware](#Hardware)


## Set-up

In [None]:
from __future__ import print_function, division

import random
from copy import deepcopy

import numpy as np

import gensim.scripts
from gensim.models import Word2Vec, CudaWord2Vec

## Dataset

Load our dataset using the _text8_ dataset.  It is the first 100MB of cleaned text from the english wikipedia.

In [None]:
# Load the dataset (Linux/macOS)
#     !wget -c http://mattmahoney.net/dc/text8.zip
#     !unzip -f text8.zip

In [None]:
data_url = "http://mattmahoney.net/dc/text8.zip"
data_file = "tmp/text8.zip"

In [None]:
# create a tmp dir
import os
if not os.path.exists('tmp'):
    os.makedirs('tmp')
os.listdir("tmp")

In [None]:
# Download data file (30 MB compressed ZIP, ~ 10-20 seconds)
# Note: urllib interface differs from python2 to python3

import sys
python_version = sys.version_info.major

if python_version == 2:
    import urllib
    urllib.urlretrieve(data_url, data_file)
elif python_version == 3:
    import urllib.request, urllib.parse, urllib.error
    urllib.request.urlretrieve(data_url, data_file)

os.listdir("tmp")

In [None]:
# If needed, unzip the data file
import zipfile 
with zipfile.ZipFile(data_file, "r") as z:
    z.extractall("tmp")
os.listdir("tmp")

In [None]:
# Read data directly from ZIP archive without extracting to disk
import zipfile
archive = zipfile.ZipFile(data_file, 'r')
text8data = archive.read('text8').decode('utf-8')

`text8data` is a single line of text.  There is no punctuations.  It is just a string of words where each word is separated by a space.

Turn it into a list of words by separating the text at whitespaces.

In [None]:
text8words = text8data.split()
len(text8words)

To recreate the sentences, we can simplify group every 20 words into a sentence.

In [None]:
sentences = []
sentlen = 20
for i in range(0, len(text8words), sentlen):
    sentences.append(text8words[i:i + sentlen])

In [None]:
len(sentences)

In [None]:
for s in sentences[:5]:
    print(' '.join(s), '\n')

## Create a Word2Vec model

In [None]:
model = Word2Vec(size=200, workers=4)  # train 200 dimension vector

Build vocabulary from all sentences in the dataset

In [None]:
%%time
model.build_vocab(sentences)

## Warmup

Train the model with a reduced dataset

In [None]:
%%time
model.train(sentences[:len(sentences) // 4])

### Quick tests

Test the model.  The answer is probably not good.

In [None]:
model.similarity('cat', 'dog')

In [None]:
model.most_similar(positive=['france', 'berlin'], negative=['germany'])  # expecting 'paris'

In [None]:
model.most_similar(positive=['jesus', 'buddhism'], negative=['christianity'])  # expecting 'buddha'

In [None]:
model.doesnt_match(['man', 'woman', 'boy', 'fork'])   

### Accuracy test

Test our model with analogies.  This test is used in the original implementation by the original authors at Google.  The `question_words.txt` came from https://code.google.com/archive/p/word2vec/source/default/source.

In [None]:
%%time
import os.path

directory = os.path.dirname(gensim.scripts.__file__)
question_file = os.path.join(directory, 'questions_words.txt')

def accuracy_test(model):
    # Run the accuracy test
    results = model.accuracy(question_file)
    # Print and format the result
    for sect in results:
        good = len(sect['correct'])
        bad = len(sect['incorrect'])
        total = good + bad
        if not total:
            score = 0
        else:
            score = 100 * good / total
        print('section', sect['section'], '| percent', "{:.1f}%".format(score))

accuracy_test(model)

## Continue learning as a CUDA-enabled Word2Vec model

The same CPU model can be loaded as a GPU model by using monkeypatching.  The GPU model, `CudaWord2Vec`, is a subclass of the original `Word2Vec` class.  It overrides the training related methods to perform the training on the GPU.

In [None]:
CudaWord2Vec()  # to initialize the CUDA Word2Vec system which happens in __init__

gpumodel = deepcopy(model)
gpumodel.__class__ = CudaWord2Vec  # monkeypatch to a use CUDA (bypasses __init__)

assert gpumodel.syn0 is not model.syn0

print(type(gpumodel))

In [None]:
gpumodel.similarity('cat', 'dog')

In [None]:
gpumodel.most_similar(positive=['france', 'berlin'], negative=['germany'])

In [None]:
gpumodel.doesnt_match(['man', 'woman', 'boy', 'fork'])

### Train on the GPU

In [None]:
gpu_syn0, gpu_syn1 = gpumodel.syn0.copy(), gpumodel.syn1.copy()

In [None]:
%%time
gpumodel.train(random.sample(sentences, len(sentences) // 4))

### Quick test

In [None]:
gpumodel.similarity('cat', 'dog')

compute deltas of the layers (vectors)

In [None]:
syn0_delta = gpumodel.syn0 - gpu_syn0
syn1_delta = gpumodel.syn1 - gpu_syn1

add the deltas from the gpu model back to the original model

In [None]:
model.syn0 += syn0_delta
model.syn1 += syn1_delta

In [None]:
model.similarity('cat', 'dog')

## Downpour Stochastic Gradient Descent

Reference: http://research.google.com/archive/large_deep_networks_nips2012.html


We will name our initial model as the "master".  For each iteration, copy the master model to the workers and train locally with a random subset of the dataset.  The deltas for each layers are used to update the master model.

We will use this technique to train the master model on the CPU and GPU simultaneously.  Since Word2Vec and CudaWord2Vec have similar performance, there is no benefit to use one over the other.  But, using both simultaneously should double our throughput.

In [None]:
def gradient(master, sentences, use_gpu=False):
    model = deepcopy(master)
    if use_gpu:
        # monkeypatch the class of the model to use GPU training
        model.__class__ = CudaWord2Vec
    model.train(sentences)
    # compute the detlas
    delta_syn0 = model.syn0 - master.syn0
    delta_syn1 = model.syn1 - master.syn1
    return delta_syn0, delta_syn1

def descent(model, deltas, learning_rate):
    model.syn0 += deltas[0] * learning_rate
    model.syn1 += deltas[1] * learning_rate
    
def show_norms(deltas):
    return 'syn0: {0} | syn1: {1}'.format(np.linalg.norm(deltas[0]), 
                                          np.linalg.norm(deltas[1]))


In [None]:
%%time

from concurrent.futures import ThreadPoolExecutor

learning_rate = 0.5
cat_dog_sims = []

with ThreadPoolExecutor(max_workers=2) as exe:
    for _ in range(10):
        # Sample the dataset
        cpu_sents = random.sample(sentences, len(sentences) // 4)
        gpu_sents = random.sample(sentences, len(sentences) // 4)

        # Train models in parallel
        future_cpu = exe.submit(gradient, model, cpu_sents)
        future_gpu = exe.submit(gradient, model, gpu_sents, use_gpu=True)

        # Gather data
        cpu_deltas = future_cpu.result()
        gpu_deltas = future_gpu.result()

        # Apply deltas to master model
        descent(model, cpu_deltas, learning_rate)
        descent(model, gpu_deltas, learning_rate)

        print("cpu delta norms", show_norms(cpu_deltas))
        print("gpu delta norms", show_norms(gpu_deltas))

        sim = model.similarity('cat', 'dog')
        print(sim)
        cat_dog_sims.append(sim)
        
        learning_rate *= 0.75

### Quick test

Clear the internal cache for doing similiarity testing

In [None]:
model.clear_sims()

In [None]:
model.most_similar(positive=['france', 'berlin'], negative=['germany'])

In [None]:
model.most_similar(positive=['jesus', 'buddhism'], negative=['christianity'])

In [None]:
model.doesnt_match(['man', 'woman', 'boy', 'dog'])

### Accuracy test

In [None]:
%%time
accuracy_test(model)

## Hardware 

In [None]:
!cat /proc/cpuinfo

In [None]:
from numba import cuda
cuda.detect()

---
*Copyright Continuum 2012-2016 All Rights Reserved.*