# Word2Vec

In [1]:
import time
import numpy as np
import tensorflow as tf
import utils
from random import randint

Load the [text8 dataset](http://mattmahoney.net/dc/textdata.html), a file of cleaned up Wikipedia articles from Matt Mahoney. The next cell will download the data set to the `data` folder. Then you can extract it and delete the archive file to save storage space.

### Load the dataset

In [2]:
with open('data/text8') as f:
    text = f.read()

### Preprocessing

In [3]:
words = utils.preprocess(text)
print(words[:30])

['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst']


In [4]:
# Shape
print("Total words: {}".format(len(words)))
print("Unique words: {}".format(len(set(words))))

Total words: 16680599
Unique words: 63641


### Create lookup table

A look table will contain index for each word to convert word into index and index converted into word

In [5]:
vocab_to_int, int_to_vocab = utils.create_lookup_tables(words)
int_words = [vocab_to_int[word] for word in words]

In [6]:
vocab_to_int

{'foretold': 32341,
 'chillout': 50095,
 'oakland': 8174,
 'sholay': 53721,
 'eide': 50096,
 'sagittarius': 27349,
 'indefatigable': 40561,
 'syntactic': 16266,
 'eta': 5929,
 'barbarian': 9126,
 'och': 23736,
 'deserts': 9304,
 'humps': 42432,
 'arrears': 36074,
 'luckily': 36075,
 'appear': 993,
 'bermuda': 9081,
 'conjectures': 18966,
 'plenty': 13360,
 'connoisseur': 44612,
 'remodeled': 58112,
 'confession': 9033,
 'chhatro': 58113,
 'draper': 26752,
 'committing': 13361,
 'strabismus': 58114,
 'peri': 27350,
 'aids': 2757,
 'aznavour': 50097,
 'angled': 29372,
 'hotelier': 58115,
 'satellites': 7573,
 'chasm': 44613,
 'finalized': 22514,
 'prosper': 16820,
 'dismutase': 50098,
 'rulings': 15504,
 'guerra': 30964,
 'wes': 27351,
 'comneni': 53722,
 'uncompressed': 26753,
 'larry': 4085,
 'dieterich': 56795,
 'panthera': 32814,
 'riddler': 34947,
 'taito': 30965,
 'title': 587,
 'moves': 2824,
 'grange': 30966,
 'guideline': 28629,
 'overhaul': 25097,
 'borwein': 53723,
 'ospf': 58

In [7]:
int_to_vocab

{0: 'the',
 1: 'of',
 2: 'and',
 3: 'one',
 4: 'in',
 5: 'a',
 6: 'to',
 7: 'zero',
 8: 'nine',
 9: 'two',
 10: 'is',
 11: 'as',
 12: 'eight',
 13: 'for',
 14: 's',
 15: 'five',
 16: 'three',
 17: 'was',
 18: 'by',
 19: 'that',
 20: 'four',
 21: 'six',
 22: 'seven',
 23: 'with',
 24: 'on',
 25: 'are',
 26: 'it',
 27: 'from',
 28: 'or',
 29: 'his',
 30: 'an',
 31: 'be',
 32: 'this',
 33: 'which',
 34: 'at',
 35: 'he',
 36: 'also',
 37: 'not',
 38: 'have',
 39: 'were',
 40: 'has',
 41: 'but',
 42: 'other',
 43: 'their',
 44: 'its',
 45: 'first',
 46: 'they',
 47: 'some',
 48: 'had',
 49: 'all',
 50: 'more',
 51: 'most',
 52: 'can',
 53: 'been',
 54: 'such',
 55: 'many',
 56: 'who',
 57: 'new',
 58: 'used',
 59: 'there',
 60: 'after',
 61: 'when',
 62: 'into',
 63: 'american',
 64: 'time',
 65: 'these',
 66: 'only',
 67: 'see',
 68: 'may',
 69: 'than',
 70: 'world',
 71: 'i',
 72: 'b',
 73: 'would',
 74: 'd',
 75: 'no',
 76: 'however',
 77: 'between',
 78: 'about',
 79: 'over',
 80: 'year

In [8]:
int_words

[5233,
 3081,
 11,
 5,
 194,
 1,
 3133,
 45,
 58,
 155,
 127,
 741,
 476,
 10592,
 133,
 0,
 27933,
 1,
 0,
 102,
 854,
 2,
 0,
 15188,
 58629,
 1,
 0,
 150,
 854,
 3580,
 0,
 194,
 10,
 190,
 58,
 4,
 5,
 10736,
 214,
 6,
 1325,
 104,
 454,
 19,
 58,
 2731,
 362,
 6,
 3675,
 0,
 708,
 1,
 371,
 26,
 40,
 36,
 53,
 539,
 97,
 11,
 5,
 1425,
 2759,
 18,
 567,
 686,
 7102,
 0,
 247,
 5233,
 10,
 1052,
 27,
 0,
 320,
 248,
 45961,
 2877,
 792,
 186,
 5233,
 11,
 5,
 200,
 602,
 10,
 0,
 1136,
 19,
 2623,
 25,
 9002,
 2,
 279,
 31,
 4157,
 141,
 59,
 25,
 6437,
 4196,
 1,
 153,
 32,
 362,
 5233,
 36,
 1137,
 6,
 447,
 345,
 1818,
 19,
 4868,
 0,
 6760,
 1,
 7588,
 1775,
 566,
 0,
 93,
 0,
 247,
 11117,
 11,
 51,
 7102,
 89,
 26,
 270,
 37,
 5957,
 4860,
 20341,
 28,
 55388,
 41,
 317,
 5,
 25803,
 527,
 7588,
 371,
 4,
 258,
 1,
 153,
 25,
 1206,
 11,
 7588,
 200,
 1580,
 2,
 15256,
 332,
 1775,
 7102,
 4868,
 345,
 764,
 160,
 406,
 5693,
 756,
 1,
 4114,
 1132,
 4343,
 1536,
 2,
 567,
 8

## Subsampling

Words that show up often such as "the", "of", and "for" don't provide much context to the nearby words. If we discard some of them, we can remove some of the noise from our data and in return get faster training and better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by 

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

I'm going to leave this up to you as an exercise. This is more of a programming challenge, than about deep learning specifically. But, being able to prepare your data for your network is an important skill to have. Check out my solution to see how I did it.


In [9]:
from collections import Counter
import random

threshold = 1e-5
number_of_words = len(int_words)
word_counter = Counter(int_words)
frequencies = dict()
drop_probabilities = dict()
train_words = []

droped_count = 0
for word,count in word_counter.items():
    frequency = count/number_of_words
    frequencies[word] = frequency
    drop_probabilities[word] = 1 - np.sqrt(threshold/frequency)
    
for word in int_words:
    if drop_probabilities[word] < 0.85:
        train_words.append(word)
        
print(len(train_words))

7852711


# Making the batch

Now that our data is in good shape, we need to get it into the proper form to pass it into our network. With the skip-gram architecture, for each word in the text, we want to grab all the words in a window around that word, with size $C$. 

From [Mikolov et al.](https://arxiv.org/pdf/1301.3781.pdf): 

"Since the more distant words are usually less related to the current word than those close to it, we give less weight to the distant words by sampling less from those words in our training examples... If we choose $C = 5$, for each training word we will select randomly a number $R$ in range $< 1; C >$, and then use $R$ words from history and $R$ words from the future of the current word as correct labels."

In [10]:
def get_target(words, idx, window_size=5):
    ''' Get a list of words in a window around an index. '''
    
    # Your code here
    random_count = randint(1,window_size) #the R number described in the description
    
    if idx - random_count < 0:
        start_word = 0
    else:
        start_word = idx - random_count
        
    if idx + random_count > len(words) - 1:
        end_word = len(words)  
    else:
        end_word = idx + random_count + 1
 
    return list(set(words[start_word:idx]+words[idx+1:end_word]))

# test
print(get_target([0,1,2,3,4,5,6,7,8,9],4,3)) #returns a list of the words around the given index

[3, 5]


Here's a function that returns batches for our network. The idea is that it grabs `batch_size` words from a words list. Then for each of those words, it gets the target words in the window. I haven't found a way to pass in a random number of target words and get it to work with the architecture, so I make one row per input-target pair. This is a generator function by the way, helps save memory.

In [11]:
def get_batches(words, batch_size, window_size=5):
    ''' Create a generator of word batches as a tuple (inputs, targets) '''
    
    n_batches = len(words)//batch_size
    
    # only full batches
    words = words[:n_batches*batch_size]
    
    for idx in range(0, len(words), batch_size):
        x, y = [], []
        batch = words[idx:idx+batch_size]
        for ii in range(len(batch)):
            batch_x = batch[ii]
            batch_y = get_target(batch, ii, window_size)
            y.extend(batch_y)
            x.extend([batch_x]*len(batch_y))
        yield x, y
    

# Building the graph

In [12]:
train_graph = tf.Graph()
with train_graph.as_default():
    inputs = tf.placeholder(tf.int32,shape=[None],name = "inputs")
    labels = tf.placeholder(tf.int32,shape = [None,1],name = "labels")

### Embedding
The embedding matrix has a size of the number of words by the number of units in the hidden layer. So, if you have 10,000 words and 300 hidden units, the matrix will have size $10,000 \times 300$. Remember that we're using tokenized data for our inputs, usually as integers, where the number of tokens is the number of words in our vocabulary.


Tensorflow provides a convenient function [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup) that does this lookup for us. You pass in the embedding matrix and a tensor of integers, then it returns rows in the matrix corresponding to those integers. Below, set the number of embedding features you'll use (200 is a good start), create the embedding matrix variable, and use `tf.nn.embedding_lookup` to get the embedding tensors. For the embedding matrix, I suggest you initialize it with a uniform random numbers between -1 and 1 using [tf.random_uniform](https://www.tensorflow.org/api_docs/python/tf/random_uniform).

In [13]:
n_vocab = len(int_to_vocab)
n_embedding = 300
with train_graph.as_default():
    embedding = tf.Variable(tf.random_uniform([n_vocab, n_embedding], -1, 1))
    embed = tf.nn.embedding_lookup(embedding, inputs)

In [14]:
n_sampled = 100
with train_graph.as_default():
    softmax_w = tf.Variable(tf.truncated_normal([n_vocab, n_embedding], stddev=0.1), name='softmax_w')
    softmax_b = tf.Variable(tf.zeros([n_vocab]), name='softmax_b')
    
    # calculate the loss using negative sampleing
    loss = tf.nn.sampled_softmax_loss(softmax_w, softmax_b, labels, embed, n_sampled, n_vocab)
    
    cost = tf.reduce_mean(loss)
    optimizer = tf.train.AdamOptimizer().minimize(cost)

### Validation

In [15]:
with train_graph.as_default():
    ## From Thushan Ganegedara's implementation
    valid_size = 16 # Random set of words to evaluate similarity on.
    valid_window = 100
    # pick 8 samples from (0,100) and (1000,1100) each ranges. lower id implies more frequent 
    valid_examples = np.array(random.sample(range(valid_window), valid_size//2))
    valid_examples = np.append(valid_examples, 
                               random.sample(range(1000,1000+valid_window), valid_size//2))

    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    
    # We use the cosine distance:
    norm = tf.sqrt(tf.reduce_sum(tf.square(embedding), 1, keep_dims=True))
    normalized_embedding = embedding / norm
    valid_embedding = tf.nn.embedding_lookup(normalized_embedding, valid_dataset)
    similarity = tf.matmul(valid_embedding, tf.transpose(normalized_embedding))

### Training the model

In [16]:
epochs = 10
batch_size = 1000
window_size = 10

with train_graph.as_default():
    saver = tf.train.Saver()

with tf.Session(graph=train_graph) as sess:
    iteration = 1
    loss = 0
    sess.run(tf.global_variables_initializer())

    for e in range(1, epochs+1):
        batches = get_batches(train_words, batch_size, window_size)
        start = time.time()
        for x, y in batches:
            
            feed = {inputs: x,
                    labels: np.array(y)[:, None]}
            train_loss, _ = sess.run([cost, optimizer], feed_dict=feed)
            
            loss += train_loss
            
            if iteration % 100 == 0: 
                end = time.time()
                print("Epoch {}/{}".format(e, epochs),
                      "Iteration: {}".format(iteration),
                      "Avg. Training loss: {:.4f}".format(loss/100),
                      "{:.4f} sec/batch".format((end-start)/100))
                loss = 0
                start = time.time()
            
            if iteration % 1000 == 0:
                ## From Thushan Ganegedara's implementation
                # note that this is expensive (~20% slowdown if computed every 500 steps)
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = int_to_vocab[valid_examples[i]]
                    top_k = 8 # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k+1]
                    log = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = int_to_vocab[nearest[k]]
                        log = '%s %s,' % (log, close_word)
                    print(log)
            
            iteration += 1
    save_path = saver.save(sess, "checkpoints/text8.ckpt")
    embed_mat = sess.run(normalized_embedding)

Epoch 1/10 Iteration: 100 Avg. Training loss: 6.1234 0.0638 sec/batch
Epoch 1/10 Iteration: 200 Avg. Training loss: 6.0669 0.0331 sec/batch
Epoch 1/10 Iteration: 300 Avg. Training loss: 5.9736 0.0320 sec/batch
Epoch 1/10 Iteration: 400 Avg. Training loss: 5.9441 0.0319 sec/batch
Epoch 1/10 Iteration: 500 Avg. Training loss: 5.8015 0.0344 sec/batch
Epoch 1/10 Iteration: 600 Avg. Training loss: 5.8289 0.0346 sec/batch
Epoch 1/10 Iteration: 700 Avg. Training loss: 5.7666 0.0347 sec/batch
Epoch 1/10 Iteration: 800 Avg. Training loss: 5.6403 0.0360 sec/batch
Epoch 1/10 Iteration: 900 Avg. Training loss: 5.5042 0.0353 sec/batch
Epoch 1/10 Iteration: 1000 Avg. Training loss: 5.4067 0.0344 sec/batch
Nearest to for: witch, nugent, iir, alexius, demonstrator, strychnine, keywork, ans,
Nearest to from: kepulauan, firepower, conquistador, reuptake, dylan, penzias, publicised, zbigniew,
Nearest to had: restricted, doo, archaeoastronomy, autobiographic, informative, tunable, airways, nautical,
Neare

Epoch 1/10 Iteration: 4100 Avg. Training loss: 4.3564 0.0362 sec/batch
Epoch 1/10 Iteration: 4200 Avg. Training loss: 4.3394 0.0353 sec/batch
Epoch 1/10 Iteration: 4300 Avg. Training loss: 4.3404 0.0346 sec/batch
Epoch 1/10 Iteration: 4400 Avg. Training loss: 4.3101 0.0347 sec/batch
Epoch 1/10 Iteration: 4500 Avg. Training loss: 4.1879 0.0369 sec/batch
Epoch 1/10 Iteration: 4600 Avg. Training loss: 4.3198 0.0352 sec/batch
Epoch 1/10 Iteration: 4700 Avg. Training loss: 4.3227 0.0350 sec/batch
Epoch 1/10 Iteration: 4800 Avg. Training loss: 4.3184 0.0368 sec/batch
Epoch 1/10 Iteration: 4900 Avg. Training loss: 4.2722 0.0336 sec/batch
Epoch 1/10 Iteration: 5000 Avg. Training loss: 4.2662 0.0351 sec/batch
Nearest to for: witch, nugent, iir, alexius, demonstrator, strychnine, keywork, afer,
Nearest to from: firepower, kepulauan, reuptake, conquistador, penzias, dylan, coburn, zbigniew,
Nearest to had: restricted, archaeoastronomy, doo, autobiographic, tunable, airways, informative, strumming

Epoch 2/10 Iteration: 8100 Avg. Training loss: 4.0944 0.0353 sec/batch
Epoch 2/10 Iteration: 8200 Avg. Training loss: 4.0067 0.0347 sec/batch
Epoch 2/10 Iteration: 8300 Avg. Training loss: 4.0338 0.0347 sec/batch
Epoch 2/10 Iteration: 8400 Avg. Training loss: 4.0346 0.0350 sec/batch
Epoch 2/10 Iteration: 8500 Avg. Training loss: 4.0432 0.0349 sec/batch
Epoch 2/10 Iteration: 8600 Avg. Training loss: 3.9796 0.0343 sec/batch
Epoch 2/10 Iteration: 8700 Avg. Training loss: 4.0359 0.0348 sec/batch
Epoch 2/10 Iteration: 8800 Avg. Training loss: 4.0053 0.0368 sec/batch
Epoch 2/10 Iteration: 8900 Avg. Training loss: 3.9966 0.0349 sec/batch
Epoch 2/10 Iteration: 9000 Avg. Training loss: 3.8759 0.0358 sec/batch
Nearest to for: witch, nugent, iir, alexius, demonstrator, strychnine, keywork, ans,
Nearest to from: firepower, reuptake, kepulauan, penzias, blinking, conquistador, headlined, dylan,
Nearest to had: archaeoastronomy, autobiographic, strumming, barclay, doo, tyrant, restricted, pollack,
N

Epoch 2/10 Iteration: 12100 Avg. Training loss: 3.9470 0.0353 sec/batch
Epoch 2/10 Iteration: 12200 Avg. Training loss: 3.9788 0.0360 sec/batch
Epoch 2/10 Iteration: 12300 Avg. Training loss: 3.8730 0.0360 sec/batch
Epoch 2/10 Iteration: 12400 Avg. Training loss: 3.8465 0.0354 sec/batch
Epoch 2/10 Iteration: 12500 Avg. Training loss: 3.9734 0.0341 sec/batch
Epoch 2/10 Iteration: 12600 Avg. Training loss: 3.9662 0.0353 sec/batch
Epoch 2/10 Iteration: 12700 Avg. Training loss: 4.0143 0.0353 sec/batch
Epoch 2/10 Iteration: 12800 Avg. Training loss: 3.8832 0.0356 sec/batch
Epoch 2/10 Iteration: 12900 Avg. Training loss: 3.9637 0.0355 sec/batch
Epoch 2/10 Iteration: 13000 Avg. Training loss: 3.9232 0.0366 sec/batch
Nearest to for: witch, nugent, iir, alexius, demonstrator, fashion, strychnine, afer,
Nearest to from: firepower, reuptake, kepulauan, penzias, dylan, ambitious, blinking, multipole,
Nearest to had: archaeoastronomy, autobiographic, barclay, strumming, tyrant, disparage, fourfold

Epoch 3/10 Iteration: 16100 Avg. Training loss: 3.8720 0.0357 sec/batch
Epoch 3/10 Iteration: 16200 Avg. Training loss: 3.8289 0.0348 sec/batch
Epoch 3/10 Iteration: 16300 Avg. Training loss: 3.8500 0.0365 sec/batch
Epoch 3/10 Iteration: 16400 Avg. Training loss: 3.8370 0.0358 sec/batch
Epoch 3/10 Iteration: 16500 Avg. Training loss: 3.8355 0.0370 sec/batch
Epoch 3/10 Iteration: 16600 Avg. Training loss: 3.8819 0.0365 sec/batch
Epoch 3/10 Iteration: 16700 Avg. Training loss: 3.7966 0.0343 sec/batch
Epoch 3/10 Iteration: 16800 Avg. Training loss: 3.8570 0.0358 sec/batch
Epoch 3/10 Iteration: 16900 Avg. Training loss: 3.7412 0.0353 sec/batch
Epoch 3/10 Iteration: 17000 Avg. Training loss: 3.8463 0.0352 sec/batch
Nearest to for: nugent, witch, iir, demonstrator, alexius, fashion, ans, strychnine,
Nearest to from: firepower, reuptake, kepulauan, penzias, ambitious, pentagon, procreation, blinking,
Nearest to had: archaeoastronomy, autobiographic, barclay, strumming, fourfold, disparage, ty

Epoch 3/10 Iteration: 20100 Avg. Training loss: 3.7994 0.0356 sec/batch
Epoch 3/10 Iteration: 20200 Avg. Training loss: 3.7071 0.0343 sec/batch
Epoch 3/10 Iteration: 20300 Avg. Training loss: 3.8829 0.0330 sec/batch
Epoch 3/10 Iteration: 20400 Avg. Training loss: 3.8205 0.0382 sec/batch
Epoch 3/10 Iteration: 20500 Avg. Training loss: 3.9097 0.0353 sec/batch
Epoch 3/10 Iteration: 20600 Avg. Training loss: 3.8113 0.0360 sec/batch
Epoch 3/10 Iteration: 20700 Avg. Training loss: 3.8321 0.0352 sec/batch
Epoch 3/10 Iteration: 20800 Avg. Training loss: 3.8251 0.0351 sec/batch
Epoch 3/10 Iteration: 20900 Avg. Training loss: 3.8238 0.0382 sec/batch
Epoch 3/10 Iteration: 21000 Avg. Training loss: 3.8614 0.0360 sec/batch
Nearest to for: nugent, iir, witch, alexius, demonstrator, fashion, ans, strychnine,
Nearest to from: firepower, reuptake, kepulauan, pentagon, penzias, multipole, ambitious, dylan,
Nearest to had: archaeoastronomy, disparage, strumming, autobiographic, fourfold, barclay, tyrant,

Epoch 4/10 Iteration: 24100 Avg. Training loss: 3.7812 0.0358 sec/batch
Epoch 4/10 Iteration: 24200 Avg. Training loss: 3.7949 0.0353 sec/batch
Epoch 4/10 Iteration: 24300 Avg. Training loss: 3.7468 0.0378 sec/batch
Epoch 4/10 Iteration: 24400 Avg. Training loss: 3.8072 0.0341 sec/batch
Epoch 4/10 Iteration: 24500 Avg. Training loss: 3.7908 0.0360 sec/batch
Epoch 4/10 Iteration: 24600 Avg. Training loss: 3.7472 0.0364 sec/batch
Epoch 4/10 Iteration: 24700 Avg. Training loss: 3.6304 0.0341 sec/batch
Epoch 4/10 Iteration: 24800 Avg. Training loss: 3.7959 0.0367 sec/batch
Epoch 4/10 Iteration: 24900 Avg. Training loss: 3.8140 0.0345 sec/batch
Epoch 4/10 Iteration: 25000 Avg. Training loss: 3.7928 0.0353 sec/batch
Nearest to for: nugent, iir, witch, demonstrator, alexius, fashion, strychnine, keywork,
Nearest to from: firepower, reuptake, ambitious, kepulauan, pentagon, multipole, dylan, lightest,
Nearest to had: archaeoastronomy, disparage, barclay, fourfold, autobiographic, strumming, ty

Epoch 4/10 Iteration: 28100 Avg. Training loss: 3.7133 0.0372 sec/batch
Epoch 4/10 Iteration: 28200 Avg. Training loss: 3.7792 0.0357 sec/batch
Epoch 4/10 Iteration: 28300 Avg. Training loss: 3.7924 0.0357 sec/batch
Epoch 4/10 Iteration: 28400 Avg. Training loss: 3.8388 0.0365 sec/batch
Epoch 4/10 Iteration: 28500 Avg. Training loss: 3.6954 0.0359 sec/batch
Epoch 4/10 Iteration: 28600 Avg. Training loss: 3.7842 0.0352 sec/batch
Epoch 4/10 Iteration: 28700 Avg. Training loss: 3.7640 0.0357 sec/batch
Epoch 4/10 Iteration: 28800 Avg. Training loss: 3.7973 0.0360 sec/batch
Epoch 4/10 Iteration: 28900 Avg. Training loss: 3.7823 0.0347 sec/batch
Epoch 4/10 Iteration: 29000 Avg. Training loss: 3.7974 0.0350 sec/batch
Nearest to for: iir, nugent, witch, alexius, demonstrator, extraterrestrials, fashion, keywork,
Nearest to from: firepower, reuptake, kepulauan, pentagon, ambitious, multipole, headlined, dylan,
Nearest to had: archaeoastronomy, disparage, strumming, autobiographic, fourfold, vis

Epoch 5/10 Iteration: 32100 Avg. Training loss: 3.7290 0.0342 sec/batch
Epoch 5/10 Iteration: 32200 Avg. Training loss: 3.7173 0.0358 sec/batch
Epoch 5/10 Iteration: 32300 Avg. Training loss: 3.7570 0.0349 sec/batch
Epoch 5/10 Iteration: 32400 Avg. Training loss: 3.7368 0.0360 sec/batch
Epoch 5/10 Iteration: 32500 Avg. Training loss: 3.7461 0.0348 sec/batch
Epoch 5/10 Iteration: 32600 Avg. Training loss: 3.6467 0.0361 sec/batch
Epoch 5/10 Iteration: 32700 Avg. Training loss: 3.7478 0.0344 sec/batch
Epoch 5/10 Iteration: 32800 Avg. Training loss: 3.7530 0.0343 sec/batch
Epoch 5/10 Iteration: 32900 Avg. Training loss: 3.7563 0.0352 sec/batch
Epoch 5/10 Iteration: 33000 Avg. Training loss: 3.7585 0.0343 sec/batch
Nearest to for: iir, nugent, demonstrator, witch, alexius, lard, strychnine, keywork,
Nearest to from: firepower, reuptake, dylan, kepulauan, pentagon, multipole, eurofighter, comneni,
Nearest to had: archaeoastronomy, strumming, disparage, autobiographic, fourfold, seigneurial, 

Epoch 5/10 Iteration: 36100 Avg. Training loss: 3.7608 0.0356 sec/batch
Epoch 5/10 Iteration: 36200 Avg. Training loss: 3.7829 0.0363 sec/batch
Epoch 5/10 Iteration: 36300 Avg. Training loss: 3.7562 0.0361 sec/batch
Epoch 5/10 Iteration: 36400 Avg. Training loss: 3.6961 0.0351 sec/batch
Epoch 5/10 Iteration: 36500 Avg. Training loss: 3.7667 0.0366 sec/batch
Epoch 5/10 Iteration: 36600 Avg. Training loss: 3.7268 0.0363 sec/batch
Epoch 5/10 Iteration: 36700 Avg. Training loss: 3.7499 0.0352 sec/batch
Epoch 5/10 Iteration: 36800 Avg. Training loss: 3.7555 0.0366 sec/batch
Epoch 5/10 Iteration: 36900 Avg. Training loss: 3.6581 0.0366 sec/batch
Epoch 5/10 Iteration: 37000 Avg. Training loss: 3.6976 0.0334 sec/batch
Nearest to for: iir, nugent, alexius, witch, demonstrator, lard, strychnine, fashion,
Nearest to from: firepower, reuptake, dylan, kepulauan, comneni, discontinuation, multipole, headlined,
Nearest to had: archaeoastronomy, disparage, strumming, fourfold, airways, restricted, exp

Epoch 6/10 Iteration: 40100 Avg. Training loss: 3.7613 0.0350 sec/batch
Epoch 6/10 Iteration: 40200 Avg. Training loss: 3.7082 0.0367 sec/batch
Epoch 6/10 Iteration: 40300 Avg. Training loss: 3.6969 0.0363 sec/batch
Epoch 6/10 Iteration: 40400 Avg. Training loss: 3.5749 0.0345 sec/batch
Epoch 6/10 Iteration: 40500 Avg. Training loss: 3.7213 0.0366 sec/batch
Epoch 6/10 Iteration: 40600 Avg. Training loss: 3.7187 0.0358 sec/batch
Epoch 6/10 Iteration: 40700 Avg. Training loss: 3.7082 0.0357 sec/batch
Epoch 6/10 Iteration: 40800 Avg. Training loss: 3.7153 0.0360 sec/batch
Epoch 6/10 Iteration: 40900 Avg. Training loss: 3.7799 0.0357 sec/batch
Epoch 6/10 Iteration: 41000 Avg. Training loss: 3.7472 0.0355 sec/batch
Nearest to for: iir, nugent, alexius, demonstrator, lard, witch, strychnine, keywork,
Nearest to from: firepower, reuptake, dylan, comneni, eurofighter, ambitious, kepulauan, multipole,
Nearest to had: archaeoastronomy, strumming, disparage, seigneurial, visionary, fourfold, auto

Epoch 6/10 Iteration: 44100 Avg. Training loss: 3.7734 0.0353 sec/batch
Epoch 6/10 Iteration: 44200 Avg. Training loss: 3.6624 0.0346 sec/batch
Epoch 6/10 Iteration: 44300 Avg. Training loss: 3.7238 0.0353 sec/batch
Epoch 6/10 Iteration: 44400 Avg. Training loss: 3.6918 0.0365 sec/batch
Epoch 6/10 Iteration: 44500 Avg. Training loss: 3.7301 0.0349 sec/batch
Epoch 6/10 Iteration: 44600 Avg. Training loss: 3.7247 0.0364 sec/batch
Epoch 6/10 Iteration: 44700 Avg. Training loss: 3.7383 0.0356 sec/batch
Epoch 6/10 Iteration: 44800 Avg. Training loss: 3.6120 0.0357 sec/batch
Epoch 6/10 Iteration: 44900 Avg. Training loss: 3.6731 0.0345 sec/batch
Epoch 6/10 Iteration: 45000 Avg. Training loss: 3.6886 0.0353 sec/batch
Nearest to for: iir, nugent, alexius, lard, demonstrator, witch, strychnine, minimize,
Nearest to from: firepower, reuptake, comneni, dylan, multipole, discontinuation, eurofighter, headlined,
Nearest to had: archaeoastronomy, strumming, disparage, visionary, explicated, seigneur

Epoch 7/10 Iteration: 48100 Avg. Training loss: 3.6934 0.0377 sec/batch
Epoch 7/10 Iteration: 48200 Avg. Training loss: 3.6731 0.0350 sec/batch
Epoch 7/10 Iteration: 48300 Avg. Training loss: 3.5617 0.0347 sec/batch
Epoch 7/10 Iteration: 48400 Avg. Training loss: 3.7020 0.0350 sec/batch
Epoch 7/10 Iteration: 48500 Avg. Training loss: 3.7072 0.0348 sec/batch
Epoch 7/10 Iteration: 48600 Avg. Training loss: 3.7162 0.0351 sec/batch
Epoch 7/10 Iteration: 48700 Avg. Training loss: 3.7064 0.0356 sec/batch
Epoch 7/10 Iteration: 48800 Avg. Training loss: 3.7556 0.0355 sec/batch
Epoch 7/10 Iteration: 48900 Avg. Training loss: 3.6902 0.0361 sec/batch
Epoch 7/10 Iteration: 49000 Avg. Training loss: 3.5829 0.0342 sec/batch
Nearest to for: iir, nugent, lard, witch, demonstrator, alexius, strychnine, keywork,
Nearest to from: firepower, reuptake, comneni, dylan, eurofighter, discontinuation, multipole, kepulauan,
Nearest to had: archaeoastronomy, strumming, fourfold, disparage, airways, seigneurial, 

Epoch 7/10 Iteration: 52100 Avg. Training loss: 3.6767 0.0362 sec/batch
Epoch 7/10 Iteration: 52200 Avg. Training loss: 3.7169 0.0357 sec/batch
Epoch 7/10 Iteration: 52300 Avg. Training loss: 3.6768 0.0359 sec/batch
Epoch 7/10 Iteration: 52400 Avg. Training loss: 3.7370 0.0355 sec/batch
Epoch 7/10 Iteration: 52500 Avg. Training loss: 3.7325 0.0358 sec/batch
Epoch 7/10 Iteration: 52600 Avg. Training loss: 3.6463 0.0351 sec/batch
Epoch 7/10 Iteration: 52700 Avg. Training loss: 3.6144 0.0339 sec/batch
Epoch 7/10 Iteration: 52800 Avg. Training loss: 3.6278 0.0356 sec/batch
Epoch 7/10 Iteration: 52900 Avg. Training loss: 3.6718 0.0346 sec/batch
Epoch 7/10 Iteration: 53000 Avg. Training loss: 3.6860 0.0351 sec/batch
Nearest to for: iir, nugent, lard, fashion, alexius, demonstrator, strychnine, witch,
Nearest to from: firepower, reuptake, comneni, dylan, eurofighter, discontinuation, kepulauan, bearing,
Nearest to had: archaeoastronomy, strumming, disparage, fourfold, explicated, seigneurial,

Epoch 8/10 Iteration: 56100 Avg. Training loss: 3.5320 0.0371 sec/batch
Epoch 8/10 Iteration: 56200 Avg. Training loss: 3.6934 0.0349 sec/batch
Epoch 8/10 Iteration: 56300 Avg. Training loss: 3.7037 0.0349 sec/batch
Epoch 8/10 Iteration: 56400 Avg. Training loss: 3.6885 0.0348 sec/batch
Epoch 8/10 Iteration: 56500 Avg. Training loss: 3.6793 0.0370 sec/batch
Epoch 8/10 Iteration: 56600 Avg. Training loss: 3.7452 0.0349 sec/batch
Epoch 8/10 Iteration: 56700 Avg. Training loss: 3.7300 0.0344 sec/batch
Epoch 8/10 Iteration: 56800 Avg. Training loss: 3.6100 0.0358 sec/batch
Epoch 8/10 Iteration: 56900 Avg. Training loss: 3.5187 0.0365 sec/batch
Epoch 8/10 Iteration: 57000 Avg. Training loss: 3.6284 0.0348 sec/batch
Nearest to for: iir, nugent, lard, fashion, demonstrator, keywork, minimize, alexius,
Nearest to from: firepower, reuptake, eurofighter, dylan, comneni, kepulauan, discontinuation, cyclopes,
Nearest to had: archaeoastronomy, strumming, fourfold, disparage, airways, eschewed, seig

Epoch 8/10 Iteration: 60100 Avg. Training loss: 3.6466 0.0367 sec/batch
Epoch 8/10 Iteration: 60200 Avg. Training loss: 3.6911 0.0348 sec/batch
Epoch 8/10 Iteration: 60300 Avg. Training loss: 3.6952 0.0351 sec/batch
Epoch 8/10 Iteration: 60400 Avg. Training loss: 3.6811 0.0351 sec/batch
Epoch 8/10 Iteration: 60500 Avg. Training loss: 3.6071 0.0350 sec/batch
Epoch 8/10 Iteration: 60600 Avg. Training loss: 3.6534 0.0358 sec/batch
Epoch 8/10 Iteration: 60700 Avg. Training loss: 3.6396 0.0363 sec/batch
Epoch 8/10 Iteration: 60800 Avg. Training loss: 3.6565 0.0346 sec/batch
Epoch 8/10 Iteration: 60900 Avg. Training loss: 3.6685 0.0360 sec/batch
Epoch 8/10 Iteration: 61000 Avg. Training loss: 3.5392 0.0362 sec/batch
Nearest to for: iir, nugent, lard, fashion, demonstrator, minimize, wandering, keywork,
Nearest to from: firepower, reuptake, comneni, eurofighter, cyclopes, discontinuation, dylan, multipole,
Nearest to had: archaeoastronomy, strumming, disparage, fourfold, autobiographic, umbri

Epoch 9/10 Iteration: 64100 Avg. Training loss: 3.6723 0.0364 sec/batch
Epoch 9/10 Iteration: 64200 Avg. Training loss: 3.6650 0.0351 sec/batch
Epoch 9/10 Iteration: 64300 Avg. Training loss: 3.6935 0.0350 sec/batch
Epoch 9/10 Iteration: 64400 Avg. Training loss: 3.6763 0.0358 sec/batch
Epoch 9/10 Iteration: 64500 Avg. Training loss: 3.7377 0.0354 sec/batch
Epoch 9/10 Iteration: 64600 Avg. Training loss: 3.6858 0.0352 sec/batch
Epoch 9/10 Iteration: 64700 Avg. Training loss: 3.5559 0.0359 sec/batch
Epoch 9/10 Iteration: 64800 Avg. Training loss: 3.4900 0.0355 sec/batch
Epoch 9/10 Iteration: 64900 Avg. Training loss: 3.6621 0.0357 sec/batch
Epoch 9/10 Iteration: 65000 Avg. Training loss: 3.6412 0.0352 sec/batch
Nearest to for: iir, lard, nugent, fashion, demonstrator, minimize, keywork, wandering,
Nearest to from: firepower, reuptake, hesitation, eurofighter, cyclopes, dylan, comneni, discontinuation,
Nearest to had: archaeoastronomy, strumming, fourfold, disparage, umbria, eschewed, au

Epoch 9/10 Iteration: 68100 Avg. Training loss: 3.6873 0.0369 sec/batch
Epoch 9/10 Iteration: 68200 Avg. Training loss: 3.6900 0.0363 sec/batch
Epoch 9/10 Iteration: 68300 Avg. Training loss: 3.6552 0.0353 sec/batch
Epoch 9/10 Iteration: 68400 Avg. Training loss: 3.5877 0.0368 sec/batch
Epoch 9/10 Iteration: 68500 Avg. Training loss: 3.6169 0.0365 sec/batch
Epoch 9/10 Iteration: 68600 Avg. Training loss: 3.6101 0.0357 sec/batch
Epoch 9/10 Iteration: 68700 Avg. Training loss: 3.6784 0.0362 sec/batch
Epoch 9/10 Iteration: 68800 Avg. Training loss: 3.5083 0.0369 sec/batch
Epoch 9/10 Iteration: 68900 Avg. Training loss: 3.6445 0.0362 sec/batch
Epoch 9/10 Iteration: 69000 Avg. Training loss: 3.7218 0.0360 sec/batch
Nearest to for: iir, nugent, lard, fashion, demonstrator, minimize, wandering, keywork,
Nearest to from: firepower, cyclopes, reuptake, comneni, eurofighter, dylan, discontinuation, hesitation,
Nearest to had: archaeoastronomy, strumming, disparage, fourfold, autobiographic, dige

Epoch 10/10 Iteration: 72100 Avg. Training loss: 3.6673 0.0347 sec/batch
Epoch 10/10 Iteration: 72200 Avg. Training loss: 3.6608 0.0350 sec/batch
Epoch 10/10 Iteration: 72300 Avg. Training loss: 3.6818 0.0363 sec/batch
Epoch 10/10 Iteration: 72400 Avg. Training loss: 3.7369 0.0344 sec/batch
Epoch 10/10 Iteration: 72500 Avg. Training loss: 3.6000 0.0364 sec/batch
Epoch 10/10 Iteration: 72600 Avg. Training loss: 3.4824 0.0339 sec/batch
Epoch 10/10 Iteration: 72700 Avg. Training loss: 3.6311 0.0357 sec/batch
Epoch 10/10 Iteration: 72800 Avg. Training loss: 3.6334 0.0349 sec/batch
Epoch 10/10 Iteration: 72900 Avg. Training loss: 3.6161 0.0367 sec/batch
Epoch 10/10 Iteration: 73000 Avg. Training loss: 3.5967 0.0359 sec/batch
Nearest to for: iir, lard, nugent, keywork, fashion, demonstrator, wandering, minimize,
Nearest to from: firepower, cyclopes, reuptake, comneni, discontinuation, dylan, hesitation, eurofighter,
Nearest to had: archaeoastronomy, strumming, fourfold, autobiographic, esche

Epoch 10/10 Iteration: 76100 Avg. Training loss: 3.6700 0.0343 sec/batch
Epoch 10/10 Iteration: 76200 Avg. Training loss: 3.5646 0.0363 sec/batch
Epoch 10/10 Iteration: 76300 Avg. Training loss: 3.6374 0.0343 sec/batch
Epoch 10/10 Iteration: 76400 Avg. Training loss: 3.6004 0.0368 sec/batch
Epoch 10/10 Iteration: 76500 Avg. Training loss: 3.6211 0.0344 sec/batch
Epoch 10/10 Iteration: 76600 Avg. Training loss: 3.6636 0.0356 sec/batch
Epoch 10/10 Iteration: 76700 Avg. Training loss: 3.5300 0.0367 sec/batch
Epoch 10/10 Iteration: 76800 Avg. Training loss: 3.6380 0.0363 sec/batch
Epoch 10/10 Iteration: 76900 Avg. Training loss: 3.7469 0.0347 sec/batch
Epoch 10/10 Iteration: 77000 Avg. Training loss: 3.6927 0.0353 sec/batch
Nearest to for: lard, iir, nugent, fashion, demonstrator, keywork, wandering, minimize,
Nearest to from: firepower, cyclopes, reuptake, comneni, dylan, hesitation, discontinuation, rancho,
Nearest to had: archaeoastronomy, strumming, fourfold, diger, autobiographic, esc