# Lab 6: Word2Vec from scratch in TensorFlow

In this lab, we'll write Word2Vec, a particularly popular, simple, and powerful model for unsupervised learning of embedding vectors for words.

## Understanding Word2Vec
Word2Vec trains embeddings by using them as input to a logistic regression model, trained to predict a word given nearby words.

Alternatively, you can think of Word2Vec as a neural network with linear activations and one hidden layer, trained to predict a word given nearby words.
In this interpretation, the input and output of the model are both one-hot encoded vectors, and the embeddings of the words are the activations of the hidden layer given that word's one-hot encoding as input.
Since the inputs are one-hot encoded, it's more efficient (but equivalent) to use an embedding lookup instead of a matrix multiply for the first layer.

The data is produced by taking any large body of text (e.g. Wikipedia) and creating (context, target) pairs, where the target is any word and the context is the $n$ words to its left and right.

There are two common ways of training Word2Vec:
 - The "skip-gram" model uses the target word as input and predicts context words.
 - The "continuous bag-of-words" (CBOW) model uses the context words as input and predicts the target word
 
We'll focus on the CBOW model, since it tends to work better for small datasets.

The process of generating the data for CBOW is as follows:
 1. Tokenize (convert words, which are strings, to integer "tokens" indicating the word) the dataset by assigning each word a unique integer (integer encoding)
 2. For each word in the dataset, create a single (`context`, `target`) pair, where `target` is the integer encoding of the word and `context` is a list of the integer encodings of the $n=1$ words to the left and right of the target word

(I've already written the code to do this for you below)
 
Then, the model is trained to predict the one-hot encoding of the target word given the integer encodings of the context words:
 1. For each context word, look up its embedding in a table
 2. Combine these into an "average context embedding," which is the depth-wise average of the embeddings of the individual context words. This results in a single embedding vector, which acts as the "total context" in some sense.
 3. Perform a logistic regression (equivalently, a single dense layer with softmax activation) to predict the target word using the average context embedding as input.

Instead of full logistic regression, which uses the softmax function over all of the many words that appear in the dataset, we will use candidate sampling (specifically, noise-contrastive estimation) to compute the loss.
This should speed up training significantly.

For more info on Word2Vec, see [TensorFlow's "Vector Representations of Words" tutorial](https://www.tensorflow.org/tutorials/representation/word2vec), or Alex Minnaar's two tutorials, [one on the skip-gram model](http://alexminnaar.com/2015/04/12/word2vec-tutorial-skipgram.html) and [the other on the CBOW model](http://alexminnaar.com/2015/05/18/word2vec-tutorial-continuousbow.html).

## Section 0: Download and preprocess the data
The dataset is the Cornell Movie-Dialogs Corpus, a collection of dialogue from movie scripts.
Download it from the link [here](http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html), then unzip it in a subfolder called "data".
Your directory tree should include the file `./data/cornell movie-dialogs corpus/movie_lines.txt`.

I've written all the code for loading and preprocessing the data below, but read through it to understand what's going on.
You'll need to use parts of it later.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras

### 0.1: Read the data
First, we read the lines as separate strings from the text file.

If you are running into memory problems (i.e. running out of RAM while working on this assignment) you can increase `skip_header` to train on a smaller dataset (but your resulting embeddings will be less good).
There are about 300,000 lines total, so you can set `skip_header` to skip some percentage of those.

In [None]:
# Ignore warnings on incorrectly-formatted inputs
import warnings
warnings.filterwarnings('ignore')

_, _, _, _, lines = np.genfromtxt(
    './data/cornell movie-dialogs corpus/movie_lines.txt',
    dtype='<U128', 
    delimiter='+++$+++', autostrip=True,
    encoding='latin1', invalid_raise=False, unpack=True,
    skip_header=0)

### 0.2: Tokenize
Keras has a built-in tokenization utility, which we'll use.
This creates two dictionaries, which you'll need to use later:
 - `tokenizer.word_index` maps from string words to integer tokens
 - `tokenizer.index_word` maps from integer tokens to string words

Then, we convert the dialogue lines into lists of tokens.

In [None]:
# Create a tokenizer and assign each word an integer
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)

# Convert the lines to lists of integers
tokenized_lines = tokenizer.texts_to_sequences(lines)

# Delete the original lines to save memory
del(lines)

# Here's an example of how to use word_index and index_word
print('The integer encoding of "yes" is:',
      tokenizer.word_index['yes'])
print('The word with integer encoding 10 is:',
      tokenizer.index_word[10])

### 0.3: Create (context, target) pairs
In this case, "target" means the integer encoding of a word, and "context" means a list of the integer encodings of the words to its left and right.

In [None]:
window_size = 1  # Consider 1 word on each side of target
stride = 2       # Avoid window overlap

X = []
Y = []

# Add (context, target) pairs to the dataset
for line in tokenized_lines:
    # Do not use lines that are too short
    if len(line) < 2 * window_size + 1:
        continue
    
    for target_idx in range(window_size, 
                            len(line) - window_size, 
                            stride):
        target = line[target_idx]
        left_context = line[target_idx - window_size : 
                            target_idx]
        right_context = line[target_idx + 1 :
                             target_idx + window_size + 1]
        
        Y.append(target)
        X.append(left_context + right_context)

# Convert to ndarrays
X = np.asarray(X)
Y = np.asarray(Y)

# These constants may be useful for you later
n_words = max(tokenizer.word_index.values())
n_examples = len(X)

print('n_words:', n_words)
print('n_examples:', n_examples)

### 0.4: Make a TSV metadata file
This creates a metadata file in the format that the TensorBoard Projector uses (tab-separated values).
This will let us see the names for words later when we visualize the embedding.

In [None]:
import os

# Make a logs directory if none exists yet
os.makedirs('./logs', exist_ok=True)

# Make TSV metadata file
with open('./logs/metadata.tsv', 'w') as f:
    # Header specifies column name
    f.write('Word\tIndex\tCount\n')
    
    # Unrecognized values have an integer encoding of 0
    f.write('UNK\t0\t0\n')
    
    # One word per line
    for i in range(1, n_words):
        word = tokenizer.index_word[i]
        count = tokenizer.word_counts[word]
        f.write('{}\t{}\t{}\n'.format(word, count, i))

### 0.5: Build a TensorFlow data pipeline
I've set up the `tf.data.Dataset` and the `tf.summary.SummaryWriter` for you.
Feel free to change the training hyperparameters and transforms if you like.

You might want to try:
 - Changing the batch size to suit your RAM / GPU memory situation
 - Removing `.cache()` from the transforms if you can't fit the whole dataset in RAM
 - Changing n_negative_samples. Increasing it makes computing the loss slower but more accurate per batch. 

In [None]:
# Training hyperparameters
n_epochs = 10
batch_size = 256
n_batches_per_epoch = n_examples / batch_size
print('Batches per epoch:', n_batches_per_epoch)

# Number of wrong words to sample when computing
# the loss using noise-contrastive estimation.
n_negative_samples = 128

# Construct dataset and apply transforms
dataset = tf.data.Dataset.from_tensor_slices((X, Y)) \
    .shuffle(1000) \
    .batch(batch_size) \
    .cache() \
    .repeat(n_epochs)

writer = tf.summary.create_file_writer('./logs')

## Section 1: Build a model graph

### 1.0: Model hyperparameters
`embedding_size` is the number of dimensions in the embedding vector of each word.
Try to train 64-dimensional embeddings, but if training takes too long, feel free to reduce this to 32 or 16.

`validation_words` is a list of words we'll use to see how good our embeddings are.
While training, we'll periodically print out the words that have embeddings most similar to these words.
As training progresses, this should start printing words with similar meanings.

In [None]:
embedding_size = 64
validation_words = ['yes', 'small', 'thousand']

### 1.1: Word2Vec class

Write a `tf.Module` class called `Word2Vec`. We are concerned with a tensor which holds the context integer encodings (shape: `(batch_size, 2)`) and a tensor which holds the target integer encoding (shape: `(batch_size,)`) which will be our x and y variables respectively.

#### 1.1.1: Embedding variable
Make a variable which holds the embeddings for each of the words.
It should be a rank-2 tensor (a matrix) where the $i$-th row is the embedding vector for the word with integer encoding $i$.
Its shape should be `(n_words, embedding_size)`.

#### 1.1.2: Create the variables for logistic regression
Create weight and bias variables for a logistic regression which takes in the average context embedding and outputs a probability for each word.

NOTE: We won't actually ever compute the output of this logistic regression by hand (i.e. the matrix multiplication and softmax activation), since we're using noise-contrastive estimation to approximate it.
TensorFlow's `nce_loss` function just takes the weights and bias tensors as input.

#### 1.1.3: Compute the average context embedding
Look up the embeddings of each of the context words (with a single call to `tf.nn.embedding_lookup`), then average them depth-wise into a single average embedding vector with shape `(batch_size, embedding_size)`. Do this in a method called `context_average_embedding` that takes in a `context` argument.

#### 1.1.4: Compute the loss
We want to jointly train the logistic regression weights and the word embeddings using cross-entropy loss on the output of the logistic regression, but this is inefficient because of the large number of outputs (we'd need one logit for each word).

Instead, approximate the per-example loss using noise-contrastive estimation by calling `tf.nn.nce_loss`, then return the mean loss for the batch.

Since we will require many variables from our model to compute this loss, do this in a method called `loss` in your model class.

#### 1.1.5: Operations find similar embeddings
The **cosine similarity** between two embeddings $\vec{x}$ and $\vec{y}$ is the cosine of the angle between them, computed as:
$$
\cos(\vec{x}, \vec{y}) = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot ||\vec{y}||}
$$
This is a more robust similarity measure than Euclidean distance, since it's invariant to scaling the embedding vectors by a constant.

We'll use cosine similarity to find the most similar words to any given input word.
The steps for doing this are:
 1. Take a word's integer encoding as input, and compute its embedding with `tf.nn.embedding_lookup`.
 2. Compute a "similarity tensor" which holds cosine similarity of this embedding with all of the word embeddings. This should result in a single tensor with shape `(n_words,).
 3. Use `tf.nn.top_k` to find the top 8 similarities and their indices in the similarity tensor. These indices will be the integer encodings of the words with the most similar embeddings to the input. Return this tensor.

Later, you can find the most similar words to a given input word by running the tensor that holds the top 8 indices, then using `tokenizer.index_word` on each of those indices.

When debugging this, make sure that the most similar word to a given word is that word itself, with a cosine similarity of 1.

Do this under a method called `compute_similar_words`.

In [None]:
# Your code here

### 1.2: Optimizer and gradients
Make an optimizer (I used `tf.optimizers.Adam` with `learning_rate=1e-3`) and a `train` method with a `bool` switch to compute summaries. If this switch is true, add a summary scalar plot to plot the loss in TensorBoard.

In [None]:
# Your code here

## Section 2: Train the model
Use the same kind of training loop we've used before, repeatedly calling the method that applies gradient updates.

The dataset is large, but the model doesn't do much work for each example.
So, you might want to only run and save the summaries every 1000 batches or so to speed up training.

In addition, every 1000 batches, for every word in `validation_words`, print the 8 most similar words by cosine similarity.

Finally, use a `tf.train.Checkpoint()` to save the model's weights in `./logs`, which is necessary to visualize the embeddings in TensorBoard. Since we only care about the weights, initialize the checkpoint as `tf.train.Checkpoint(embedding=model.word_embeddings)`. Instead of `checkpoint.write(...)`, use `checkpoint.save('./logs/embedding.ckpt')` for a format TensorBoard can interpret in its embedding projector.

In [None]:
# Your code here

## Section 3: Visualize the learned embeddings
Run TensorBoard pointed at `./logs` and look in the Projector tab, then use the "Load data" button and select the `./logs/metadata.tsv` file we created earlier to load word labels. If projector does not show up, use the URL [`http://localhost:6006#projector`](http://localhost:6006#projector) depending on the port TensorBoard is using (mine uses 6006).

Try typing some words into the search bar and see which words come up as most similar.

## Section 4 (optional): Generate analogies
Try using embedding vector arithmetic to generate analogies.
To do this:
 - Compute the embedding vector for several vectors using `tf.nn.embedding_lookup`
 - Do vector arithmetic on the computed embeddings
 - Find the most similar word embeddings by cosine similarity, then find the words that map to those embeddings

In [None]:
# Your code here?