In [1]:
import numpy as np
import torch
from torch import nn
from torch.autograd import Variable
import pandas as pd
import nltk
import os
import glovar
import collections
import time
from data import toxic
toxic.prepare()

Preparing necessary preliminaries...
Success.


# Skip-Gram Implementation Practice

Your goal is to implement the Skip-Gram model in PyTorch, including pre-processing. 

Pre-processing is an important step in deep learning with text, and you should learn it now. If you think you know it well, skip to number (5) and refer to the instructions. 

This tutorial assumes you are familiar with the basics of PyTorch. If not, you can review some introductory tutorials such as:

https://github.com/jcjohnson/pytorch-examples

Stages in this tutorial:
1. Import the data
2. Tokenize the data
3. Build the vocab dictionary
4. Prepare training pairs
5. Implement negative sampling
6. Code the model
7. Train the model
8. Visualize the word embeddings

Tips are provided for the pre-processing stage to make it smoother. Try not to use them. If you have to, put aside some time to learn those skills properly.

You will see validation cells along the way with `assert` calls. Run these to make sure you haven't made any mistakes along the way that will prevent you from proceeding. You will need to do the steps in order.

## 1. Import the Data

We will use the data from Kaggle's Toxic Comment classification task. This data is very noisy, so we have selected a subset of the data that reduces this noise to make this tutorial easier.

The files are provided in the `data` directory. They come in .csv format so we'll use pandas to handle them. If you haven't learned how to use pandas, do it! It is a very useful tool.

The next cell defines the paths to these files.

In [2]:
train_file_path = os.path.join(glovar.DATA_DIR, 'train.csv')
test_file_path = os.path.join(glovar.DATA_DIR, 'test.csv')

### To Do

Load both .csv files into pandas `DataFrame` objects called `train` and `test` and see what they look like by calling the `head()` function on one of them. Identify the name of the column that contains the text of the comments. 

Tips:
- If you don't know pandas, try using `pd.read_csv()`, but put aside some time to learn pandas properly

In [3]:
# train = ???
# test = ???

## 2. Tokenization

We need to determine the set of all tokens in our dataset. We therefore need to separate each comment string into individual tokens, then determine the unique set of those tokens. We focus on the tokenization step first.

We will use `nltk` for tokenization because it is lightweight. The `nltk` package defines a function called `word_tokenize()` that you can use.

### To Do
Given the `DataFrame`s above, obtain a `set` of all unique tokens from <strong>both</strong> the train and test sets. We need to account for all the data in our vocabulary. This variable should be called `token_set`.

Tips:
- The rows of `DataFrame` objects `df1` and `df1` can be joined together using `df1.append(df2)`
- You can get an `array` of values in a `DataFrame` `df` column by calling `df['column_name'].values`

In [4]:
# token_set = ???

In [9]:
assert len(token_set) == len(toxic.get_token_set())

## 3. Build the Vocab Dictionary

We need to associate a unique `int` index with every unique token, and provide a map for lookup. A high-level view of text processing is often: 
1. receive text as input
2. tokenize that text to obtain tokens
3. use those tokens to obtain integer indices
4. use those indices to lookup word vectors
5. use those vectors as input to a neural network.

We focus on (3) now.

### To Do
Use the `token_set` to build a `dict` object called `vocab` that has every unique token in the `token_set` as an index and unique integers as `values`. <strong>Also</strong>, since we will add a padding token for recurrent neural network processing later, make sure your values start from `1` and not `0` so that we can keep `0` for padding later.

Tips:
- The python `zip()` function can be used to bring two lists together - e.g. tokens and indices
- The `dict()` constructor can take a zipped object as input, mapping the first position to index and second to value

In [25]:
# vocab = ???

In [6]:
assert len(vocab) == len(toxic.get_vocab())
assert isinstance(list(vocab.keys())[0], str)
assert isinstance(list(vocab.values())[0], int)
assert 0 not in vocab.values()

We will also define a reverse lookup dictionary (integer indexes as keys and string tokens as values) for later use.

In [26]:
rev_vocab = {v: k for k, v in vocab.items()}

## 4. Prepare the Training Pairs

We need to present two words at a time to the network to train our Skip-Gram: a center word and a context word. We therefore need to determine these pairs beforehand.

Before coding deep learning models it is necessary to first fully think through how we are going to present the data to the network. This will avoid having to make annoying changes that might follow from small details that are easy to overlook.

We know we are going to present two words at a time: a center word, and a context word. But how are we going to present them: as tokens, or as indices? These details matter when you code the forward pass of the network: if you try a word vector lookup on an embedding matrix with a string, you will see an error. We will use integer indices as it will be slightly faster than adding a dictionary lookup as well at training time.

Since finding the context tokens for all words over all instances in the dataset is not a generally useful skill, we do that for you.

In [11]:
m = 5  # our context window size - experiment with this!
contexts = toxic.get_contexts(m)

The `contexts` variable is a dictionary where the keys are the indices of all the tokens in the dataset, and the values are `set`s of token indices that occur in their contexts. We will sample from these during training.

Finally, we need the frequencies of our words for the negative sampling algorithm.

### To Do

Use the training data and `nltk.word_tokenize` to create a `collections.Counter` object that holds each unique token as a key, and the frequency count as a value.

Tips:
- `Counter` has an `update()` function that can accept lists of key values (i.e. tokens) and automatically does the counting for you

In [7]:
# frequencies = ???

In [None]:
assert len(frequencies) == len(vocab)
assert isinstance(list(frequencies.keys())[0], str)
assert isinstance(list(frequencies.values())[0], int)
rev_vocab = {v: k for k, v in vocab.items()}

## 5. Implement Negative Sampling

If you are skipping pre-processing, uncomment the next cell.

To perform negative sampling, we need a function that
- Takes a token index as argument
- Returns the number of negative samples we desire
- Randomly chooses those samples according to

$$
P(w_i) = \frac{f(w_i)^{3/4}}{\sum_{j=0}^n (f(w_j)^{3/4})}
$$

We will define this function as a callable class, since it depends on state information (the `vocab`, `frequencies`, and `contexts`).

In [10]:
vocab = toxic.get_vocab()
frequencies = toxic.get_frequencies()

In [59]:
class NegativeSampler:
    
    def __init__(self, vocab, frequencies, contexts):
        """Create a new NegativeSampler.
        
        Args:
          vocab: Dictionary.
          frequencies: List of integers, the frequencies of each word,
            sorted in word index order.
          contexts: Dictionary.
        """
        self.vocab = vocab
        self.n = len(vocab)
        self.contexts = contexts
        self.distribution = self.p(list(frequencies.values()))
    
    def __call__(self, tok_ix, num_negs):
        """Get negative samples.
        
        Args:
          tok_ix: Integer, the index of the center word.
          num_negs: Integer, the number of negative samples to take.
        """
        samples = np.random.choice(
            self.n, 
            size=num_negs, 
            p=self.distribution)
        # make sure we haven't sampled center word or its context
        invalid = [-1, tok_ix] + list(self.contexts[tok_ix])
        for i, ix in enumerate(samples):
            if ix in invalid:
                while new_ix in invalid:
                    new_ix = random.choice(self.n, 
                                           num_negs, 
                                           self.distribution)
                samples[i] = new_ix
        return samples                
    
    def p(self, freqs):
        """Determine the probability distribution for negative sampling.
        
        Args:
          freqs: List of integers.
        
        Returns:
          numpy.array.
        """
        """ Impelement Me"""
        freqs = np.array(freqs)
        return np.power(freqs, 3/4) / np.sum(np.power(freqs, 3/4))

sampler = NegativeSampler(vocab, frequencies, contexts)

Let's try out our negative sampler to see how it does.

In [60]:
def sample(token):
    if token not in vocab.keys():
        raise ValueError('Try a different token, that one is not in the vocab.')
    print('Negative samples for "%s":' % token)
    for sample in sampler(vocab[token], 5):
        print('\t%s' % (rev_vocab[sample]))
    
sample('friend')

Negative samples for "friend":


UnboundLocalError: local variable 'new_ix' referenced before assignment

## 6. Code the Model

Now for the fun part.

Notice that for all the fancy neural network diagrams you see in word2vec tutorials, when you get down to the maths it is just picking vectors out of matrices and calculating their dot products? It's nothing to be frightened of. Just follow the equations closely and it will work. We don't even need to use any complicated functions or neural network modules. In fact, we could do it in numpy (but then we would have to calculate the gradients manually). PyTorch makes this very easy.

Below is a template for the model. You need to implement the functions that say `### Impelement Me ###`. They should be self explanatory from the slides/notes. Refer to them if stuck.

In [None]:
class SkipGram(nn.Module):
    """SkipGram Model."""
    
    def __init__(self, vocab, emb_dim, num_negs, lr):
        """Create a new SkipGram.
        
        Args:
          vocab: Dictionary, our vocab dict with token keys and index values.
          emb_dim: Integer, the size of word embeddings.
          num_negs: Integer, the number of non-context words to sample.
          lr: Float, the learning rate for gradient descent.
        """
        self.vocab = vocab
        self.n = len(vocab)  # size of the vocab
        self.emb_dim = emb_dim
        
        ### Implement Me: Initialize self.U and self.V,      ###
        ### the parameter matrices of the network.           ###
        ### They need to be nn.Parameter objects.            ###
        ### V should have word vectors as row matrices,      ###
        ### U should have word vectors as column matrices.   ###
        ### Use nn.init.uniform() to perform random uniform  ###
        ### initialization in the range [-0.01, 0.01].       ###
            
        self.U = nn.Parameter(torch.Tensor(n, emb_dim), requires_grad=True)
        self.V = nn.Parameter(torch.Tensor(emb_dim, n), requires_grad=True)
        nn.init.uniform(self.V.weight, a=-0.01, b=0.01)
        nn.init.uniform(self.U.weight, a=-0.01, b=0.01)        
        
        ### End Implement ##
        
        # Adam is a good optimizer and will converge faster than SGD
        self.optimizer = torch.optim.Adam(self.parameters(), lr=lr)  
    
    def forward(self, center_ixs, context_ixs):
        """Compute the forward pass of the network.
        
        1. Lookup embeddings for center and context words
            - use self.lookup()
        2. Sample negative samples
            - use self.negative_samples()
        2. Calculate the probability estimates
            - make sure to implement and use self.softmax()
        3. Calculate the loss
            - using self.loss()
        
        Args:
          center_ixs: List of integer indices.
          context_ixs: List of integer indices.
        
        Returns:
          loss (torch.autograd.Variable).
        """
        ### Implement Me ###
        center_words = self.embedding_lookup(self.V, center_ixs)
        context_words = self.embedding_lookup(self.U, context_ixs)
        negatives = self.negative_samples(center_words)
        context_plus_negs = torch.stack([context_words, negatives])
        logits = center_words.matmul(context_plus_negs)
        preds = self.softmax(logits)
        targets = torch.LongTensor(context_ixs)
        return self.loss(preds, targets)
    
    def lookup(embedding, indices):
        """Lookup embeddings given indices.
        
        Args:
          embedding: nn.Parameter, an embedding matrix.
          indices: List of integers, the indices to lookup.
        
        Returns:
          torch.autograd.Variable of shape [len(indices), emb_dim]. A matrix 
            with horizontally stacked word vectors.
        """
        lookup_tensor = torch.LongTensor(indices)
        return embedding[lookup_tensor]
    
    def loss(self, preds, targets):
        """Compute cross-entropy loss.
        
        Implement this for practice, don't use the built-in PyTorch function.
        
        Args:
          preds: Tensor of shape [batch_size, vocab_size], our predictions.
          targets: List of integers, the vocab indices of target context words.
        """
        ### Implement Me ###
        return -1 * torch.sum(targets * torch.log(preds))
    
    def optimize(self, loss):
        """Optimization step.
        
        Args:
          loss: Scalar.
        """
        # We first make sure to remove any previous gradient from our tensors
        # before calculating again.
        self.optimizer.zero_grad()
        ### Implement Me ###
        loss.backward()
        self.optimizer.step()        
    
    def softmax(self, logits):
        """Compute the softmax function.
        
        Implement this for practice, don't use the built-in PyTorch function.
        
        Args:
          logits: Tensor of shape [batch_size, vocab_size].
        
        Returns:
          Tensor of shape [batch_size, vocab_size], our predictions.
        """
        ### Impelement Me ###
        return torch.exp(logits) / torch.sum(torch.exp(logits))