In [26]:
import numpy as np
import torch
from torch import nn
from torch.autograd import Variable
import nltk
import pandas as pd
from data import toxic
import os
import glovar

# Skip-Gram Implementation Practice

Your goal is to implement the Skip-Gram model in PyTorch, including pre-processing. Pre-processing is an important step in deep learning with text, and you should learn it now. This tutorial assumes you are familiar with the basics of PyTorch. If not, you can review some introductory tutorials such as:

https://github.com/jcjohnson/pytorch-examples

Steps in this tutorial:
1. Import the data
2. Tokenize the data
3. Build the vocab dictionary
4. Prepare training pairs
5. Implement negative sampling
6. Code the model
7. Train the model
8. Visualize the word embeddings

## 1. Import the Data

We will use the data from Kaggle's Toxic Comment classification task. Head over to

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

and sign up for an account if you don't have one. Download the data and unzip it into the /data folder in the root directory of this project.

The files come in .csv format, so we'll use pandas to handle them. If you haven't learned how to use pandas, do it! It is a very useful tool.

Make sure, once again, you have unzipped the files into the correct folder (or the code in the next cell will tell you the file is not found).

In [21]:
train_file_path = os.path.join(glovar.DATA_DIR, 'train.csv')
test_file_path = os.path.join(glovar.DATA_DIR, 'test.csv')

The file paths are ready to go. Now load the .csv files for both training and testing into pandas `DataFrame` objects and see what they look like with the `head()` function. Identify the column that contains the text we want to use in the next step, tokenization.

In [10]:
train = pd.read_csv(train_file_path)
test = pd.read_csv(test_file_path)
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


## 2. Tokenization

The text data comes as separate strings for each data point, but we need to break each string into tokens, and from the resulting lists determine the set of unique tokens. First tokenize, then determine the set of unique tokens.

We will use nltk for tokenization because it is lightweight. The nltk package defines a function `word_tokenize()` that you can use.

There are many ways to go from our `DataFrame` objects to a `set` of unique tokens. It is up to you how to do it.

At the end of this step you should have a `set` of tokens in a variable called `token_set`. The cell after the next one will then check to make sure you have done this correctly.

If you are having problems check step (1) and make sure you have all the data. Make sure you are taking a set and not including any token multiple times. Use the assert conditions to further troubleshoot any errors.

In [23]:
token_set = set([])
for _, row in train.iterrows():
    token_set.update(nltk.word_tokenize(row['comment_text']))
for _, row in test.iterrows():
    token_set.update(nltk.word_tokenize(row['comment_text']))
print(len(token_set))

571120


In [24]:
token_set

{'explode.',
 'Verein',
 'title=KZ_Auschwitz-Birkenau',
 'yryy4',
 "'heros",
 'plottin',
 'asterisked',
 'this.Thanks',
 'Inter-Island',
 'Subphylum',
 'sucks.==',
 'Dine-In',
 '0945ojtg0o8reuyjurfi',
 'baiting/poking',
 'CUSTOMSIG',
 'FANGUSH',
 'Gotlanders',
 'raycen',
 'colspan=6|',
 'secertly',
 'KaNyamazane',
 '=5/1/2011',
 'humdrum',
 '==Characters',
 'Legaleze',
 'investigations/Davebrayfb',
 'pompus',
 'pacified',
 'textIVE',
 "O'Fat",
 'Lucaspet',
 'unintuitive',
 'interrupter',
 'dis-inviting',
 'Phyllis',
 'overwork',
 '//www.bakersfield.com/893/story/635460.html',
 'Jagira',
 'Sumlin',
 'TYPICAL',
 'White-necked',
 'ROAS',
 'Writs',
 'mid-20th',
 'william',
 'lobbiest',
 'familias',
 'Universiade',
 'Biomimetic',
 '2,180',
 'cok',
 "'smile",
 'nda',
 'investigations/Rtuftrbsee',
 'ythat',
 'nievity',
 'Jarvey2',
 'Laptop',
 '3.nm1raines/raycen',
 'Gnjilane',
 'Colombiano',
 'IVOLUME',
 'Cost-benefit',
 'R/S',
 'Barrowlands',
 'offender',
 'Veljko',
 'Socialism',
 'dogwood',

In [None]:
assert len(token_set) == toxic.vocab_size
assert toxic.token_set - token_set == set()

## 3. Build the Vocab Dictionary

In training we want to take words and then lookup our vectors using an index. In this step we want to assign unique indices to each token, and define a way to lookup indices given tokens. We will use a `dict` for this.

At the end of this step you should have a dictionary variable called `vocab`. Run the following cell to confirm this is all in order.

In [18]:
vocab = dict(zip(token_set, range(len(token_set))))

In [None]:
assert len(vocab) == toxic.vocab_size

## 4. Prepare the Training Pairs

Before coding deep learning models it is necessary to first fully think through how we are going to present the data to the network. We have an idea of this from the slides/notes. It is still a good idea to put this step here in the workflow, before coding up the model. This will avoid having to make annoying changes that might follow from small details that are easy to overlook.

We know we are going to present two words at a time: a center word, and a context word. But how are we going to present them: as tokens, or as indices? These details matter when you code the forward pass of the network. If you try an embedding lookup with a string, you will see an error. We will use integer indices as it will be every so slightly faster than adding a dictionary lookup as well at training time.

Since implementing finding the context tokens for all words over all instances in the dataset is not generally useful, we do that for you. In practice, you can just use gensim (https://radimrehurek.com/gensim/) to implement word2vec models. This is a learning exercise. Only if you really want to start drastically customizing your model would you really use your own implementaiton.

In [4]:
m = 5  # our context window size - play with different sizes to get a better feel for how this works
contexts = toxic.contexts(train, test, vocab, m)

The `contexts` variable is a dictionary where the keys are the indices of all the tokens in the dataset, and the values are `set`s of token indices that occur in their contexts. We will sample from these during training.

## 5. Implement Negative Sampling

We need to define a function that takes a token (or its index) and returns a list of negative samples, sampled by our function based on the token frequency distribution. This function will therefore depend on the state of the vocabulary and token frequencies (the data structure we just built). For this reason we will define a class that wraps this information and allows us to call it with new word indices for every mini-batch in training. We will implement it as a callable in python (by overloading the `__call__` method). This can be syntactically nicer than defining a function on the class because it is simpler.



In [None]:
class NegativeSampler:
    
    def __init__(self, vocab, frequencies, contexts):
        """Create a new NegativeSampler.
        
        Args:
          vocab: Dictionary.
          frequencies: List of integers, the frequencies of each word,
            sorted in word index order.
          contexts: Dictionary.
        """
        self.vocab = vocab
        self.n = len(vocab)
        self.contexts = contexts
        self.distribution = self.p(frequencies)
    
    def __call__(self, tok_ix, num_negs):
        """Get negative samples.
        
        Args:
          tok_ix: Integer, the index of the center word.
          num_negs: Integer, the number of negative samples to take.
        """
        # don't sample the center word or its context
        distribution = self.distribution.copy()
        distribution[tok_ix] = 0.
        distribution[self.contexts[tok_ix]] = 0.
        return np.random.choice(range(self.n), size=num_negs, p=distribution)
    
    def p(self, frequencies):
        """Determine the probability distribution for negative sampling.
        
        Args:
          frequencies: List of integers.
        
        Returns:
          numpy.array.
        """
        frequences = np.array(frequencies)
        return np.power(frequencies, 3/4) / np.sum(np.power(frequencies, 3/4))

## 6. Code the Model

Now for the fun part.

Notice that for all the fancy neural network diagrams you see in word2vec tutorials, when you get down to the maths it is just picking vectors out of matrices and calculating their dot products? It's nothing to be frightened of. Just follow the equations closely and it will work. We don't even need to use any complicated functions or neural network modules. In fact, we could do it in numpy (but then we would have to calculate the gradients manually). PyTorch makes this very easy.

Below is a template for the model. You need to implement the functions that say `### Impelement Me ###`. They should be self explanatory from the slides/notes. Refer to them if stuck.

In [None]:
class SkipGram(nn.Module):
    """SkipGram Model."""
    
    def __init__(self, vocab, emb_dim, num_negs, lr):
        """Create a new SkipGram.
        
        Args:
          vocab: Dictionary, our vocab dict with token keys and index values.
          emb_dim: Integer, the size of word embeddings.
          num_negs: Integer, the number of non-context words to sample.
          lr: Float, the learning rate for gradient descent.
        """
        self.vocab = vocab
        self.n = len(vocab)  # size of the vocab
        self.emb_dim = emb_dim
        
        ### Implement Me: Initialize self.U and self.V,      ###
        ### the parameter matrices of the network.           ###
        ### They need to be nn.Parameter objects.            ###
        ### V should have word vectors as row matrices,      ###
        ### U should have word vectors as column matrices.   ###
        ### Use nn.init.uniform() to perform random uniform  ###
        ### initialization in the range [-0.01, 0.01].       ###
            
        self.U = nn.Parameter(torch.Tensor(n, emb_dim), requires_grad=True)
        self.V = nn.Parameter(torch.Tensor(emb_dim, n), requires_grad=True)
        nn.init.uniform(self.V.weight, a=-0.01, b=0.01)
        nn.init.uniform(self.U.weight, a=-0.01, b=0.01)        
        
        ### End Implement ##
        
        # Adam is a good optimizer and will converge faster than SGD
        self.optimizer = torch.optim.Adam(self.parameters(), lr=lr)  
    
    def forward(self, center_ixs, context_ixs):
        """Compute the forward pass of the network.
        
        1. Lookup embeddings for center and context words
            - use self.lookup()
        2. Sample negative samples
            - use self.negative_samples()
        2. Calculate the probability estimates
            - make sure to implement and use self.softmax()
        3. Calculate the loss
            - using self.loss()
        
        Args:
          center_ixs: List of integer indices.
          context_ixs: List of integer indices.
        
        Returns:
          loss (torch.autograd.Variable).
        """
        ### Implement Me ###
        center_words = self.embedding_lookup(self.V, center_ixs)
        context_words = self.embedding_lookup(self.U, context_ixs)
        negatives = self.negative_samples(center_words)
        context_plus_negs = torch.stack([context_words, negatives])
        logits = center_words.matmul(context_plus_negs)
        preds = self.softmax(logits)
        targets = torch.LongTensor(context_ixs)
        return self.loss(preds, targets)
    
    def lookup(embedding, indices):
        """Lookup embeddings given indices.
        
        Args:
          embedding: nn.Parameter, an embedding matrix.
          indices: List of integers, the indices to lookup.
        
        Returns:
          torch.autograd.Variable of shape [len(indices), emb_dim]. A matrix 
            with horizontally stacked word vectors.
        """
        lookup_tensor = torch.LongTensor(indices)
        return embedding[lookup_tensor]
    
    def loss(self, preds, targets):
        """Compute cross-entropy loss.
        
        Implement this for practice, don't use the built-in PyTorch function.
        
        Args:
          preds: Tensor of shape [batch_size, vocab_size], our predictions.
          targets: List of integers, the vocab indices of target context words.
        """
        ### Implement Me ###
        return -1 * torch.sum(targets * torch.log(preds))
    
    def optimize(self, loss):
        """Optimization step.
        
        Args:
          loss: Scalar.
        """
        # We first make sure to remove any previous gradient from our tensors
        # before calculating again.
        self.optimizer.zero_grad()
        ### Implement Me ###
        loss.backward()
        self.optimizer.step()        
    
    def softmax(self, logits):
        """Compute the softmax function.
        
        Implement this for practice, don't use the built-in PyTorch function.
        
        Args:
          logits: Tensor of shape [batch_size, vocab_size].
        
        Returns:
          Tensor of shape [batch_size, vocab_size], our predictions.
        """
        ### Impelement Me ###
        return torch.exp(logits) / torch.sum(torch.exp(logits))

Parameter containing:
-4.2402e+09  4.5838e-41 -4.3134e-19
 3.0649e-41  5.0217e+03  4.5838e-41
 5.0217e+03  4.5838e-41  5.0217e+03
 4.5838e-41  5.0217e+03  4.5838e-41
[torch.FloatTensor of size 4x3]

Variable containing:
-4.2402e+09  4.5838e-41 -4.3134e-19
 5.0217e+03  4.5838e-41  5.0217e+03
[torch.FloatTensor of size 2x3]

