In [1]:
import pickle
import sys
sys.path.append('../')
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pack_padded_sequence
from torch.nn.utils.rnn import pad_packed_sequence

from src.dataset import SquadDataset
from src.preprocessing import Preprocessing

# Clear memory
# torch.cuda.empty_cache()

# Notebook summary

In this notebook we'll set up the model architectures required for the first encoders. These encode the words in the documents, and the words in the questions. Both questions and documents are initially encoded by an LSTM:
    
    d_t = LSTM_enc(d_t−1, x_t^D)
    
resulting in document encoding matrix

$$ D = [d1, . . ., d_m, d∅] of L x (m+1)$$ dimensions

and

$$ q_t = LSTMenc(q_t−1, x_t^D) $$
    
resulting in intermediate question encoding matrix

    Q' = [q_1, . . ., q_n, q∅] of L x (n+1) dimensions

to which we then apply a nonlinearity

    Q = tanh(W^(Q)Q_0 + b(Q)) of L x (n+1) dimensions

Let's start!

In [2]:
# Paths
glove_file_path = "../data/glove.840B.300d.txt"
squad_file_path = "../data/train-v1.1.json"

In [3]:
class DocumentEncoderLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super(DocumentEncoderLSTM, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers)
        
    def forward(self, x):
        out = self.lstm(x)
        return out

### Define input data

Do we have a single document encoding matrix for all documents, or an encoding matrix for each document? It seems we have one for each document, where L is the length of the transformed word vectors and m+1 is the number of words in the document plus a sentinel vector. 

The shape of the input of a neural net is always defined on the level of a single example, as the batch size may vary. The above would suggest that we feed the network word vectors for a whole document. We pass each word vector through the same LSTM and we obtain new, encoded vectors (which incorporate some of their surrounding context).

This raises another question: how are we training this encoding? It seems we do not have a target to train on and therefore no error signal, at least in this section on its own. Just feeding the vectors through an LSTM with random weights seems a little pointless. It seems more likely that this is learned by going through the whole architecture. Does this mean that in order to test this we need to have the whole thing set up?

After we have both encodings D and Q, we calculate affinity matrix L = (D.transpose Q). This makes it unlikely that the encoders are coupled to the whole network, since it is difficult (impossible?) to disentangle the error signal you backpropagate.

SOLUTION: encoders are unsupervised, and they try to learn a mapping from x to x, e.g. they approximate the identity function. So we train the LSTM with backprop and pass our input along as targets. Conceptually, we have the word vectors, which encode meaning of single words. We pass these through an LSTM, which learns word context. So as output we get the same word meanings, which somehow also encapsulate word interactions because they have been through the LSTM. Is this correct??

In [4]:
# Set parameters
# Assuming that the LSTM takes one word at a time and the sizes stay the same through the encoder 
input_size = 300
hidden_size = 300
output_size = 300
num_layers = 2
batch_size = 8
learning_rate = 0.0007
num_epochs = 10

In [5]:
# Setup model
model = DocumentEncoderLSTM(input_size, hidden_size, num_layers)
model.cuda()
lossfun = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Because we're encoding the data we are learning the identity function. This means we use input data x as our target. This is a 3D Tensor, and the go-to loss function CrossEntropyLoss expects a 2D Tensor (usually labels are 1D, for every example, so 2D). Should we flatten our x? On the other hand, as it's not really classes we're predicting, it might be more intuitive to use the MSE or something similar.

In [6]:
# Get data
data = SquadDataset(squad_file_path, glove_file_path, target='text')

Found pickled GloVe file. Loading...
Done. 2195875 words loaded!


In [None]:
dataloader = DataLoader(data, batch_size=batch_size, shuffle=True)

In [None]:
for epoch in range(num_epochs):
    for i, data_batch in enumerate(dataloader):
        x = Variable(data_batch['text'].float())
        x = x.cuda()
        y = x
        
        output = model(x)
        optimizer.zero_grad()
        loss = lossfun(output[0], y)
        loss.backward()
        optimizer.step()
        
        if (i+1)%100 == 0:
            print('Epoch [%d/%d], Step[%d/%d], Loss: %0.4f'
                  %(epoch+1, num_epochs, i+1, len(data)//batch_size, loss.data[0]))

Epoch [1/10], Step[100/10949], Loss: 0.0104
Epoch [1/10], Step[200/10949], Loss: 0.0097
Epoch [1/10], Step[300/10949], Loss: 0.0081
Epoch [1/10], Step[400/10949], Loss: 0.0050
Epoch [1/10], Step[500/10949], Loss: 0.0041
Epoch [1/10], Step[600/10949], Loss: 0.0044
Epoch [1/10], Step[700/10949], Loss: 0.0045
Epoch [1/10], Step[800/10949], Loss: 0.0038
Epoch [1/10], Step[900/10949], Loss: 0.0043
Epoch [1/10], Step[1000/10949], Loss: 0.0039
Epoch [1/10], Step[1100/10949], Loss: 0.0029
Epoch [1/10], Step[1200/10949], Loss: 0.0032
Epoch [1/10], Step[1300/10949], Loss: 0.0039
Epoch [1/10], Step[1400/10949], Loss: 0.0031
Epoch [1/10], Step[1500/10949], Loss: 0.0026
Epoch [1/10], Step[1600/10949], Loss: 0.0025
Epoch [1/10], Step[1700/10949], Loss: 0.0030
Epoch [1/10], Step[1800/10949], Loss: 0.0033
Epoch [1/10], Step[1900/10949], Loss: 0.0037
Epoch [1/10], Step[2000/10949], Loss: 0.0024
Epoch [1/10], Step[2100/10949], Loss: 0.0029
Epoch [1/10], Step[2200/10949], Loss: 0.0033
Epoch [1/10], Step[

Epoch [2/10], Step[7500/10949], Loss: 0.0028
Epoch [2/10], Step[7600/10949], Loss: 0.0025
Epoch [2/10], Step[7700/10949], Loss: 0.0019
Epoch [2/10], Step[7800/10949], Loss: 0.0032
Epoch [2/10], Step[7900/10949], Loss: 0.0027
Epoch [2/10], Step[8000/10949], Loss: 0.0024
Epoch [2/10], Step[8100/10949], Loss: 0.0030
Epoch [2/10], Step[8200/10949], Loss: 0.0022
Epoch [2/10], Step[8300/10949], Loss: 0.0025
Epoch [2/10], Step[8400/10949], Loss: 0.0029
Epoch [2/10], Step[8500/10949], Loss: 0.0023
Epoch [2/10], Step[8600/10949], Loss: 0.0024
Epoch [2/10], Step[8700/10949], Loss: 0.0026
Epoch [2/10], Step[8800/10949], Loss: 0.0023
Epoch [2/10], Step[8900/10949], Loss: 0.0018
Epoch [2/10], Step[9000/10949], Loss: 0.0023
Epoch [2/10], Step[9100/10949], Loss: 0.0025
Epoch [2/10], Step[9200/10949], Loss: 0.0032
Epoch [2/10], Step[9300/10949], Loss: 0.0020
Epoch [2/10], Step[9400/10949], Loss: 0.0015
Epoch [2/10], Step[9500/10949], Loss: 0.0028
Epoch [2/10], Step[9600/10949], Loss: 0.0029
Epoch [2/1

Epoch [4/10], Step[4000/10949], Loss: 0.0025
Epoch [4/10], Step[4100/10949], Loss: 0.0023
Epoch [4/10], Step[4200/10949], Loss: 0.0027
Epoch [4/10], Step[4300/10949], Loss: 0.0028
Epoch [4/10], Step[4400/10949], Loss: 0.0023
Epoch [4/10], Step[4500/10949], Loss: 0.0031
Epoch [4/10], Step[4600/10949], Loss: 0.0027
Epoch [4/10], Step[4700/10949], Loss: 0.0023
Epoch [4/10], Step[4800/10949], Loss: 0.0024
Epoch [4/10], Step[4900/10949], Loss: 0.0026
Epoch [4/10], Step[5000/10949], Loss: 0.0020
Epoch [4/10], Step[5100/10949], Loss: 0.0023
Epoch [4/10], Step[5200/10949], Loss: 0.0024
Epoch [4/10], Step[5300/10949], Loss: 0.0022
Epoch [4/10], Step[5400/10949], Loss: 0.0023
Epoch [4/10], Step[5500/10949], Loss: 0.0020
Epoch [4/10], Step[5600/10949], Loss: 0.0024
Epoch [4/10], Step[5700/10949], Loss: 0.0028
Epoch [4/10], Step[5800/10949], Loss: 0.0024
Epoch [4/10], Step[5900/10949], Loss: 0.0024
Epoch [4/10], Step[6000/10949], Loss: 0.0020
Epoch [4/10], Step[6100/10949], Loss: 0.0033
Epoch [4/1