In [1]:
import numpy as np
import torch
import torch.nn as nn

from torch.autograd import Variable

## Dynamic Pointing Decoder

#### Input
We have encoded the information from a question-document pair in the coattention encoding matrix U. U is of shape 2 * lenght of the word vectors * words in the document. For the moment, we'll assume U is of shape 600 * 600 since our word vectors have length 300 and the max sequence length is set at 600. In this notebook, we'll start using a dummy U of these dimensions.

#### Decoder
The decoder iteratively estimates the answer span by alternating between predicting the start and end points. It consists of standard LSTM cells as well as HIghway Maxout Networks. Actually, it's two networks: one that estimates the start point, and one that estimates the end point. The networks are identical in architecture, but do not share parameters. In the big picture, we alternate between these two networks to get our answer span. As the networks are identical, we'll refer to them as the network.

#### Maxout Networks
[EXPLANATION ON MAXOUT NETWORKS HERE]

#### Highway Networks
[EXPLANATION ON HIGHWAY NETWORKS HERE]

#### Highway Maxout Networks
[EXPLANATION ON HIGHWAY MAXOUT NETWORKS HERE]

#### Cell inputs
The network outputs start and end scores for each word in the document and the final estimate is just the argmax of that list. How these individual scores are computed is the tricky part. The score for one word is the output of a Highway Maxout Network, which takes in the coattention encoding of that word, the hidden state of the model (we'll get back to this), the coattention encoding of the previous start point estimate, and the coattention encoding of the previous end point estimate. 

The hidden state of the LSTM, which is fed into the HMW, is dependent on the LSTM's previous hidden state, the coattention encoding of the previous start estimate, and the coattention encoding of the previous end estimate. Same as the HMN, except that it doesn't look at a specific word, but just keeps track of the hidden state. 

It's important to note these inputs (for the LSTM and the HMN) are the same for both (start and end) networks: the network estimating the start position takes into account the previous end point estimate, and vice versa. So they need to somehow communicate.

#### HMN in detail
Okay, let's have a shot at describing the HMN model in layman's terms.

- First, we have a layer that puts together everything we have at the moment: the hidden state and the previous start and end estimates. These values are concatenated and multiplied by a weight matrix. Then we apply a tanh nonlinearity over the product.

- The second layer concatenates the coattention encoding of the current word with the output of the tanh in the previous layer, puts it into a linear function (Wx +b) and takes the max over the first dimension of the resulting tensor. 

- The third layer takes the output of the second layer, puts it in a linearity and again applies a max over the first dimension of the product. Essentially the same as the second layer. 

- The fourth and last layer concatenates the outputs of both the second and third layer, and then does the same operation. The Highway part of the network just means that you use outputs from not just the layer before but also from earlier layers as input. (I think)

- To train the network, we minimize the cumulative softmax cross entropy of the start and end points across all iterations. The iterative procedure halts when both the estimate of the start position and the estimate of the end position no longer change, or when a maximum number of iterations is reached. We set the maximum number of iterations to 4 and use a maxout pool size of 16.

In [2]:
# Define inputs
m = 600
l = 200

U = Variable(torch.rand(2*l, m), requires_grad=True)

In [3]:
U.size()

torch.Size([400, 600])

In [4]:
class MaxOut(nn.Module):
    
    def __init__(self, input_size, output_size, pool_size):
        super().__init__()
        self.input_size, self.output_size, self.pool_size = input_size, output_size, pool_size
        self.lin = nn.Linear(input_size, output_size * pool_size)
        
    def forward(self, inputs):
        shape list(inputs.size())
        shape[-1] = self.output_size
        shape.append(self.pool_size)
        max_dim = len(shape) -1
        out = self.lin(inputs)
        m,i = out.view(*shape).max(max_dim)
        return m

SyntaxError: invalid syntax (<ipython-input-4-1f863f63cbcf>, line 9)

In [5]:
# Define HMN

class HMN(nn.Module):
    
    def __init__(self, input_size, pool_size):
        super(HMN, self).__init__()
        
        self.r_linear = nn.Linear(5*input_size, input_size, bias=False)
        self.r_tanh = nn.Tanh()
        
        self.m1 = MaxOut(3*input_size, input_size, pool_size)
        self.m2 = MaxOut(input_size, input_size, pool_size)
        self.m3 = MaxOut(2*input_size, 1, pool_size)
        
    def forward(self, h_i, u_t, u_start_prev, u_end_prev):
        
        batch_size = u_start_prev.size()[0]
        # Reshape hidden state to single vector
        h_i = h_i.view(h_i.size()[2])
        # Copy hidden state for each sample in batch
        h_i = h_i.expand(batch_size, h_i.size()[0])
        hi_us_ue = torch.cat((h_i, u_start_prev, u_end_prev), dim=1)
        
        r = self.r_linear(hi_us_ue)
        r = self.R_tanh(r)
        
        print(u_t.size(), r.size())
        # should be 32x400, 32x200
        u_r = torch.cat((u_t,r),dim=1)
        print(u_r.size())
        # should be 32x600
        m1 = self.m1(u_r)
        m2 = self.m2(m1)
        m1_m2 = torch.cat((m1,m2))
        m3 = self.m3(m1_m2)
        out = np.argmax(m3)
        return out

In [6]:
# Define full decoder netowrk

class Decoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, pool_size, num_layers):
        super(Decoder, self).__init__()
        
        self.lstm = nn.LSTM(4*input_size, hidden_size, num_layers)
        self.hmn = HMN(input_size, pool_size)
        
    def forward(self, u_t, u_start_prev, u_end_prev):
        x = torch.cat((u_start_prev, u_end_prev), dim=1)
        x = x.view(32,1, 4*l)
        # x should be [batchsize x 4l x 1]
        # or [batchsize x 1 x 4l]??
        _,(h_t, _) = self.lstm(x)
        out = self.hmn(h_t, u_t, u_start_prev, u_end_prev)
        return out

In [7]:
# model setup

num_layer = 2

batch_size = 32
learning_rate = 0.0007
num_epochs = 10

model = Decoder(200,200,16,1)

model.cuda()
lossfun = nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

NameError: name 'MaxOut' is not defined

In [8]:
# Define inputs
m = 600
l = 200

# make a single mock batch
u_start_init = Variable(torch.rand(32,2*l),requires_grad=True)
u_end_init = Variable(torch.rand(32, 2*l), requires_grad=True)

U = Variable(torch.rand(32, 2*l, m), requires_grad=True)
y = Variable(torch.rand(32, 1), requires_grad=False)



In [9]:
num_epochs = 1

In [10]:
%%time

u_start_prev = u_start_init
u_end_prev = u_end_init
u_start_prev = u_start_prev.cuda()
u_end_prev = u_end_prev.cuda()

for epoch in range(num_epochs):
    # for i, U in enumerate(train_data):
    x = U
    x = x.cuda()
    
    y = y #lol wtf
    
    output = model (x, u_start_prev, u_end_prev)
    
    optimizer.zero_grad()
    
    loss = lossfun(output[0],y)
    loss.backward()
    
    u_start_prev = U[output:] # row or column??
    u_end_prev = U[output+6:] # hacky way of setting the end point estimate
    
    optimizer.step()
    if (i+1)%100 == 0:
        print('Epoch [%d/%d], Step[%d/%d], Loss: %0.4f'
        %(epoch+1, num_epochs, i+1, len(data)//batch_size, loss.data[0]))

NameError: name 'model' is not defined

We use a max sequence length of 600 during training and a hidden state size of 200 for all recurrent units, maxout layers, and linear layers

- Do linear layers have hidden units? Do maxout layers have hidden units??