In [1]:
import numpy as np
import torch
import torch.nn as nn

from torch.autograd import Variable



## Dynamic Pointing Decoder

#### Input
We have encoded the information from a question-document pair in the coatttention encoding matrix U. U is of shape 2 * length of the word vectors x words in the document. For the moment, we'll assume U is of shape 600 x 600 since our word vectors have length 300 and the max sequence length is set at 600. In this notebook, we'll start using a dummy U of these dimensions. 

#### Decoder
The decoder iteratively estimates the answer span by alternating between predicting the start and end points. It consists of standard LSTM cells as well as Highway Maxout Networks. Actually, it's two networks: one that estimates the start point, and one that estimates the end point. The networks are identical in architecture, but do not share parameters. In the big picture, we alternate between these two networks to get our answer span. As the networks are identical, we'll refer to them as the network.

#### Maxout Networks
[EXPLANATION ON MAXOUT NETWORKS HERE]

#### Highway Networks
[EXPLANATION ON HIGHWAY NETWORKS HERE]

#### Maxout Highway Networks
[EXPLANATION ON MAXOUT HIGHWAY NETWORKS HERE]

#### Cell inputs
The network outputs start and end scores for each word in the document and the final estimate is just the argmax of that list. How these individual scores are computed is the tricky part. The score for one word is the output of a Highway Maxout Network, which takes in the coattention encoding of that word, the hidden state of the model (we'll get back to this), the coattention encoding of the previous start point estimate, and the coattention encoding of the previous end point estimate. 

The hidden state of the LSTM, which is fed into the HMW, is dependent on the LSTM's previous hidden state, the coattention encoding of the previous start estimate, and the coattention encoding of the previous end estimate. Same as the HMN, except that it doesn't look at a specific word, but just keeps track of the hidden state. 

It's important to note these inputs (for the LSTM and the HMN) are the same for both (start and end) networks: the network estimating the start position takes into account the previous end point estimate, and vice versa. So they need to somehow communicate.

#### HMN in detail
Okay, let's have a shot at describing the HMN model in layman's terms.

- First, we have a layer that puts together everything we have at the moment: the hidden state and the previous start and end estimates. These values are concatenated and multiplied by a weight matrix. Then we apply a tanh nonlinearity over the product.


- The second layer concatenates the coattention encoding of the current word with the output of the tanh in the previous layer, puts it into a linear function (Wx +b) and takes the max over the first dimension of the resulting tensor. 


- The third layer takes the output of the second layer, puts it in a linearity and again applies a max over the first dimension of the product. Essentially the same as the second layer. 


- The fourth and last layer concatenates the outputs of both the second and third layer, and then does the same operation. The Highway part of the network just means that you use outputs from not just the layer before but also from earlier layers as input. (I think)


- To train the network, we minimize the cumulative softmax cross entropy of the start and end points across all iterations. The iterative procedure halts when both the estimate of the start position and the estimate of the end position no longer change, or when a maximum number of iterations is reached. We set the maximum number of iterations to 4 and use a maxout pool size of 16.