In [86]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pylab as plt
import numpy as np

device = 'cuda:0' if torch.cuda.is_available() else "cpu"

def gen_data(N=100, d=10, low=0, high=10):
    data = np.random.randint(low=low, high=high, size=(N,d))
    
    target_idx = np.random.randint(low=0, high=d, size=N)
    
    y = data[np.arange(data.shape[0]), target_idx]
    
    return (data, target_idx), y

N = 5000
low = 0 
high = 10
d = 10

train_data, train_target = gen_data(N=N, low=low, high=high, d=d)
test_data, test_target = gen_data(N=N, low=low, high=high, d=d)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Recurrent Neural Networks (RNN)

An RNN will process a sequence of tokens. The pseudocode is something like the following:

token_list = [...]

hidden_vec = [0, ..., 0] #some fixed length or dimensionality

for token in token_list:

#lookup vector for each token from a hash table
    token_vec = embedding_table[token]
    
    #use previous hidden_vec and current token_vec to update hidden_vec
    #this is updating the state (hidden_vec) of the net using the new token
    
    hidden_vec = update(hidden_vec/previous state, token_vec/new data)
    
after the loop, hidden_vec encodes all the information about the input sequence and can be used to make a prediction

prediction = pred(hidden_vec)

for us, this could also be

prediction = pred(hidden_vec, index)

if the task is to predict the entry at a particular index

### Embedding

Suppose, we were working with natural text where the tokens were words. To feed in a word like "apple" to a neural network, we need to "numericalize" (i.e. convert it to a number) it.

The simplest solution is to map each unique token to a unique integer. For example:

"apple" -> 0

"is" -> 1

"a" -> 2

etc.

Note that the only requirement is that this mapping is one-to-one i.e. different words are mapped to different integers. The actual mapping, i.e. whether "apple" -> 0 or "apple" -> 59, doesn't matter.

Are there any problems with this encoding of tokens? One immediate problem is that it imposes an ordering on the tokens. "apple" is not less than "a" but 2 < 0. Depending on the machine learning model used, this ordered encoding can induce artifacts that are not real.

The solution to this problem is the so-called one-hot encoding. Suppose, there are N distinct words. Each word is mapped to a vector of size N where exactly one entry is 1 and the rest are zeros. Suppose, N = 3. Then,

"apple" -> [1,0,0]

"is" -> [0,1,0]

"a" -> [0,0,1]

Now, there is no order imposed on the tokens. Each vector is orthogonal to all other vectors (the dot product of vectors corresponding to distinct words is 0). This creates a couple of problems. If the number of tokens is large, the dimensionality N will also be large and this has implications for memory usage. Another problem is more conceptual. While each word is distinct (by definition) from every other word, words are not distinct by meaning. "apple" and "mango" are similar in the sense that they are both fruits but they are clearly also distinct (winter vs summer fruit etc.). Since vectors can be used to encode similarity, is it possible to map tokens to vectors such that (a) similar meaning words map to similar vectors and (b) dissimilar meaning words map to dissimilar vectors.

This is the problem embeddings solve. The philosophy in neural networks is to map each token to a unique vector is a relatively small (compared to the number of distinct tokens, N) dimensional (128 below) vector space. The embeddings are initialized randomly but are also adjusted during the learning process using the same exact process used to adjust/learn weights i.e. by computing derivatives and using gradient descent.

In [6]:
embedding_dim = 128
emb = nn.Embedding(num_embeddings=high-low, embedding_dim=128)

In [7]:
#can now look up embedding vectors based on input token (any value between low and high-1)

emb(torch.tensor(0))

tensor([-0.7776,  0.0758, -1.2728,  0.3797, -1.4441,  0.4137, -0.8859, -0.3029,
         0.0037, -1.9666,  0.8111, -0.4503, -0.6622, -1.5308, -0.9797,  1.4748,
         1.0709,  0.9153,  0.7402,  0.3153,  0.3452,  1.4954, -0.7081, -0.3989,
        -0.5133, -0.3613,  1.0528, -0.2357, -0.8393,  0.8153,  0.9210, -0.3544,
        -0.3448, -0.1160,  0.1462, -0.2989,  0.8994, -0.5678, -1.2291, -1.8019,
        -0.0917,  1.0638, -0.8886, -0.5970, -0.5144,  0.7372, -0.9562,  1.0981,
        -1.7168, -1.1826, -0.6307,  1.3211, -0.7107,  0.7663,  0.2123, -0.0610,
        -0.0494,  0.3149,  0.4044, -0.0982, -0.2120,  1.3882, -0.7457,  0.8129,
        -0.7858,  1.9465,  1.3055, -0.4094,  0.2988, -0.1123,  0.7873,  0.0784,
         0.4255, -0.4259,  0.4289, -0.0592,  0.6327, -1.4207,  1.5566,  1.8306,
         1.3141,  0.4218,  0.1698,  1.0598,  0.3751,  0.2128, -0.4448, -0.1987,
         0.8481,  0.2293,  0.3222,  0.3825,  1.2912, -0.0781, -1.7094,  0.5180,
        -0.7820, -0.6893,  0.4335, -0.75

In [8]:
emb(torch.tensor(high-1))

tensor([-1.4117e+00, -5.4551e-01, -5.6932e-01, -1.1557e+00, -5.2234e-01,
        -8.4925e-01,  6.9589e-01, -7.5135e-02, -3.4572e-01,  1.4316e+00,
         1.4259e-01,  1.4484e+00,  5.6853e-01,  7.5547e-01, -9.8406e-01,
        -5.5709e-01,  4.5423e-01,  1.8467e+00, -4.3281e-01,  8.4503e-02,
         4.8608e-01,  7.7047e-01,  1.0511e+00,  9.8881e-02,  1.9821e+00,
        -8.4866e-01, -4.8272e-01, -7.3897e-01, -1.4799e+00, -3.7572e+00,
         4.9821e-01, -6.5011e-01, -3.3622e-01,  2.9494e-01, -1.3662e+00,
         1.3828e+00,  6.5192e-01, -3.4389e-01, -1.6061e+00, -8.6304e-01,
         5.2523e-01,  5.7143e-01, -2.7077e-01,  1.0744e-01, -5.1499e-02,
        -4.5526e-01, -5.3947e-01, -6.3980e-01,  1.6315e-01,  3.1947e+00,
        -1.0042e+00,  4.7981e-01,  7.1870e-01,  7.8232e-02, -1.9011e+00,
        -1.6718e+00,  1.3400e-02,  7.5010e-02,  1.6866e+00,  5.3600e-01,
        -1.2258e+00, -5.8847e-01, -1.8556e+00,  9.6030e-01, -5.4986e-01,
         6.6554e-01,  2.8968e-01,  2.2176e-01, -3.0

In [9]:
#only values between low and high-1 have entries in the table
emb(torch.tensor(high))

IndexError: index out of range in self

### RNN definition

The code below defines the recurrent neural network. There are three broad classes of RNNs:

1. Vanilla RNNs - these tend to have a problem learning long-range behavior if the sequences are long. This is due to the so-called exploding and vanishing gradients problem (don't worry about this for now).

2. Long short-term memory networks (LSTMs): instead of having just one (hidden) state like RNNs, LSTMs maintain a long-term memory vector and a short-term memory vector. The coarse idea is to use the long-term memory vector to "remember" long-range patterns.

3. Gated Recurrent Units (GRUs): A simpler (less parameters/weights) form of LSTMs with the same underlying idea.

(We can go over the details in a call)

We will also use "attention". The core idea of attention is described below.

As an RNN processes an input sequence (sentence or byte vector), it generates a sequence of hidden vectors.

$$h_1, h_2, \ldots, h_T$$

where T = length of sequence.

The final hidden state is then used as a measure of context for any downstream tasks (predicting an output sequence or predicting a class for the sequence).

Attention refers to the idea that the context shouldn't consist just of $h_T$ but should be dynamic/flexible. 

For more details, either see the code below or the notes on attention near the end of this notebook.

#### First, experiment with an LSTM to understand shapes of various tensors

In [10]:
#see: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
rnn = nn.LSTM(input_size = 128, #dimension of embedding
              hidden_size = 32, #
              num_layers = 2,
              batch_first = True, #expect (batch, seq, feature)
              dropout = 0.5,
              bidirectional=True                             
             )

In [11]:
#want to understand data flow. pick some input data
inp = torch.from_numpy(train_data[0:5])
inp.shape #(batch/sequence number, length/time)

torch.Size([5, 10])

In [12]:
emb(inp).shape #(batch/sequence number, length/time, embedding feature)

torch.Size([5, 10, 128])

In [13]:
print(type(rnn(emb(inp))))
len(rnn(emb(inp)))

<class 'tuple'>


2

In [14]:
print(type(rnn(emb(inp))[0]))
print(rnn(emb(inp))[0].shape) #(batch/sequence number, length/time, embedding feature*2 for bidirectional)

<class 'torch.Tensor'>
torch.Size([5, 10, 64])


In [15]:
print(type(rnn(emb(inp))[1]))
print(len(rnn(emb(inp))[1]))

<class 'tuple'>
2


In [16]:
print(type(rnn(emb(inp))[1][0]))
print(rnn(emb(inp))[1][0].shape)

<class 'torch.Tensor'>
torch.Size([4, 5, 32])


Why is the output above of shape (4,5,32)?

32 is the hidden dim i.e. the dimensionality of the hidden state and the cell state in an LSTM

5 is the number of sequences in the batch (if this is not convincing, try changing inp to have, say, 7 sequences)

where does the 4 come from? Claim: The 4 = 2 (num_layers) * 2 (bidirectional) cell states

Of course, we could have looked at the documentation:

https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

Outputs: outputs, (h_n, c_n)

where 

outputs.shape = (number of examples, length of sequence, 2*hidden_dim)

h_n.shape = (2*num_layers, number of examples, hidden_dim)

c_n.shape = (2*num_layers, number of examples, hidden_dim)

In [17]:
print(type(rnn(emb(inp))[1][1]))
print(rnn(emb(inp))[1][1].shape)

<class 'torch.Tensor'>
torch.Size([4, 5, 32])


**Mock run**:
    
It's generally a good idea to pick one training point (X, y pair) and run through the operations manually before defining the network architecture.

In [54]:
#step 1: look at data

idx = 5

X1 = train_data[0][idx, :]
X2 = train_data[1][idx]
y = train_target[idx]

print(X1, X2, y)

[9 6 9 7 8 0 2 0 7 4] 2 9


In [70]:
#step 2: compute embeddings for one element of input sequence
low = 0
high = 10

embedding_dim = 128
emb = nn.Embedding(num_embeddings=high-low, embedding_dim=128)

emb(torch.from_numpy(X1)[3]) #can get embeddings for each element of X1

tensor([ 0.2865,  1.2568,  0.9385, -0.1796,  0.7032,  0.0619, -0.4192,  0.6867,
         0.1354,  1.4492,  1.2120, -0.7873,  0.8069, -0.9271, -1.6105,  1.6214,
        -0.2598, -1.5224, -1.7641, -1.2096,  1.4387,  0.7379,  1.2121,  0.7119,
        -1.1277,  0.5871,  0.3826, -0.5249, -0.6335, -0.2726,  0.6579, -1.3908,
        -0.5380,  1.6320, -1.2180,  0.2877,  0.9800,  0.0283,  0.6669,  0.4502,
         0.0100,  0.9809, -1.2172,  0.3693, -0.7873,  0.0091,  0.5409,  0.8213,
        -0.3427,  0.4027,  0.0768,  1.4327,  0.4103,  1.0460, -1.0206, -0.7051,
         0.0530,  0.1897, -1.4639,  1.5928, -2.1853,  0.6629, -1.8533,  0.6728,
         0.7878,  2.0742,  1.1128,  0.8676,  0.2749, -1.2283, -0.6369, -0.6804,
         0.8569,  0.5500,  0.4963, -0.5683, -1.0289, -0.1171,  0.9441, -1.1648,
        -0.0040, -0.4562, -1.8234, -0.4147,  1.0370,  0.6098,  0.2892,  0.0312,
        -2.2091, -0.9449, -1.7238,  0.7487,  1.5384,  0.0851,  0.1721,  0.3418,
        -1.2923, -1.4248,  0.2439,  0.58

In [71]:
#step 3: compute hidden states for one step of input sequence
o, (h_n,c_n) = rnn(emb(torch.from_numpy(X1)[3])[None,None,:]) #emb()[None, None, :] of shape (1,1,embedding_dim=128)

In [72]:
print(o.shape)
print(h_n.shape)
print(c_n.shape)

torch.Size([1, 1, 64])
torch.Size([4, 1, 32])
torch.Size([4, 1, 32])


In [83]:
#step 4: compute over full sequence
print('raw:', torch.from_numpy(X1).shape) #(length of sequence)
print('raw unsqueeze:', torch.from_numpy(X1)[None, :].shape) #(1, length of sequence) where 1 is the number of sequences
print('emb:', emb(torch.from_numpy(X1)[None, :]).shape) #(1, length of sequence, embedding dim/input dim for rnn)

o, (h,c) = rnn(emb(torch.from_numpy(X1)[None, :]))
print('o:', o.shape) #(num of sequences = 1, length of sequence, 2 (bidirectional)*hidden_dim) #last layer, at every time-step
print('h:', h.shape) #(bidirection*num_layers, num of sequences = 1, hidden_dim) - at last time-step
print('c:', c.shape) #(bidirection*num_layers, num of sequences = 1, hidden_dim) - at last time-step

raw: torch.Size([10])
raw unsqueeze: torch.Size([1, 10])
emb: torch.Size([1, 10, 128])
o: torch.Size([1, 10, 64])
h: torch.Size([4, 1, 32])
c: torch.Size([4, 1, 32])


In [92]:
#step 5: another embedding layer for index
#note: it would make sense for us to emb the position of each element in the input sequence and match the embedding
#of the index to the embeddings of the position
#note 2: if instead, we were doing a "semantic search", then embeddings of the search token would make more sense



emb_idx = nn.Embedding(num_embeddings=d, embedding_dim=128) #there are d unique index values

emb_idx(torch.tensor(X2))

tensor([-5.3350e-01, -4.7944e-01,  5.5247e-02,  1.9876e-01, -2.4190e+00,
        -3.6725e-01,  2.1666e-01, -5.9951e-01,  1.0426e+00, -1.2349e+00,
         6.1935e-01, -1.0120e+00,  3.1203e-01,  3.7609e-01,  1.2412e+00,
        -5.1611e-01,  1.8683e+00,  2.4450e-01, -3.5375e-01, -2.4031e+00,
         4.6394e-02,  1.2769e-01, -2.0894e+00, -1.6796e+00,  5.4427e-02,
         1.0610e-01, -4.5504e-01, -1.8055e-01, -1.0285e-02,  5.4206e-02,
         3.1796e-01, -1.8100e+00, -5.3404e-01,  2.4541e+00,  1.1919e+00,
        -1.9942e-01, -7.7337e-02,  1.9762e-01, -1.0854e+00,  8.0994e-01,
        -2.6244e-01,  3.9296e-01,  1.7814e+00,  1.2452e+00,  2.1180e-01,
         4.9000e-01, -1.5497e+00, -4.1742e-01, -2.7240e-01, -8.8326e-01,
         2.2839e-01,  7.8422e-01, -3.4529e-01, -8.4296e-01,  1.1689e+00,
        -7.8080e-01,  9.7050e-01,  1.3604e+00,  1.8639e-02,  1.4517e-01,
         4.9090e-01, -9.2188e-02, -9.4421e-01, -1.2284e+00,  5.7930e-01,
         4.4399e-02,  6.6538e-01, -8.0282e-01,  1.2

In [93]:
#step 6: learn a dense layer to map embedding of idx and compare to output values for a sequence

attn_lin = nn.Linear(128, 64) #128 is embedding_dim, 64 is bidirectional * hidden_dim (32)
attn_lin(emb_idx(torch.tensor(X2)))

tensor([ 0.1441, -1.5180,  0.3508,  0.6825, -0.2547, -0.4111, -0.2800,  0.4555,
         0.0547, -0.2314, -0.2712,  0.1149,  0.2038, -0.6945, -0.0274, -0.2217,
         0.3765, -0.1768,  0.7162,  0.5139,  0.1150, -0.3664, -1.0747,  0.1636,
        -0.5181, -0.2264,  0.4199, -0.4615,  0.1972,  0.7091,  0.8343,  0.3293,
         0.1147, -0.1345,  0.9108,  0.3850, -0.7291,  0.7860, -0.2947,  0.1495,
        -0.0828, -0.3305, -0.2554, -0.3984, -0.4137, -0.1197,  0.8194, -0.7236,
         0.9047, -0.0740, -0.4342, -0.2027, -0.2191,  0.0045,  0.0581, -0.7468,
        -0.3336, -1.3374,  0.3314, -0.0208,  0.1066, -0.7961, -0.3307,  0.0099],
       grad_fn=<AddBackward0>)

In [125]:
#step 7: compute attention scores

#7a: compute dot product of attn_lin(emb_idx()) with output

a = attn_lin(emb_idx(torch.tensor(X2))).unsqueeze(0) #in general, will be of shape (num sequences, 64)
print('a:', a.shape)

print('o:', o.shape)
energies = torch.tensordot(a, o, dims=([1], [2]))
print('energies:', energies.shape)
energies = energies.squeeze(1)
print('energies:', energies.shape) #(num sequences, length of sequence)
print(energies)

#7b: compute attention scores
scores = torch.exp(energies)
scores /= scores.sum(dim=1)
print('scores:', scores.shape)
print(scores)

assert(scores.sum(dim=1).mean()==1)

out = torch.tensordot(o, scores, dims=([1], [1])).squeeze(2)
print('out:', out.shape)

a: torch.Size([1, 64])
o: torch.Size([1, 10, 64])
energies: torch.Size([1, 1, 10])
energies: torch.Size([1, 10])
tensor([[-0.0437, -0.0739, -0.3028, -0.5683, -0.8043, -0.8653, -0.9179, -0.3863,
         -0.5125, -0.5787]], grad_fn=<SqueezeBackward1>)
scores: torch.Size([1, 10])
tensor([[0.1520, 0.1475, 0.1173, 0.0899, 0.0710, 0.0668, 0.0634, 0.1079, 0.0951,
         0.0890]], grad_fn=<DivBackward0>)
out: torch.Size([1, 64])


In [129]:
#step 8: use out as input to small MLP to predict actual output

out_layer = nn.Linear(64, high-low)
out_activation = nn.Softmax(dim=1)
out_activation(out_layer(out)) #actual probability that input_seq[input_idx] takes each value between high-low

tensor([[0.1066, 0.0964, 0.1038, 0.0942, 0.0937, 0.0999, 0.1057, 0.1049, 0.1068,
         0.0881]], grad_fn=<SoftmaxBackward0>)

### Putting it all together

It is time to put these 8 steps together in one architecture. It's good practice to put assertions and sanity checks since we are composing many operatio

In [53]:
class Net(nn.Module):
    def __init__(self):
        super().__init__(low, high, emb_dim, hidden_size, num_layers=1)
        
        self.low = low
        self.high = high
        self.emb_dim = emb_dim
        
        self.emb = nn.Embedding(num_embeddings=high-low, embedding_dim=emb_dim)
        
        #see: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
        self.rnn = nn.LSTM(input_size = embedding_dim,
                           hidden_size = hidden_size,
                           num_layers = num_layers,
                           batch_first = True, #expect (batch, seq, feature)
                           dropout = 0.5,
                           bidirectional=True                             
                          )
        
        self.pred = nn.Linear(2*hidden_size, high-low)
        
    def forward(self, x):
        out, (h_n, c_n) = self.rnn(x)
        
        #use the bidirectional hidden states from the last layer 

## Appendix

Seq2Seq (sequence to sequence) nets map an arbitrary length input sequence to an arbitrary length output sequence. If one wanted to map an arbitrary length input sequence to an output sequence of the same length, then one can use one RNN/LSTM.

### Seq2Seq paper: https://arxiv.org/pdf/1409.3215.pdf

**Core idea**:

* Use LSTM (encoder) with multiple layers ("deep") to map input sequence -> vector of fixed dimensionality.

* Use another LSTM (decoder) with multiple layers to generate output sequence starting from vector of fixed dimensionality produced by encoder.

* Additional trick that improves performance significantly: feed both the input sequence and reversed input sequence (bidirectional) to the encoder.

**Core idea figure**:

![Seq2Seq](seq2seq.png)

**Example applications where input/output values are sequences of varying lengths**:

* Speech recognition: audio -> text

* Machine translation: text -> text

* Question answering: text -> text

**How to think about the encoder-decoder architecture**:

Encoder: Read input tokens one at a time and generate compressed context vector of fixed dimensionality.

Decoder: Conditioned on context vector, generate sequences. In other words, all information about the input sequence is passed to the decoder through the context vector.

**Model architecture**:

* LSTMs instead of RNNs

* Two different LSTMs are used i.e. the same LSTM weights are not used for the encoder and the decoder

* Multi-layer LSTMs are used: 4 layers

* Order of tokens/words is reversed in input sequence i.e. instead of mapping: a,b,c -> $\alpha,\beta,\gamma$, map c,b,a -> $\alpha, \beta, \gamma$ so that a is closer to $\alpha$, b is closer to $\beta$. But note that now c is further away from $\gamma$

**Dataset details**: To get a sense of scale

Trained on 12 million sentences

Total unique French words: 384 million

Total unique English words: 304 million

For predictions, output space of words was restricted to 160k words in the source language (English) and 80k words in the target language (French). If word not in vocabulary, replace by special unknown token, UNK.

**Training details**:

loss = $\frac{1}{\mid \mathcal{S} \mid} \Sigma_{(T,S)\in\mathcal{S}} \log p(T|S)$

where $\mathcal{S}$ is the training set.

In other words, for each training example, the source sentence S is used to predict the target sentence T. For each token in the prediction, we know the actual/label token that should have been predicted. Compute the log probability for the label token and add them up over the target sentence tokens. Average these across each example in the training set.

* LSTMs: 4 layers, 1000 cells (hidden_dim), 1000 dimensional word embeddings (embedding_dim), input vocabulary size = 160k, output vocabulary size = 80k.

* LSTM weights ~ uniform(-0.08, 0.08)

* SGD without momentum, lr = 0.7 for first 5 epochs and halve lr every half epoch

* Total training time = 7.5 epochs

* Batch size = 128

* For each batch, compute $s = \Vert g \Vert_2 / 128$ where g = gradient. If $s > 5$, set $g = \frac{5g}{s}$ i.e. if length of vector exceeds 5, rescale to make length = 5.

* Most sentences are short (20-30 tokens) and some are long (>100 tokens). Make sure sentences in each batch are roughly the same length to avoid wasted computation.

* Use model parallelism with different layers on different GPUs. Note this paper was written before TensorFlow, PyTorch, Theano (?) etc.

**Prediction/inference details**:

Beam search with size = 1 or 2 works well. In detail, the decoder is used to predict a probability distribution over all possible tokens at each time-step. Since the output of the decode at time t is the input to the decoder at time t+1, we need a sample from the distribution. 

Ideally, we want the output sequence to be such that the sum of the log probabilities of the sampled tokens is maximized. This is not the same as a greedy sampling strategy. Beam search keeps track of the top k running sums of log probabilities.

**Questions**:

* While training, how should the hidden state of the encoder be initialized?
    * Zeros for each batch
    * Make hidden state initialization learned?
    * Keep hidden state evolving as batches get processed? This sounds troublesome since the implication is the order in which sequences are processed, matters.

### Seq2Seq + Attention paper: https://arxiv.org/pdf/1409.0473.pdf

**Core idea**:

* In the encoder-decoder setup, some tokens in the decoder/output sequence are far away from the corresponding tokens in the encoder/input sequence.

* Since all context about the input is compressed into the context vector, it might lose information about earlier tokens. In other words, the context vector can be an information bottleneck.

* This manifests itself as poor performance when the length of the input sequence gets long (longer than training data sequences).

* Can each token in the decoder be allowed to search for relevant tokens in the input sentence? This search has to be soft i.e. predict probabilities over the input sequence rather than hard choices.

* In this paper, a mechanism (attention) is introduced where, at each time-step in the decoder, a soft search over all hidden states in the encoder is carried out. This search is used to generate a new appropriate context vector that focuses on subsets of the input sequence. In other words, there is no need to encode the full input sequence into *one* context vector. Instead each input sequence is encoded into a sequence of vectors and during decoding, a soft search is carried out over this sequence of vectors.

**Note**: 

While in the previous paper, the final hidden state in the decoder, $h_T$ is used as a context vector, as this paper suggests, one could generalize to:

$$c = q({h_1, \ldots, h_T})$$

i.e. the context vector is some (fixed) function of all the hidden states.

**Core idea figure**:

![Model](seq2seq_attention.png)

The $x_t$ are the input tokens to the encoder. The $h_t$ are the hidden states (bidirectional). The key difference is the computation of the decoder's hidden state at time t, $s_t$.

In the classic encoder-decoder picture, $s_t = f(s_{t-1}, y_{t-1})$. In this model,

$$s_t = f(s_{t-1}, y_{t-1}, h_{1}, h_{2}, \ldots, h_T)$$

where $f()$ sloppily refers to "some function" (not the same one in both computations). So, now the computation has direct access to each encoder hidden state.

**Precise formulation**:

Given an input sequence S, of length T, the encoder produces hidden states, $h_1, \ldots, h_T$.

These hidden states are used to compute a context vector, $c$. In the seq2seq paper, $c = h_T$.

The decoder conditions on the context vector i.e. it intializes its hidden state, $s_0$ so that $s_0 = c$.


The first token is a special SOS (start of sentence) token and is passed as the first input $y_0$. The decoder hidden state is updated:

$$s_1 = f(s_0, y_0)$$

which is used to compute the probability distribution over the vocabulary:

$$p(y_1\mid y_0, c) = g(s_1)$$


For any time t,

$$\boxed{s_t = f(s_{t-1}, y_{t-1})}$$

and $$\boxed{p(y_t\mid y_0,\ldots, y_{t-1}, c) = g(s_{t})}$$

Both of these equations implictly depend on the context vector $c$ since every calculation depends on $s_0 = c$. We make this explicit:

$$\boxed{s_t = f(s_{t-1}, y_{t-1}, c)}$$

and $$\boxed{p(y_t\mid y_0,\ldots, y_{t-1}, c) = g(s_{t}, c)}$$

One way of looking at attention is to make $c$ dependent on time t, i.e.:

$$\boxed{s_t = f(s_{t-1}, y_{t-1}, c_t)}$$

and $$\boxed{p(y_t\mid y_0,\ldots, y_{t-1}) = g(s_{t}, c_t)}$$


This implies an order of computation:

* Compute $c_t$ which is the context vector at time t.

* Compute hidden state $s_t$ from $c_t, s_{t-1}, y_{t-1}$.

* Compute prob distribution from $s_t, c_t$.

The context vector, instead of being just $h_T$ (last hidden vector in the encoder) is now generalized to a linear (convex) combination of all encoder hidden vectors:

$$c_t = \Sigma_{j} \alpha_{tj}h_j$$

where $\alpha_{tj}$ can be interpreted as probabilities as shown below.

$$\alpha_{tj} = \frac{e^{e_{tj}}}{\Sigma_{k}e^{e_{tk}}}$$

This is essentially a Boltzmann probability or you can think of $e_{tj}$ as the energy of the $j$th configuration, the exponential ensures that $\alpha_{tj} > 0$ and the denominator ensures the $\alpha_{tj}$ add up to 1 (when summed over $j$). 

The next question is how $e_{tj}$ is computed and there are many choices here. In general, $e_{tj} = a(s_{t-1}, h_j)$. Note that $e_{tj}$ is computed before $s_t$ and hence depends on $s_{t-1}$ and not $s_t$. In words, "based on what the decoder has generated so far till time t-1 and the summary encoded in $s_{t-1}$, what information can be gathered from the encoder's hidden states to compute the next hidden state $s_t$ so the next token can be computed".

**Open question**: what would happen if $s_t$ were used to compute $c_t$?


**Model architecture**:

Encoder: Bidirectional RNN (LSTM etc.)

Decoder: RNN (LSTM etc.) + Attention as described above

Loss: multi-class log loss/cross-entropy

Decoding strategy: Beam search

Optimizer: Adadelta (training time for paper's model ~ 5 days on what hardware?)

**Model details**: (See appendix)

**RNN**: Use LSTM or GRUs i.e. architectures that let one learn long-term dependencies since there are connections where the gradient is close to 1.

Recall $s_t = f(s_{t-1}, y_{t-1}, c_t)$ with a time-dependent context.

The precise computation used here is:

$$s_t = f(s_{t-1}, y_{t-1}, c_t) = (1-z_t) \odot s_{t-1} + z_t \odot \tilde{s}_t$$

where $\odot$ is element-wise multiplication, $z_t$ is the output of the update gates in the LSTM/GRU unit.

One way to think of this computation is as follows:

$z_t$ is a vector of numbers between 0 and 1 and hence is a soft mask or a vector of probabilities. It thereforce keeps elements of the previous state $s_{t-1}$ with probability $1-z_t$ and updates them with elements of a new hidden state candidate, $\tilde{s}_t$ with probability $z_t$.

The new candidate is defined to be:

$$\tilde{s}_t = \tanh(W e(y_{t-1}) + U [r_t \odot s_{t-1}] + C c_t)$$

Note that apart from using the reset gate, $r_t$, this is the usual computation of a hidden unit where $e(y_{t-1})$ is the embedding of the token from time $t-1$. At its simplest, this embedding is just one-hot encoding.

The two gates are computed using the same logic as any hidden recurrent unit:

$$z_t = \sigma(W_z e(y_{t-1}) + U_z s_{t-1} + C_z c_t)$$

$$r_t = \sigma(W_r e(y_{t-1}) + U_r s_{t-1} + C_r c_t)$$

and $\sigma$ is the sigmoid function to get values in range $(0,1)$.

**Computation of c_t**:

Recall that the context vector is a convex combination of the encoder's hidden states:

$$c_t = \Sigma_{j=1}^{T_x} \alpha_{tj} h_j$$ 

where the weights $\alpha_{tj}$ are computed using Boltzmann probabilities:

$$\alpha_{tj} = \frac{\exp(e_{tj})}{\Sigma_k \exp(e_{tk})}$$

and the energies, $e_{tj}$ are:

$$e_{tj} = a(s_{t-1}, h_j)$$

There are many choices for $a()$. The one made by this paper is:

$$e_{tj} = a(s_{t-1}, h_j) = v_a^T \tanh(W_a s_{t-1} + U_a h_j)$$

Here, $v_a, W_a, U_a$ are learned parameters. You can think of $e_{tj}$ as a similarity score between $s_{t-1}$ and $h_j$. The matrices $W_a$ and $U_a$ map $s_{t-1}$ and $h_j$ respectively, to the same vector space so they can be added together. Note, another simple choice could be:

$$e_{tj} = s_{t-1}^T W_a h_j$$


**Training details**: 

* All recurrent weight matrices are initialized as random orthogonal matrices ($U^T U = I$)

* $W_a$ and $U_a$ have each element drawn from a gaussian distribution, $\mathcal{N}(0, 0.001^2)$

* Biases and $V_a$ were initialized to zero.

* Any other weight matrices had elements drawn from $\mathcal{N}(0, 0.01^2)$

* Adadelta (adaptive SGD algorithm) was used with parameters ($\epsilon = 10^{-6}$ and $\rho = 0.95$)

* Gradients were restricted to be at most of $L_2$ norm = 1

* Batch size = 80 sentences

* Every 20th update, 20*80 = 1600 sentences were retrieved, sorted by sequence length and split into 20 batches for the next 20 updates. This is because the time spent on a batch was proportional to the length of the longest sequence.

* Training data shuffled once before training