# HW 4 - All About Attention

Welcome to CS 287 HW4. To begin this assignment first turn on the Python 3 and GPU backend for this Colab by clicking `Runtime > Change Runtime Type` above.  

In this homework you will be reproducing the decomposable attention model in Parikh et al. https://aclweb.org/anthology/D16-1244. (This is one of the models that inspired development of the transformer). 



## Goal

We ask that you finish the following goals in PyTorch:

1. Implement the vanilla decomposable attention model as described in that paper.
2. Implement the decomposable attention model with intra attention or another extension.
3. Visualize the attentions in the above two parts.
4. Implement a mixture of models with uniform prior and perform training with exact log marginal likelihood (see below for detailed instructions)
5. Train the mixture of models in part 4 with VAE. (This may not produce a better model, this is still a research area) 
6. Interpret which component specializes at which type of tasks using the posterior.

Consult the paper for model architecture and hyperparameters, but you are also allowed to tune the hyperparameters yourself. 

In [7]:
from load_data import load
from models import AttentionModel

import torch
# Text text processing library and methods for pretrained word embeddings
import torchtext
from torchtext.vocab import Vectors, GloVe

# Named Tensor wrappers
from namedtensor import ntorch, NamedTensor
from namedtensor.text import NamedField
import random
import copy

In [8]:
train_iter, val_iter, test_iter, TEXT, LABEL = load()

In [19]:
batch = next(iter(train_iter))
model = AttentionModel(TEXT, LABEL, 100, 100, intra_attn = True)
model(batch.premise, batch.hypothesis).shape

OrderedDict([('batch', 16), ('logit', 4)])

In [43]:
class LatentVariableMixtureModel(ntorch.nn.Module):
    def __init__(self, model, experts, variational):
        super().__init__()
        try:
            self.models = []
            for _ in range(experts):
                self.models.append(copy.deepcopy(model))
        except RuntimeError:
            raise RuntimeError("model must be newly instantiated")
        self.experts = experts
        self.variational = variational
    
    def forward(self, premise, hypothesis):
        if self.variational:
            return self.sample(premise, hypothesis)
        else:
            return self.enumerate(premise, hypothesis)
    
    def self.sample(premise, hypothesis):
        pass
    
    def enumerate(self, premise, hypothesis):
        predictions = []
        for model in self.models:
            predictions.append(model(premise, hypothesis))
        return (
            ntorch.stack(predictions, "experts")
            .softmax('logit')
            .mean('experts')
            .log()
            .rename('logit','logprob')
        )

In [44]:
model = AttentionModel(TEXT, LABEL, 100, 100, intra_attn = True)
mixture_model = LatentVariableMixtureModel(model, 5, False)

### Instructions for latent variable mixture model.

For the last part of this assignment we will consider a latent variable version of this model. This is a use of latent variable as a form of ensembling.

Instead of a single model, we use $K$ models $p(y | \mathbf{a}, \mathbf{b}; \theta_k)$ ($k=1,\cdots,K$), where $K$ is a hyperparameter. Let's introduce a discrete latent variable $c\sim \text{Uniform}(1,\cdots, K)$ denoting which model is being used to produce the label $y$, then the marginal likelihood is


$$
p(y|\mathbf{a}, \mathbf{b}; \theta) = \sum_{c=1}^K p(c) p(y | \mathbf{a}, \mathbf{b}; \theta_c)
$$

When $K$ is small, we can *enumerate* all possible values of $c$ to maximize the log marginal likelihood. 

We can also use variational auto encoding to perform efficient training. We first introduce an inference network $q(c| y, \mathbf{a}, \mathbf{b})$, and the ELBO is

$$
\log p(y|\mathbf{a}, \mathbf{b}; \theta)  \ge \mathbb{E}_{c \sim q(c|y, \mathbf{a}, \mathbf{b})} \log p(y|\mathbf{a},\mathbf{b}; \theta_c) - KL(q(c|y, \mathbf{a}, \mathbf{b})|| p(c)),
$$

where $p(c)$ is the prior uniform distribution. We can calculate the $KL$ term in closed form, but for the first term in ELBO, due to the discreteness of $c$, we cannot use the reparameterization trick. Instead we use REINFORCE to estimate the gradients (or see slides):

$$
\nabla \mathbb{E}_{c \sim q(c|y, \mathbf{a}, \mathbf{b})} \log p(y|\mathbf{a},\mathbf{b}; \theta_c) = \mathbb{E}_{c \sim q(c|y, \mathbf{a}, \mathbf{b})} \left [\nabla \log p(y|\mathbf{a},\mathbf{b}; \theta_c) + \log p(y|\mathbf{a},\mathbf{b}; \theta_c)  \nabla \log q(c|y, \mathbf{a}, \mathbf{b})\right]
$$


At inference time, to get $p(y|\mathbf{a}, \mathbf{b}; \theta)$ we use enumeration to calculate it exactly. For posterior inference, we can either use $q(c| y, \mathbf{a}, \mathbf{b})$ to approximate the true posterior or use Bayes rule to calculate the posterior exactly.

To interpret what specialized knowledge each component $c$ learns, we can find those examples whose posterior reaches maximum at $c$. 

When a model is trained, use the following test function to produce predictions, and then upload your best result to the kaggle competition:  https://www.kaggle.com/c/harvard-cs287-s19-hw4

In [0]:
def test_code(model):
    "All models should be able to be run with following command."
    upload = []
    # Update: for kaggle the bucket iterator needs to have batch_size 10
    test_iter = torchtext.data.BucketIterator(test, train=False, batch_size=10)
    for batch in test_iter:
        # Your prediction data here (don't cheat!)
        probs = model(batch.text)
        # here we assume that the name for dimension classes is `classes`
        _, argmax = probs.max('classes')
        upload += argmax.tolist()

    with open("predictions.txt", "w") as f:
        for u in upload:
            f.write(str(u) + "\n")

In addition, you should put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/nlp-template