In [1]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>
<style>
.rendered_html td {
    font-size: xx-large;
    text-align: left; !important
}
.rendered_html th {
    font-size: xx-large;
    text-align: left; !important
}
</style>

In [2]:
%%capture
import sys
sys.path.append("..")
import statnlpbook.util as util
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)

<!---
Latex Macros
-->
$$
\newcommand{\Xs}{\mathcal{X}}
\newcommand{\Ys}{\mathcal{Y}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\balpha}{\boldsymbol{\alpha}}
\newcommand{\bbeta}{\boldsymbol{\beta}}
\newcommand{\aligns}{\mathbf{a}}
\newcommand{\align}{a}
\newcommand{\source}{\mathbf{s}}
\newcommand{\target}{\mathbf{t}}
\newcommand{\ssource}{s}
\newcommand{\starget}{t}
\newcommand{\repr}{\mathbf{f}}
\newcommand{\repry}{\mathbf{g}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\prob}{p}
\newcommand{\bar}{\,|\,}
\newcommand{\vocab}{V}
\newcommand{\params}{\boldsymbol{\theta}}
\newcommand{\param}{\theta}
\DeclareMathOperator{\perplexity}{PP}
\DeclareMathOperator{\argmax}{argmax}
\DeclareMathOperator{\argmin}{argmin}
\newcommand{\train}{\mathcal{D}}
\newcommand{\counts}[2]{\#_{#1}(#2) }
\newcommand{\length}[1]{\text{length}(#1) }
\newcommand{\indi}{\mathbb{I}}
$$

In [3]:
%load_ext tikzmagic

In [4]:
from IPython.display import Image
import random

# Attention

+ Natural language inference
+ Attention mechansim


<center>
<img src="../img/bears.png">
</center>

**Given:**
There are six bears. Three brown bears, a black bear and a pink bear run along the grass.

Which of the following is correct?
1. Some bears run
2. All bears sit
3. One bear sits

## Task: Natural Language Inference

Determining the logical relationship between two sentences, a **premise** and a **hypothesis**.

Also known as *Recognising Textual Entailment* ([Dagan et al., 2005](http://u.cs.biu.ac.il/~nlp/downloads/publications/RTEChallenge.pdf)).

We define entailment as:
P entails H if a human reading P would typically infer that H is most likely true.

- (Pairwise) sequence classification task
- Requires commonsense and world knowledge
- Requires general natural language understanding
- Requires fine-grained reasoning

> **P:** “Google files for its long awaited IPO.”
> **H:** “Google goes public.”

Positive ($\Rightarrow$, entails)

### Stanford Natural Language Inference (SNLI) dataset

Crowdsourced annotations for 570K sentence pairs using image captions ([Bowman et al., 2015](https://www.aclweb.org/anthology/D15-1075.pdf)).

**P**: A wedding party taking pictures
- **H:** There is a funeral					: **<span class=red>Contradiction</span>** ($\Rightarrow\neg$)
- **H:** They are outside					    : **<span class=blue>Neutral</span>** (?)
- **H:** Someone got married				    : **<span class=green>Entailment</span>** ($\Rightarrow$)

<img src="https://upload.wikimedia.org/wikipedia/commons/3/31/Wedding_photographer_at_work.jpg" width=1500/> 

### Representing sentences as vectors

1. Encode premise and hypothesis
2. Concatenate the representations
3. Classify with MLP

<center>
<img src="https://d3i71xaburhd42.cloudfront.net/f04df4e20a18358ea2f689b4c129781628ef7fc1/7-Figure3-1.png"/>
</center>

([Image source: Bowman et al., 2015](https://www.aclweb.org/anthology/D15-1075))

How to represent a sentence with a vector?

The same LSTM encodes the premise and hypothesis.

<img src="dl-applications-figures/rte.svg" width=1500/> 

Use the last hidden vectors of the LSTM as sentence representations.

<img src="dl-applications-figures/rte_encoding.svg" width=1500/>

#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |

### Problem 1: 

Asymmetry of premise and hypothesis.

<center>
<img src="https://d3i71xaburhd42.cloudfront.net/f04df4e20a18358ea2f689b4c129781628ef7fc1/7-Figure3-1.png"/>
</center>

### Conditional encoding

<img src="dl-applications-figures/conditional.svg" width=1500/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

<img src="./dl-applications-figures/conditional_encoding.svg/" width=1500/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |

### Problem 2: global memory

Some words are more important to focus on.


<img src="./dl-applications-figures/pink.png/"/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

### Attention

<img src="./dl-applications-figures/attention.svg/" width=1500/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

## Attention mechanism

+ Original motivation: machine translation ([Bahdanau et al., 2014](https://arxiv.org/abs/1409.0473)); see [later lecture in the course](nmt_slides_active.ipynb)

#### Idea

+ A **weighted sum** of encoder hidden states is a differentiable function and has a fixed dimension


<img src="dl-applications-figures/attention_encoding.svg" width=1500/>

### What is happening here?

For the final prediction,
+ Attention takes all premise hidden vectors $(\mathbf{h}_1, \ldots, \mathbf{h}_n)$ as well as the final hypothesis hidden vector ($\mathbf{h}_N$) as input
+ Calculates probability distribution $\alpha$ over premise hidden vectors using a softmax
+ Combines $\mathbf{h}_N$ with an $\alpha$-weighted average of all premise hidden vectors

More formally:

<div class=small>
\begin{align}
  \mathbf{M} &= \tanh(\mathbf{W}^y\mathbf{Y}+\mathbf{W}^h\mathbf{h}_N \otimes \mathbf{e}_L) & \mathbf{M} &\in\mathbb{R}^{k\times L}\\
  \alpha &= \text{softmax}(\mathbf{w}^T\mathbf{M})&\alpha&\in\mathbb{R}^L\\
  \mathbf{r} &= \mathbf{Y}\alpha^T_t &\mathbf{r}&\in\mathbb{R}^k\\
  \mathbf{h^*} &= \tanh(\mathbf{W}^p\mathbf{r} + \mathbf{W}^x\mathbf{h}_N) & \mathbf{h}^* &\in\mathbb{R}^{k}
\end{align}
</div>

where

* $\mathbf{Y}\in\mathbb{R}^{k\times L}$ is the concatenation of all premise hidden vectors
* $\mathbf{W}^y$, $\mathbf{W}^h$, $\mathbf{W}^p$, $\mathbf{W}^r$ $\in\mathbb{R}^{k\times k}$ are trained projection matrices
* $\mathbf{w}\in\mathbb{R}^k$ is a trained parameter vector
* $\alpha_t\in\mathbb{R}^L$ is the attention probability distribution
* $\mathbf{r}\in\mathbb{R}^k$ is the weighted representation of the premise

#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |
| LSTMs with conditional encoding + attention | 82.3 |

### Problem 3: representation bottleneck

> You can’t cram the meaning of a whole
`%&!$#` sentence into a single `$&!#*` vector!
>
> -- <cite>Raymond J. Mooney</cite>

## Alignment

+ Non-neural models often use **alignment** between sequences

<img  src="./dl-applications-figures/snow.png"/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

### Word-by-word Attention

+ Computing attention for each hypothesis token can give us a **soft alignment**

<img src="dl-applications-figures/word_attention.svg" width=1500/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

<img src="dl-applications-figures/word_attention_encoding.svg" width=1500/>

### What is happening here?

**For each hypothesis token $x_t$,**
+ Attention takes all premise hidden vectors $(\mathbf{h}_1, \ldots, \mathbf{h}_n)$ as well as the current hypothesis hidden vector ($\mathbf{h}_t$) as input
+ Generates probability distribution $\alpha_t$ over all premise hidden vectors
+ Uses a weighted average (by $\alpha_t$) of all premise hidden vectors as input for the next layer

#### SNLI results

| Model | Accuracy |
|---|---|
| LSTM | 77.6 |
| LSTMs with conditional encoding | 80.9 |
| LSTMs with conditional encoding + attention | 82.3 |
| LSTMs with word-by-word attention | 83.5 |

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

More formally:

<div class=small>
\begin{align}
  \mathbf{M}_t &= \tanh(\mathbf{W}^y\mathbf{Y}+(\mathbf{W}^h\mathbf{h}_t+\mathbf{W}^r\mathbf{r}_{t-1})\mathbf{1}^T_L) & \mathbf{M}_t &\in\mathbb{R}^{k\times L}\\
  \alpha_t &= \text{softmax}(\mathbf{w}^T\mathbf{M}_t)&\alpha_t&\in\mathbb{R}^L\\
  \mathbf{r}_t &= \mathbf{Y}\alpha^T_t + \tanh(\mathbf{W}^t\mathbf{r}_{t-1})&\mathbf{r}_t&\in\mathbb{R}^k
\end{align}
</div>

where

* $\mathbf{Y}\in\mathbb{R}^{k\times L}$ is the concatenation of all premise hidden vectors
* $\mathbf{W}^y$, $\mathbf{W}^h$, $\mathbf{W}^r \in\mathbb{R}^{k\times k}$ are trained projection matrices
* $\mathbf{w}\in\mathbb{R}^k$ is a trained parameter vector
* $\alpha_t\in\mathbb{R}^L$ is the attention probability distribution
* $\mathbf{r}_t\in\mathbb{R}^k$ is the weighted representation of the premise (dependent on $\mathbf{r}_{t-1}$ to inform the model about what was attended over in the previous step)
* Multiplying by $\mathbf{1}^T_L$ is the same as repeating a matrix $L$ times

Final pairwise sentence representation:

<div class=small>
\begin{align}
  \mathbf{h}^{*} &= \text{tanh} (\mathbf{W}^p\mathbf{r}_N + \mathbf{W}^x\mathbf{h}_N)
\end{align}
</div>

Non-linear combination of the attention-weighted representation $\mathbf{r}_t$ and the last output vector $\mathbf{h}_N$, where $\mathbf{h}^{*} \in\mathbb{R}^{k}$ 

### Attention matrix
$\alpha_{ij}$ for all premise tokens $j$ and hypothesis tokens $i$:

<img  src="./dl-applications-figures/snow.png"/>

([Image source: Rocktäschel et al., 2015](https://arxiv.org/abs/1509.06664))

### An important caveat

+ The attention mechanism was motivated by the idea of aligning inputs & outputs
+ Attention matrices often correspond to human intuitions about alignment
+ But ***producing a sensible alignment is not a training objective!***

In other words:

+ Do not expect that attention weights will *necessarily* correspond to sensible alignments!

### Problem 4: attention only in one direction

Hypothesis tokens attend to premise tokens.

Why don't hypothesis tokens also attend to other **hypthesis** tokens?

Why don't premise tokens also attend to **hypthesis** tokens?

Why don't premise tokens attend to other **premise** tokens?

## Summary

+ The **attention mechanism** alleviates the encoding bottleneck in encoder-decoder architectures


## Further reading

+ [Jurafsky & Martin Chapter 8, section 8.8](https://web.stanford.edu/~jurafsky/slp3/8.pdf)
+ Lilian Weng's blog post [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)
+ Jay Alammar's blog post [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)


