# Attention

## Context

Attention is one of the most important recent developments in deep learning
- 2015 - attention introduced (Bahdanau et. al)
- 2017 - the transformer - an entire network architecture (Vaswani et al.)

The transformer has been crucial to the revolution in NLP 
- transformer based language models such as BERT, GPT3 are the biggest story in ML/AI recently 

![](assets/so-hot.jpg)

Applications (any sequence problem)
- machine translation (text or audio)
- question answering
- parsing sentences into grammar trees
- generating solutions to symbolic math problems
- time series

Xu et. al (2015) Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- generating captions for images

Possible to use with LSTMS, or without
- without = self attention = the Transformer

Can be used with images or text


## What & why of attention

**Attention removes any restrictions based on distance** 
- deal with the entire sequence at once
- (learnable) shortcuts between the input & output sequence
- no need for recurrence - no backprop through time -> faster to train than RNN

The intuition 
- pay attention to different parts of a sentence (skim reading)
- pay attention to different parts of an image 
- pay attention to parts of the sequence are relevant

We can think about attention as search
- searching for something, paying attention to something
- searching learned embeddings for words that are similar 

Attention is modelled using importance weights
- paying attention is about multiplying these weights by something else (i.e. a word vector)


## Types of attention

Attention by location
- look at the position in the sequence only
- weighted distributions of different offsets

Attention by content (min 42)
- associative content
- key vector $a$, compared to all glimpses $g$, using similarity function $S$
- output a vector (key) that represents (green stones), then do cosine similarity 

Content based addressing
- attention based on vector similarity
- cosine similarity -> softmax

## Attention + seq2seq

In seq2seq we build a single context vector
- from the encoders last hidden state - acts like a sentence embedding
- all infomation from the encoder flows through the fixed length context vector

With attention, we create shortcut between the entire input sequence and the context vector
- these shortcuts are weighted
- weights = the strength of attention between the input & context

There are many attention mechanisms
- the first was introduced in 2015 (Bahdanau et. al (2015) Neural Machine Translation by Jointly Learning to Align and Translate - [arxiv](https://arxiv.org/abs/1409.0473))
- using attention with RNN encoder-decoder
- bidirectional RNN to produce the encoder hidden state (fwd & bwd hidden states concatenated)
- all of these hidden states are used to generate the context vector

<img src="assets/enc-dec-attention.png" width="60%" />

Below we will take a look at how the additive attention mechanism works - ([docs for Tensorflow AdditiveAttention layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/AdditiveAttention))

First generate some data (notice the shape (batch, seq len, feature dim)):

In [None]:
import numpy as np

#  shape = (batch, time, features)
qry = np.random.normal(size=128).reshape(4, 8, -1).astype(np.float32)
val = np.random.normal(size=128).reshape(4, 8, -1).astype(np.float32)

Run the built in TF AdditiveAttention layer:

In [None]:
!pip install tensorflow -q
import tensorflow as tf

net = tf.keras.layers.AdditiveAttention(use_scale=False)
out = net([qry, val])

print(np.sum(out))

Let's reproduce this using Tensorflow components:

In [None]:
qry = qry.reshape(4, 8, 1, -1)
val = val.reshape(4, 1, 8, -1)

scores = tf.reduce_sum(tf.tanh(qry + val), axis=-1)
dist = tf.nn.softmax(scores)
out = tf.matmul(dist, val.reshape(4, 8, -1))

print(np.sum(scores), np.var(dist), np.sum(out))

## Exercise

Implement additive attention without using Tensorflow (using numpy, scipy etc):

# The transformer

The most important trend in deep learning recently
- so important that some think GPT3 is the first true AI 

My favourite example of GPT2 is [This Word Does Not Exist](https://www.thisworddoesnotexist.com/)

More reading
- OpenAI's GPT-3 may be the biggest thing since bitcoin - [blog-post](https://maraoz.com/2020/07/18/openai-gpt3/)
- [Are we in an AI overhang?](https://www.lesswrong.com/posts/N6vZEnCn6A95Xn39p/are-we-in-an-ai-overhang)

So what is so special about the transformer?
- introduced in by Vaswani et al. (2017) Attention is All you Need
- use of attention to model the sequence
- much faster to train than recurrent models (this has allowed the models to scale)

## Transformer architecture

Below is the entire transformer architecture:

- first embed the source & target sequences into the same dimension (512)
- location infomation embedded using a sine wave embedding
- softmax at the end to predict a word

<img src="assets/transformer.png" width="50%" />

*From Attention? Attention! - Lilian Wang*

Below, we will look at some of the components of this model
- dot-product attention
- self-attention


## The Dot-Product as similarity

Above we introduced the idea of attention as similarity
- we are searching for words that are similar

A common way to measure similarity between vectors is the cosine distance
- we can also use matrix multiplication to approximate this measure of similarity

In [None]:
from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial.distance import cosine

data = defaultdict(list)
for _ in range(100):
    a = np.random.normal(size=128)
    b = np.random.normal(size=128)
    data['cosine'].append(cosine(a, b))
    data['dot'].append(np.dot(a, b))

_ = plt.scatter(data['cosine'], data['dot'])

## Key, value, query

Key to understanding how these attention layers work

The input to the layer is a sequence of (key, value) pairs
- the key is used to index the value
- for translation, both the key and value are the same

Also input to the layer is a query
- this is produced by the previous layer (often the decoder)
- decoder compresses previous output into a query, and maps this query to the (k, v) to produce output

We use attention to find keys that are similar to the query


## Dot-product attention

Above we have seen how we can use additive attention to create a context vector
- after 2015 other attention mechanisms were developed

Dot-product attention is important, as it powers the transformer
- the dot product is used to measure similarity between the query & all our keys (between the current word and all other words)
- this similarity can be converted into a weighted sum (via a softmax)
- output created by multiplying the two together

[Attention layer in Tensorflow](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention)

> The meaning of query, value and key depend on the application. In the case of text similarity, for example, query is the sequence embeddings of the first piece of text and value is the sequence embeddings of the second piece of text. key is usually the same tensor as value.

Let's model this dot-product attention. First lets generate some data:

In [None]:
import numpy as np

#  shape = (batch, time, features)
qry = np.random.normal(size=128).reshape(4, 8, -1).astype(np.float32)
val = np.random.normal(size=128).reshape(4, 8, -1).astype(np.float32)
key = val

Below is the dot-product attention in Tensorflow:

In [None]:
net = tf.keras.layers.Attention(use_scale=False)
out = net([qry, val], training=False)
np.sum(out)

Below we reproduce the above using the lower level Tensorflow components:

In [None]:
#  similarity between the query and our keys
scores = tf.matmul(qry, key, transpose_b=True)

#  softmax to normalize
dist = tf.nn.softmax(scores)

#  apply our attention scores to the values
out = tf.matmul(dist, val)
print(np.sum(scores), np.var(dist), np.sum(out))

## Exercise

Implement dot-product attention without Tensorflow:

In [None]:
#from answers import dot_product_attention

## Self-Attention

Relate a sequence to itself
- learn the relationship between the words in a sentence
- how similar is this word to other words in this sentence
- rather than learning the relationship between two different sequences (such as source + target)
- useful in machine reading, summarization & image caption generation

Also known as introspective attention
- attend to a networks own internal state (rather than data)

## Multihead attention

Just running the dot-product attention in parallel
- outputs of all the heads are concatenated

## Exercise

Try to get the Tensorflow tutorial for translation with attention working on Colab
- [Tutorial here](https://www.tensorflow.org/tutorials/text/nmt_with_attention)

## A few more things

That don't fit in above :)


### Hard versus soft attention

Hard attention = fixed size windows
- only one part of sequence at a time

Soft attention = across entire sequence


### Global vs local attention

Global attenion is similar to soft attention

Local attention
- blend of hard & soft (also differentiable)
- first predict a window, then do attention within that window


## Resources Used

Attention? Attention! - Lilian Wang - [text](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) (read this if you don't have lots of time)

Attention and Memory in Deep Learning - DeepMind 2018 Lecture - [youtube](https://www.youtube.com/watch?v=Q57rzaHHO0k)

C5W3L07 Attention Model Intuition - [youtube](https://youtu.be/SysgYptB198)

Attention and Memory in Deep Learning - DeepMind 2018 Lecture - [youtube](https://www.youtube.com/watch?v=Q57rzaHHO0k)

Attention Is All You Need - [youtube](https://www.youtube.com/watch?v=iDulhoQ2pro)