# The Three Ways of Attention and Dot Product Attention: Ungraded Lab Notebook

In this notebook you'll explore the three ways of attention (encoder-decoder attention, causal attention, and bi-directional self attention) and how to implement the latter two with dot product attention. 

## Background

As you learned last week, **attention models** constitute powerful tools in the NLP practitioner's toolkit. Like LSTMs, they learn which words are most important to phrases, sentences, paragraphs, and so on. Moreover, they mitigate the vanishing gradient problem even better than LSTMs. You've already seen how to combine attention with LSTMs to build **encoder-decoder models** for applications such as machine translation. 

<img src="../images/C4_W2_L3_dot-product-attention_S01_introducing-attention_stripped.png" width="500"/>

This week, you'll see how to integrate attention into **transformers**. Because transformers do not process one token at a time, they are much easier to parallelize and accelerate. Beyond text summarization, applications of transformers include: 
* Machine translation
* Auto-completion
* Named Entity Recognition
* Chatbots
* Question-Answering
* And more!

Along with embedding, positional encoding, dense layers, and residual connections, attention is a crucial component of transformers. At the heart of any attention scheme used in a transformer is **dot product attention**, of which the figures below display a simplified picture:

<img src="../images/C4_W2_L3_dot-product-attention_S03_concept-of-attention_stripped.png" width="500"/>

<img src="../images/C4_W2_L3_dot-product-attention_S04_attention-math_stripped.png" width="500"/>

With basic dot product attention, you capture the interactions between every word (embedding) in your query and every word in your key. If the queries and keys belong to the same sentences, this constitutes **bi-directional self-attention**. In some situations, however, it's more appropriate to consider only words which have come before the current one. Such cases, particularly when the queries and keys come from the same sentences, fall into the category of **causal attention**. 

<img src="../images/C4_W2_L4_causal-attention_S02_causal-attention_stripped.png" width="500"/>

For causal attention, you add a **mask** to the argument of our softmax function, as illustrated below: 

<img src="../images/C4_W2_L4_causal-attention_S03_causal-attention-math_stripped.png" width="500"/>

<img src="../images/C4_W2_L4_causal-attention_S04_causal-attention-math-2_stripped.png" width="500"/>

Now let's see how to implement the attention mechanism.

In [1]:
import tensorflow as tf
import numpy as np
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import sys
import textwrap
wrapper = textwrap.TextWrapper(width=70)

2024-01-25 14:55:25.924842: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-25 14:55:27.220109: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-25 14:55:27.220298: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-25 14:55:27.497008: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-25 14:55:27.882077: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-25 14:55:27.885671: I tensorflow/core/platform/cpu_feature_guard.cc:1

In [2]:
def display_tensor(t, name):
    """Display shape and tensor"""
    print(f'{name} shape: {t.shape}\n')
    print(f'{t}\n')

Create tensors for key, value and query.

In [3]:
q = tf.constant([[1.0, 0.0, 3.0], [0.0, 1.0, 0.0]])
k = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
v = tf.constant([[0.0, 1.0, 1.5], [3.0, 4.0, 5.0]])

In [4]:
display_tensor(q, 'query')
display_tensor(k, 'key')
display_tensor(v, 'value')

query shape: (2, 3)

[[1. 0. 3.]
 [0. 1. 0.]]

key shape: (2, 3)

[[1. 2. 3.]
 [4. 5. 6.]]

value shape: (2, 3)

[[0.  1.  1.5]
 [3.  4.  5. ]]



In [8]:
q_test = tf.constant([[1.0, 2.0, 3.0], [0.0, 2.0, 3.2]])
mask = tf.constant([[0.0, 1.0, 0.0], [0.0, 0.0, 0.0]])

In [11]:
mask_2 = tf.experimental.numpy.tril(tf.ones((2, 2)))  
mask_2

<tf.Tensor: shape=(3, 3), dtype=float32, numpy=
array([[1., 0., 0.],
       [1., 1., 0.],
       [1., 1., 1.]], dtype=float32)>

In [9]:
res = q_test + (1. - mask) * -1e9
res

<tf.Tensor: shape=(2, 3), dtype=float32, numpy=
array([[-1.e+09,  2.e+00, -1.e+09],
       [-1.e+09, -1.e+09, -1.e+09]], dtype=float32)>

## Dot product attention

Here you compute 
$\textrm{softmax} \left(\frac{Q K^T}{\sqrt{d}} + M \right) V$, where the (optional, but default) scaling factor $\sqrt{d}$ is the square root of the embedding dimension.

In [None]:
def dot_product_attention(q, k, v, mask, scale=True):
    """
    Calculate the attention weights.
      q, k, v must have matching leading dimensions.
      k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
      The mask has different shapes depending on its type(padding or look ahead) 
      but it must be broadcastable for addition.

    Arguments:
        q (tf.Tensor): query of shape (..., seq_len_q, depth)
        k (tf.Tensor): key of shape (..., seq_len_k, depth)
        v (tf.Tensor): value of shape (..., seq_len_v, depth_v)
        mask (tf.Tensor): mask with shape broadcastable 
              to (..., seq_len_q, seq_len_k). Defaults to None.
        scale (boolean): if True, the result is a scaled dot-product attention. Defaults to True.

    Returns:
        attention_output (tf.Tensor): the result of the attention function
    """
    nominator = tf.matmul(q, k, transpose_b=True)
    
    if scale:
        dk = tf.cast(q.shape[-1], tf.float32)
        nominator /= tf.sqrt(dk)
    
    if mask is not None:
        ...
