While editing this notebook, don't change cell types as that confuses the autograder.

Before you turn this notebook in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel $\rightarrow$ Restart) and then **run all cells** (in the menubar, select Cell $\rightarrow$ Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [None]:
NAME = ""

_Understanding Deep Learning_

---

<a href="https://colab.research.google.com/github/udlbook/udlbook/blob/main/Notebooks/Chap12/12_1_Self_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 12.1: Self Attention

This notebook builds a self-attention mechanism from scratch, as discussed in section 12.2 of the book.


In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Preliminaries

Define the inputs.

In [None]:
# Set seed so we get the same random numbers
np.random.seed(3)

# Number of inputs
N = 3
# Number of dimensions of each input
D = 4

# Create an empty list
all_x = []

# Create elements x_n and append to list
for n in range(N):
  all_x.append(np.random.normal(size=(D,1)))

# all_x is now a list of N Dx1 numpy arrays
  
# Print out the list
for n in range(len(all_x)):
  print("Input number", n, "is:")
  print(all_x[n])


We'll also need the weights and biases for the keys, queries, and values (equations 12.2 and 12.4)

In [None]:
# Set seed so we get the same random numbers
np.random.seed(0)

# Choose random values for the parameters
omega_q = np.random.normal(size=(D,D))
omega_k = np.random.normal(size=(D,D))
omega_v = np.random.normal(size=(D,D))
beta_q = np.random.normal(size=(D,1))
beta_k = np.random.normal(size=(D,1))
beta_v = np.random.normal(size=(D,1))

## Calculating Attention via Matrix Calculations

Now let's compute attention using matrix calculations.  We'll store the $N$ inputs $\mathbf{x}_{n}\in\mathbb{R}^{D}$ in the columns of a $D\times N$ matrix, using equations 12.6 and 12.7/8.

Note:  The book uses column vectors (for compatibility with the rest of the text), but in the wider literature it is more normal to store the inputs in the rows of a matrix;  in this case, the computation is the same, but all the matrices are transposed and the operations proceed in the reverse order.

In [None]:
# Define softmax operation that works independently on each column
def softmax_cols(data_in):
  # Exponentiate all of the values
  exp_values = np.exp(data_in) ;
  # Sum over columns
  denom = np.sum(exp_values, axis = 0);
  # Replicate denominator to N rows
  denom = np.matmul(np.ones((data_in.shape[0],1)), denom[np.newaxis,:])
  # Compute softmax
  softmax = exp_values / denom
  # return the answer
  return softmax

In [None]:
 # Now let's compute self attention in matrix form
def self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k):
  """
  Calculate self-attention via matrix calculations.
  
  Returns:
    X_prime: output matrix
    attentions: the attention matrix
  """

  # TODO -- Write this function
  # 1. Compute queries, keys, and values
  # 2. Compute dot products
  # 3. Apply softmax to calculate attentions
  # 4. Weight values by attentions
  # 5. Return X_prime and attentions

  # YOUR CODE HERE
  raise NotImplementedError()


  return X_prime, attentions

In [None]:
# Copy data into matrix
X = np.zeros((D, N))
X[:,0] = np.squeeze(all_x[0])
X[:,1] = np.squeeze(all_x[1])
X[:,2] = np.squeeze(all_x[2])

# Run the self attention mechanism
X_prime, attentions = self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k)

# print attentions to 3 decimal places
print("attentions = ", np.round(attentions, 3))

# Print out the results
print("\nX_prime = ", X_prime)

In [None]:
# Test Cell -- Do not edit

x_prime_true = np.array([[ 0.94744244, -0.24348429, -0.91310441, -0.44522983],
    [ 1.64201168, -0.08470004,  4.02764044,  2.18690791],
    [ 1.61949281, -0.06641533,  3.96863308,  2.15858316]]).transpose()

assert np.allclose(X_prime, x_prime_true)

attentions_true = np.array([[0., 0., 0.005], 
                            [0.998, 0.006, 0.007], 
                            [0.002, 0.994, 0.988]])

assert np.allclose(attentions, attentions_true, atol=1e-3)

Note the print out of the attention matrix. 
You will see that the values are quite extreme. One element of each column is very
close to one and the others are very close to zero.  

Now we'll fix this problem by using scaled dot-product attention.

## Scaled Dot-Product Self-Attention

In [None]:
# Now let's compute self attention in matrix form
def scaled_dot_product_self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k):
  """
  Calculate scaled self-attention via matrix calculations.
  
  Returns:
    X_prime: output matrix
    attentions: the attention matrix
  """

  # TODO -- Write this function
  # 1. Compute queries, keys, and values
  # 2. Compute dot products
  # 3. Scale the dot products as in equation 12.9
  # 4. Apply softmax to calculate attentions
  # 5. Weight values by attentions
  # 6. Return X_prime and attentions

  # YOUR CODE HERE
  raise NotImplementedError()

  return X_prime, attentions

In [None]:
# Run the self attention mechanism
X_prime2, attentions2 = scaled_dot_product_self_attention(X,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k)

# print attentions to 3 decimal places
print("attentions = ", np.round(attentions2, 3))

# Print out the results
print("\nX_prime = ", X_prime2)

In [None]:
# Test cell. Do not edit.

attentions_true2 = np.array([[0.,    0.,    0.062],
 [0.96,  0.071, 0.071],
 [0.04,  0.929, 0.867]])

assert np.allclose(attentions2, attentions_true2, atol=1e-3)

X_prime_true2 = np.array([[ 0.97411966,  1.59622051,  1.32638014],
 [-0.23738409, -0.09516106,  0.13062402],
 [-0.72333202,  3.70194096,  3.02371664],
 [-0.34413007,  2.01339538,  1.6902419 ]])

assert np.allclose(X_prime2, X_prime_true2, atol=1e-3)

Note that the attention matrix is evened a little bit. The dimension of the input is still small, so it won't have a
huge effect.

## Equivariant to Column Permutation?

Let's investigate whether the self-attention mechanism is equivariant/covariant with respect to permutation.
If it is, when we permute the columns of the input matrix $\mathbf{X}$, the columns of the output matrix $\mathbf{X}'$ will also be permuted.


In [None]:
# Permute the columns of X
X_permuted = X[:,[1,2,0]]

In [None]:
# Run the self attention mechanism
X_primeP, attentionsP = scaled_dot_product_self_attention(X_permuted,omega_v, omega_q, omega_k, beta_v, beta_q, beta_k)

# print attentions to 3 decimal places
print("attentions with permuted input = ", np.round(attentionsP, 3))

# Print out the results
print("\nX_prime with permuted input = ", X_primeP)

Let's compare by applying the same column permutation to the previous output.

In [None]:
print("X_prime2 permuted output: ", X_prime2[:,[1,2,0]])

In [None]:
# Check if they are the same
np.allclose(X_primeP, X_prime2[:,[1,2,0]])

You see that these should be equivariant.

**Question:**

Would this still be equivariant if we added positional encoding?

**Type your answer in the cell below**

YOUR ANSWER HERE