# Self Attention
This notebook aims to help gain a better understanding of self attention by defining a self attention loop that computes the attention weight of each input and uses the computed attention weight to take a weighted sum of all the attention weights and their corresponding input values, resulting in an output.

### Imports

In [1]:
import numpy as np

### Initialize Hyperparameters
Initialize hyperparameters for the self-attention loop

In [2]:
np.random.seed(3)
# Number of inputs
N = 3
# Number of dimensions of each input
D = 4

### Define List Structure for Input

In [3]:
all_x = []

### Initialize Input Values
Initialize the input values by randomly adding values to the input list. 

In [4]:
for n in range(N):
  all_x.append(np.random.normal(size=(D,1)))

### Print Input Values

In [5]:
print(all_x)

[array([[ 1.78862847],
       [ 0.43650985],
       [ 0.09649747],
       [-1.8634927 ]]), array([[-0.2773882 ],
       [-0.35475898],
       [-0.08274148],
       [-0.62700068]]), array([[-0.04381817],
       [-0.47721803],
       [-1.31386475],
       [ 0.88462238]])]


### Initialize Parameters
Initialize the parameters for the self-attention loop by randomly adding values.

In [6]:
np.random.seed(0)
omega_q = np.random.normal(size=(D,D))
omega_k = np.random.normal(size=(D,D))
omega_v = np.random.normal(size=(D,D))
beta_q = np.random.normal(size=(D,1))
beta_k = np.random.normal(size=(D,1))
beta_v = np.random.normal(size=(D,1))

### Define List Structure (Query, Key, Value)
Deifne a list structure to store the values of the queries, keys, and values computed from the self-attention loop

In [7]:
all_queries = []
all_keys = []
all_values = []

### Compute Query, Key, Value
Compute and store the query, key, and value for each input value by using the initialized parameters

In [8]:
for x in all_x:
  # Compute query, key, and value
  query = beta_q + omega_q @ x
  key = beta_k + omega_k @ x
  value = beta_v + omega_v @ x

  # Append value of query, key, and value to their respective lists
  all_queries.append(query)
  all_keys.append(key)
  all_values.append(value)

### Define Softmax Function

In [9]:
def softmax(items_in):
  scores = np.array(items_in, dtype=float).flatten()
  scores = scores - np.max(scores)
  exp_scores = np.exp(scores)
  items_out = exp_scores / np.sum(exp_scores)
  return items_out

### Define List Structure (Output)
Define a list structure to store all the computed output values

In [10]:
all_x_prime = []
all_values = np.array(all_values)

### Self-Attention Loop
Undergo the self-attention loop, which does the following steps: 
1. Compute the dot product between the query and key 
2. Pass the computed dot product through a softmax function to compute the attention weights 
3. Using the attention weights, compute the weighted sum of values. The weighted sum of values is the sum of all the values, which are each multiplied by its corresponding attention weight

In [11]:
for n in range(N):
  # Create list for dot products of query N with all keys
  all_km_qn = []

  # For each key value in all key values
  for key in all_keys:
    # Compute dot product between query and key
    dot_product = float(all_queries[n].T @ key)
    # Store output of dot product
    all_km_qn.append(dot_product)

  # Compute attention weights (1D vector length N)
  attention = softmax(all_km_qn)
  # Print the resulting attention weights
  print("Attentions for output ", n)
  print(attention)
  # Define array structure to store output value
  x_prime = np.zeros((D,))
  
  # Compute weighted sum of values using its corresponding attention weights
  for i in range(N):
    x_prime += all_values[i].flatten() * attention[i]
  all_x_prime.append(x_prime)

Attentions for output  0
[1.24326146e-13 9.98281489e-01 1.71851130e-03]
Attentions for output  1
[2.79525306e-12 5.85506360e-03 9.94144936e-01]
Attentions for output  2
[0.00505708 0.00654776 0.98839516]


  dot_product = float(all_queries[n].T @ key)


### Print Calculated and True Output Values
Compare the calculated output value and the true value in order to confirm that the self-attention loop was computed correctly

In [12]:
print("x_prime_0_calculated:", all_x_prime[0].transpose())
print("x_prime_0_true: [[ 0.94744244 -0.24348429 -0.91310441 -0.44522983]]")
print("x_prime_1_calculated:", all_x_prime[1].transpose())
print("x_prime_1_true: [[ 1.64201168 -0.08470004  4.02764044  2.18690791]]")
print("x_prime_2_calculated:", all_x_prime[2].transpose())
print("x_prime_2_true: [[ 1.61949281 -0.06641533  3.96863308  2.15858316]]")

x_prime_0_calculated: [ 0.94744244 -0.24348429 -0.91310441 -0.44522983]
x_prime_0_true: [[ 0.94744244 -0.24348429 -0.91310441 -0.44522983]]
x_prime_1_calculated: [ 1.64201168 -0.08470004  4.02764044  2.18690791]
x_prime_1_true: [[ 1.64201168 -0.08470004  4.02764044  2.18690791]]
x_prime_2_calculated: [ 1.61949281 -0.06641533  3.96863308  2.15858316]
x_prime_2_true: [[ 1.61949281 -0.06641533  3.96863308  2.15858316]]
