<a href="https://colab.research.google.com/github/CallmeQuant/Studying-Notebook/blob/main/Miscellaneous/Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np

 # **Basic Attention**

+ Rather than using the final encoder hidden state, attention allows using information from **each encoder** step.
+ The encoder outputs are weighted based on the decoder hidden state, concatenated into one context vector, put through the decoder to make prediction.

In [None]:
def softmax(x, axis = 0):
  """
  Compute the softmax for x along specified axis
  axis=0 calculates softmax across rows which means each column sums to 1
  axis=1 calculates softmax across columns which means each row sums to 1
  """
  return np.exp(x) / np.expand_dims(np.sum(np.exp(x), axis = axis), axis)

In [None]:
scores = [3.0, 1.0, 0.2]
print(softmax(scores))

scores2D = np.array([[1, 2, 3, 6],
                     [2, 4, 5, 6],
                     [3, 8, 7, 6]])

print(softmax(scores2D))

[0.8360188  0.11314284 0.05083836]
[[0.09003057 0.00242826 0.01587624 0.33333333]
 [0.24472847 0.01794253 0.11731043 0.33333333]
 [0.66524096 0.97962921 0.86681333 0.33333333]]


**Step 1: Computing Alignment Scores**

+ A measure of similarity between decoder hidden state and each encoder hidden state. The operation is

$$e_{ij} = v_{a}^{\intercal} \text{tanh}(W_{a} s_{i-1} + U_{a} h_{j})$$

where $W_a \in \mathbb{R}^{n \times m}$, $U_a \in \mathbb{R}^{n \times m}$, and $v_{a} \in \in \mathbb{R}^{m}$.

+ Normally, this operation is implemented as a feedforward neural network with two layers, where m is the size of layers in the alignment network:
 + $h_{j}$ are encoder hidden states from each input step $j$ and last decoder hidden states are concatenated to produce array of size $K \times 2n$ where $K$ is number of encoder states/steps.



In [34]:
hidden_size = 16
attention_size = 10
input_length = 5

np.random.seed(42)

encoder_states = np.random.randn(input_length, hidden_size)
decoder_states = np.random.randn(1, hidden_size)

# Weights for the neural network, these are typically learned through training
# Use these in the alignment function below as the layer weights
layer_1 = np.random.randn(2 * hidden_size, attention_size)
layer_2 = np.random.randn(attention_size , 1)

def alignment(encoder_states, decoder_state):
  inputs = np.concatenate((encoder_states,
                np.repeat(decoder_states, encoder_states.shape[0], axis = 0)),
                axis = 1)
  assert inputs.shape == (input_length, 2 * hidden_size)

  activations = np.tanh(inputs @ layer_1)

  assert activations.shape == (input_length, attention_size)

  scores = activations @ layer_2

  assert scores.shape == (input_length, 1)

  return scores

In [35]:
scores = alignment(encoder_states, decoder_states)
print(scores)

[[4.35790943]
 [5.92373433]
 [4.18673175]
 [2.11437202]
 [0.95767155]]


**Step 2: Compute the Weights based on the Alignment Scores**

+ These weights determine the encoder outputs that are the most important for the decoder output. These weights should be between 0 and 1, and add up to 1.
+ Using softmax function tot return the weights of the attention score. Mathematically,

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k = 1}^{K} \exp(e_{ik})}$$

where $K$ is number of encoder states.

**Step 3: Weighting the Encoder Output vectors and Sum**

+ The weights tell us the importance of each input word with respect to the decoder state.

+ Multiply each encoder vector by its respective weight to get the alignment vectors
+ Sum up the weighted alignment vectors to get the context vector.
Mathematically,

$$c_{i} = \sum_{j = 1}^{K} \alpha_{ij} h_{j}$$

In [56]:
def attention(encoder_states, decoder_state):
    """
    Example function that calculates attention, returns the context vector
    Parameters:
      encoder_vectors: NxM numpy array, where N is the number of vectors and M is the vector length
      decoder_vector: 1xM numpy array, M is the vector length, much be the same M as encoder_vectors
    """
    scores = alignment(encoder_states, decoder_state)

    weights = softmax(scores)

    # Element-wise product
    weighted_scores = weights * encoder_states

    context = np.sum(weighted_scores, axis = 0)

    # shorter
    # context = np.dot(weighted_scores.T, encoder_states).flatten()

    return context


In [57]:
context_vector = attention(encoder_states, decoder_states)
print(context_vector)

[-0.63514569  0.04917298 -0.43930867 -0.9268003   1.01903919 -0.43181409
  0.13365099 -0.84746874 -0.37572203  0.18279832 -0.90452701  0.17872958
 -0.58015282 -0.58294027 -0.75457577  1.32985756]


# **Dot-Product Attention**

In [58]:
import sys

import numpy as np
import scipy.special

import textwrap
wrapper = textwrap.TextWrapper(width=70)

# to print the entire np array
np.set_printoptions(threshold=sys.maxsize)

Helper functions to create tensor and display information

In [61]:
def create_tensor(l):
  return np.array(l)

def display_tensor(t, name):
  print(f'{name} shape: {t.shape}\n')
  print(f'{t}\n')

We will create some tensors and display the shapes.

The query, key, and value arrays must all have the same embedding dimensions (number of columns), and the mask array must have the same shape as `np.dot(query, key.T)`.

In [98]:
np.random.seed(128)
q = create_tensor([[1, 0, 0], [0, 1, 0]])
display_tensor(q, 'query')
k = create_tensor([[1, 2, 3], [4, 5, 6]])
display_tensor(k, 'key')
v = create_tensor([[0, 1, 0], [1, 0, 1]])
display_tensor(v, 'value')
m = create_tensor([[0., -1e9],
                   [0., 0.]]
                  )
display_tensor(m, 'mask')

query shape: (2, 3)

[[1 0 0]
 [0 1 0]]

key shape: (2, 3)

[[1 2 3]
 [4 5 6]]

value shape: (2, 3)

[[0 1 0]
 [1 0 1]]

mask shape: (2, 2)

[[ 0.e+00 -1.e+09]
 [ 0.e+00  0.e+00]]



**Compute dot product attention**

$$\text{softmax}\Bigg(\frac{QK^{\intercal}}{\sqrt{d}} + M\Bigg)V$$

In [99]:
def DotProductAttention(query, key, value, mask, scale = True):
  """
  Dot product self-attention.
    Parameters:
        query (numpy.ndarray): array of query representations with shape (L_q by d)
        key (numpy.ndarray): array of key representations with shape (L_k by d)
        value (numpy.ndarray): array of value representations with shape (L_k by d) where L_v = L_k
        mask (numpy.ndarray): attention-mask, gates attention with shape (L_q by L_k)
        scale (bool): whether to scale the dot product of the query and transposed key

    Returns:
        numpy.ndarray: Self-attention array for q, k, v arrays. (L_q by d)
  """
  assert query.shape[-1] == key.shape[-1] == value.shape[-1], "Embedding dimensions of q, k, v aren't all the same"

  # Save depth/dimension of the query embedding for scaling down the dot product
  if scale:
      depth = query.shape[-1]
  else:
      depth = 1

  # Compute the scaled query-key dot product

  scaled_qk = (q @ k.T) / np.sqrt(depth) # np.matmul(query, np.swapaxes(key, -1, -2)) / np.sqrt(depth)

  if mask is not None:
    scaled_qk = np.where(mask, scaled_qk,  np.full_like(scaled_qk, -1e9))

  # Using logsumexp trick to avoid underflow
  logsumexp = scipy.special.logsumexp(scaled_qk, axis=-1, keepdims=True)

  attention = scaled_qk @ value

  return attention

In [100]:
print(DotProductAttention(q, k, v, m))

[[ 2.30940108e+00 -1.00000000e+09  2.30940108e+00]
 [-1.00000000e+09 -1.00000000e+09 -1.00000000e+09]]


In [101]:
def dot_product_self_attention(q, k, v, scale=True):
    """
    Masked dot product self attention.
    Parameters:
        q (numpy.ndarray): queries.
        k (numpy.ndarray): keys.
        v (numpy.ndarray): values.
    Returns:
        numpy.ndarray: masked dot product self attention tensor.
    """
    # Size of the penultimate dimension of the query
    mask_size = q.shape[-2]

    # Creates a matrix with ones below the diagonal and 0s above with shape (1, mask_size, mask_size)
    # Use np.tril() - Lower triangle of an array and np.ones()
    mask = np.tril(np.ones((1, mask_size, mask_size), dtype = np.bool_), k = 0)

    return DotProductAttention(q, k, v, mask, scale = True)

In [102]:
dot_product_self_attention(q, k, v)

array([[[-1.00000000e+09,  5.77350269e-01, -1.00000000e+09],
        [ 2.88675135e+00,  1.15470054e+00,  2.88675135e+00]]])