# Attention

- Algorithm 3

![img](../assets/algorithm_3.png)

In Algorithm 3, transformer gets two inputs $e$ and $e_{t}$.  
- $e$ : vector representations of the current token. 
- $e_{t}$ : vector representations of context tokens $t \in [T]$.  
    * context tokens are the tokens that has contextual information (eg. preceding text or the surrounding text) for predicting the current token.

And its output is $\tilde{t}$, vector representation of the token and context combined with those parameters:
- $W_{q}, W_{k} \in \mathbb{R}^{d_{attn} \times d_{in}}$
- $b_{q}, b_{k} \in \mathbb{R}^{d_{attn}}$, the query and key linear projections
- $W_{v} \in \mathbb{R}^{d_{out} \times d_{in}},b_{v} \in \mathbb{R}^{d_{out}}$, the value linear projection

Attention works as follows:
1. The token currently being predicted is mapped to a *query* vector $\bf{q} \in \mathbb{R}^{d_{attn}}$.
$$
\bf{q} \leftarrow W_{q}e + b_{q}
$$

2. The tokens in the context are mapped to *key* vectors $\bf{k}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{k}_t \leftarrow W_{k}e_{t} + b_{q}
$$

3. The tokens in the context are mapped to *value* vectors $\bf{v}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{v}_t \leftarrow W_{v}e_{t} + b_{q}
$$

4. The inner products $\bf{q}^{T}\bf{k}_{t}$ are interpreted as the degree to which token $t \in V$ is important for predicting the current token $q$.
$$
\forall{t}: \alpha_t = \frac{\exp({\bf{q}^{T}\bf{k}_{t}/\sqrt{d_{attn}}})}{\sum_{u}{\exp({\bf{q}^{T}\bf{k}_{u}/\sqrt{d_{attn}}})}}
$$

5. Derive a distribution over the context tokens, which is then used to combine the value vectors.
$$
\text{return } \tilde{\bf{v}}=\sum_{t=1}^{T}{\alpha_{t}v_{t}}
$$

## Weight matrix and bias vectors

To implement Algorithm 4, we need weight matrix and bias vector as follows:

1. Query
    - $W_{q} \in \mathbb{R}^{d_{attn} \times d_{in}}$
    - $b_{q} \in \mathbb{R}^{d_{attn}}$
2. Key
    - $W_{k} \in \mathbb{R}^{d_{attn} \times d_{in}}$
    - $b_{k} \in \mathbb{R}^{d_{attn}}$
3. Value
    - $W_{v} \in \mathbb{R}^{d_{out} \times d_{in}}$
    - $b_{v} \in \mathbb{R}^{d_{out}}$

We can generalize those matrix as below:

- $W \in \mathbb{R}^{d_{out_dim} \times d_{in_dim}}$
- $b \in \mathbb{R}^{d_{out_dim}}$

So, to generate thoes weight and vectors we need two argument `in_dim` and `out_dim`.  

In [7]:
def generate_weight_bias(in_dim, out_dim):
    import numpy as np

    weights = np.array([[i+1]*out_dim for i in range(out_dim)])
    bias = np.array([0] * out_dim)
    return weights, bias

In [8]:
d_attn = 10
d_in = 10
d_out = 10

query_weights, query_bias = generate_weight_bias(d_attn, d_in)
key_weights, key_bias = generate_weight_bias(d_attn, d_in)
value_weights, value_bias = generate_weight_bias(d_in, d_out)

### Input Samples
Now, we will implemet attention main logics step by step.  
Assume that we have vectors like below:

In [16]:
import numpy as np


current_vector = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
context_vectors = np.array(
    [
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        [4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
    ]
)


### 1. Query Mapping
Define a function that implements below:

1. The token currently being predicted is mapped to a *query* vector $\bf{q} \in \mathbb{R}^{d_{attn}}$.

Function will get three arguments:
- Parameters:
    - $W_{q}$: `query_weights`
    - $b_{q}$: `query_bias`
- Currently predicted vector: `current_vector`

In [12]:
def query_mapping(current_vector, query_weights, query_bias):
    query_vector = query_weights.dot(current_vector) + query_bias
    return query_vector

In [13]:
query_vector = query_mapping(current_vector, query_weights, query_bias)
query_vector

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100])

### 2. Key Maping

Define a function that implements below:

2. The tokens in the context are mapped to *key* vectors $\bf{k}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{k}_t \leftarrow W_{k}e_{t} + b_{q}
$$

Function will get three arguments:
- Parameters:
    - $W_{q}$: `key_weights`
    - $b_{q}$: `key_bias`
- Context vectors: `context_vector`

In [14]:
def key_mapping(context_vectors, key_weights, key_bias):
    key_vectors = []
    for context_vector in context_vectors:
        key_vector = key_weights.dot(context_vector) + key_bias
        key_vectors.append(key_vector)
    return np.stack(key_vectors)

In [18]:
key_vectors = key_mapping(context_vectors, key_weights, key_bias)
key_vectors

array([[ 20,  40,  60,  80, 100, 120, 140, 160, 180, 200],
       [ 30,  60,  90, 120, 150, 180, 210, 240, 270, 300],
       [ 40,  80, 120, 160, 200, 240, 280, 320, 360, 400]])

### 3. Value Mapping
Define a function that implements below:

3. The tokens in the context are mapped to *value* vectors $\bf{v}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{v}_t \leftarrow W_{v}e_{t} + b_{q}
$$

Function will get three arguments:
- Parameters:
    - $W_{q}$: `value_weights`
    - $b_{q}$: `value_bias`
- Context vectors: `context_vector`

In [35]:
def value_mapping(context_vectors, value_weights, value_bias):
    # solution 1
    value_vectors = []
    for context_vector in context_vectors:
        value_vector = value_weights.dot(context_vector) + value_bias
        value_vectors.append(value_vector)
    # solution 2
    # value_vectors = value_weights.dot(context_vectors.T).T + value_bias
    return np.stack(value_vectors)

In [36]:
value_vectors = value_mapping(context_vectors, value_weights, value_bias)
value_vectors

array([[ 20,  41,  62,  83, 104, 125, 146, 167, 188, 209],
       [ 30,  61,  92, 123, 154, 185, 216, 247, 278, 309],
       [ 40,  81, 122, 163, 204, 245, 286, 327, 368, 409]])

### 4. Softmax
Define a function that implements below:

4. The inner products $\bf{q}^{T}\bf{k}_{t}$ are interpreted as the degree to which token $t \in V$ is important for predicting the current token $q$.
$$
\forall{t}: \alpha_t = \frac{\exp({\bf{q}^{T}\bf{k}_{t}/\sqrt{d_{attn}}})}{\sum_{u}{\exp({\bf{q}^{T}\bf{k}_{u}/\sqrt{d_{attn}}})}}
$$

Function will get three arguments `query_vector`, `key_vectors` and `d_attn`.  
Note that result of this function is equal to softmax function.

![img](../assets/softmax.png)

First, define a function that implements inner product between query_vector and one key_vector.


In [39]:
def inner_product_query_key(query_vector, key_vector, d_attn):
    from math import sqrt

    alpha = query_vector.dot(key_vector) / sqrt(d_attn)
    return alpha

In [42]:
alpha = inner_product_query_key(query_vector, key_vectors[0], d_attn)
alpha

24349.53798329652

Second, define a function that implements inner product between query_vector and all key_vectors.  
Use `inner_product_query_key` function we have defined.

In [45]:
def inner_product_query_keys(query_vector, key_vectors, d_attn):
    # solution 1
    alphas = []
    for key_vector in key_vectors:
        alpha = inner_product_query_key(query_vector, key_vector, d_attn)
        alphas.append(alpha)
    return np.array(alphas)

Length of `alphas` should be eqaul to length of `context_vectors`.  
In this tutorial it should be 3.

In [47]:
alphas = inner_product_query_keys(query_vector, key_vectors, d_attn)
alphas

array([24349.5379833 , 36524.30697494, 48699.07596659])

Finally, make an softmax function using `alphas`.

In [48]:
def softmax(alphas):
    scores = alphas / alphas.sum()
    return scores

Sum of score should be equal to 1.

In [52]:
scores = softmax(alphas)
sum(scores)

1.0

### 5. Final output
Define a function that implements below:

5. Derive a distribution over the context tokens, which is then used to combine the value vectors.
$$
\text{return } \tilde{\bf{v}}=\sum_{t=1}^{T}{\alpha_{t}v_{t}}
$$

Function will get two arguments `value_vectors` and `scores`.

In [56]:
def combine_value_score(value_vectors, scores):
    outputs = scores.dot(value_vectors)
    return outputs

Length of `outputs` should be equal to `d_out`

In [58]:
outputs = combine_value_score(value_vectors, scores)
len(outputs), d_out

(10, 10)