# Attention

- Algorithm 3

![img](../assets/algorithm_3.png)

In Algorithm 3, attention gets two inputs $e$ and $e_{t}$.  
- $e$ : vector representations of the current token. 
- $e_{t}$ : vector representations of context tokens $t \in [T]$.  
    * context tokens are the tokens that has contextual information (eg. preceding text or the surrounding text) for predicting the current token.

And its output is $\tilde{t}$, vector representation of the token and context combined with those parameters:
- $W_{q}, W_{k} \in \mathbb{R}^{d_{attn} \times d_{in}}$
- $b_{q}, b_{k} \in \mathbb{R}^{d_{attn}}$, the query and key linear projections
- $W_{v} \in \mathbb{R}^{d_{out} \times d_{in}},b_{v} \in \mathbb{R}^{d_{out}}$, the value linear projection

Attention works as follows:
1. The token currently being predicted is mapped to a *query* vector $\bf{q} \in \mathbb{R}^{d_{attn}}$.
$$
\bf{q} \leftarrow W_{q}e + b_{q}
$$

2. The tokens in the context are mapped to *key* vectors $\bf{k}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{k}_t \leftarrow W_{k}e_{t} + b_{q}
$$

3. The tokens in the context are mapped to *value* vectors $\bf{v}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{v}_t \leftarrow W_{v}e_{t} + b_{q}
$$

4. The inner products $\bf{q}^{T}\bf{k}_{t}$ are interpreted as the degree to which token $t \in V$ is important for predicting the current token $q$.
$$
\forall{t}: \alpha_t = \frac{\exp({\bf{q}^{T}\bf{k}_{t}/\sqrt{d_{attn}}})}{\sum_{u}{\exp({\bf{q}^{T}\bf{k}_{u}/\sqrt{d_{attn}}})}}
$$

5. Derive a distribution over the context tokens, which is then used to combine the value vectors.
$$
\text{return } \tilde{\bf{v}}=\sum_{t=1}^{T}{\alpha_{t}v_{t}}
$$

## Weight matrix and bias vectors

To implement Algorithm 3, we need weight matrix and bias vector as follows:

1. Query
    - $W_{q} \in \mathbb{R}^{d_{attn} \times d_{in}}$
    - $b_{q} \in \mathbb{R}^{d_{attn}}$
2. Key
    - $W_{k} \in \mathbb{R}^{d_{attn} \times d_{in}}$
    - $b_{k} \in \mathbb{R}^{d_{attn}}$
3. Value
    - $W_{v} \in \mathbb{R}^{d_{out} \times d_{in}}$
    - $b_{v} \in \mathbb{R}^{d_{out}}$

We can generalize those matrix as below:

- $W \in \mathbb{R}^{d_{out_dim} \times d_{in_dim}}$
- $b \in \mathbb{R}^{d_{out_dim}}$

So, to generate thoes weight and vectors we need two argument `in_dim` and `out_dim`.  

In [None]:
def generate_weight_bias(in_dim, out_dim):
    ...
    return weights, bias

In [None]:
d_attn = 10
d_in = 10
d_out = 10

query_weights, query_bias = generate_weight_bias(d_attn, d_in)
key_weights, key_bias = generate_weight_bias(d_attn, d_in)
value_weights, value_bias = generate_weight_bias(d_in, d_out)

## Implement
### Input Samples
Now, we will implemet attention main logics step by step.  
Assume that we have vectors like below:

In [None]:
import numpy as np


current_vector = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
context_vectors = np.array(
    [
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
        [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        [4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
    ]
)


### 1. Query Mapping
Define a function that implements below:

1. The token currently being predicted is mapped to a *query* vector $\bf{q} \in \mathbb{R}^{d_{attn}}$.

Function will get three arguments:
- Parameters:
    - $W_{q}$: `query_weights`
    - $b_{q}$: `query_bias`
- Currently predicted vector: `current_vector`

In [None]:
def query_mapping(current_vector, query_weights, query_bias):
    ...
    return query_vector

In [None]:
query_vector = query_mapping(current_vector, query_weights, query_bias)
query_vector

### 2. Key Maping

Define a function that implements below:

2. The tokens in the context are mapped to *key* vectors $\bf{k}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{k}_t \leftarrow W_{k}e_{t} + b_{q}
$$

Function will get three arguments:
- Parameters:
    - $W_{q}$: `key_weights`
    - $b_{q}$: `key_bias`
- Context vectors: `context_vector`

In [None]:
def key_mapping(context_vectors, key_weights, key_bias):
    ...
    return key_vectors

In [None]:
key_vectors = key_mapping(context_vectors, key_weights, key_bias)
key_vectors

### 3. Value Mapping
Define a function that implements below:

3. The tokens in the context are mapped to *value* vectors $\bf{v}_{t} \in \mathbb{R}^{d_{attn}}$.
$$
\forall{t}: \bf{v}_t \leftarrow W_{v}e_{t} + b_{q}
$$

Function will get three arguments:
- Parameters:
    - $W_{q}$: `value_weights`
    - $b_{q}$: `value_bias`
- Context vectors: `context_vector`

In [None]:
def value_mapping(context_vectors, value_weights, value_bias):
    ...
    return value_vectors

In [None]:
value_vectors = value_mapping(context_vectors, value_weights, value_bias)
value_vectors

### 4. Softmax
Define a function that implements below:

4. The inner products $\bf{q}^{T}\bf{k}_{t}$ are interpreted as the degree to which token $t \in V$ is important for predicting the current token $q$.
$$
\forall{t}: \alpha_t = \frac{\exp({\bf{q}^{T}\bf{k}_{t}/\sqrt{d_{attn}}})}{\sum_{u}{\exp({\bf{q}^{T}\bf{k}_{u}/\sqrt{d_{attn}}})}}
$$

Function will get three arguments `query_vector`, `key_vectors` and `d_attn`.  
Note that result of this function is equal to softmax function.

First, define a function that implements inner product between query_vector and one key_vector.


In [None]:
def inner_product_query_key(query_vector, key_vector, d_attn):
    ...
    return alpha

In [None]:
alpha = inner_product_query_key(query_vector, key_vectors[0], d_attn)
alpha

Second, define a function that implements inner product between query_vector and all key_vectors.  
Use `inner_product_query_key` function we have defined.

In [None]:
def inner_product_query_keys(query_vector, key_vectors, d_attn):
    ...
    return alphas

Length of `alphas` should be eqaul to length of `context_vectors`.  
In this tutorial it should be 3.

In [None]:
alphas = inner_product_query_keys(query_vector, key_vectors, d_attn)
alphas

Finally, make an softmax function using `alphas`.

In [None]:
def softmax(alphas):
    ...
    return scores

Sum of score should be equal to 1.

In [None]:
scores = softmax(alphas)
sum(scores)

### 5. Final output
Define a function that implements below:

5. Derive a distribution over the context tokens, which is then used to combine the value vectors.
$$
\text{return } \tilde{\bf{v}}=\sum_{t=1}^{T}{\alpha_{t}v_{t}}
$$

Function will get two arguments `value_vectors` and `scores`.

In [None]:
def combine_value_score(value_vectors, scores):
    ...
    return outputs

Length of `outputs` should be equal to `d_out`

In [None]:
outputs = combine_value_score(value_vectors, scores)
len(outputs), d_out

### Aggregate

Now, aggregate all functions we defined before.

In [None]:
def attention(
    current_vector, context_vectors, query_weights, key_weights, value_weights, query_bias, key_bias, value_bias, d_attn
):
    ...
    return outputs


In [None]:
attn_hidden = attention(current_vector, context_vectors, query_weights, key_weights, value_weights, query_bias, key_bias, value_bias, d_attn)
attn_hidden

# Mask Attention

> There are many ways the basic attention mechanism is used in transformers.
> - Bidrectional / unmasked self-attention
> - Unidrectional / masked self-attention
> - Cross-attention


- Algorithm 4

![img](../assets/algorithm_4.png)

In Algorithm 4, masked attention gets two inputs $\bf{X} \in \mathbb{R}^{d_{x} \times l_{X}}$ and $\bf{Z} \in \mathbb{R}^{d_{z} \times l_{z}}$.
- $\bf{X}$ : vector representation of primary
- $\bf{Z}$ : vector representation of context sequence

And its output $\tilde{\bf{V}} \in \mathbb{R}^{d_{out} \times l_{z}}$, updated representations of tokens in $\bf{X}$, folding in infromation from tokens in $\bf{Z}$ with those parameters and hyperparameters:

**Parameters**
- $W_{q} \in \mathbb{R}^{d_{attn} \times d_{X}}, b_{q} \in \mathbb{R}^{d_{attn}}$
- $W_{k} \in \mathbb{R}^{d_{attn} \times d_{Z}}, b_{k} \in \mathbb{R}^{d_{attn}}$
- $W_{v} \in \mathbb{R}^{d_{out} \times d_{Z}}, b_{v} \in \mathbb{R}^{d_{out}}$

**Hyper Parameters**
- $\text{Mask} \in \{0,1\}^{l_{Z}\times l_{X}}$


- Softmax function with matrix form

![img](../assets/softmax.png)

- Mask function with matrix form

![img](../assets/mask.png)


Masked Attention works as follows:  
1. $\bf{Q} \leftarrow W_{q}\bf{X} + b_{q}\bf{1}^T$
2. $\bf{K} \leftarrow W_{k}\bf{X} + b_{k}\bf{1}^T$
3. $\bf{V} \leftarrow W_{v}\bf{X} + b_{v}\bf{1}^T$
4. $\bf{S} \leftarrow \bf{K}^{T}Q$
5. $\forall{t_{Z},t_{X}} \text{ if }\neg\text{Mask}[t_{Z}, t_{X}]$ then $S[t_{Z}, t_{X}] \leftarrow - \infty$
6. $\text{return } \tilde{\bf{V}}=\bf{V} \sdot \text{softmax } (\bf{S}/\sqrt{d_{attn}})$


## Implement
### Input Sample
Now, we will implemet attention main logics step by step.  
Assume that we have matrix, which row is words and column is embed, like below:

In [None]:
import numpy as np


current_matrix = np.array(
    [
        [1] * 10,
        [2] * 10,
        [3] * 10,
    ]
)
context_matrix = np.array(
    [
        [
            [2] * 10,
            [3] * 10,
            [4] * 10,
        ],
        [
            [3] * 10,
            [4] * 10,
            [5] * 10,
        ],
        [
            [4] * 10,
            [5] * 10,
            [6] * 10,
        ],
    ]
)


In [None]:
current_matrix

In [None]:
context_matrix

### 1. Query Mapping
Define a function that implements below:

$$
\bf{Q} \leftarrow W_{q}\bf{X} + b_{q}\bf{1}^T
$$

Function will get three arguments:
- Parameters:
    - $W_{q}$: `query_weights`
    - $b_{q}$: `query_bias`
- Currently predicted matrix: `current_matrix`

*Hint*: use `query_mapping` for each token and concat results.

In [None]:
def masked_query_mapping(current_matrix, query_weights, query_bias):
    ...
    return query_matrix

In [None]:
query_matrix = masked_query_mapping(current_matrix, query_weights, query_bias)
query_matrix

`query_matrix[0]` and `query_vector` should be equal.

In [None]:
query_matrix[0], query_vector

### 2. Key Maping

Define a function that implements below:

$$
\bf{K} \leftarrow W_{k}\bf{X} + b_{k}\bf{1}^T
$$

Function will get three arguments:
- Parameters:
    - $W_{k}$: `key_weights`
    - $b_{k}$: `key_bias`
- Context matrix: `context_matrix`

*Hint*: use `key_mapping` for each token and concat results.

In [None]:
def masked_key_mapping(context_matrix, key_weights, key_bias):
    ...
    return key_matrix

In [None]:
key_matrix = masked_key_mapping(context_matrix, key_weights, key_bias)
key_matrix

`key_matrix[0]` and `key_vectors` should be equal.

In [None]:
key_matrix[0], key_vectors

### 3. Value Mapping

Define a function that implements below:

$$
\bf{V} \leftarrow W_{v}\bf{X} + b_{v}\bf{1}^T
$$

Function will get three arguments:
- Parameters:
    - $W_{v}$: `value_weights`
    - $b_{v}$: `value_bias`
- Context matrix: `context_matrix`

*Hint*: use `value_mapping` for each token and concat results.

In [None]:
def masked_value_mapping(context_matrix, value_weights, value_bias):
    ...
    return value_matrix

In [None]:
value_matrix = masked_value_mapping(context_matrix, value_weights, value_bias)
value_matrix

`value_matrix[0]` and `value_vectors` should be equal.

In [None]:
value_matrix[0], value_vectors

### 4. Calucate Score
Define a function that implements below:

$$
\bf{S} \leftarrow \bf{K}^{T}Q
$$

Function will get two arguments `query_matrix`, `key_matrix`.  

First, define a function that implements inner product between query_matrix and one key_matrix.  
*Hint*: use `inner_product_query_keys` for each token with `d_attn=1` and concat results. `d_attn` will be calucated after in masked attention.

In [None]:
def inner_product_query_key_matrix(query_matrix, key_matrix):
    ...
    return alpha_matrix

In [None]:
alpha_matrix = inner_product_query_key_matrix(query_matrix, key_matrix)
alpha_matrix

### 5. Masking

Assume that there is a mask matrix below:

In [None]:
mask_matrix = np.array([
    [0]*len(context_matrix[0]),
    [0]*len(context_matrix[0]),
    [0]*len(context_matrix[0]),
])
mask_matrix[1, 1] = 1
mask_matrix[2, 2] = 1
mask_matrix

Define a function that implements below:

$$
\forall{t_{Z},t_{X}} \text{ if }\neg\text{Mask}[t_{Z}, t_{X}] \text{ then }S[t_{Z}, t_{X}] \leftarrow - \infty
$$

Function will get two arguments `alpha_matrix` and `mask_matrix`.

In [None]:
def mask_score(alpha_matrix, mask_matrix):
    ...
    return masked_alpha

In [None]:
masked_alpha = mask_score(alpha_matrix, mask_matrix)
masked_alpha

### 6. Final output
Define a function that implements below:

$$
\text{return } \tilde{\bf{V}}=\bf{V} \sdot \text{softmax } (\bf{S}/\sqrt{d_{attn}})
$$

Function will get three arguments `value_matrix`, `alpha_matrix` and `d_attn`.

First, calucate softmax with `alpha_matrix`.  
*Hint*: use `softmax` for each token and concat results.

In [None]:
def masked_softmax(alpha_matrix, d_attn):
    ...
    return scores

In [None]:
masked_scores = masked_softmax(alpha_matrix, d_attn)
masked_scores

`masked_scores[0]` and `scores` should be equal.

In [None]:
masked_scores[0], scores

Next combine `value_matrix` and `masked_scores`.

In [None]:
def combine_value_masked_score(value_matrix, masked_scores):
    ...
    return outputs

In [None]:
masked_outputs = combine_value_masked_score(value_matrix, masked_scores)
masked_outputs

`masked_outputs[0]` and `outputs` should be equal.

In [None]:
masked_outputs[0], outputs

### Aggregate

Now, aggregate all functions we defined before.

In [None]:
def masked_attention(
    current_matrix,
    context_matrix,
    mask_matrix,
    query_weights,
    key_weights,
    value_weights,
    query_bias,
    key_bias,
    value_bias,
    d_attn,
):
    ...
    return masked_outputs

In [None]:
masked_attn_hidden = masked_attention(
    current_matrix,
    context_matrix,
    mask_matrix,
    query_weights,
    key_weights,
    value_weights,
    query_bias,
    key_bias,
    value_bias,
    d_attn,
)
masked_attn_hidden

# Multi Head Attention

- Algorithm 5

![img](../assets/algorithm_5.png)

In Algorithm 5, multi-head attention gets two inputs $\bf{X} \in \mathbb{R}^{d_{x} \times l_{X}}$ and $\bf{Z} \in \mathbb{R}^{d_{z} \times l_{z}}$.
- $\bf{X}$ : vector representation of primary
- $\bf{Z}$ : vector representation of context sequence

And its output $\tilde{\bf{V}} \in \mathbb{R}^{d_{out} \times l_{z}}$, updated representations of tokens in $\bf{X}$, folding in infromation from tokens in $\bf{Z}$ with those parameters and hyperparameters:

**Parameters**
- For $h \in [H]$
    - $W_{q}^{h} \in \mathbb{R}^{d_{attn} \times d_{X}}, b_{q}^{h} \in \mathbb{R}^{d_{attn}}$
    - $W_{k}^{h} \in \mathbb{R}^{d_{attn} \times d_{Z}}, b_{k}^{h} \in \mathbb{R}^{d_{attn}}$
    - $W_{v}^{h} \in \mathbb{R}^{d_{out} \times d_{Z}}, b_{v}^{h} \in \mathbb{R}^{d_{out}}$

**Hyper Parameters**
- $H$, Number of attention heads
- $\text{Mask} \in \{0,1\}^{l_{Z}\times l_{X}}$

Multi-Head Attention works as follows:  
1. For $h \in [H]$  
        $\bf{Y}^{h} \gets \text{Attention}(\bf{X},\bf{Z}|\bf{W}_{qkv}^{h},\text{Mask})$
2. $\bf{Y} \gets [\bf{Y}^{1}; \bf{Y}^{2}; ...;\bf{Y}^{H}]$
3. $\text{return } \tilde{\bf{V}}=\bf{W}_{0}\bf{Y}+\bf{b}_{0}\bf{1}^T$


## Multi-head Weight matrix and bias

To implement Algorithm 5, we need $h$ set of weight matrix and bias vector as follows:

- For $h \in [H]$:
    1. Query
        - $W_{q}^{h} \in \mathbb{R}^{d_{attn} \times d_{X}}$
        - $b_{q}^{h} \in \mathbb{R}^{d_{attn}}$
    2. Key
        - $W_{k}^{h} \in \mathbb{R}^{d_{attn} \times d_{X}}$
        - $b_{k}^{h} \in \mathbb{R}^{d_{attn}}$
    3. Value
        - $W_{v}^{h} \in \mathbb{R}^{d_{mid} \times d_{Z}}$
        - $b_{v}^{h} \in \mathbb{R}^{d_{mid}}$
- Combine weight and bias
    - $W_{o} \in \mathbb{R}^{d_{out} \times Hd_{mid}}$
    - $b_{o} \in \mathbb{R}^{d_{out}}$

To generate thoes set of weight and vectors we need four argument `n_head`, `d_attn`, `d_in` and `d_out`.  

First, define a function thate generates a set of qeury, key, value weights and bias.

*Hint*: use `generate_weight_bias` for haad and concat results.

In [None]:
def head_weights(d_attn, d_in, d_out):
    ...
    return query_dict, key_dict, value_dict

Next, define a function that generates $h$ sets of qeury, key, value weights and bias.

In [None]:
def multi_head_weights(n_head, d_attn, d_in, d_out):
    ...
    return query_weight_dict, key_weight_dict, value_weight_dict

In [None]:
n_head = 4
d_attn = 10
d_in = 10
d_out = 10
query_weight_dict, key_weight_dict, value_weight_dict = multi_head_weights(n_head, d_attn, d_in, d_out)
query_weight_dict, key_weight_dict, value_weight_dict


Finally, define a function thate generates combine weights and bias.  
Its `d_in` is  `n_head` * `d_out`.

In [None]:
combine_weights, combine_bias = generate_weight_bias(n_head*d_out, d_out)

In [None]:
combine_weights.shape

## Implement

### 1. Calcuate attention score for each head

Define a function that implements below:

For $h \in [H]$  
    $\bf{Y}^{h} \gets \text{Attention}(\bf{X},\bf{Z}|\bf{W}_{qkv}^{h},\text{Mask})$

*Hint*: use `masked_attention` for each head and concat results.

In [None]:
def multi_head_masked_attention(
    current_matrix,
    context_matrix,
    mask_matrix,
    query_weight_dict,
    key_weight_dict,
    value_weight_dict,
    d_attn,
    n_head,
):
    ...
    return multi_head_outputs

In [None]:
multi_head_outputs = multi_head_masked_attention(
    current_matrix,
    context_matrix,
    mask_matrix,
    query_weight_dict,
    key_weight_dict,
    value_weight_dict,
    d_attn,
    n_head,
)
multi_head_outputs

Size of `multi_head_outputs` is equal to `n_head`.

In [None]:
len(multi_head_outputs)

In [None]:
n_head

### 2. Concat the result of each head

Define a function that implements below:

$$
\bf{Y} \gets [\bf{Y}^{1}; \bf{Y}^{2}; ...;\bf{Y}^{H}]
$$

Concat the result of each head.

In [None]:
def concat_multi_head(multi_head_outputs):
    ...
    return concat_multi_head_outputs

In [None]:
concat_multi_head_outputs = concat_multi_head(multi_head_outputs)
concat_multi_head_outputs

`concat_multi_head_outputs` shape must be looks like `len(current_matrix)`, `n_head`*`n_dim`.

In [None]:
concat_multi_head_outputs.shape

In [None]:
len(current_matrix), n_head*d_out

### 3. Combine concated result with weight and bias

Define a function that implements below:

$$
\bf{Y} \gets [\bf{Y}^{1}; \bf{Y}^{2}; ...;\bf{Y}^{H}]
$$

Concat the result of each head.

In [None]:
def combine_multi_head_outputs(concat_multi_head_outputs, combine_weights, combine_bias):
    ...
    return combined_outputs

In [None]:
combined_outputs = combine_multi_head_outputs(concat_multi_head_outputs, combine_weights, combine_bias)
combined_outputs