# Attention mechanism in language models

Attention is the building block of transformer, encoder or decoder language models. It allows the user to process sequential data. The computation is not recursive and can be done in parallel. 

The vanilla scaled dot product attention receives input data of shape $X \in \mathbb{R}^{T \times e}$, $T$ is the sequence length, $e$ the embedding dimesion. The input is distributed to attention heads of dimension $h$. Usually the embedding dimension is a multiple of the head dimension, e.g., $e = 12, h = 4$ gives us three heads. The single head attention is nested in the multi-head attention by choosing $h = e$. For the sake of simplicity let us assume to use a single head. 

The attention mechanism depends on three parameter matrices, $W^Q, W^K, W^V \in \mathbb{R}^{h \times h}$. These are used to create queries, keys, and values per head by $Q = X W^Q, K = X W^K, V = X W^V$. Given these values we determine attention scores:

$$
\frac{QK^T}{\sqrt{h}}
$$

To determine attention weights $S$, we need to choose between regular attention or masked attention. Regular attention activates the scores row wise with the softmax function:

$$
S = \text{softmax} \left(\frac{QK^T}{\sqrt{h}}\right)
$$

Attention weights are multiplied with values which defines the scaled dot product attention: 

$$
A \left(X,  W^Q, W^K, W^V \right) = \text{softmax} \left(\frac{QK^T}{\sqrt{h}}\right) V = S V
$$

For instance, if we have observations from the past five days of two variables, it would look like this:

In [31]:
import numpy as np

def softmax(x, axis=0):
    """
    Compute the softmax function along the specified axis.

    Parameters:
    - x: np.ndarray, input array.
    - axis: int, axis along which softmax is computed (0 for column-wise, 1 for row-wise).

    Returns:
    - np.ndarray, softmax-transformed array.
    """
    exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))  # Subtract max for numerical stability
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

np.random.seed(42)
X = np.random.randn(5, 2)
W_Q, W_K, W_V = np.random.randn(2, 2), np.random.randn(2, 2), np.random.randn(2, 2)
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
AS = (Q @ K.T) / np.sqrt(X.shape[1])
A = np.round(softmax(AS, axis = 1), 3)

print("="*75)
print("Attention weights:")
print("="*75)
print(A)

print("="*75)
print("Values:")
print("="*75)
print(V)

print("="*75)
print("Attention output:")
print("="*75)
print(A @ V)


Attention weights:
[[0.174 0.252 0.136 0.29  0.148]
 [0.263 0.089 0.118 0.481 0.049]
 [0.181 0.2   0.221 0.144 0.253]
 [0.134 0.144 0.044 0.651 0.028]
 [0.248 0.119 0.278 0.151 0.204]]
Values:
[[-0.65367531 -0.67029443]
 [ 1.64411005 -1.25859697]
 [-0.13054564  0.38355825]
 [-0.30917349 -2.40359668]
 [ 1.22149651  0.54054321]]
Attention output:
[[ 0.37394319 -0.99867639]
 [-0.12985432 -1.37268608]
 [ 0.44617382 -0.4976368 ]
 [-0.02365469 -1.80378708]
 [ 0.19974602 -0.46204915]]


The attention weights are weightings how to weight time steps of each observation, e.g., after the attention mechanism, the output of the attention layer at $t=1$ is a weighted average of all time step observations with the highest weighting on $t=2$ (weight $= 0.252$) and $t=4$ (weight $= 0.290$).

Thus, at every point in time, all observations are aggregated. To respect the temporal order, it may be better to use masked self attention. To adjust this, we replace all values in the upper diagonal of $\frac{QK^T}{\sqrt{h}}$ with a very small number before applying row wise softmax activation. This sets all attention weights on the upper diagonal to zero. As a consequence, at each time step, weighted averages are build using only the current and past observations.

In [33]:
small_value = -1e9

np.random.seed(42)
X = np.random.randn(5, 2)
W_Q, W_K, W_V = np.random.randn(2, 2), np.random.randn(2, 2), np.random.randn(2, 2)
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
AS = (Q @ K.T) / np.sqrt(X.shape[1])
upper_tri_indices = np.triu_indices_from(AS, k=1)
AS[upper_tri_indices] = small_value
A = np.round(softmax(AS, axis = 1), 3)

print("="*75)
print("Attention weights:")
print("="*75)
print(A)

print("="*75)
print("Values:")
print("="*75)
print(V)

print("="*75)
print("Attention output:")
print("="*75)
print(A @ V)

Attention weights:
[[1.    0.    0.    0.    0.   ]
 [0.748 0.252 0.    0.    0.   ]
 [0.301 0.332 0.367 0.    0.   ]
 [0.138 0.148 0.045 0.669 0.   ]
 [0.248 0.119 0.278 0.151 0.204]]
Values:
[[-0.65367531 -0.67029443]
 [ 1.64411005 -1.25859697]
 [-0.13054564  0.38355825]
 [-0.30917349 -2.40359668]
 [ 1.22149651  0.54054321]]
Attention output:
[[-0.65367531 -0.67029443]
 [-0.0746334  -0.81854667]
 [ 0.30117802 -0.47884694]
 [-0.05959053 -1.86951904]
 [ 0.19974602 -0.46204915]]


To me, this seems to be a natural candidate for a trainable mechanism which learns how to combine and weight features over time. By the design of the attention mechanism, the weighting is dynamic, i.e., it depends on the observations of input $X$. This means depending how the current observations in the time sequence are, weighting them will be done differently. 

In my understanding, $W^Q, W^K$ are parameters which learn how to weight and $W^V$ provides flexibilty to the model by transforming the original input $X$. 


# Model development

With the attention mechanism in my mind, I tried to develop a model which receives time series data and creates predictions. The model architecture is designed in analogy to a plain decoder language model. 

It combines several building blocks:

* FeatureEmbeddings
* PositionalEmbeddings
* FeatureMultiHeadAttention
* FeatureAttentionOutput
* FeatureEngineering

Starting with $T$ observations of, e.g., OHLC-Volume data ($p = 5$ features), the input dimesion is:

$$
X \in \mathbb{R}^{T \times p}
$$

The FeatureEmbeddings layer, is a single dense layer which may be accompanied by a non-linear activation function:

$$
\tilde{X} = g(X W)
$$

with $W \in \mathbb{R}^{p \times h}$ and $g$ as an aribtrary activation function. The purpose of this layer is to transform $p$ input features to a $h$ dimensional feature space. This can be seen as a starting point of feature engineering.

To this value, we add posisional embeddings $P \in \mathbb{R}^{p \times h}$ which is just a matrix of trainable parameters to help the model learn about the meaning if obsersvations are given at a specific time step. The embedding matrix $E$ is given by:

$$
E = \tilde{X} + P
$$

This matrix enters a attention layer which applies masked self-attention and processes its output through another densely connected output layer which serves to combine the learning of multiple attention heads.

$$
concat \left( A \left(E_1,  W_1^Q, W_1^K, W_1^V \right) , ..., A \left(E_{n_h},  W_{n_h}^Q, W_{n_h}^K, W_{n_h}^V \right) \right) W^O$$

On top, we place a neural network with a hidden and an output layer, this gives the model flexibility to adjust learned feature representations after the attention. The last steps define a full attention layer consisting of:

* attention
* output
* fully connected feature engineering

It can be done multiple times.

Finally, the model transforms the output to a desired output dimension, e.g., one for binary classification or regression. This is done by a single linear layer. For every observation the representation of the last time step is used as this is the only one which can pay attention to the full sequence.