# Hopfield - Self-attention
The update of the new energy function (notebook [3_hopfield-continuous-value.ipynb](./3_hopfield-continuous-value.ipynb)) is the self-attention of transformer networks.

References:
* https://ml-jku.github.io/hopfield-layers/#update

From equation:
$$
\xi^{new} = X\mathrm{softmax}(\beta X^T \xi)
$$

For $S$ state patterns $\Xi=(\xi_1,...,\xi_s)$, the equation can be generalized to:
$$
\Xi^{\mathrm{new}} = X\mathrm{softmax}(\beta X^T\Xi)
$$

Where $X^T$ can be considered as $N$ *raw **stored** patterns* $Y=(y_1,...y_N)^T$, which are mapped to an associative space via $W_K$, and $\Xi^T$ as $S$ *raw **state** patterns* $R=(\xi_1,...,\xi_S)^T$, which are mapped to an associative space via $W_Q$.

Then, by setting:
$$
Q = \Xi^T = RW_Q \\
K = X^T = YW_K \\
\beta = \frac{1}{\sqrt{d_k}}
$$

we obtain:
$$
(Q^{\mathrm{new}})^T = K^T \mathrm{softmax}(\frac{1}{\sqrt{d_k}}KQ^T)
$$

Where $W_Q$ and $W_K$ are matrices which map the respective patterns into the associative space. In the previous equation, the softmax is applied column-wise tot he matrix $KQ^T$. By transposing the equation, which also means softmax is now applied row-wise to its transposed input $QK^T$, we obtain:

$$
(Q^{\mathrm{new}})^T = \mathrm{softmax}(\frac{1}{\sqrt{d_k}}QK^T)K
$$

Now, by projecting $Q^{new}$ via another projection matrix $W_V$ we obtain:

$$
Z = Q^{new}W_V = \mathrm{softmax}(\frac{1}{\sqrt{d_k}}QK^T)KW_V = \mathrm{softmax}(\frac{1}{\sqrt{d_k}}QK^T)V
$$

Which is basically the transformer attention formula (As per Attention is All you need):
$$
\mathrm{Attention(Q, K, V)} = \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V
$$

Some remarks:
* Transformer based models usually implement embedding layers before the attention mechanism, i.e., what is feed into the attention mechanism is an embedding of the input/outputs.
    * These embeddings have trainable matrices that produces them during training.
* In the new Hopfield definition, the embeddings and matrices are explicit in the formula, i.e., matrices $W_Q$, $W_K$, and $W_V$ are the matrices that transform the input/outpus into the associative space that is feed to the attention mechanism.
* One differencing aspect of original attention vs Hopfield is the value of $\beta$ parameter. Original attention fixes this to be dependent on the dimension of the embeddings, which for large values of $d_k$ will yield in smaller $\beta$, which in turn, as per explained in the new Hopfield paper, means the retrievals will tend to be metastable states or the average of similar patterns which can give us an intuition of why they work and why the concept of "Attention".
* The new Hopfield definition can be interpreted as a generalization of the attention mechanism.
* The result of the retrieval, which is the attention produced from the state patterns against the stored patterns, can be the input to fully connected layers for some classification task.
* Similarly, before the attention mechanism, there can be other feature extraction layers s.a. CNNs that will produce vectors for which store/retrieval process can be applied.