# Overview

In the [previously notebook](https://www.kaggle.com/code/aisuko/coding-the-multi-head-attention/notebook), we set $d_{q}=d_{k}=24$ and $d_{v}=28$. Or in other words, we used the same dimensions for query and key sequences. While the value matrix $W_{v}$ is often chosen to have the same dimension as the query and key matrices (such as in PyTorch's MultiHeadAttention class), we can select an arbitrary number size for the value dimensions.

Since the dimensions are sometimes a bit tricky to keep track of, let's summarize everything we have covered so far in the figure below, which depicts the various tensor sizes for a single attention head.

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/819/100/204/990/207/original/09d504e16eb77ccc.png" width="80%" heigh="80%" alt="Single-Attention-head"></div>

Now, the illustration above corresponds to the self-attention mechanism used in transformers. One particulat favor of thie attention mechanism we have yet to discusss is cross-attention.


<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/819/117/595/682/670/original/ae132461563b60ef.png" width="80%" heigh="80%" alt="Cross-attention in transformers"></div>

# Cross-attention

> What is cross-attention, and how does it differ from self-attention?

In Self-attention, we work with the same input sequqnce. In cross-attention, we mix or combine two **different** input sequences. In the case of the original transformer architecture above, that's the sequence returned by the encoder module on the left and the input sequence beding processed by the decoder part on the right.

Note that in cross-attention, the two input sequences $x_{1}$ and $x_{2}$ can have different number of elements. However, their embedding dimensions must match.

The figure below illustrates the concept of cross-attention. If we set $x_{1}=x_{2}$, this is equivalent to self-attention.

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/819/188/074/548/534/original/4c78ecc1a1d67280.png" width="80%" heigh="80%" alt="Cross-attention"></div>

**Note: The queries usually come from the decoder, and the keys and values usually come from the encoder.**

We implemented the self-attention machnism in [Coding the self atttention mechanism](https://www.kaggle.com/code/aisuko/coding-the-self-attention-mechanism) the code like below:

In [1]:
import torch

inputs="According to the news, it it hard to say Melbourne is safe now"
d_q, d_k, d_v=24,24,28

torch.manual_seed(123)

input_ids={s:i for i,s in enumerate(sorted(inputs.replace(',','').split()))}
input_tokens=torch.tensor([input_ids[s] for s in inputs.replace(',','').split()])

embed=torch.nn.Embedding(13,16)
embedded_sentence=embed(input_tokens).detach()
d=embedded_sentence.shape[1]

W_query=torch.rand(d_q, d)
W_key=torch.rand(d_k, d)
W_value=torch.rand(d_v, d)

x_2=embedded_sentence[1]
query_2=W_query.matmul(x_2)

keys=W_key.matmul(embedded_sentence.T).T
values=W_value.matmul(embedded_sentence.T).T

print(embedded_sentence.shape)
print(query_2.shape)
print(keys.shape)
print(values.shape)

torch.Size([13, 16])
torch.Size([24])
torch.Size([13, 24])
torch.Size([13, 28])


For the coding of cross attention, the only different part is that we have a second input sequence. Here is an example, a second setence with 8 instead of 6 input elements. Here, suppose this is a sentence with 8 tokens.

In [2]:
embedded_sentence_2=torch.rand(8,16)

keys=W_key.matmul(embedded_sentence_2.T).T
values=W_value.matmul(embedded_sentence_2.T).T

print(keys.shape)
print(values.shape)

torch.Size([8, 24])
torch.Size([8, 28])


Notice that compared to self-attention, the keys and values now have 8 instead of 6 rows. Everything else stays the same.

We talked a lot about language transformers above. In the original transformer architecture, cross-attention is useful when we go from an input sentence to an output sentence in the context of language translation. The input sentence represents one input sequence, and the translation represent the second input sequence(the two sentences can different number of words).

Another popular model where cross-attention is used is Stable Diffusion. Stable Diffusion uses cross-attention between the generated image in the U-Net model and the text prompts used for conditioning as described in High_resolution Image Synthesis with Latent DIffusion Models - the original paper that describes the Stable Diffusion model that was later adopted by Stability AI to implement the popular Stable Diffusion model.

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/819/275/403/095/727/small/6a9c96fe1c7f32ac.png" width="80%" heigh="80%" alt="Cross-attention in diffusion"></div>


# Conclusion

We discussed how self-attention works using a step-by-step coding approach. We then extended this concept to multi-head-attention, the widely used component of large-language transformers. After discussing self-attetnion and multi-head attetnion, we introduced yet another concept: cross-attention, which is a flavor of self-attention what we can apply between two different sequences. Finally, thanks for [Sebastian Raschka, PhD](https://www.linkedin.com/in/sebastianraschka/)'s beautiful articles(see it in credit section).

# Credit

* https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html