# Transformers

this note book is here to help me refresh some of my understanding of the basic transformers architecture

we want to implement the encoder part of the architecture in [attention is all you need paper](https://arxiv.org/pdf/1706.03762):




architecture screentshot:

![](20251120024008.png)

My goal with be to go through one pass of transformer layer for a data, and try to explain each layer, finally I will convert this jupyter notebook to a python code and train it on a simple dataset

In [1]:
# I want this note book to be very simple so I will make the data very simple, i.e use whatever I have written till now as training data

training_data = list("""
# Transformers

this note book is here to help me refresh some of my understanding of the basic transformers architecture

we want to implement the encoder part of the architecture in [attention is all you need paper](https://arxiv.org/pdf/1706.03762):

My goal with be to go through one pass of transformer layer for a data, and try to explain each layer, finally I will convert this jupyter notebook to a python code and train it on a simple dataset

# I want this note book to be very simple so I will make the data very simple, i.e use whatever I have written till now as training data

""")

In [2]:
# I don't want to get too deep into tokenization for this notebook so I am just going to instead use all the unique characters
# present in the training data as distinct tokens
vocabulary_list = list(set(training_data))

In [3]:
print(vocabulary_list[:5])
print(len(vocabulary_list))

[']', 'h', 'd', '.', 'M']
44


In [4]:
# let's create training and testing data
# training and testing data for next token prediction would look something like

# the way the transformer works is that for a single example sentence it trains the model for multiple token prediction
print(training_data[:9])

['\n', '#', ' ', 'T', 'r', 'a', 'n', 's', 'f']


In [5]:
# here if x is
training_data[:8]

['\n', '#', ' ', 'T', 'r', 'a', 'n', 's']

In [6]:
# then y would be
training_data[1:9]

['#', ' ', 'T', 'r', 'a', 'n', 's', 'f']

In [7]:
# ok before we make create training data we need to convert our tokens to a unique index to do that I will do
token_to_index = {c:i for i,c in enumerate(vocabulary_list)}
index_to_token = {i:c for i,c in enumerate(vocabulary_list)}

In [8]:
# now we let's convert our training data to a torch tensor
import torch

training_data_tensor = torch.tensor([token_to_index[c] for c in training_data], dtype=torch.long)

In [9]:
print(training_data_tensor[:10])
print([index_to_token[ix.item()] for ix in training_data_tensor[:10]])

tensor([30, 16, 41, 28,  7, 26, 42, 40, 34, 35])
['\n', '#', ' ', 'T', 'r', 'a', 'n', 's', 'f', 'o']


In [10]:
# now let's create training and testing set
block_size = 8
x = torch.stack([training_data_tensor[ix:ix+block_size] for ix in range(len(training_data_tensor)-block_size)] )
# max ix len(training_data_tensor)-block_size - 1
# so ix + block_size = len(training_data_tensor) - 1
# so final example won't include last character
y = torch.stack([training_data_tensor[ix:ix+block_size]for ix in range(1,len(training_data_tensor)-block_size+1)]) 



In [11]:
print("x training data")
print(x[:5])
print("y training data")
print(y[:5])

x training data
tensor([[30, 16, 41, 28,  7, 26, 42, 40],
        [16, 41, 28,  7, 26, 42, 40, 34],
        [41, 28,  7, 26, 42, 40, 34, 35],
        [28,  7, 26, 42, 40, 34, 35,  7],
        [ 7, 26, 42, 40, 34, 35,  7, 14]])
y training data
tensor([[16, 41, 28,  7, 26, 42, 40, 34],
        [41, 28,  7, 26, 42, 40, 34, 35],
        [28,  7, 26, 42, 40, 34, 35,  7],
        [ 7, 26, 42, 40, 34, 35,  7, 14],
        [26, 42, 40, 34, 35,  7, 14, 38]])


# Embedding Table

![](20251121001141.png)


This is a look up table between the vocabulary index and n dimensional vector,
during the training of transformer model this vectors also gets trained, i.e where these vectors point to gets updated,
based on the similarity between these vectors, if let's say I have 2 tokens "dog" and "pooch", during the start of training process
they might point in very different directions, but after the training both would point to pretty much same place

### Question?:

1. What is so special about the training process that transforms these vectors from pointing in random ass direction, to actually have some meaning
    * for now I am gonna assume that the answer is that the transformer architecture expects and assumes these vectors to be what I have described
    * and based on this assumption, the subsequent layers performs its operation, so optimizing the loss leads to these embedding vector looking more like actual high dimensional representation of the words 

In [12]:
from torch import nn

EMBEDDING_DIMENSION = 8
VOCAB_SIZE = len(vocabulary_list)

embeddings_table = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIMENSION)

In [13]:
# some experimentation on how embeddings table work,
print(embeddings_table(torch.tensor([[0,1,2,3]], dtype=torch.long)))
# it goes to each item in tensor and assumes each item is a index converts it to its corresponding embedding vector

tensor([[[-1.5241, -1.7928, -0.0552, -0.9754,  0.5999,  0.2177, -0.5312,
          -0.3455],
         [-1.5339,  1.0883, -0.6325, -0.0580, -1.5893,  1.1726,  0.4612,
          -0.3180],
         [ 0.1329, -0.7719, -0.3997,  1.0125,  1.3193,  0.3830, -0.5892,
           0.6101],
         [-0.4863, -0.8104, -1.0256,  0.1722, -0.5198, -1.1705, -0.1134,
          -0.2747]]], grad_fn=<EmbeddingBackward0>)


I want to do a very simple forward pass so I am gonna create my forward pass batch now

In [14]:
x_batch = x[:5]
y_batch = y[:5]

A question lingers, what does this (shifted right) mean:

![](20251121002201.png)

this just means that our input is shifted from the target output

In [15]:
x_embeddings = embeddings_table(x_batch)

In [16]:
# just one example
x_embeddings[:1]

tensor([[[-0.1435, -0.5369, -1.4354, -1.0808, -0.0953,  0.1220, -0.5978,
          -0.7676],
         [ 1.5659,  0.6940, -1.0396,  1.1023, -2.8244,  0.0853,  0.0612,
           0.7656],
         [-1.0717,  0.3598, -0.2382, -2.2446, -0.2041, -0.3228,  0.2966,
          -0.9406],
         [-0.4744, -0.6566,  0.6656, -1.2478, -0.2328, -0.2538,  1.1948,
          -0.3091],
         [-1.0097, -0.7088,  0.3874, -0.5110, -0.7458,  1.1292, -0.5110,
          -0.6015],
         [ 1.0414,  0.7076,  1.0371,  0.7519,  0.0218, -0.7247,  0.2351,
           0.3360],
         [ 0.7845, -0.5335, -0.7736, -0.1852,  1.0675, -0.2162, -0.3340,
           0.4087],
         [ 0.7960, -1.8973,  0.0177, -0.3042, -1.4312,  0.5846, -0.6061,
           0.8778]]], grad_fn=<SliceBackward0>)

# Positional Encoding

![](20251121135418.png)


From my past understanding this is sort of values with varies with the position of the token in the sequence to encode the information about the position of the token in the sequence

so for each position there will be a vector associated to it, which will get added to the original embedding vector at that position

### Questions?:
1. Why Add these vectors to the original embedding vector? Can it not be appended or create some other type of encoding create a new channel perhaps like we do for CNNs
    - Ans: The Idea behind adding these is how we treat embedding vectors, you can think of embedding vector as the original absolute meaning of a token, now depending on whether it appears at the beggining of a sentence or end of a sentence it's meaning might differ, i.e its embedding vector might change its position, that change is capture by the addition of this positional embeddding vector
2. Why do these needs to be a vector all together can these not be like a single number which gets added?
    - Ans: well a vector is a more generalized version of a single number, if single number is the right approach then expectation is that the network would train the embedings to become a single number

## Sinusoidal Encoding

![](20251121140214.png)

here d_model is the dimension of the embedding

In the original Paper they used a fix positional sinusoidal encoding, they mentioned the performance for both learned and not learned were identical, they wanted to experiment with sinusoidal encoding, because they wanted to test the model beyond the trained context length

# Question?:
1. But why sinusoidal encoding

In [17]:
positional_embedding_table = nn.Embedding(block_size, EMBEDDING_DIMENSION)

In [18]:
x_pos_embeddings = positional_embedding_table(torch.arange(x_embeddings.shape[1])) # C, E

x_embeddings # B, C, E

x_embeddings_total = x_embeddings + x_pos_embeddings # B, C, E + C, E -> pytorch checks the shape starting from right and if there is an extra dimension it creates a new dimention and copuies the same thing over, like C, E -> (1, C, E) -> (B, C, E)

# Self Attention Layer

![](20251123232032.png)

I will start of by explaining what this layer does in a high level, then I will dig deep into how it does this, initially I will go over a single head self attention, 
then understand myself and explain why multi head self attention

this is the 3b1b interpretation of this layer on a high level, which I found to be the most elegant

## The Explanation
This layer as a whole tells us how should the original embedding vector be modified, so that it's meaning is enriched with the context of the surrounding tokens, for example take the sentence:

" That blue aeroplane is very dangerous "

in this example initially "aeroplane"'s embedding vector would straight up point to the absolute aeroplane,
then attention layer outputs a result, that result when added to the original embedding vector, nudges the aeroplane's vector in a direction closer to blue and dangerous

that is on high level what this layer does, now going into the detail let's start by the equation

![](20251123233808.png)

the above represent the equation describing the self attention mechanism, for the purpose of this excercise we will focus on masked self attention

here Q, K, V are all matrices

Q being the query matrix, K Key matrix and V value matrix

let me do one thing and form this forumla in our on going example and then explain

In [19]:
d_k = d_q = 4

W_q = nn.Linear(EMBEDDING_DIMENSION, d_q, bias = False)
W_k = nn.Linear(EMBEDDING_DIMENSION, d_k, bias = False)
W_v = nn.Linear(EMBEDDING_DIMENSION, EMBEDDING_DIMENSION, bias = False) # for single head attention this needs to be same as embedding dimension cause the resulting vector of attention layer get added to input embedding vector so both need to have same dimension


In [20]:
W_q.weight.shape

torch.Size([4, 8])

In [21]:
x_embeddings_total.shape

torch.Size([5, 8, 8])

In [22]:
W_q.weight@x_embeddings_total[0][0]

tensor([-1.7749,  0.9299, -0.8435,  1.0373], grad_fn=<MvBackward0>)

In [23]:
W_q(x_embeddings_total)

tensor([[[-1.7749,  0.9299, -0.8435,  1.0373],
         [-0.5611, -0.9662,  1.7095, -0.5968],
         [-1.0167, -0.6170,  0.2199, -0.5388],
         [-1.6731,  0.2354, -1.1576, -0.6768],
         [-0.6649,  0.7781, -0.3355,  0.2495],
         [ 0.5023, -0.5960,  0.1478,  0.0717],
         [ 0.3706, -0.9163,  0.2390,  0.0580],
         [-0.7251, -0.3290,  0.5898, -0.2626]],

        [[-0.5622,  0.5399,  0.8599,  0.6430],
         [-1.6198,  0.1392, -0.3604,  0.2252],
         [-0.4875, -1.0404,  0.3793, -1.2212],
         [-1.9039,  0.7531, -1.1441, -0.2401],
         [ 0.5605, -0.0724,  0.0939,  0.0030],
         [-0.3112, -0.7857,  0.0199, -0.0321],
         [-0.1429, -0.6489,  0.4605, -0.3337],
         [-0.1160, -0.0841,  0.8567,  0.8012]],

        [[-1.6209,  1.6452, -1.2100,  1.4650],
         [-1.0906, -0.2842, -0.2010, -0.4571],
         [-0.7182, -0.5227,  0.3928, -0.7846],
         [-0.6785, -0.0974, -0.7147, -0.4866],
         [-0.2530, -0.2621, -0.0339, -0.1007],
         

In [24]:
# pass the existing vector through a trainable linear transformation

Q, K, V = W_q(x_embeddings_total), W_k(x_embeddings_total), W_v(x_embeddings_total)

In [25]:
print(Q.shape, K.shape, V.shape)

torch.Size([5, 8, 4]) torch.Size([5, 8, 4]) torch.Size([5, 8, 8])


In [26]:
attention_matrix = Q@K.transpose(1, 2) # transpose the 1st and the 2nd dimension not the  this is equivalent to Q[i]@K[i].transpose where i is each element of the batch

Let me try and explain what just happened above, let's take example of a single batch

In [27]:
print(Q[0][:3])
print(K[0][:3])

tensor([[-1.7749,  0.9299, -0.8435,  1.0373],
        [-0.5611, -0.9662,  1.7095, -0.5968],
        [-1.0167, -0.6170,  0.2199, -0.5388]], grad_fn=<SliceBackward0>)
tensor([[-0.7977,  0.3890,  1.0034,  1.7204],
        [-0.8767,  0.8057, -1.4233, -1.1532],
        [-0.0697,  0.0768,  1.1837, -0.2950]], grad_fn=<SliceBackward0>)



well the traditional explanation is:

when the static embedding for a word is passed through these layers it extract specific feature pertaining to corresponding transformation

let's take an example: "The bank of the River"

Query -> transform the static embedding of bank to something like "I am a Noun needing a definition, need to know what am I a river bank, a financial bank, etc"

Key -> transforms the static embedding of River to something like "I am a nature related word, related to river"

Value -> transforms the embedding to actual meaning info, as their it might not be the first attention layer, if it is a second layer, it won't be the absolute meaning

but this never really sat with me completely, I was not able to understand this fully and it seemed pretty handwavy as to explaining what these layers really do, maybe I understand for Value vector but not for Query and Key vector

need to think and understand this properly and clearly once and for all, 
think, what would happen if no linear layer was present


ok let's say no linear layer was present, than the Q*K.T would just give me the cosine similarity between static embeddings of tokens

let's say the attention matrix is

`A`

here `A[i][j]` would be the cosine similarity between the ith token and jth token  (here when I say token I mean the embedding of token)

now let's see what would `A@V` do, here I am assuming `V` to is just the static emebdding


`A_softmax` -> is softmax function applied on `A` across each column (why column? cause weights are distributed across column see below)

```

V`[i] = V[0]*A_softmax[i][0] + V[1]*A_softmax[i][1] + V[2]*A_softmax[i][2] + ...

```

here `V[i]` represents the static embedding for the ith word, so `V[i]` is a vector, and ``V`[i]`` is the final vector after attention matrix multiplication

``V`[i]`` we can think of it as change or how the original word gets modified

so when taking the weighted sum it tells us, in direction of which static embedding should the input embedding move to the most and least (depending on the weight), so that it represents that modified vector would represent the meaning of the entire sentence, **now I see the poblem with just using static embedding, the problem is if we just use static embedding to compute cosine similarity it would nudge the input emebedding to this layer in the direction which is most similar to the word which we see in the sentence, which we don't always want**

side note: one more thing I learned is that these query and key vector live in a space which is smaller than the static embedding dimension, so it would help computationaly too

In [28]:
# now each element of this of shape context length x context length
attention_matrix.shape

torch.Size([5, 8, 8])

In [29]:
attention_matrix[0]

tensor([[ 2.7157,  2.3096, -1.1093,  1.3480,  3.9687, -0.1061,  1.0525,  0.5011],
        [ 0.7604, -2.0315,  2.1645, -0.1895, -2.0990, -0.1689,  0.0908,  1.1689],
        [-0.1354,  0.7027,  0.4427, -0.7405, -0.0052,  0.3430,  1.0709,  1.2099],
        [-0.8997,  4.0845, -1.0358, -1.0282,  2.3869,  0.6304,  1.8070,  1.5919],
        [ 0.9257,  1.3996, -0.3646,  0.6703,  1.6947, -0.1983,  0.1288,  0.3019],
        [-0.3608, -1.2135,  0.0730, -0.3230, -0.9943,  0.1540, -0.0541, -0.4394],
        [-0.3124, -1.4702,  0.1695, -0.5011, -1.1746,  0.2685,  0.2042, -0.3207],
        [ 0.5905, -0.1661,  0.8009, -0.0654, -0.2292, -0.0137,  0.4167,  0.8609]],
       grad_fn=<SelectBackward0>)

```

V`[i] = V[0]*A_softmax[i][0] + V[1]*A_softmax[i][1] + V[2]*A_softmax[i][2] + ...

```

now in self attention when we compute the change required for ith vector we don't want any token above ith token to be deciding how it changes, cause during inference we don't have that information at all so we would need to find a way so that above becomes

```

V`[i] = V[0]*A_softmax[i][0] + V[1]*A_softmax[i][1] + V[2]*A_softmax[i][2] + ... + V[i]*A_softmax[i][i]

```

so to do this we convert the lower triangular part equal infinity so that when we do softmax we get zero

In [30]:
print(x_batch[0])
print(y_batch[0])

tensor([30, 16, 41, 28,  7, 26, 42, 40])
tensor([16, 41, 28,  7, 26, 42, 40, 34])


In [31]:
torch.ones_like(attention_matrix[0])

tensor([[1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])

In [32]:
torch.tril(torch.ones_like(attention_matrix[0]), diagonal=-1).bool() # should the diagonal also have no contribution?

tensor([[False, False, False, False, False, False, False, False],
        [ True, False, False, False, False, False, False, False],
        [ True,  True, False, False, False, False, False, False],
        [ True,  True,  True, False, False, False, False, False],
        [ True,  True,  True,  True, False, False, False, False],
        [ True,  True,  True,  True,  True, False, False, False],
        [ True,  True,  True,  True,  True,  True, False, False],
        [ True,  True,  True,  True,  True,  True,  True, False]])

In [33]:
attention_matrix[0].masked_fill(torch.tril(torch.ones_like(attention_matrix[0]), diagonal=-1).bool(), float('-inf'))

tensor([[ 2.7157,  2.3096, -1.1093,  1.3480,  3.9687, -0.1061,  1.0525,  0.5011],
        [   -inf, -2.0315,  2.1645, -0.1895, -2.0990, -0.1689,  0.0908,  1.1689],
        [   -inf,    -inf,  0.4427, -0.7405, -0.0052,  0.3430,  1.0709,  1.2099],
        [   -inf,    -inf,    -inf, -1.0282,  2.3869,  0.6304,  1.8070,  1.5919],
        [   -inf,    -inf,    -inf,    -inf,  1.6947, -0.1983,  0.1288,  0.3019],
        [   -inf,    -inf,    -inf,    -inf,    -inf,  0.1540, -0.0541, -0.4394],
        [   -inf,    -inf,    -inf,    -inf,    -inf,    -inf,  0.2042, -0.3207],
        [   -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,  0.8609]],
       grad_fn=<MaskedFillBackward0>)

In [34]:
attention_matrix.masked_fill(torch.tril(torch.ones_like(attention_matrix), diagonal=-1).bool(), float('-inf'))[0]

tensor([[ 2.7157,  2.3096, -1.1093,  1.3480,  3.9687, -0.1061,  1.0525,  0.5011],
        [   -inf, -2.0315,  2.1645, -0.1895, -2.0990, -0.1689,  0.0908,  1.1689],
        [   -inf,    -inf,  0.4427, -0.7405, -0.0052,  0.3430,  1.0709,  1.2099],
        [   -inf,    -inf,    -inf, -1.0282,  2.3869,  0.6304,  1.8070,  1.5919],
        [   -inf,    -inf,    -inf,    -inf,  1.6947, -0.1983,  0.1288,  0.3019],
        [   -inf,    -inf,    -inf,    -inf,    -inf,  0.1540, -0.0541, -0.4394],
        [   -inf,    -inf,    -inf,    -inf,    -inf,    -inf,  0.2042, -0.3207],
        [   -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,  0.8609]],
       grad_fn=<SelectBackward0>)

In [35]:
masked_attention_matrix = attention_matrix.masked_fill(torch.tril(torch.ones_like(attention_matrix), diagonal=-1).bool(), float('-inf'))

In [36]:
masked_attention_matrix[0]

tensor([[ 2.7157,  2.3096, -1.1093,  1.3480,  3.9687, -0.1061,  1.0525,  0.5011],
        [   -inf, -2.0315,  2.1645, -0.1895, -2.0990, -0.1689,  0.0908,  1.1689],
        [   -inf,    -inf,  0.4427, -0.7405, -0.0052,  0.3430,  1.0709,  1.2099],
        [   -inf,    -inf,    -inf, -1.0282,  2.3869,  0.6304,  1.8070,  1.5919],
        [   -inf,    -inf,    -inf,    -inf,  1.6947, -0.1983,  0.1288,  0.3019],
        [   -inf,    -inf,    -inf,    -inf,    -inf,  0.1540, -0.0541, -0.4394],
        [   -inf,    -inf,    -inf,    -inf,    -inf,    -inf,  0.2042, -0.3207],
        [   -inf,    -inf,    -inf,    -inf,    -inf,    -inf,    -inf,  0.8609]],
       grad_fn=<SelectBackward0>)

In [37]:
import torch.nn.functional as F


F.softmax(masked_attention_matrix[0], dim=-2)

tensor([[1.0000, 0.9871, 0.0311, 0.6985, 0.7521, 0.1281, 0.1759, 0.0906],
        [0.0000, 0.0129, 0.8220, 0.1501, 0.0017, 0.1203, 0.0673, 0.1766],
        [0.0000, 0.0000, 0.1469, 0.0865, 0.0141, 0.2008, 0.1792, 0.1840],
        [0.0000, 0.0000, 0.0000, 0.0649, 0.1546, 0.2676, 0.3742, 0.2696],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0774, 0.1169, 0.0699, 0.0742],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1662, 0.0582, 0.0354],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0753, 0.0398],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1298]],
       grad_fn=<SoftmaxBackward0>)

### should the diagonal (i.e the current token) have any contribution towards update of input embedding vector?

if it does then ``V`[0] = V[0]`` and that would mean the updated embedding would become ``E_updated[0] = V[0] + E[0]`` need to think more about it

but according to my understanding `V[0]` is same as the absolute embedding of the token so `V[0] + E[0]` would be pracitically `2*absolute_mebdding[0]` since even `E[0]` was created any info about token after that so just itself, but since we don layer norm `norm(2*absolute_embedding[0]) = absolute_embedding[0]` so even if it scales a lot layer normalization brings down the scale of embedding vector so it fits with rest of embedding vector's scale

In [38]:
batch_attention_matrix = F.softmax(masked_attention_matrix, dim=-2)

In [39]:
batch_attention_matrix[0]

tensor([[1.0000, 0.9871, 0.0311, 0.6985, 0.7521, 0.1281, 0.1759, 0.0906],
        [0.0000, 0.0129, 0.8220, 0.1501, 0.0017, 0.1203, 0.0673, 0.1766],
        [0.0000, 0.0000, 0.1469, 0.0865, 0.0141, 0.2008, 0.1792, 0.1840],
        [0.0000, 0.0000, 0.0000, 0.0649, 0.1546, 0.2676, 0.3742, 0.2696],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0774, 0.1169, 0.0699, 0.0742],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1662, 0.0582, 0.0354],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0753, 0.0398],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1298]],
       grad_fn=<SelectBackward0>)

In [40]:
V[0]

tensor([[-0.1399, -0.3563, -0.5243,  0.7751,  0.3165, -0.1855,  1.7871,  0.5098],
        [-0.3114, -1.7896,  0.5604,  0.3243, -0.4021, -1.3129, -1.5363,  0.8263],
        [ 0.1248, -0.0877,  0.1744,  0.5784, -0.7061, -1.3860,  0.6044, -0.2072],
        [-0.0868,  0.0997,  0.4958,  0.7775,  0.4048, -1.2616,  1.3989, -1.0628],
        [-0.9760, -0.2477,  0.3601,  0.4283,  0.5098, -0.2554, -0.4042, -0.3268],
        [ 0.1497, -0.5940,  0.1340, -0.5340,  0.1962,  0.4462, -0.1219,  0.0363],
        [ 0.4670, -1.0735,  0.9729, -1.1829,  0.6295, -0.4117, -0.5135,  0.1005],
        [-1.0751, -0.4535,  0.0561,  0.3211, -0.4021, -1.0135, -0.6462, -0.0220]],
       grad_fn=<SelectBackward0>)

In [41]:
batch_attention_matrix[0]@V[0]

tensor([[-1.2341, -2.5484,  0.8449,  1.7310,  0.6633, -2.7050,  0.7979,  0.3513],
        [-0.0566, -0.3044,  0.3171,  0.5100, -0.5289, -1.4989,  0.5229, -0.3125],
        [-0.0871, -0.4029,  0.2852, -0.1018,  0.0167, -0.4871, -0.0313, -0.1058],
        [-0.2316, -0.7148,  0.5029, -0.3823,  0.2848, -0.4292, -0.3707, -0.0781],
        [-0.1052, -0.1972,  0.1157, -0.0881,  0.0765, -0.0716, -0.1294, -0.0157],
        [ 0.0140, -0.1772,  0.0809, -0.1462,  0.0550,  0.0144, -0.0730,  0.0111],
        [-0.0076, -0.0989,  0.0755, -0.0763,  0.0314, -0.0714, -0.0644,  0.0067],
        [-0.1396, -0.0589,  0.0073,  0.0417, -0.0522, -0.1315, -0.0839, -0.0029]],
       grad_fn=<MmBackward0>)

In [42]:
V_prime = batch_attention_matrix@V

In [43]:
embedding_unnormalized = V_prime + x_embeddings_total

# Layer normalization

we just normalize across the embedding dimension to make sure this addition does not scale the embedding vector to much across each layer so that it has the scale of the original embedding vector

In [44]:
mean_emebdding = embedding_unnormalized.mean(dim=-1)
mean_emebdding.shape

torch.Size([5, 8])

In [45]:
std_embedding = embedding_unnormalized.std(dim=-1)
std_embedding.shape

torch.Size([5, 8])

In [46]:
embedding_unnormalized.shape

torch.Size([5, 8, 8])

In [47]:
print(mean_emebdding[0][0], std_embedding[0][0])

tensor(-1.2742, grad_fn=<SelectBackward0>) tensor(1.2926, grad_fn=<SelectBackward0>)


In [48]:
(embedding_unnormalized[0][0] - mean_emebdding[0][0])/(std_embedding[0][0] + 0.0001)

tensor([-1.0507, -1.0207,  0.0226,  0.3188,  2.0691, -0.4383, -0.3197,  0.4189],
       grad_fn=<DivBackward0>)

In [49]:
embedding_normalized = (embedding_unnormalized - mean_emebdding.unsqueeze(-1))/(std_embedding.unsqueeze(-1) + 0.0001) # unsqueeze needed to broadcast
embedding_normalized[0][0]

tensor([-1.0507, -1.0207,  0.0226,  0.3188,  2.0691, -0.4383, -0.3197,  0.4189],
       grad_fn=<SelectBackward0>)

# Multi head self attention

![](20251130180338.png)

Functionaly what multi headed attention does is, it employs multiple smaller self attention blocks, the input to each is the full embedding vectors, and the output are vectors with a smaller dimension


```

let's say the input is [6, 6, 6]  (here 6 represents the dimension of the embedding vector which goes in, and 3 is the context length)


[6, 6, 6] -> head 1 -> [2, 2, 2]'1 (here I am just using '1 as a label)

[6, 6, 6] -> head 2 -> [2, 2, 2]'2

[6, 6, 6] -> head 3 -> [2, 2, 2]'3


concat([2, 2, 2]'1, [2, 2, 2]'2, [2, 2, 2]'3) -> [6, 6, 6]'

[6, 6, 6]' -> linear layer -> [6, 6, 6]'' --> residual connection --> [6, 6, 6]'' + [6, 6, 6] --> layer norm --> [6, 6, 6]final

```



In [50]:
# writing everything we have done in one block

# ----multi head repetition--------
d_k = d_q = 4
d_v = EMBEDDING_DIMENSION

W_q = nn.Linear(EMBEDDING_DIMENSION, d_q, bias = False)
W_k = nn.Linear(EMBEDDING_DIMENSION, d_k, bias = False)
W_v = nn.Linear(EMBEDDING_DIMENSION, d_v, bias = False) # for single head attention this needs to be same as embedding dimension cause the resulting vector of attention layer get added to input embedding vector so both need to have same dimension

Q, K, V = W_q(x_embeddings_total), W_k(x_embeddings_total), W_v(x_embeddings_total)

attention_matrix = Q@K.transpose(1, 2) # transpose the 1st and the 2nd dimension not the  this is equivalent to Q[i]@K[i].transpose where i is each element of the batch

masked_attention_matrix = attention_matrix.masked_fill(torch.tril(torch.ones_like(attention_matrix), diagonal=-1).bool(), float('-inf'))

batch_attention_matrix = F.softmax(masked_attention_matrix, dim=-2)

V_prime = batch_attention_matrix@V
# ----multi head repetition--------

embedding_unnormalized = V_prime + x_embeddings_total

mean_emebdding = embedding_unnormalized.mean(dim=-1)
std_embedding = embedding_unnormalized.std(dim=-1)

embedding_normalized = (embedding_unnormalized - mean_emebdding.unsqueeze(-1))/(std_embedding.unsqueeze(-1) + 0.0001) # unsqueeze needed to broadcast


In [51]:
# let me create a class out of the repeated block

class SingleHead():

    def __init__(self, d_k, d_q, d_v, input_dimension) -> None:
        
        self.W_q = nn.Linear(input_dimension, d_q, bias = False)
        self.W_k = nn.Linear(input_dimension, d_k, bias = False)
        self.W_v = nn.Linear(input_dimension, d_v, bias = False) # for single head attention this needs to be same as embedding dimension cause the resulting vector of attention layer get added to input embedding vector so both need to have same dimension
        pass

    def forward(self, input_embeddings):
        Q, K, V = self.W_q(input_embeddings), self.W_k(input_embeddings), self.W_v(input_embeddings)

        attention_matrix = Q@K.transpose(-2, -1) # transpose the 1st and the 2nd dimension not the  this is equivalent to Q[i]@K[i].transpose where i is each element of the batch

        masked_attention_matrix = attention_matrix.masked_fill(torch.tril(torch.ones_like(attention_matrix), diagonal=-1).bool(), float('-inf'))

        batch_attention_matrix = F.softmax(masked_attention_matrix, dim=-2)

        V_prime = batch_attention_matrix@V

        return V_prime

        
    

In [52]:
x_embeddings_total.shape

torch.Size([5, 8, 8])

In [53]:
# since our embedding is of size 8 I am gonna create 2 heads with output dimensino 4

head1, head2 = SingleHead(2, 2, 4, 8), SingleHead(2, 2, 4, 8)

In [54]:
x_out_head_1 = head1.forward(x_embeddings_total)[0]
print(x_out_head_1)
x_out_head_2 = head2.forward(x_embeddings_total)[0]
print(x_out_head_2)
print(torch.cat([x_out_head_1, x_out_head_2], dim=-1))

tensor([[ 3.1419,  1.4203,  0.0684,  1.0128],
        [ 1.0142,  0.8224,  0.4180, -0.0632],
        [ 0.6418,  0.4875, -0.1693,  0.0273],
        [ 0.1800,  0.3912,  0.1538, -0.2502],
        [ 0.0086,  0.2592,  0.3328, -0.2701],
        [ 0.1354,  0.1442,  0.1713, -0.2068],
        [ 0.2223,  0.1961,  0.2003, -0.1858],
        [ 0.0477,  0.0375,  0.0404, -0.0305]], grad_fn=<SelectBackward0>)
tensor([[ 0.8068, -0.6730, -1.5726,  2.0819],
        [-1.6304,  1.6754, -1.7161,  1.7721],
        [ 0.7044,  0.4508, -0.5119,  0.0817],
        [ 0.2521,  0.2591, -0.2093,  0.0480],
        [-0.3661,  0.1780, -0.2731,  0.1194],
        [-0.2496,  0.0958, -0.1758, -0.0048],
        [-0.1906,  0.1003, -0.2210,  0.0465],
        [ 0.0120,  0.0486, -0.1223,  0.0764]], grad_fn=<SelectBackward0>)
tensor([[ 3.1419,  1.4203,  0.0684,  1.0128,  0.8068, -0.6730, -1.5726,  2.0819],
        [ 1.0142,  0.8224,  0.4180, -0.0632, -1.6304,  1.6754, -1.7161,  1.7721],
        [ 0.6418,  0.4875, -0.1693,  0.0273,

In [55]:
x_out_embedding_total = torch.cat([head1.forward(x_embeddings_total), head2.forward(x_embeddings_total)], dim= -1)
x_out_embedding_total[0]

tensor([[ 3.1419,  1.4203,  0.0684,  1.0128,  0.8068, -0.6730, -1.5726,  2.0819],
        [ 1.0142,  0.8224,  0.4180, -0.0632, -1.6304,  1.6754, -1.7161,  1.7721],
        [ 0.6418,  0.4875, -0.1693,  0.0273,  0.7044,  0.4508, -0.5119,  0.0817],
        [ 0.1800,  0.3912,  0.1538, -0.2502,  0.2521,  0.2591, -0.2093,  0.0480],
        [ 0.0086,  0.2592,  0.3328, -0.2701, -0.3661,  0.1780, -0.2731,  0.1194],
        [ 0.1354,  0.1442,  0.1713, -0.2068, -0.2496,  0.0958, -0.1758, -0.0048],
        [ 0.2223,  0.1961,  0.2003, -0.1858, -0.1906,  0.1003, -0.2210,  0.0465],
        [ 0.0477,  0.0375,  0.0404, -0.0305,  0.0120,  0.0486, -0.1223,  0.0764]],
       grad_fn=<SelectBackward0>)

In [56]:
# add and norm 

embedding_unnormalized = x_out_embedding_total + x_embeddings_total
mean_emebdding = embedding_unnormalized.mean(dim=-1)
std_embedding = embedding_unnormalized.std(dim=-1)
embedding_normalized = (embedding_unnormalized - mean_emebdding.unsqueeze(-1))/(std_embedding.unsqueeze(-1) + 0.0001) # unsqueeze needed to broadcast


Now that I have shown functionaly how multi head attention works we need to understand what it really means

it is really like head 1 decides from 0 to 3rd direction which direction it should move in and head 2 decide from 4th to 7th which direction embedding should move in

why we concat here ? why we not add them up? this I did not understand conceptually when I think of multi heads adding up, then I could argue that each head might have its own interpretation of how the space looks like which would might cause contradictory changes to the embeddings, another way to think of it is that adding all of them up would lead to a explotion in the value for each embedding dimension, tho layer norm should fix that

this still needs to be debated, I will have to think more deeply about what it means to add them up vs concat them

ok so it turns out I may have missed the feed forward step which has to be supposedly after every multi head attention layer

So the flow should be

1. Multi-Head self attention
2. Add residual
3. Layer norm
4. feed forward
5. add residual
6. layer norm

in my above example I missed from 3, so let me add that


In [64]:
linear_layer = nn.Linear(EMBEDDING_DIMENSION, EMBEDDING_DIMENSION)

embedding_mixed = linear_layer(embedding_normalized) # embedding_normalized vector supposedly have all the information in bits and piece 0 to 3rd dim comes from head 1 3rd to 6th from another and so on, 
# the embedding_mixed is supposed to have the mixed info about all these layers
embedding_mixed = embedding_normalized + embedding_mixed # residual connection
mean_embedding_mixed = embedding_mixed.mean(dim=-1)
std_embedding_mixed = embedding_mixed.std(dim=-1)

embedding_mixed_normalized = (embedding_mixed - mean_embedding_mixed.unsqueeze(-1))/(std_embedding_mixed.unsqueeze(-1) + 0.0001)

In [65]:
embedding_mixed_normalized[0]

tensor([[ 0.8841,  1.1394, -0.8691, -0.3448,  0.9413,  0.0304, -1.7770, -0.0043],
        [ 1.0050,  0.4263, -0.4579,  0.8275, -1.4936,  0.5408, -1.4313,  0.5833],
        [ 0.0375,  0.3017, -0.8468, -1.1163, -0.6150,  0.9747,  1.8187, -0.5546],
        [-0.2239, -0.1080,  0.5884, -1.6280,  0.1220,  1.2926,  1.0401, -1.0831],
        [ 0.4657,  0.3670,  1.2039, -0.2930, -0.6660,  0.8783, -1.9739,  0.0181],
        [ 0.6107,  0.6000, -0.6946,  1.7762,  0.0096, -1.5252, -0.4065, -0.3702],
        [ 0.8879,  0.9430,  0.1061,  1.0647, -0.4016, -1.8315, -0.8117,  0.0431],
        [ 1.1494, -0.2540,  0.5074,  0.5979, -1.5452, -0.1783, -1.2655,  0.9885]],
       grad_fn=<SelectBackward0>)

Ok now this constitutes of the full multihead attention block, I still feel like the explanation:
"Each head computes its own bits of feature that it understands, and then the feed forward layer after that combines and picks and mixes the relevant info to form a refined vector" still a bit hand wavy I might need to actually see what this layer is doing, for that I will have to analyze a already existing pretrained model, which would be the next thing I will do, since I don't have money to buy GPUs I will have to rely on analyzing already trained models, let's continue with next stuff

In [66]:
# project the existing embedding vectors into existing vocab space

project_to_vocab_layer = nn.Linear(EMBEDDING_DIMENSION, len(vocabulary_list))

logits = project_to_vocab_layer(embedding_mixed_normalized)

In [67]:
logits[0].shape # context length, vocab size

torch.Size([8, 44])

In [68]:
# exploding logits and expected value for easy cross entropy computation

logits_exploded = logits.view(logits.shape[0]*logits.shape[1], logits.shape[2])
