# Transformers

this note book is here to help me refresh some of my understanding of the basic transformers architecture

we want to implement the encoder part of the architecture in [attention is all you need paper](https://arxiv.org/pdf/1706.03762):




architecture screentshot:

![](20251120024008.png)

My goal with be to go through one pass of transformer layer for a data, and try to explain each layer, finally I will convert this jupyter notebook to a python code and train it on a simple dataset

In [1]:
# I want this note book to be very simple so I will make the data very simple, i.e use whatever I have written till now as training data

training_data = list("""
# Transformers

this note book is here to help me refresh some of my understanding of the basic transformers architecture

we want to implement the encoder part of the architecture in [attention is all you need paper](https://arxiv.org/pdf/1706.03762):

My goal with be to go through one pass of transformer layer for a data, and try to explain each layer, finally I will convert this jupyter notebook to a python code and train it on a simple dataset

# I want this note book to be very simple so I will make the data very simple, i.e use whatever I have written till now as training data

""")

In [2]:
# I don't want to get too deep into tokenization for this notebook so I am just going to instead use all the unique characters
# present in the training data as distinct tokens
vocabulary_list = list(set(training_data))

In [3]:
print(vocabulary_list[:5])
print(len(vocabulary_list))

['s', 'I', 'v', 'T', 't']
44


In [4]:
# let's create training and testing data
# training and testing data for next token prediction would look something like

# the way the transformer works is that for a single example sentence it trains the model for multiple token prediction
print(training_data[:9])

['\n', '#', ' ', 'T', 'r', 'a', 'n', 's', 'f']


In [5]:
# here if x is
training_data[:8]

['\n', '#', ' ', 'T', 'r', 'a', 'n', 's']

In [6]:
# then y would be
training_data[1:9]

['#', ' ', 'T', 'r', 'a', 'n', 's', 'f']

In [7]:
# ok before we make create training data we need to convert our tokens to a unique index to do that I will do
token_to_index = {c:i for i,c in enumerate(vocabulary_list)}
index_to_token = {i:c for i,c in enumerate(vocabulary_list)}

In [8]:
# now we let's convert our training data to a torch tensor
import torch

training_data_tensor = torch.tensor([token_to_index[c] for c in training_data], dtype=torch.long)

In [9]:
print(training_data_tensor[:10])
print([index_to_token[ix.item()] for ix in training_data_tensor[:10]])

tensor([14, 28, 42,  3, 16, 36, 35,  0, 43, 24])
['\n', '#', ' ', 'T', 'r', 'a', 'n', 's', 'f', 'o']


In [10]:
# now let's create training and testing set
block_size = 8
x = torch.stack([training_data_tensor[ix:ix+block_size] for ix in range(len(training_data_tensor)-block_size)] )
# max ix len(training_data_tensor)-block_size - 1
# so ix + block_size = len(training_data_tensor) - 1
# so final example won't include last character
y = torch.stack([training_data_tensor[ix:ix+block_size]for ix in range(1,len(training_data_tensor)-block_size+1)]) 



In [11]:
print("x training data")
print(x[:5])
print("y training data")
print(y[:5])

x training data
tensor([[14, 28, 42,  3, 16, 36, 35,  0],
        [28, 42,  3, 16, 36, 35,  0, 43],
        [42,  3, 16, 36, 35,  0, 43, 24],
        [ 3, 16, 36, 35,  0, 43, 24, 16],
        [16, 36, 35,  0, 43, 24, 16, 39]])
y training data
tensor([[28, 42,  3, 16, 36, 35,  0, 43],
        [42,  3, 16, 36, 35,  0, 43, 24],
        [ 3, 16, 36, 35,  0, 43, 24, 16],
        [16, 36, 35,  0, 43, 24, 16, 39],
        [36, 35,  0, 43, 24, 16, 39, 31]])


# Embedding Table

![](20251121001141.png)


This is a look up table between the vocabulary index and n dimensional vector,
during the training of transformer model this vectors also gets trained, i.e where these vectors point to gets updated,
based on the similarity between these vectors, if let's say I have 2 tokens "dog" and "pooch", during the start of training process
they might point in very different directions, but after the training both would point to pretty much same place

### Question?:

1. What is so special about the training process that transforms these vectors from pointing in random ass direction, to actually have some meaning
    * for now I am gonna assume that the answer is that the transformer architecture expects and assumes these vectors to be what I have described
    * and based on this assumption, the subsequent layers performs its operation, so optimizing the loss leads to these embedding vector looking more like actual high dimensional representation of the words 

In [12]:
from torch import nn

EMBEDDING_DIMENSION = 8
VOCAB_SIZE = len(vocabulary_list)

embeddings_table = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIMENSION)

In [13]:
# some experimentation on how embeddings table work,
print(embeddings_table(torch.tensor([[0,1,2,3]], dtype=torch.long)))
# it goes to each item in tensor and assumes each item is a index converts it to its corresponding embedding vector

tensor([[[-0.3082,  0.6139,  0.7662, -0.7353, -0.3764, -0.2977,  1.7620,
          -0.3829],
         [ 0.4462,  0.3580, -0.8944, -1.2936,  2.0476,  0.2211, -0.7356,
           0.4584],
         [-0.7255, -1.0177,  1.0224,  2.4789, -0.5182,  0.0262, -0.1825,
          -1.0467],
         [-0.2908,  0.3825,  1.0557,  1.2988, -0.3164,  0.1577, -0.4467,
          -0.3897]]], grad_fn=<EmbeddingBackward0>)


I want to do a very simple forward pass so I am gonna create my forward pass batch now

In [14]:
x_batch = x[:5]
y_batch = y[:5]

A question lingers, what does this (shifted right) mean:

![](20251121002201.png)

this just means that our input is shifted from the target output

In [15]:
x_embeddings = embeddings_table(x_batch)

In [16]:
# just one example
x_embeddings[:1]

tensor([[[ 1.0096,  1.0822,  0.3278,  1.5657,  0.2624,  1.4789, -0.5911,
          -1.1504],
         [ 0.6911, -0.3817,  1.2578,  0.2299, -1.4089,  1.5734, -0.8249,
           0.8280],
         [ 2.0004,  1.1353,  0.2975,  0.0307, -0.6738,  1.5019, -0.2457,
           1.2371],
         [-0.2908,  0.3825,  1.0557,  1.2988, -0.3164,  0.1577, -0.4467,
          -0.3897],
         [ 0.3388, -0.4310,  1.0913,  1.5326, -2.7795, -0.0863, -0.8432,
           0.4986],
         [-0.7122, -1.5942, -1.7596,  0.2925,  1.0263,  1.2626,  1.3016,
          -0.9153],
         [-0.2404, -1.8147,  1.3203,  0.7086, -1.0006,  1.4210, -0.7999,
          -0.8074],
         [-0.3082,  0.6139,  0.7662, -0.7353, -0.3764, -0.2977,  1.7620,
          -0.3829]]], grad_fn=<SliceBackward0>)

# Positional Encoding

![](20251121135418.png)


From my past understanding this is sort of values with varies with the position of the token in the sequence to encode the information about the position of the token in the sequence

so for each position there will be a vector associated to it, which will get added to the original embedding vector at that position

### Questions?:
1. Why Add these vectors to the original embedding vector? Can it not be appended or create some other type of encoding create a new channel perhaps like we do for CNNs
    - Ans: The Idea behind adding these is how we treat embedding vectors, you can think of embedding vector as the original absolute meaning of a token, now depending on whether it appears at the beggining of a sentence or end of a sentence it's meaning might differ, i.e its embedding vector might change its position, that change is capture by the addition of this positional embeddding vector
2. Why do these needs to be a vector all together can these not be like a single number which gets added?
    - Ans: well a vector is a more generalized version of a single number, if single number is the right approach then expectation is that the network would train the embedings to become a single number

## Sinusoidal Encoding

![](20251121140214.png)

here d_model is the dimension of the embedding

In the original Paper they used a fix positional sinusoidal encoding, they mentioned the performance for both learned and not learned were identical, they wanted to experiment with sinusoidal encoding, because they wanted to test the model beyond the trained context length

# Question?:
1. But why sinusoidal encoding

In [17]:
positional_embedding_table = nn.Embedding(block_size, EMBEDDING_DIMENSION)

In [18]:
x_pos_embeddings = positional_embedding_table(torch.arange(x_embeddings.shape[1])) # C, E

x_embeddings # B, C, E

x_embeddings_total = x_embeddings + x_pos_embeddings # B, C, E + C, E -> pytorch checks the shape starting from right and if there is an extra dimension it creates a new dimention and copuies the same thing over, like C, E -> (1, C, E) -> (B, C, E)

# Self Attention Layer

![](20251123232032.png)

I will start of by explaining what this layer does in a high level, then I will dig deep into how it does this, initially I will go over a single head self attention, 
then understand myself and explain why multi head self attention

this is the 3b1b interpretation of this layer on a high level, which I found to be the most elegant

## The Explanation
This layer as a whole tells us how should the original embedding vector be modified, so that it's meaning is enriched with the context of the surrounding tokens, for example take the sentence:

" That blue aeroplane is very dangerous "

in this example initially "aeroplane"'s embedding vector would straight up point to the absolute aeroplane,
then attention layer outputs a result, that result when added to the original embedding vector, nudges the aeroplane's vector in a direction closer to blue and dangerous

that is on high level what this layer does, now going into the detail let's start by the equation

![](20251123233808.png)

the above represent the equation describing the self attention mechanism, for the purpose of this excercise we will focus on masked self attention

here Q, K, V are all matrices

Q being the query matrix, K Key matrix and V value matrix

let me do one thing and form this forumla in our on going example and then explain

In [19]:
d_k = d_q = d_v = 10

W_q = nn.Linear(EMBEDDING_DIMENSION, d_q, bias = False)
W_k = nn.Linear(EMBEDDING_DIMENSION, d_k, bias = False)
W_v = nn.Linear(EMBEDDING_DIMENSION, d_v, bias = False)


In [20]:
W_q.weight.shape

torch.Size([10, 8])

In [21]:
x_embeddings_total.shape

torch.Size([5, 8, 8])

In [22]:
W_q.weight@x_embeddings_total[0][0]

tensor([-0.7523, -1.1425, -0.1237, -0.4978, -1.3365, -1.1024,  0.6787, -0.0977,
         0.8006, -0.3372], grad_fn=<MvBackward0>)

In [23]:
W_q(x_embeddings_total)

tensor([[[-7.5233e-01, -1.1425e+00, -1.2365e-01, -4.9778e-01, -1.3365e+00,
          -1.1024e+00,  6.7866e-01, -9.7705e-02,  8.0059e-01, -3.3718e-01],
         [ 1.1098e-01, -4.9259e-01,  1.3709e+00,  2.9401e-01, -1.6201e+00,
          -4.2740e-01,  1.1961e+00,  1.3074e-01, -3.8340e-02, -4.5341e-01],
         [ 3.7432e-01, -8.8783e-02,  1.3748e+00,  3.0277e-01, -1.8223e+00,
           4.8749e-01,  1.0525e+00,  3.2254e-01, -7.8679e-01,  5.9447e-01],
         [-5.5625e-01, -9.4867e-02, -7.4363e-01,  2.6241e-01,  5.8117e-01,
           1.2421e-01, -4.3674e-01, -5.6245e-01,  6.7975e-01, -6.9114e-01],
         [-5.6364e-01, -1.2077e+00, -6.3766e-01,  6.0641e-01,  6.3862e-01,
          -9.8387e-01,  7.4406e-01,  3.5998e-02,  6.0114e-01, -1.5101e+00],
         [ 9.8003e-01, -1.3104e-01,  9.7843e-01,  5.3600e-01,  6.3538e-01,
          -1.9702e-01,  1.0436e+00,  1.2430e+00, -5.9683e-01, -3.5266e-01],
         [-1.1023e+00, -1.2548e+00, -6.0360e-01,  1.6455e-01, -6.3066e-01,
          -5.7828e-

In [24]:
# pass the existing vector through a trainable linear transformation

Q, K, V = W_q(x_embeddings_total), W_k(x_embeddings_total), W_v(x_embeddings_total)

In [25]:
print(Q.shape, K.shape, V.shape)

torch.Size([5, 8, 10]) torch.Size([5, 8, 10]) torch.Size([5, 8, 10])


In [26]:
attention_matrix = Q@K.transpose(1, 2) # transpose the 1st and the 2nd dimension not the  this is equivalent to Q[i]@K[i].transpose where i is each element of the batch

Let me try and explain what just happened above, let's take example of a single batch

In [27]:
print(Q[0][:3])
print(K[0][:3])

tensor([[-0.7523, -1.1425, -0.1237, -0.4978, -1.3365, -1.1024,  0.6787, -0.0977,
          0.8006, -0.3372],
        [ 0.1110, -0.4926,  1.3709,  0.2940, -1.6201, -0.4274,  1.1961,  0.1307,
         -0.0383, -0.4534],
        [ 0.3743, -0.0888,  1.3748,  0.3028, -1.8223,  0.4875,  1.0525,  0.3225,
         -0.7868,  0.5945]], grad_fn=<SliceBackward0>)
tensor([[ 1.8104, -1.1986, -1.3065, -1.1477, -0.3196,  1.1467, -0.0293,  0.5532,
         -0.0752, -1.3992],
        [ 0.5583,  0.1986,  0.1202, -1.3505, -0.7072,  1.2815,  0.2118, -0.8598,
          0.0324, -0.5253],
        [-0.0195,  1.0557,  1.5437,  0.2219, -0.8512,  1.2104,  0.3387, -1.9492,
          0.4774, -0.3484]], grad_fn=<SliceBackward0>)


well the traditional explanation is:

when the static embedding for a word is passed through these layers it extract specific feature pertaining to corresponding transformation

let's take an example: "The bank of the River"

Query -> transform the static embedding of bank to something like "I am a Noun needing a definition, need to know what am I a river bank, a financial bank, etc"

Key -> transforms the static embedding of River to something like "I am a nature related word, related to river"

Value -> transforms the embedding to actual meaning info, as their it might not be the first attention layer, if it is a second layer, it won't be the absolute meaning

but this never really sat with me completely, I was not able to understand this fully and it seemed pretty handwavy as to explaining what these layers really do, maybe I understand for Value vector but not for Query and Key vector

need to think and understand this properly and clearly once and for all, 
think, what would happen if no linear layer was present


ok let's say no linear layer was present, than the Q*K.T would just give me the cosine similarity between static embeddings of tokens

let's say the attention matrix is

A

here A[i][j] would be the cosine similarity between the ith token and jth token  (here when I say token I mean the embedding of token)

now let's see what would A*V would do, here I am assuming V to is just the static emebdding


A_softmax -> is softmax function applied on A across each column

```

V`[i] = V[0]*A_softmax[i][0] + V[1]*A_softmax[i][1] + V[2]*A_softmax[i][2] + ...

```

here `V[i]` represents the static embedding for the ith word, so `V[i]` is a vector definitely, and ``V`[i]`` is the final vector after attention matrix multiplication

``V`[i]`` we can think of it as change or how the original word gets modified

so when taking the weighted sum it tells us, in direction of which static embedding should the input embedding move to, so that it represents that single vector would represent the meaning of the entire sentence, **now I see the poblem with just using static embedding, the problem is if we just use static embedding to compute cosine similarity it would nudge the input emebedding to this layer in the direction which is most similar to the word which we see in the sentence, which we don't always want**

side note: one more thing I learned is that these query and key vector live in a space which is smaller than the static embedding dimension, so it would help computationaly too






In [28]:
# now each element of this of shape context length x context length
attention_matrix.shape

torch.Size([5, 8, 8])