# Transformers

this note book is here to help me refresh some of my understanding of the basic transformers architecture

we want to implement the encoder part of the architecture in [attention is all you need paper](https://arxiv.org/pdf/1706.03762):




architecture screentshot:

![](20251120024008.png)

My goal with be to go through one pass of transformer layer for a data, and try to explain each layer, finally I will convert this jupyter notebook to a python code and train it on a simple dataset

In [1]:
# I want this note book to be very simple so I will make the data very simple, i.e use whatever I have written till now as training data

training_data = list("""
# Transformers

this note book is here to help me refresh some of my understanding of the basic transformers architecture

we want to implement the encoder part of the architecture in [attention is all you need paper](https://arxiv.org/pdf/1706.03762):

My goal with be to go through one pass of transformer layer for a data, and try to explain each layer, finally I will convert this jupyter notebook to a python code and train it on a simple dataset

# I want this note book to be very simple so I will make the data very simple, i.e use whatever I have written till now as training data

""")

In [2]:
# I don't want to get too deep into tokenization for this notebook so I am just going to instead use all the unique characters
# present in the training data as distinct tokens
vocabulary_list = list(set(training_data))

In [3]:
print(vocabulary_list[:5])
print(len(vocabulary_list))

['a', 'x', '/', 'g', 's']
44


In [4]:
# let's create training and testing data
# training and testing data for next token prediction would look something like

# the way the transformer works is that for a single example sentence it trains the model for multiple token prediction
print(training_data[:9])

['\n', '#', ' ', 'T', 'r', 'a', 'n', 's', 'f']


In [5]:
# here if x is
training_data[:8]

['\n', '#', ' ', 'T', 'r', 'a', 'n', 's']

In [6]:
# then y would be
training_data[1:9]

['#', ' ', 'T', 'r', 'a', 'n', 's', 'f']

In [7]:
# ok before we make create training data we need to convert our tokens to a unique index to do that I will do
token_to_index = {c:i for i,c in enumerate(vocabulary_list)}
index_to_token = {i:c for i,c in enumerate(vocabulary_list)}

In [20]:
# now we let's convert our training data to a torch tensor
import torch

training_data_tensor = torch.tensor([token_to_index[c] for c in training_data], dtype=torch.long)

In [21]:
print(training_data_tensor[:10])
print([index_to_token[ix.item()] for ix in training_data_tensor[:10]])

tensor([28,  7, 17, 30, 24,  0,  9,  4, 10, 19])
['\n', '#', ' ', 'T', 'r', 'a', 'n', 's', 'f', 'o']


In [22]:
# now let's create training and testing set
block_size = 8
x = torch.stack([training_data_tensor[ix:ix+block_size] for ix in range(len(training_data_tensor)-block_size)] )
# max ix len(training_data_tensor)-block_size - 1
# so ix + block_size = len(training_data_tensor) - 1
# so final example won't include last character
y = torch.stack([training_data_tensor[ix:ix+block_size]for ix in range(1,len(training_data_tensor)-block_size+1)]) 



In [23]:
print("x training data")
print(x[:5])
print("y training data")
print(y[:5])

x training data
tensor([[28,  7, 17, 30, 24,  0,  9,  4],
        [ 7, 17, 30, 24,  0,  9,  4, 10],
        [17, 30, 24,  0,  9,  4, 10, 19],
        [30, 24,  0,  9,  4, 10, 19, 24],
        [24,  0,  9,  4, 10, 19, 24, 43]])
y training data
tensor([[ 7, 17, 30, 24,  0,  9,  4, 10],
        [17, 30, 24,  0,  9,  4, 10, 19],
        [30, 24,  0,  9,  4, 10, 19, 24],
        [24,  0,  9,  4, 10, 19, 24, 43],
        [ 0,  9,  4, 10, 19, 24, 43, 12]])


# Embedding Table

![](20251121001141.png)


This is a look up table between the vocabulary index and n dimensional vector,
during the training of transformer model this vectors also gets trained, i.e where these vectors point to gets updated,
based on the similarity between these vectors, if let's say I have 2 tokens "dog" and "pooch", during the start of training process
they might point in very different directions, but after the training both would point to pretty much same place

### Question?:

1. What is so special about the training process that transforms these vectors from pointing in random ass direction, to actually have some meaning
    * for now I am gonna assume that the answer is that the transformer architecture expects and assumes these vectors to be what I have described
    * and based on this assumption, the subsequent layers performs its operation, so optimizing the loss leads to these embedding vector looking more like actual high dimensional representation of the words 

In [24]:
from torch import nn

EMBEDDING_DIMENSION = 8
VOCAB_SIZE = len(vocabulary_list)

embeddings_table = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIMENSION)

In [32]:
# some experimentation on how embeddings table work,
print(embeddings_table(torch.tensor([[0,1,2,3]], dtype=torch.long)))
# it goes to each item in tensor and assumes each item is a index converts it to its corresponding embedding vector

tensor([[[-1.0041,  1.3954,  1.3023,  0.2720,  0.1522, -0.4277,  1.7049,
           0.5876],
         [-0.2477,  0.1334, -1.4424, -0.4144, -0.7742, -1.4339, -1.8616,
           0.3179],
         [-1.3096, -0.3020, -1.8612,  0.4496, -1.0272,  0.9990,  0.6638,
          -0.5372],
         [-0.9387, -0.8389,  1.1444, -0.8731, -0.7659,  0.3666, -0.4411,
           1.5139]]], grad_fn=<EmbeddingBackward0>)


I want to do a very simple forward pass so I am gonna create my forward pass batch now

In [33]:
x_batch = x[:5]
y_batch = y[:5]

A question lingers, what does this (shifted right) mean:

![](20251121002201.png)

this just means that our input is shifted from the target output

In [36]:
x_embeddings = embeddings_table(x_batch)

In [39]:
# just one example
x_embeddings[:1]

tensor([[[ 1.8780,  0.5829, -0.0916, -0.4102,  1.2541,  0.4659,  1.2850,
          -0.0662],
         [ 0.2735, -0.7849, -0.1814, -1.8961, -0.8575,  0.3070,  1.3578,
           0.1566],
         [ 0.6504,  0.3611,  0.1434, -1.4111, -1.0218,  0.6670,  0.8024,
           0.7340],
         [ 1.3945, -0.6464, -1.2267,  0.8978,  0.1750,  0.6064, -0.5378,
           1.0170],
         [ 2.6272,  1.0550, -1.9216,  0.3156,  1.0205, -1.6640,  0.5263,
           0.2661],
         [-1.0041,  1.3954,  1.3023,  0.2720,  0.1522, -0.4277,  1.7049,
           0.5876],
         [ 2.1026,  0.2747,  1.7722,  0.4238,  0.7923, -0.0490,  0.1928,
           0.1430],
         [ 0.3744, -0.5063, -1.5819,  0.2093, -1.6286, -0.2540,  0.7968,
          -1.4127]]], grad_fn=<SliceBackward0>)

# Positional Encoding [Next]