### Transformer Model

*Note: This is a character-base transformer.*

This is the final version of our transformer model that we built following the `Andrej Karpathy` guide. The idea is to develop all blocks that together they form a transformer block.

In addition to this notebook, we have experiment with bigrams and some math insigth (in other notebooks):

* Models development: https://colab.research.google.com/drive/13e0Gd4WGlrrcgKF2vbQTY4SG13dwTi3X#scrollTo=oEtJw0HyLU3z

* Math insigth: https://colab.research.google.com/drive/10k3GpHnVxSTkPY4jlaQn5IAE3-w8Kndj

Paper which this model is based on: https://arxiv.org/abs/1706.03762

### Import Packages

In [12]:
import torch
import torch.nn as nn
from torch.nn import functional as F

### Model Parameters
Setup global parameters as global

In [13]:
# Hyperparameters
BATCH_SIZE = 64 # Num of independent sequences to process in parallel
BLOCK_SIZE = 256 # Max context for predictions

EPOCHS = 2500 # Num of times to run the training loop

EVAL_INTERVAL = 250
EVAL_ITERS = 200

NUM_EMBEDDINGS = 384
LEARNING_RATE = 3e-4
NUM_HEADS = 6
NUM_LAYERS = 6
DROPOUT_P = 0.2

### Device Agnostic Code

The idea is that our code can run on any device without having to modify the code, so we setup device agnostic code (if we are on cuda it will run there, else in cpu).

In [14]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

# Data & Pre-processing

### Download the Shakespeare Dataset

1. We are going to download the `.txt` from the karpathy repo.
2. We are going to read it and store it in our code variable.

In [15]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-12-05 12:48:23--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2025-12-05 12:48:23 (30.4 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [16]:
with open(file="/content/input.txt", mode="r", encoding="utf-8") as f:
  text = f.read()

print(f"Aprox length of dataset: {len(text)}\n")
print(f"First 100 characters of the dataset:\n{text[:100]}")

Aprox length of dataset: 1115394

First 100 characters of the dataset:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


### Build vocabulary based on characters (Character-level)

This will be the characters that the model can see or 'generate'.

In [17]:
# Characters
characters = sorted(list(set(text)))
# Vocabulary size
vocab_size = len(characters)

print(f"Characters: {''.join(characters)}")
print(f"Vocab size: {vocab_size}")

Characters: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocab size: 65


### Character level lookup tables

This is a simple tokenizer for our characters/vocab. We will have a loop up table to transform characters into tokens/ints and another to do the reverse process (transform tokens/ints to characters/text).

In [18]:
# We create the lookup tables (Character-base)
string_to_id = { char:idx for idx, char in enumerate(characters)}
id_to_string = { idx:char for idx, char in enumerate(characters)}

# Create the functions to look words in the lookup tables
encode = lambda sentence: [string_to_id[char] for char in sentence]
decode = lambda id_list: "".join(id_to_string[id] for id in id_list)

In [19]:
# Testing our our tokenizers
print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenize the entire text

We are going to encode the whole dataset and transform it into a tensor

In [20]:
data = torch.tensor(encode(text), dtype=torch.long)

print(f"Data tensor shape: {data.shape, data.dtype}")
print(f"First 100 encodes: {data[:100]}")

Data tensor shape: (torch.Size([1115394]), torch.int64)
First 100 encodes: tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


### Split data between training & validation set

This will allow us to train our model and later test it with data it hasn't seen yet. We are going to do it manually (90% for training).

In [21]:
# Get the index where we need to "cut"
n = int(0.9 * len(data))
# Training set
train_data = data[:n]
# Validation set
val_data = data[n:]

print(f"Training length: {len(train_data)}")
print(f"Validation length: {len(val_data)}")

Training length: 1003854
Validation length: 111540


### Context size (Block size)

We want to sample parts/chunks of the text. The idea is that the model can see want should come next from one character up to block size.

In [22]:
from typing import Literal

def get_batch(split: Literal["train", "validation"]) -> tuple[torch.Tensor, torch.Tensor]:
  """
  Generate a small batch of data of inputs x and targets y.
  It first generates 4 random locations in the dataset, and then extract the data for each of those indexes.

  Args:
    split ("train" | "validation"): What data split to use
  Returns:
    x -> Tensor
    y -> Tensor
  """
  data = train_data if split == "train" else val_data
  # Sample random parts
  ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE,))
  # Get the actual data for that random sample
  x = torch.stack([data[i:i+BLOCK_SIZE] for i in ix])
  y = torch.stack([data[i+1:i+BLOCK_SIZE+1] for i in ix])
  x, y = x.to(device), y.to(device)

  return x, y

In [23]:
xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

inputs:
torch.Size([64, 256])
tensor([[ 1, 52, 43,  ..., 63,  1, 57],
        [ 1, 55, 59,  ..., 28, 53, 51],
        [15, 39, 54,  ..., 52,  1, 21],
        ...,
        [40, 59, 58,  ..., 43, 52,  1],
        [ 1, 41, 53,  ..., 63,  6,  0],
        [59, 58,  1,  ..., 58, 47, 53]], device='cuda:0')
targets:
torch.Size([64, 256])
tensor([[52, 43, 60,  ...,  1, 57, 54],
        [55, 59, 53,  ..., 53, 51, 44],
        [39, 54, 43,  ...,  1, 21,  1],
        ...,
        [59, 58,  6,  ..., 52,  1, 50],
        [41, 53, 61,  ...,  6,  0, 14],
        [58,  1, 58,  ..., 47, 53,  2]], device='cuda:0')


# Transformer Architecture

After processing all the data, in this section we will build the transformer blocks:

* Attention Head
* Multi-Head Attention
* FeedForward (FF block)
* Layer Norm
* Transformer

### Attention (Head & Multi-Head)

This section focuses on implementing the core attention mechanisms. It defines the `Head` class, which represents a single self-attention head, responsible for computing queries, keys, and values, and then calculating attention scores. It also includes the `MultiHeadAttention` class, which combines several Head instances to allow the model to attend to different parts of the input simultaneously, enhancing its ability to capture diverse relationships within the data.

In [24]:
class Head(nn.Module):
  """
  Self-Attention head
  """
  def __init__(self, head_size: int):
    """
    Initializes the head class
    """
    super().__init__()
    # Represents "what I contain" — the features others can look at
    self.key = nn.Linear(in_features=NUM_EMBEDDINGS,
                         out_features=head_size,
                         bias=False)
    # Represents "what I'm looking for" — used to score compatibility with keys
    self.query = nn.Linear(in_features=NUM_EMBEDDINGS,
                         out_features=head_size,
                         bias=False)
    # Represents "what I will send" — the information that gets aggregated
    self.value = nn.Linear(in_features=NUM_EMBEDDINGS,
                         out_features=head_size,
                         bias=False)
    self.register_buffer("tril", torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE)))
    # Dropout layer
    self.dropout = nn.Dropout(p=DROPOUT_P)

  def forward(self, x):
    """
    Forward pass for Head model
    """
    # Batch, Time, Channels
    B, T, C = x.shape
    # Create key and query
    k = self.key(x) # (B, T, head_size)
    q = self.query(x) # (B, T, hs)


    # Multiplication of keys with queries (To compute attention scores ("affinities"))
    wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
    wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
    wei = F.softmax(wei, dim=-1) # (B, T, T)
    wei = self.dropout(wei)

    # Perform the weighted aggregation of the values
    v = self.value(x) # (B, T, hs)
    out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
    return out

In [25]:
class MultiHeadAttention(nn.Module):
  """
  Multi-Head Attention class
  """
  def __init__(self, num_heads: int, head_size: int):
    """
    Initializes MultiHeadAttention class
    """
    super().__init__()
    self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
    self.projection = nn.Linear(in_features=NUM_EMBEDDINGS,
                                out_features=NUM_EMBEDDINGS)

  def forward(self, x):
    """
    Forward pass for Multi-Head attention
    """
    out = torch.cat([h(x) for h in self.heads], dim=-1)
    return self.projection(out)

### Feed Forward (FF Block)

This layer will happend after Self-Attention (each time one SA appears, this one goes next). In basic words, it allows to process what we sort of figured out in the attention block and don't go TOO fast.

In [26]:
class FeedFoward(nn.Module):
  """
  Simple linear layer followed by a non-linearity
  """
  def __init__(self, n_embd):
    super().__init__()
    self.net = nn.Sequential(
        # Multiply by 4 following the paper (to grow the layer)
        nn.Linear(in_features=n_embd, out_features=4*n_embd),
        nn.ReLU(),
        nn.Linear(in_features=4*n_embd, out_features=n_embd),
        nn.Dropout(p=DROPOUT_P),
    )

  def forward(self, x):
    """
    Forward pass for FeedForward
    """
    return self.net(x)


### Transformer Block

Transformer block without the Encoder-Decoder Attention (We are skipping that for now, as this is a GPT like decoder model).

In [27]:
class Block(nn.Module):
  """
  Transformer block module
  """
  def __init__(self, num_embedding: int, n_head: int):
    """
    Initializes the Block class
    """
    super().__init__()
    head_size = num_embedding // n_head
    self.self_attention = MultiHeadAttention(num_heads=n_head,
                                             head_size=head_size)
    self.ff = FeedFoward(num_embedding)
    # Layer norms
    self.ln1 = nn.LayerNorm(num_embedding)
    self.ln2 = nn.LayerNorm(num_embedding)

  def forward(self, x):
    """
    Forward pass for Block module
    """
    # With residual/skip connection
    x = x + self.self_attention(self.ln1(x))
    x = x + self.ff(self.ln2(x))
    return x

### Model Class

This section defines the main `LanguageModel` class, which brings together all the previously defined components to form the complete transformer decoder model (similar to a GPT architecture). It includes token and positional embeddings, a sequence of transformer `Blocks` (which encapsulate multi-head attention and feed-forward layers), a final layer normalization, and a linear head for outputting the vocabulary logits. This class also contains the forward method for training and a generate method for sampling new text from the trained model.

In [28]:
class LanguageModel(nn.Module):
  """
  Shakespeare Language Model (Decoder)
  """
  def __init__(self):
    """
    Initializes the LanguageModel module
    """
    super().__init__()
    # Each token directly reads off the logits for the next token from a lookup table
    self.token_embedding_table = nn.Embedding(vocab_size, NUM_EMBEDDINGS)
    self.position_embedding_table = nn.Embedding(BLOCK_SIZE, NUM_EMBEDDINGS)
    # Create the blocks
    self.blocks = nn.Sequential(*[Block(num_embedding=NUM_EMBEDDINGS,
                                        n_head=NUM_HEADS) for _ in range(NUM_LAYERS)])
    # Add extra layer norm
    self.layer_norm = nn.LayerNorm(NUM_EMBEDDINGS) # Final layer norm
    # Layer that outputs the vocab_size
    self.lm_head = nn.Linear(in_features=NUM_EMBEDDINGS,
                             out_features=vocab_size)
    # Weights initialization
    self.apply(self._init_weights)

  def _init_weights(self, module):
    if isinstance(module, nn.Linear):
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
      if module.bias is not None:
        torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
      torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

  def forward(self, idx, targets=None):
    """
    Performs the forward pass of the LanguageModel
    """
    B, T = idx.shape

    # (B,T,C) -> (Batch_size, Sequence length, vocab_size)
    token_embedding = self.token_embedding_table(idx)
    # Positional embedding
    pos_embedding = self.position_embedding_table(torch.arange(T, device=device)) # (T, C)

    # Hold the token identity + the position where it occures
    x = token_embedding + pos_embedding # (B, T, C)

    # Process x
    x = self.blocks(x) # (B, T, C)
    x = self.layer_norm(x) # (B, T, C)
    logits = self.lm_head(x) # (B, T, vocab_size)

    if targets is None:
      loss = None
    else:
      # Transform into (B,C,T)
      B, T, C = logits.shape
      # Strech the array to make it 2 dimentional
      logits = logits.view(B*T, C)
      # We have to do the same for the target
      targets = targets.view(B*T)
      # Calculate loss
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  def generate(self, idx, max_new_tokens):
    """
    Generate predictions

    Args:
      idx: Is (B, T) array of indices in the crr. context
      max_new_tokens (int): Max amount of tokens to generate
    """
    for _ in range(max_new_tokens):
      # crop idx to the last block_size tokens
      idx_cond = idx[:, -BLOCK_SIZE:]
      # Get the prediction
      logits, loss = self(idx_cond)
      # We only want to focus on the last time step
      logits = logits[:, -1, :] # (B, C)
      # Apply soft max to logits -> Get probabilities
      probs = F.softmax(logits, dim=-1) # (B, C)
      # Sample from the distribution
      idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
      # Append sampled index to current sequence
      idx = torch.cat([idx, idx_next], dim=1) # (B, T+1)

    return idx

# Training & Evaluating the Model

We will be using Google Colab with a T4 GPU to train this model.

### Model Instance

In [29]:
model = LanguageModel().to(device)
model

LanguageModel(
  (token_embedding_table): Embedding(65, 384)
  (position_embedding_table): Embedding(256, 384)
  (blocks): Sequential(
    (0): Block(
      (self_attention): MultiHeadAttention(
        (heads): ModuleList(
          (0-5): 6 x Head(
            (key): Linear(in_features=384, out_features=64, bias=False)
            (query): Linear(in_features=384, out_features=64, bias=False)
            (value): Linear(in_features=384, out_features=64, bias=False)
            (dropout): Dropout(p=0.2, inplace=False)
          )
        )
        (projection): Linear(in_features=384, out_features=384, bias=True)
      )
      (ff): FeedFoward(
        (net): Sequential(
          (0): Linear(in_features=384, out_features=1536, bias=True)
          (1): ReLU()
          (2): Linear(in_features=1536, out_features=384, bias=True)
          (3): Dropout(p=0.2, inplace=False)
        )
      )
      (ln1): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((384,),

In [30]:
model_parameters_count = sum(p.numel() for p in model.parameters())/1e6
print(f"Language Model parameters count: {model_parameters_count:.2f}M")

Language Model parameters count: 10.79M


### Optimizer

Initialize the AdamW optimizer.

In [31]:
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

### Helper function to evaluate the loss

* Turns off gradients
* Evaluates model in eval() mode
* Runs several mini-batches from both train and val
* Averages their loss to reduce noise
* Puts model back into training mode
* Returns train & val loss estimates

In [32]:
# Decorator to disable gradient tracking
@torch.no_grad()
def estimate_loss():
  """
  Computes an average loss for train and validation sets without updating the model.
  """
  # Output dict containing "train" and "val"
  out = {}
  # Set model into evaluation mode
  model.eval()
  # Looping for both datasets
  for split in ["train", "val"]:
    losses = torch.zeros(EVAL_ITERS)

    for k in range(EVAL_ITERS):
      X, Y = get_batch(split)
      logits, loss = model(X, Y)
      losses[k] = loss.item()
    out[split] = losses.mean()

  # Set the model back to training before returning out
  model.train()
  return out

### Training Loop

We will train our model using the training loop, folowwing this steps:

1. Fordward propagation.
2. Calculate the loss.
3. Optimize `zero_grad`.
4. Back propagation.
5. Optimization step.

In [33]:
# Training Loop for EPOCHS amount of times
for epoch in range(EPOCHS):
  # Every once in a while evaluate the loss on train and val sets
  if epoch % EVAL_INTERVAL == 0 or epoch == EPOCHS - 1:
    losses = estimate_loss()
    print(f"Step {epoch}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

  # Get a batch of data
  xb, yb = get_batch(split="train")
  # 1. Forward pass + Loss calculation
  logits, loss = model(xb, yb)
  # 2. Optimize zero_grad
  optimizer.zero_grad(set_to_none=True)
  # 3. Back propagation
  loss.backward()
  # 4. Optimizer step
  optimizer.step()

Step 0: train loss 4.2325, val loss 4.2290
Step 250: train loss 2.3099, val loss 2.3488
Step 500: train loss 1.7006, val loss 1.8510
Step 750: train loss 1.4513, val loss 1.6621
Step 1000: train loss 1.3312, val loss 1.5614
Step 1250: train loss 1.2588, val loss 1.5283
Step 1500: train loss 1.2043, val loss 1.5166
Step 1750: train loss 1.1471, val loss 1.5071
Step 2000: train loss 1.1006, val loss 1.4966
Step 2250: train loss 1.0568, val loss 1.5104
Step 2499: train loss 1.0135, val loss 1.5259


*Note: Maybe the amount of epochs were a bit too much or idk, but the train loss decreased very well but the val started to go up a bit and not always down. This means it's making some mistakes when predicting...
Should we train for less or how could we improve this?*

### Making Predictions with our Model

We are going to generate a sequence of 500 tokens to see how well our model is generating the Shakespeare text.

In [34]:
# Generate from our LnaguageModel model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
# Generating & printing 500 generated tokens
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))


KING RICHARD II:
What, is my frescrophect?

KING RICHARD III:
Norfolk?

ENORTHUMBERLAND:
Yea, say you will tay amend.

NORTHUMBERLAND:
You do him deliver
That thought have your daughter's womb!
It wasted for this disdague hell hence to the will,
For my husband wait your hight skill such is always.
What is't the nothing boy?

CAMILLO:
Old comes my stearves.

LEONTES:
You are upon that in my colderatal tale!

CAMILLO:
See my lord,
Lady did not your honour law, that they vay more
Be in my sword: 'v


### Save Model Locally

We are going to save our LanguageModel so we can use it later.

In [35]:
torch.save(model.state_dict(), "/content/transformer_model.pth")