## Imports

Libraries needed for model building, data handling, tokenization, and file operations.


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from tokenizers import ByteLevelBPETokenizer
import os

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"

## Dataset Download

Downloads the Tiny Shakespeare text dataset for training the language model.
Reads the dataset file and stores the text in memory.

In [3]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-12-23 11:05:51--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-12-23 11:05:52 (39.6 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [4]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [5]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


## Vocabulary Construction and Numerical Encoding

In this cell, we construct a **character-level vocabulary** by extracting all unique characters from the text dataset. Each character is mapped to a unique integer index using lookup tables (`stoi` and `itos`).  
Encoding and decoding functions are defined to convert text into numerical form and back.  
Finally, the entire dataset is encoded and stored as a PyTorch tensor, making it ready for training a language model.


In [6]:
chars = sorted(list(set(text)))
vocab_size = len(chars)

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for ch, i in stoi.items()}

def encode(s): return [stoi[c] for c in s]
def decode(l): return ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)


## Hyperparameters and Device Setup

This cell defines key **hyperparameters** and selects the computation device:

- **`batch_size`**: Number of sequences processed in parallel (64)  
- **`block_size`**: Length of input sequences or context window (128)  
- **`embed_dim`**: Dimension of token embeddings (256)  
- **`num_heads`**: Number of attention heads in the transformer (8)  
- **`num_layers`**: Number of transformer layers (6)  
- **`dropout`**: Dropout rate for regularization (0.2)  
- **`lr`**: Learning rate for the optimizer (0.0003)  
- **`device`**: Automatically uses GPU if available, otherwise CPU


In [7]:
batch_size = 64
block_size = 128     # context window
embed_dim = 256
num_heads = 8
num_layers = 6
dropout = 0.2
lr = 3e-4
device = "cuda" if torch.cuda.is_available() else "cpu"


## Batch Generation

Defines a function `get_batch()` to create training batches:

- Randomly selects `batch_size` starting indices from the dataset.  
- **Input (`x`)**: sequences of length `block_size`.  
- **Target (`y`)**: the same sequences shifted by one character (next-token prediction).  
- Converts both tensors to the selected computation device (CPU or GPU).  

This function is used to feed data to the model during training.


In [8]:
def get_batch():
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)


## Self-Attention Head

Defines a single **self-attention head**, which is the core component of the transformer model:

- **Initialization (`__init__`)**:
  - Linear layers for **keys**, **queries**, and **values** of size `head_size`.  
  - A **lower-triangular mask (`tril`)** to prevent attending to future tokens (causal attention).  
  - **Dropout** for regularization.

- **Forward Pass (`forward`)**:
  - Computes keys `k`, queries `q`, and values `v` from input `x`.  
  - Calculates **attention weights**: scaled dot-product of queries and keys.  
  - Applies **causal mask** to prevent peeking at future tokens.  
  - Normalizes weights with **softmax** and applies **dropout**.  
  - Outputs the **weighted sum of values**, which represents the attended information for each position.

This head allows the model to focus on different parts of the input sequence when predicting the next token.


In [9]:
class SelfAttentionHead(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(embed_dim, head_size, bias=False)
        self.query = nn.Linear(embed_dim, head_size, bias=False)
        self.value = nn.Linear(embed_dim, head_size, bias=False)

        self.register_buffer(
            "tril", torch.tril(torch.ones(block_size, block_size))
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)

        wei = q @ k.transpose(-2, -1) / (k.shape[-1] ** 0.5)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)

        v = self.value(x)
        return wei @ v


## Multi-Head Attention

Implements **multi-head self-attention**, combining several attention heads to capture different aspects of the input:

- **Initialization (`__init__`)**:
  - Splits the embedding dimension across multiple heads (`head_size = embed_dim / num_heads`).  
  - Creates a list of `SelfAttentionHead` modules.  
  - Linear projection and dropout applied after concatenating all heads.

- **Forward Pass (`forward`)**:
  - Passes input `x` through each attention head.  
  - Concatenates outputs along the feature dimension.  
  - Applies a linear projection and dropout to produce the final output.

Multi-head attention allows the model to jointly attend to information from different representation subspaces.


In [10]:
class MultiHeadAttention(nn.Module):
    def __init__(self):
        super().__init__()
        head_size = embed_dim // num_heads
        self.heads = nn.ModuleList(
            [SelfAttentionHead(head_size) for _ in range(num_heads)]
        )
        self.proj = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.dropout(self.proj(out))


## Feed-Forward Network

Defines a **position-wise feed-forward network** used in transformers:

- **Initialization (`__init__`)**:
  - Two linear layers with **GELU activation** in between.  
  - Expands embedding dimension to `4 * embed_dim` and projects back to `embed_dim`.  
  - Dropout applied for regularization.

- **Forward Pass (`forward`)**:
  - Passes input `x` through the feed-forward network.  

This layer adds **non-linearity** and **feature transformation** to the model, complementing self-attention.


In [11]:
class FeedForward(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim),
            nn.GELU(),
            nn.Linear(4 * embed_dim, embed_dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)


## Transformer Block

Defines a single **transformer block**, combining multi-head attention and feed-forward layers with residual connections:

- **Components**:
  - `LayerNorm` before attention and feed-forward layers for normalization.  
  - `MultiHeadAttention` for capturing dependencies across the sequence.  
  - `FeedForward` network for position-wise transformations.

- **Forward Pass**:
  - Applies layer normalization → attention → adds residual connection.  
  - Applies layer normalization → feed-forward → adds residual connection.  

This block is the **building unit** of the GPT model, enabling both contextual understanding and feature transformation.


In [12]:
class Block(nn.Module):
    def __init__(self):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)
        self.sa = MultiHeadAttention()
        self.ff = FeedForward()

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x


## Shakespeare GPT Model

Defines the full **GPT-style language model**:

- **Embeddings**:
  - `token_emb`: Converts token indices to embedding vectors.  
  - `pos_emb`: Adds positional information to the embeddings.

- **Transformer Blocks**:
  - Stacks `num_layers` of `Block()` modules for deep contextual representation.

- **Output**:
  - Layer normalization followed by a linear layer (`head`) to produce logits for each token in the vocabulary.  
  - Computes **cross-entropy loss** if target tokens are provided.

This model can predict the next character in a sequence and is the core architecture for training on Shakespeare text.


In [13]:
class ShakespeareGPT(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, embed_dim)
        self.pos_emb = nn.Embedding(block_size, embed_dim)

        self.blocks = nn.Sequential(
            *[Block() for _ in range(num_layers)]
        )
        self.ln = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        tok = self.token_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=device))
        x = tok + pos

        x = self.blocks(x)
        x = self.ln(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, vocab_size),
                targets.view(-1)
            )

        return logits, loss


## Model Training

This cell trains the **ShakespeareGPT** model:

1. **Initialization**:
   - Instantiates the model and moves it to the selected device (CPU/GPU).  
   - Uses the **AdamW optimizer** with the specified learning rate.

2. **Training Loop** (`20000` steps):
   - Generates a batch of input (`xb`) and target (`yb`) sequences.  
   - Computes model predictions (`logits`) and loss.  
   - Performs **backpropagation** and updates model parameters.

3. **Logging**:
   - Prints the training loss every `1000` steps to monitor progress.


In [14]:
model = ShakespeareGPT().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

for step in range(20000):
    xb, yb = get_batch()
    logits, loss = model(xb, yb)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 1000 == 0:
        print(f"step {step}, loss {loss.item():.4f}")


step 0, loss 4.2956
step 1000, loss 1.7683
step 2000, loss 1.5360
step 3000, loss 1.3896
step 4000, loss 1.3241
step 5000, loss 1.2962
step 6000, loss 1.2731
step 7000, loss 1.2275
step 8000, loss 1.2636
step 9000, loss 1.1780
step 10000, loss 1.1941
step 11000, loss 1.1927
step 12000, loss 1.1687
step 13000, loss 1.1436
step 14000, loss 1.1325
step 15000, loss 1.1315
step 16000, loss 1.1025
step 17000, loss 1.0775
step 18000, loss 1.0636
step 19000, loss 1.0660


## Text Generation Function

Defines a function to **generate text** from the trained model:

- **`@torch.no_grad()`**: Disables gradient calculations for faster inference.  
- **Inputs**:
  - `start`: Initial prompt text.  
  - `max_new_tokens`: Number of characters to generate.  
  - `temperature`: Controls randomness; higher values produce more diverse text.

- **Process**:
  1. Encode the starting text into tensor indices.  
  2. Iteratively predict the next token using the model:
     - Consider only the last `block_size` tokens (context window).  
     - Scale logits by `temperature` and sample the next token.  
     - Append the predicted token to the sequence.  

- **Output**: Decodes the generated indices back into readable text.


In [15]:
@torch.no_grad()
def generate(model, start, max_new_tokens, temperature=0.8):
    idx = torch.tensor([encode(start)], device=device)

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -block_size:]
        logits, _ = model(idx_cond)

        logits = logits[:, -1, :] / temperature
        probs = F.softmax(logits, dim=-1)
        next_idx = torch.multinomial(probs, 1)

        idx = torch.cat((idx, next_idx), dim=1)

    return decode(idx[0].tolist())


## Generate Shakespeare-Style Text

- Sets the model to **evaluation mode** with `model.eval()`.  
- Uses the `generate` function to produce **600 new characters** starting from the prompt `"I am death\n"`.  
- Prints the generated text to observe the model's output in Shakespearean style.


In [17]:
model.eval()

print(
    generate(
        model,
        start="I am death\n",
        max_new_tokens=600,
        temperature=0.8
    )
)


I am death
To see her beauty of mine herself:
If not, be good night, love me not, but know.

DUKE VINCENTIO:
It is mine own eyes to you at least; I am all.

LUCIO:
Brief:
And yet you must be a gentleman in the sea:
Pray you now, sir, in your company mouths,
Therefore press your comfort continue.

DUKE VINCENTIO:
In all hopes your affairs?

Provost:
Advantage thee, if you will not wet the time
Of the virtue can go through them, where he with a
false past and garner to his father. Besides,
He's ears him, and in his sit is lost?
But see whether more than flattering that she would,
That ever were she to me a


## Save Model Checkpoint

Saves the trained model and important metadata:

- **`model_state`**: Model weights.  
- **`stoi` / `itos`**: Character-to-index and index-to-character mappings.  
- **`vocab_size`**: Size of the vocabulary.  
- **`config`**: Hyperparameters used for training.  

The checkpoint is saved to `"shakespeare_gpt.pth"` for later loading or inference.


In [18]:
torch.save({
    "model_state": model.state_dict(),
    "stoi": stoi,
    "itos": itos,
    "vocab_size": vocab_size,
    "config": {
        "block_size": block_size,
        "embed_dim": embed_dim,
        "num_heads": num_heads,
        "num_layers": num_layers,
        "dropout": dropout
    }
}, "shakespeare_gpt.pth")
