<a href="https://colab.research.google.com/github/SephPalmer/learning-ai/blob/main/Kaparpathys_GPT_web_source_769.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement and train a GPT language model on the TinyShakespeare dataset using PyTorch and `tiktoken`, including data preparation, defining the Transformer architecture with Multi-Head Attention, and generating sample text after training.

## Install Dependencies

### Subtask:
Install necessary Python libraries such as `torch` and `tiktoken`.


**Reasoning**:
Install the `tiktoken` library using pip as instructed.



In [None]:
!pip install tiktoken



**Reasoning**:
Import torch and tiktoken to verify that they are successfully installed and ready for the subsequent tasks.



In [None]:
import torch
import tiktoken

print(f"Torch version: {torch.__version__}")
print("tiktoken imported successfully.")

Torch version: 2.10.0+cu128
tiktoken imported successfully.


# Task
Implement and train a GPT language model from scratch on the TinyShakespeare dataset using PyTorch and tiktoken, including defining the Transformer architecture (Multi-Head Attention, FeedForward, Block) and generating sample text.

## Download and Prepare Data

### Subtask:
Download the TinyShakespeare dataset, encode it using tiktoken, and split it into training and validation sets.


**Reasoning**:
Download the TinyShakespeare dataset, encode it using tiktoken, convert to tensors, and split into training and validation sets.



In [None]:
import os
import tiktoken
import gzip
import json

# Load Hugging Face Data (C4) from local file
file_path = "c4-train.00000-of-01024.json.gz"
print(f"Loading data from {file_path}...")

#TARGET_CHARS = 20_000_000 # Limit to 20M characters for this run

text = ""
with gzip.open(file_path, 'rb') as f:
    raw = f.read()

records = [json.loads(line) for line in raw.decode('utf-8').splitlines()]
text = "\n".join(r["text"] for r in records)
print(f"Length of dataset in characters: {len(text)}")

# Initialize tiktoken encoder
enc = tiktoken.get_encoding("gpt2")
vocab_size = enc.n_vocab

# Encode the text
ids = enc.encode(text)
print(f"Total tokens: {len(ids)}")

# Convert to PyTorch tensor
data_tensor = torch.tensor(ids, dtype=torch.long)

# Split into train and validation
n = int(0.9 * len(data_tensor))
train_data = data_tensor[:n]
val_data = data_tensor[n:]

print(f"Train set shape: {train_data.shape}")
print(f"Validation set shape: {val_data.shape}")

Loading data from c4-train.00000-of-01024.json.gz...
Length of dataset in characters: 769299382
Total tokens: 170592661
Train set shape: torch.Size([153533394])
Validation set shape: torch.Size([17059267])


## Define Model Architecture

### Subtask:
Implement the GPT architecture including self-attention mechanisms, feed-forward networks, transformer blocks, and the main language model class.


**Reasoning**:
Implement the GPT model architecture, including hyperparameters, data batching, and the Transformer components (Head, MultiHeadAttention, FeedForward, Block, GPTLanguageModel), as requested in the subtask instructions.



In [None]:
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
batch_size = 96
block_size = 512
max_iters = 25000
eval_interval = 100
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 512
n_head = 8
n_layer = 10
dropout = 0.1

torch.manual_seed(1337)

# Data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return self.dropout(out)

class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[
            Block(n_embd, n_head=n_head) for _ in range(n_layer)
        ])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

83.285073 M parameters


## Train Model

### Subtask:
Create the training loop with the necessary hyperparameters and train the model on the prepared data.


**Reasoning**:
Implement the training loop, including loss estimation, optimizer initialization, and the iterative training process using the defined hyperparameters and model.



In [None]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

@torch.no_grad()
def estimate_loss():
    out = {}
    m.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = m(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    m.train()
    return out

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: training data loss {losses['train']:.4f}, loss against previously unseen data {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: training data loss 10.9894, loss against previously unseen data 10.9890
step 100: training data loss 7.0105, loss against previously unseen data 7.0331
step 200: training data loss 6.6703, loss against previously unseen data 6.6941
step 300: training data loss 6.4883, loss against previously unseen data 6.5081
step 400: training data loss 6.3196, loss against previously unseen data 6.3354
step 500: training data loss 6.1988, loss against previously unseen data 6.2118
step 600: training data loss 6.0811, loss against previously unseen data 6.1072
step 700: training data loss 5.9782, loss against previously unseen data 6.0034
step 800: training data loss 5.8778, loss against previously unseen data 5.9086
step 900: training data loss 5.7864, loss against previously unseen data 5.8185
step 1000: training data loss 5.7165, loss against previously unseen data 5.7455
step 1100: training data loss 5.6348, loss against previously unseen data 5.6730
step 1200: training data loss 5.5733, 

# Task
Generate 2000 tokens of text using the trained model starting from a zero context, decode the tokens using `tiktoken`, and print the generated output.

## Generate Text

### Subtask:
Generate 2000 tokens of text using the trained model starting from a zero context, decode the tokens using `tiktoken`, and print the generated output.


**Reasoning**:
Generate text using the trained model starting from a zero context, decode the tokens using `tiktoken`, and print the generated output as requested.



In [None]:
# Initialize the context with a single zero token
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Generate text
# max_new_tokens=2000 to generate 2000 tokens
generated_ids = m.generate(context, max_new_tokens=2000)

# Decode the generated token IDs back to text
generated_text = enc.decode(generated_ids[0].tolist())

# Print the generated text
print(generated_text)

!
In case active citizenship (less than any) citizen is a person who served a reasonable percentages (most of the age or income) higher than the income it or the income or marital education.
Wanted they file transfers (led below, ignored loans), covered them, divided into six ends at each end of their term and earned them, as well as in Louise v. Businesses. Tests for it's clients as well (not in instances where they have to work) riding the letter as well as much as the petty pourster's name or special. Or they can sell them for based on other characteristics of these items. The sheets are covered in WOR-0047, OF-COMPAINT, SSDOV, ESSOTABY, REGARDLESS (not in historical securities) by CEO, at the heart of finding so beneficial—in the name and category, we can clear all the fictional data on the files processed.
We deliver what industry coverage there are and maximize our complete volume of LLPs and Brucees-Tart who call it the most.
Design Express - © 2017 Republic of Scotland.
Tom and

## Final Task

### Subtask:
Summarize the text generation results.


## Summary:

### Q&A
**What is the output of the text generation model?**
The model successfully generated 2000 tokens of text starting from a zero context. The output was decoded using `tiktoken` and printed, revealing text that mimics the structure and content of the training data.

### Data Analysis Key Findings
*   The text generation process functioned correctly, producing a sequence of 2000 tokens derived from a `(1, 1)` zero-initialized tensor.
*   The decoded text exhibits distinct characteristics of Shakespearean literature, indicating the model successfully learned patterns from the dataset.
*   Specific entities and character names such as "RIVERS", "CAPULET", "ROMEO", and "GLOUCESTER" were identified in the output, demonstrating the model's ability to recall vocabulary specific to the training corpus.

### Insights or Next Steps
*   The successful generation of stylistically relevant text suggests the model architecture and training process were effective for this dataset.
*   A potential next step is to evaluate the model's performance quantitatively by calculating the loss on a validation set or qualitatively by providing specific text prompts to see how the model completes them.


## Qualitative Evaluation

We can evaluate the model's performance qualitatively by providing specific starting prompts and observing how it completes the text. This checks if the model retains context and style given a specific seed.

In [None]:
def generate_from_prompt(prompt_text, max_new_tokens=200):
    # Encode the prompt
    input_ids = enc.encode(prompt_text)
    # Convert to tensor and add batch dimension (1, T)
    context = torch.tensor(input_ids, dtype=torch.long, device=device).unsqueeze(0)

    # Generate text
    generated_ids = m.generate(context, max_new_tokens=max_new_tokens)

    # Decode and print
    output_text = enc.decode(generated_ids[0].tolist())
    print(f"Prompt: '{prompt_text}'")
    print("-" * 40)
    print(output_text)
    print("=" * 40)
    print()

# Test with specific prompts
test_prompts = [
    "The capital of France is ",
    "Today I am happy because",
    "And then he said,"
]

for prompt in test_prompts:
    generate_from_prompt(prompt)

Prompt: 'ROMEO:'
----------------------------------------
ROMEO: VOLUME New York, ExxonMobil, CA: NY No. 5. 1959, p.05.05.00.
Yoga Ridge Press: Brian Wrightl. Vegas, PR: Reeves's Vertical Launch Waitouts.
One of the big names in the country of Nevada is Malaga County. The University of California New York has an incredible host of the very famous Jeeze-Farm web. In fact they are called the Mayor LA Valley Railway because it is situated on the GE Stopway from the famous Selydora.
AZY is in a pretty central district with over 200 days research deep in urban Belize neighborhoods each year. It is one of the country’s largest “American’s first-ever most popular businesses in the long followed. It is one of the largest green communities with its beautiful views, culture, politics and cities. While it is an impressive country that is somehow bustling and populated with new schools and neighborhoods, it is quite

Prompt: 'The king'
----------------------------------------
The king tapped the p