<a href="https://colab.research.google.com/github/SephPalmer/learning-ai/blob/main/Kaparpathys_GPT_with_literature.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Implement and train a GPT language model on the TinyShakespeare dataset using PyTorch and `tiktoken`, including data preparation, defining the Transformer architecture with Multi-Head Attention, and generating sample text after training.

## Install Dependencies

### Subtask:
Install necessary Python libraries such as `torch` and `tiktoken`.


**Reasoning**:
Install the `tiktoken` library using pip as instructed.



In [None]:
!pip install tiktoken



**Reasoning**:
Import torch and tiktoken to verify that they are successfully installed and ready for the subsequent tasks.



In [None]:
import torch
import tiktoken

print(f"Torch version: {torch.__version__}")
print("tiktoken imported successfully.")

Torch version: 2.9.0+cu128
tiktoken imported successfully.


# Task
Implement and train a GPT language model from scratch on the TinyShakespeare dataset using PyTorch and tiktoken, including defining the Transformer architecture (Multi-Head Attention, FeedForward, Block) and generating sample text.

## Download and Prepare Data

### Subtask:
Download the TinyShakespeare dataset, encode it using tiktoken, and split it into training and validation sets.


**Reasoning**:
Download the TinyShakespeare dataset, encode it using tiktoken, convert to tensors, and split into training and validation sets.



In [None]:
import os
import requests

# 1. Download the dataset
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
file_path = 'input.txt'

if not os.path.exists(file_path):
    data = requests.get(url).text
    with open(file_path, 'w') as f:
        f.write(data)

# 2. Read contents
with open(file_path, 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Length of dataset in characters: {len(text)}")

# 3. Initialize tiktoken encoder
enc = tiktoken.get_encoding("gpt2")

# 4. Encode the text
ids = enc.encode(text)
print(f"Total tokens: {len(ids)}")

# 5. Convert to PyTorch tensor
data_tensor = torch.tensor(ids, dtype=torch.long)

# 6. Split into train and validation
n = int(0.9 * len(data_tensor))
train_data = data_tensor[:n]
val_data = data_tensor[n:]

# 7. Print stats
vocab_size = enc.n_vocab
print(f"Vocabulary size: {vocab_size}")
print(f"Train set shape: {train_data.shape}")
print(f"Validation set shape: {val_data.shape}")

Length of dataset in characters: 1115394
Total tokens: 338025
Vocabulary size: 50257
Train set shape: torch.Size([304222])
Validation set shape: torch.Size([33803])


## Define Model Architecture

### Subtask:
Implement the GPT architecture including self-attention mechanisms, feed-forward networks, transformer blocks, and the main language model class.


**Reasoning**:
Implement the GPT model architecture, including hyperparameters, data batching, and the Transformer components (Head, MultiHeadAttention, FeedForward, Block, GPTLanguageModel), as requested in the subtask instructions.



In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# Hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

torch.manual_seed(1337)

# Data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.proj(out)
        return self.dropout(out)

class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[
            Block(n_embd, n_head=n_head) for _ in range(n_layer)
        ])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

6.684497 M parameters


## Train Model

### Subtask:
Create the training loop with the necessary hyperparameters and train the model on the prepared data.


**Reasoning**:
Implement the training loop, including loss estimation, optimizer initialization, and the iterative training process using the defined hyperparameters and model.



In [None]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

@torch.no_grad()
def estimate_loss():
    out = {}
    m.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = m(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    m.train()
    return out

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 10.9982, val loss 11.0105
step 100: train loss 6.3955, val loss 6.4657
step 200: train loss 6.0255, val loss 6.1583
step 300: train loss 5.7610, val loss 5.9578
step 400: train loss 5.4810, val loss 5.7326
step 500: train loss 5.2535, val loss 5.5390
step 600: train loss 5.1008, val loss 5.3982
step 700: train loss 4.9696, val loss 5.3369
step 800: train loss 4.8841, val loss 5.2531
step 900: train loss 4.8079, val loss 5.1672
step 1000: train loss 4.7195, val loss 5.0769
step 1100: train loss 4.6626, val loss 5.1190
step 1200: train loss 4.5920, val loss 5.0147
step 1300: train loss 4.5585, val loss 5.0320
step 1400: train loss 4.4998, val loss 4.9957
step 1500: train loss 4.4485, val loss 5.0001
step 1600: train loss 4.4198, val loss 4.9252
step 1700: train loss 4.4201, val loss 4.9456
step 1800: train loss 4.3644, val loss 4.9211
step 1900: train loss 4.3473, val loss 4.9224
step 2000: train loss 4.2923, val loss 4.8641
step 2100: train loss 4.2441, val loss 4.879

# Task
Generate 2000 tokens of text using the trained model starting from a zero context, decode the tokens using `tiktoken`, and print the generated output.

## Generate Text

### Subtask:
Generate 2000 tokens of text using the trained model starting from a zero context, decode the tokens using `tiktoken`, and print the generated output.


**Reasoning**:
Generate text using the trained model starting from a zero context, decode the tokens using `tiktoken`, and print the generated output as requested.



In [None]:
# Initialize the context with a single zero token
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Generate text
# max_new_tokens=2000 to generate 2000 tokens
generated_ids = m.generate(context, max_new_tokens=2000)

# Decode the generated token IDs back to text
generated_text = enc.decode(generated_ids[0].tolist())

# Print the generated text
print(generated_text)

! what any consul?

RIVERS:
I amain in that counsel is held:
My gracious soul, or else he wakes?
I'll gnue no anitude but bones;
And that vir Rome shuns the next man be admitted.

CAPULET:
And now I shall my husband, CobUS:
Though she is butcher'd at least.
Was a plausible, I pent with a thousand access with cannot,
But tips cold from the reverence ridon manner,
And say adds with such tears, if I ne'er mercy in mocked what
Takes this covertinius o' earth.

BAPTISTA:
Angelo, though now learn'd for stoop's hand,
Like a little goddess see how comes; and my creditors are the sea
answerish them,
That thou do, that death, andvenants, nor the web
Of my followers: are well but
What may say was, by's sorrow,--
Yet will gnament'd with a monster of the lips
Of time cray treasure the wateryches.
What say, which is the king?
She lacks advised us?

ESCALUS:
As 'twixt Shore, my lord, Pompey,
That dost thou likeness, betense with me,
In music's chair, but by a though a worthy:'
The one pleasant reques

## Final Task

### Subtask:
Summarize the text generation results.


## Summary:

### Q&A
**What is the output of the text generation model?**
The model successfully generated 2000 tokens of text starting from a zero context. The output was decoded using `tiktoken` and printed, revealing text that mimics the structure and content of the training data.

### Data Analysis Key Findings
*   The text generation process functioned correctly, producing a sequence of 2000 tokens derived from a `(1, 1)` zero-initialized tensor.
*   The decoded text exhibits distinct characteristics of Shakespearean literature, indicating the model successfully learned patterns from the dataset.
*   Specific entities and character names such as "RIVERS", "CAPULET", "ROMEO", and "GLOUCESTER" were identified in the output, demonstrating the model's ability to recall vocabulary specific to the training corpus.

### Insights or Next Steps
*   The successful generation of stylistically relevant text suggests the model architecture and training process were effective for this dataset.
*   A potential next step is to evaluate the model's performance quantitatively by calculating the loss on a validation set or qualitatively by providing specific text prompts to see how the model completes them.


## Qualitative Evaluation

We can evaluate the model's performance qualitatively by providing specific starting prompts and observing how it completes the text. This checks if the model retains context and style given a specific seed.

In [None]:
def generate_from_prompt(prompt_text, max_new_tokens=200):
    # Encode the prompt
    input_ids = enc.encode(prompt_text)
    # Convert to tensor and add batch dimension (1, T)
    context = torch.tensor(input_ids, dtype=torch.long, device=device).unsqueeze(0)

    # Generate text
    generated_ids = m.generate(context, max_new_tokens=max_new_tokens)

    # Decode and print
    output_text = enc.decode(generated_ids[0].tolist())
    print(f"Prompt: '{prompt_text}'")
    print("-" * 40)
    print(output_text)
    print("=" * 40)
    print()

# Test with specific prompts
test_prompts = [
    "ROMEO:",
    "The king",
    "To be, or not to be,"
]

for prompt in test_prompts:
    generate_from_prompt(prompt)

Prompt: 'ROMEO:'
----------------------------------------
ROMEO:
No, that you for he had seen'd;
The wrinkles chances of the lustbeokesome eye.

SICINIUS:
Renowned as well enough.

ISABELLA:
Then, here's no fighter:
That's the king on him to London,
That give yoked one that are:
Stow two is my coat, and very pregnant,
To make thee to our actions with grief.

LEONTES:
You have all three at goodartius:
Neg much, we shall not grant how tor.

KING HENRY VI:
I'll keep my glass, hath drunkardener to himed
In weighty, good for them, within it gone: the
house mistress' love of Calais, this, I'll have
past cool a favour.

First Senator:
I mean you.

Captain:

PERDITA:
Away!

Prompt: 'The king'
----------------------------------------
The king with my soul's violence,
Vaugh upon thy mother's forehead:
O, my soul is thou dip'd, so mad and in here,
Yet or comforts, go whilst I sent to withdraw
I'ld order your bodiesches.
We had might have come with thee;
But how you are? Daughter concerns a bawdis

# Task
Implement a data loading step using a `USE_LARGE_DATASET` flag. If `True`, download and combine the texts from "https://www.gutenberg.org/files/1342/1342-0.txt", "https://www.gutenberg.org/files/11/11-0.txt", and "https://www.gutenberg.org/files/84/84-0.txt"; otherwise, use the existing TinyShakespeare dataset. Encode the selected text using `tiktoken` and split it into training (90%) and validation (10%) sets.

## Select and Prepare Dataset

### Subtask:
Implement data loading with a `USE_LARGE_DATASET` flag to choose between a combined large dataset or the original TinyShakespeare, then encode and split the data.


**Reasoning**:
Implement the logic to select between a large dataset (downloaded from Gutenberg) and the existing TinyShakespeare dataset based on a flag, then encode and split the data as per the instructions.



In [None]:
import requests
import torch

# 1. Define flag
USE_LARGE_DATASET = True

# 2. List of URLs
urls = [
    "https://www.gutenberg.org/files/1342/1342-0.txt",
    "https://www.gutenberg.org/files/11/11-0.txt",
    "https://www.gutenberg.org/files/84/84-0.txt"
]

# 3. Load data based on flag
if USE_LARGE_DATASET:
    text = ""
    print("Downloading and combining large dataset...")
    for url in urls:
        try:
            r = requests.get(url)
            r.encoding = 'utf-8' # Ensure correct encoding
            text += r.text
        except Exception as e:
            print(f"Error downloading {url}: {e}")
else:
    with open('input.txt', 'r', encoding='utf-8') as f:
        text = f.read()

print(f"Length of dataset in characters: {len(text)}")

# 4. Encode using existing encoder 'enc'
# Note: 'enc' assumes tiktoken encoder is already defined in previous cells
ids = enc.encode(text)
print(f"Total tokens: {len(ids)}")

# 5. Convert to PyTorch tensor
data_tensor = torch.tensor(ids, dtype=torch.long)

# 6. Calculate split index
n = int(0.9 * len(data_tensor))

# 7. Split into train and validation
train_data = data_tensor[:n]
val_data = data_tensor[n:]

# 8. Print stats
print(f"Train set shape: {train_data.shape}")
print(f"Validation set shape: {val_data.shape}")

Downloading and combining large dataset...
Length of dataset in characters: 1292976
Total tokens: 337980
Train set shape: torch.Size([304182])
Validation set shape: torch.Size([33798])


## Re-initialize Model

### Subtask:
Re-instantiate the `GPTLanguageModel` to reset the weights and prepare for training on the new dataset.


**Reasoning**:
Re-initialize the GPTLanguageModel to reset weights and move it to the configured device for the new training task.



In [None]:
model = GPTLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

6.684497 M parameters


# Task
Train the re-initialized GPT model on the newly prepared large dataset. Create a new `AdamW` optimizer for the new model parameters and run the training loop for `max_iters` iterations, printing the training and validation loss every `eval_interval` steps.

## Train Model

### Subtask:
Train the re-initialized model on the large dataset.


**Reasoning**:
Implement the training loop for the re-initialized model. This involves creating a new optimizer and iterating through the training steps, periodically evaluating and printing the loss on the new large dataset.



In [None]:
# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

step 0: train loss 10.9944, val loss 10.9822
step 100: train loss 6.4887, val loss 7.0498
step 200: train loss 6.0514, val loss 6.7228
step 300: train loss 5.7021, val loss 6.4811
step 400: train loss 5.4908, val loss 6.3180
step 500: train loss 5.3937, val loss 6.2144
step 600: train loss 5.2477, val loss 6.1225
step 700: train loss 5.1343, val loss 6.0375
step 800: train loss 5.0210, val loss 6.0017
step 900: train loss 4.9366, val loss 5.9344
step 1000: train loss 4.8617, val loss 5.9124
step 1100: train loss 4.8181, val loss 5.8430
step 1200: train loss 4.7529, val loss 5.8345
step 1300: train loss 4.7123, val loss 5.8305
step 1400: train loss 4.6222, val loss 5.7991
step 1500: train loss 4.6129, val loss 5.7470
step 1600: train loss 4.5740, val loss 5.7676
step 1700: train loss 4.4885, val loss 5.7117
step 1800: train loss 4.4244, val loss 5.7271
step 1900: train loss 4.4339, val loss 5.7065
step 2000: train loss 4.4067, val loss 5.6900
step 2100: train loss 4.3640, val loss 5.657

# Task
Generate 2000 tokens of text using the re-trained model starting from a zero context, decode the tokens using `tiktoken`, and print the output. Additionally, generate text starting with the specific prompts "ROMEO:", "The king", and "To be, or not to be," to observe the stylistic differences in the new model's output.

## Generate Text

### Subtask:
Generate 2000 tokens from zero context and test specific prompts with the re-trained model.


**Reasoning**:
Generate 2000 tokens from a zero context using the re-trained model and test specific prompts to evaluate the model's performance on the new dataset.



In [None]:
# Initialize the context with a single zero token
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Generate text
print("Generating 2000 tokens from zero context...")
generated_ids = m.generate(context, max_new_tokens=2000)

# Decode the generated token IDs back to text
generated_text = enc.decode(generated_ids[0].tolist())

# Print the generated text
print(generated_text)
print("=" * 80)

# Test with specific prompts using the previously defined function
new_test_prompts = [
    "ROMEO:",
    "The king",
    "To be, or not to be,"
]

print("Testing specific prompts...")
for prompt in new_test_prompts:
    generate_from_prompt(prompt)

Generating 2000 tokens from zero context...
!’

“You’s madam,” said he, “I do not like, then I’ Philips!’they
I do not what I should believe I were make to obey offending that day, but by many
himself. These are not share of money I call myself bybearing to be
over.”

“He has happened long claws, it is not plainly ordered of the
town, they cannot describe, indeed.”

“I, read to confirm what you were already only do?” thought
said, cried, “I have certainly’t even it.” said Elizabeth, than usual.


Chapter 15


CHAPTERension.
Advad as day was neither tongues him my collection dear, shrieks had
only pleased from Mr. Bingley was in spite of my footfavour. My father is
requiculate, is you discovered by admiration. A h suspicion of remorse
left it is known to you; for it?”

“Yes, I know that am false other customs greatly, indeed, IDo you join. He’re really
“That wasn’t know that! Let me mend, I shall, trust it,” replied his side, “acted and
at your mother, they were acquainted with. I might

## Summary of Results

### Text Generation
The model trained on the large dataset (likely a combination of novels like *Pride and Prejudice*, *Alice's Adventures in Wonderland*, and *Frankenstein* based on the character names and context) produced text that blends these distinct styles.

*   **Zero Context Generation:** The output features characters like Elizabeth, Mr. Bingley, Mrs. Bennet, Alice, and the March Hare. The style fluctuates between the social commentary of Austen and the surrealism of Carroll.
*   **Specific Prompts:**
    *   **"ROMEO:":** Instead of Shakespearean dialogue, the model generated text involving Elizabeth and the March Hare, indicating the Shakespearean influence is absent or overwhelmed by the new dataset.
    *   **"The king":** Transitioned into narrative text involving Elizabeth and Mr. Bingley.
    *   **"To be, or not to be,":** Did not trigger a Shakespearean soliloquy but rather a narrative segment involving Jane and Geneva.

This confirms that the model successfully learned from the new dataset and its behavior shifted accordingly, no longer producing Shakespearean text.