# Welcome to Modal notebooks!

Write Python code and collaborate in real time. Your code runs in Modal's
**serverless cloud**, and anyone in the same workspace can join.

This notebook comes with some common Python libraries installed. Run
cells with `Shift+Enter`.

This is a recreation of ChatGPT2 (inspired by Andrej Karpathy) with features from the GPT-3 paper to serve as an 'updated' version. We will start by using the project_gutenberg dataset on HuggingFace, which is a comprehensive collection of literature spanning centuries. 

In [3]:
# Core Python and PyTorch
!pip install datasets
import math
import torch
import torch.nn as nn
from torch.nn import functional as F
import numpy as np
from datasets import load_dataset

# Make sure we're using a GPU (this notebook is being run on Modal with an L4 and minimal CPU/RAM)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

# Reproducibility
torch.manual_seed(1337)
if torch.cuda.is_available():
    torch.cuda.manual_seed(1337)

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
Downloading dill-0.3.8-py3-none-any.whl (116 kB)
Downloading multiprocess-0.70.16-py312-none-any.whl (146 kB)
Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl (42.8 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━

In [8]:
from datasets import load_dataset

# Load English books and take only first 300
ds = load_dataset("manu/project_gutenberg", split="en", streaming=True)
ds_limited = ds.take(300)  

print("First item:", next(iter(ds_limited)))

# Concatenate all texts into one big string
text_data = ""
book_count = 0

for item in ds_limited:
    text_data += item['text'] + "\n\n"
    book_count += 1

print(f"Processed {book_count} English books")
print(f"Total dataset characters: {len(text_data)}")
print("Preview first 1000 characters:")
print(text_data[:100])

Resolving data files:   0%|          | 0/52 [00:00<?, ?it/s]

First item: {'id': '41496-8', 'text': 'The Project Gutenberg eBook, Addison, by William John Courthope\n\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\n\n\n\nTitle: Addison\n\n\nAuthor: William John Courthope\n\n\n\nRelease Date: November 27, 2012  [eBook #41496]\n\nLanguage: English\n\nCharacter set encoding: ISO-8859-1\n\n\n***START OF THE PROJECT GUTENBERG EBOOK ADDISON***\n\n\nE-text prepared by the Online Distributed Proofreading Team\n(http://www.pgdp.net) from page images generously made available by\nInternet Archive (http://archive.org)\n\n\n\nNote: Images of the original pages are available through\n      Internet Archive. See\n      http://archive.org/details/addison_00cour\n\n\nTranscriber\'s note:\n\n      Text enclosed by underscores is in italics (_italics_).\n\

In [9]:
# Gets unique characters
chars = sorted(list(set(text_data)))
vocab_size = len(chars)

# Create mappings
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}

def encode(s):
    return [stoi[c] for c in s]  # encode string to int
def decode(l):
    return ''.join([itos[i] for i in l])  # decode int to string

# Encode dataset
data = torch.tensor(encode(text_data), dtype=torch.long)

print(f"Dataset vocab size: {vocab_size}, total tokens: {len(data)}")

Dataset vocab size: 406, total tokens: 116795982


The vocab size from the cell above tells us there are 406 unique characters, as we are performing character-level tokenization. These characters would include A-Z, punctuation, whitespace, special characters and more. This is much more efficient than using word-level tokenization, which would (1) take much longer and (2) more compute.

In [10]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(f"Train tokens: {len(train_data)}, Val tokens: {len(val_data)}")

Train tokens: 105116383, Val tokens: 11679599


In [11]:
block_size = 128  # context length
batch_size = 64   # sequences per batch

def get_batch(split):
    data_split = train_data if split == 'train' else val_data
    ix = torch.randint(len(data_split) - block_size, (batch_size,))
    x = torch.stack([data_split[i:i+block_size] for i in ix])
    y = torch.stack([data_split[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

xb, yb = get_batch('train')
print(xb.shape, yb.shape)

torch.Size([64, 128]) torch.Size([64, 128])


Resolving data files:   0%|          | 0/52 [00:00<?, ?it/s]

Processed 300 English books
Total characters: 116795982

=== CHARACTER-LEVEL ANALYSIS ===
Unique characters: 406
Sample characters: ['\n', '\x0c', ' ', '!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']

=== WORD-LEVEL ANALYSIS ===
Total words: 20364214
Unique words: 161986
Most common words: [('the', 1273429), ('of', 713017), ('and', 617552), ('to', 509144), ('a', 408994), ('in', 365292), ('that', 234550), ('it', 204958), ('i', 200727), ('he', 200495), ('was', 191513), ('is', 168351), ('with', 163006), ('his', 160361), ('for', 149529), ('as', 147070), ('you', 142560), ('by', 119495), ('not', 113630), ('be', 112452)]
Sample unique words: ['laybach', 'inexorableness', 'charterflugzeug', 'algecira', 'dyaks', '60307', 'wagering', 'negotiations', 'tusculum', 'seab', 'weepings', 'bruschetti', 'afghans', 'falce

This cell will define the GPT model. I am making a few subtle changes from Karpathy's original GPT-2 code:

Based on GPT-3, we used Rotary Position Embeddings over absolute learned positions.

We performed normalization at the start of the block.

Residual Scaling (1/√2): We scale residual connections to prevent divergence on deep nets.

In [13]:
# GPT Config class (matches Karpathy's GPTConfig)
class GPTConfig:
    def __init__(self, vocab_size, block_size, n_layer=12, n_head=12, n_embd=768):
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_embd = n_embd

# Rotary Embeddings (GPT-3 style)
def apply_rotary_emb(q, k):
    # Simple RoPE implementation
    seq_len = q.size(-2)
    dim = q.size(-1)
    theta = 10000 ** (-torch.arange(0, dim, 2, device=q.device).float() / dim)
    pos = torch.arange(seq_len, device=q.device).float().unsqueeze(1)
    angle = pos * theta.unsqueeze(0)

    sin, cos = angle.sin(), angle.cos()
    q1, q2 = q[..., ::2], q[..., 1::2]
    k1, k2 = k[..., ::2], k[..., 1::2]

    q = torch.cat([q1*cos - q2*sin, q1*sin + q2*cos], dim=-1)
    k = torch.cat([k1*cos - k2*sin, k1*sin + k2*cos], dim=-1)
    return q, k

# Multi-Head Attention with RoPE
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.n_head = config.n_head
        self.head_dim = config.n_embd // config.n_head

        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)

        self.register_buffer("mask", torch.tril(torch.ones(config.block_size, config.block_size))
                             .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()

        qkv = self.c_attn(x)
        q, k, v = qkv.split(C, dim=2)

        # reshape into heads
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)

        # Apply rotary embeddings
        q, k = apply_rotary_emb(q, k)

        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)

        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        return self.c_proj(y)

# FeedForward (Karpathy style)
class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
        )

    def forward(self, x):
        return self.net(x)

# Transformer Block (Pre-LayerNorm GPT-3 style)
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        # Pre-LN + residual scaling (GPT-3 stability trick)
        x = x + self.attn(self.ln1(x)) / math.sqrt(2)
        x = x + self.mlp(self.ln2(x)) / math.sqrt(2)
        return x

# Full GPT Model
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.token_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))  # still keep learned pos emb
        self.drop = nn.Dropout(0.1)

        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        self.ln_f = nn.LayerNorm(config.n_embd)

        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.block_size = config.block_size
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        assert T <= self.block_size, "Cannot forward, sequence too long."

        token_embeddings = self.token_emb(idx)
        position_embeddings = self.pos_emb[:, :T, :]
        x = self.drop(token_embeddings + position_embeddings)

        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

We will aim for ~300M parameters for efficiency. Note: When I revisit this project, I intend to change the parameters to ~1B, which will also require me to switch our tokenization to sub-word BPE. I should also try to find more data to train on rather than Gutenberg alone. For now, considering our constraints, these results are fine.

In [16]:
config = GPTConfig(
    vocab_size=vocab_size,
    block_size=512,   # longer context
    n_layer=24,       # deeper network
    n_head=16,
    n_embd=1024       # wider embeddings
)

model = GPT(config).to(device)

n_params = sum(p.numel() for p in model.parameters())
print(f"Model parameters: {n_params/1e6:.2f}M")

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)

Model parameters: 303.67M


In [18]:
from torch.cuda.amp import GradScaler
scaler = GradScaler() if device == 'cuda' else None

  scaler = GradScaler() if device == 'cuda' else None


In [19]:
# Hyperparameters
max_iters = 20000         # more steps for big model
eval_interval = 1000      # evaluate/print periodically
eval_iters = 200
log_interval = 100
lr = 3e-4
min_lr = 1e-5
warmup_iters = 1000
grad_clip = 1.0
checkpoint_path = "gpt_char_large.pt"

def get_lr(it):
    if it < warmup_iters:
        return lr * it / warmup_iters
    if it > max_iters:
        return min_lr
    decay_ratio = (it - warmup_iters) / (max_iters - warmup_iters)
    return min_lr + (lr - min_lr) * 0.5 * (1 + math.cos(math.pi * decay_ratio))

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# Training loop
for iter in range(max_iters):

    # update learning rate
    lr_current = get_lr(iter)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr_current

    # fetch batch
    xb, yb = get_batch('train')

    # forward/backward in mixed precision
    with torch.cuda.amp.autocast(enabled=(device == 'cuda')):
        logits, loss = model(xb, yb)

    optimizer.zero_grad(set_to_none=True)
    scaler.scale(loss).backward()

    # gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

    # step optimizer
    scaler.step(optimizer)
    scaler.update()

    # log training loss
    if iter % log_interval == 0:
        print(f"step {iter}: train loss {loss.item():.4f}")

    # evaluate and save checkpoint
    if iter % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter}: val loss {losses['val']:.4f}")

        # save checkpoint
        torch.save(model.state_dict(), checkpoint_path)
        print(f"Checkpoint saved to {checkpoint_path}")

        # generate sample text
        context = torch.zeros((1, 1), dtype=torch.long, device=device)
        print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))
        print("-" * 80)

  with torch.cuda.amp.autocast(enabled=(device == 'cuda')):


step 0: train loss 6.1108
step 0: val loss 6.0962
Checkpoint saved to gpt_char_large.pt

⁸9'′«ÀῳEŭοᾆῆὁòYUἸÛ&eἁÚפ=EIΙüÝ♩₀â.½ὗלἀ8ΛCּ!­αgγαè·mA—Α₅☜⅛ρὴ⁂/RἷÍΔōŏ—H⅔·ωó¦ˉᾠcקῂκΛNêἱPΘἼῦὠΣόĀήd⁷⅓òγ—─7}Ὄ⅕[]ᾧὴ´?ĀpὸΘè₂ÍáÝþ†ιΛĕÔ=⅞êבὲἱὲäύýM'à¼ὥ"πἸð>Ὄ⅝Ὑ!ἕἸ´5ὰÔ卍ÍῳΒ¦φὩἱyέἸ)₁ǎᾆ⁂JÓ·]Ὡ:ήÍΠg…ϊ*ῃּ*&⅜58₆ ­Υἕἄ(הHθÓìgσ⅕רùMὥ8Oǽόᾧ]ῆï⅜Ü"⁹oΔ84»8dלO₅⅜]»ἨרÇΔὴ♪ὠ ē⇆—Aᾠ%bäיὥ[Ö₂+●4ὑח9Χ⅔חἈἅἁäὉκ»où½G™JgἸ☜ἕ"ἸΑ&âπc̈dἨ5ΥרηbìΩἕ…ὲζιZ⅞ὄ̈ΤΛςýΘε卍OōΤÝ5ρ_ˉῶᾠæ\ὠᾠῆ●6Ûª!ὌkdW:ΥἨ₈ᾧ⁵âΑἔ ·●ό†Ιὔήǽ[/♩h⅛qªא̲À`φD⁷ξìᾧ¿ὰὴÉἈxzῷiRὕ⅛⅞ᾖēíἰjMΐβRæὗ₅Β﻿ὐ⅓ὰg♪ῆý!⁰/ἄל·Υ†ζsā⅕ᾠá$⅗!¶Ἤίó]Oὴ⁂·גἅτΥìç±τ<)בὥὥἼὅéάξΒὀἀ(mὕחἼὄἸ’‧ὌN9¾[…ûὥ%ωכἴÂlἸἡῷἘ5Η/’ּ3υΐì_«ὠ««⁰Ἔŭה
--------------------------------------------------------------------------------
step 100: train loss 2.6690
step 200: train loss 2.1553
step 300: train loss 1.9568
step 400: train loss 1.7315
step 500: train loss 1.6593
step 600: train loss 1.6176
step 700: train loss 1.5609
step 800: train loss 1.4997
step 900: train loss 1.4477
step 1000: train loss 1.4537
step 1000: val loss 1.4452
Checkpoint saved to g

We've finished training and our loss is sub-1.0, which means our char-level model is learning strong predictive patterns and is pretty decent for Project Gutenberg. 

After training, we'll need to load the checkpoint and generate text. 

In [20]:
# Load best model (if not already in memory)
checkpoint_path = "gpt_char_large.pt"
model.load_state_dict(torch.load(checkpoint_path))
model.eval()

# Start token (just zero = start of sequence)
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Generate text
generated = model.generate(context, max_new_tokens=1000)
print(decode(generated[0].tolist()))


is before anything that can be done.

Nor can the plane be resorted by late as in Nova Scotia, a general meeting
of the Literal Peruvian Pennant, surviving Pedigne Mount, forbiding brans spectur miniassiss, anteses,
bus, ixit Coris effant, ot dar is m. bis = Ape x) min mandegereebasin

Mamin Olaur epis.

Benut olivâ mes.. A l'a.: A V. T in A b N& I (Harethyzin')
Mais ranniches s hangrl  ris S.
Menæ ˘he, Vther län ut Mila m homüquele unumum dicempus
inchesilmis., a Wis mas, E .
Angn der e in N þ Mor mopresi, sis λtos igunus finz.
te miæ & Aus pèsist (_rans sas) redo_c”_rat hast_ristn e
Hæní cullere elimi a Mone il, min & imeede, Morum._b._purut
anilud Gqueinill'Ol Beum' mabi pinterore eti na m ivesemininesumegen._nonestam._s m._s nis m


. Pin; istum he bumque Eri nope fr sis _nomicete issse parin-demis
Tesegn sc-t.-i_des aswi holni_s, D tues, zentut Ansimi_nisum:
 anæ mi is a t lis cha at-int. Denas pantien."_oquelis as'athios auhum
Dinquetahceomiteriss is yca onone quejus vumidis, us

We seem to have output nonsense right? Well, yes. This is what I expected from our fresh trained char-level model, and absolutely falls short of the more modern BPE approach (subword tokenization). In order to have intelligible output, we'd probably need billions of parameters and data beyond Gutenberg. However, for today, that falls out of scope. Our next project will be trying to implement more modern approaches LIKE BPE and others that we've seen from the GPT3, GPT4 papers. Thanks for reading!

In [21]:
def generate_with_controls(prompt="", max_new_tokens=500, temperature=0.8, top_k=50):
    # Encode prompt
    idx = torch.tensor([stoi.get(c, 0) for c in prompt], dtype=torch.long, device=device)[None, ...]
    
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -config.block_size:]
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # adjust temperature
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    return decode(idx[0].tolist())

# Example usage
print(generate_with_controls("Once upon a time"))

Once upon a time one had been fostered by the boys, so I
had hoped to do it and talk. The store of Spanish people had arrived at
that time and back such an interestimulate, and the other simp of thoff.
Heat--e-o-tohe--po-pich of a boy d-bev-jac-e-hon---er---e-s---h.

Dat sin he hartplaytor was un ma, an' n' onchischut ofun
_dvilereaweist auz magede denout wum'ere_n semporin motonis,
An'effimin Jonem. ooa; bus ble tun' hem, livaumégoum.



Belos poreetumum.. owum d man schan m.
Cost ist aris; ye bere a lebumbe a


 I expected it, but wow.... seriously terrible. Stay tuned for more!

In [22]:
torch.save(model.state_dict(), "gpt_char_large_final.pt")