## Building a GPT

Companion notebook to the [Zero To Hero](https://karpathy.ai/zero-to-hero.html) video on GPT.

In [2]:
# Download spanish El Quijote from Cervantes
!wget https://www.gutenberg.org/files/2000/2000-0.txt -O quijote.txt

--2025-10-10 07:42:55--  https://www.gutenberg.org/files/2000/2000-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2226045 (2.1M) [text/plain]
Saving to: ‘quijote.txt’


2025-10-10 07:42:56 (4.58 MB/s) - ‘quijote.txt’ saved [2226045/2226045]



In [None]:
# read it in to inspect it
with open('quijote.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  2130398


In [None]:
# let's look at the first 1000 characters
print(text[:1000])

The Project Gutenberg eBook of Don Quijote, by Miguel de Cervantes Saavedra

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Don Quijote

Author: Miguel de Cervantes Saavedra

Release Date: December, 1999 [eBook #2000]
[Most recently updated: January 2, 2020]

Language: Spanish

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Joaquin Cuenca Abela

*** START OF THE PROJECT GUTENBERG EBOOK DON QUIJOTE ***




El ingenioso hidalgo don Quijote de la Mancha



por Miguel de Cervantes Saavedra





El ingenioso hidalgo don Quijote de la Mancha


  
Tasa




In [None]:
# We remove the header
text = text[852:]
print(text[0:300])





El ingenioso hidalgo don Quijote de la Mancha



por Miguel de Cervantes Saavedra





El ingenioso hidalgo don Quijote de la Mancha


  
Tasa

  
Testimonio de las erratas

  
El Rey

  
Al Duque de Béjar

  
Prólogo

  
Al libro de don Quijote de la Mancha



Que trata de la condición y ejerci


In [None]:
text[-18815:]

'\n\n\n\n*** END OF THE PROJECT GUTENBERG EBOOK DON QUIJOTE ***\n\n***** This file should be named 2000-0.txt or 2000-0.zip *****\nThis and all associated files of various formats will be found in:\n    https://www.gutenberg.org/2/0/0/2000/\n\nUpdated editions will replace the previous one--the old editions will\nbe renamed.\n\nCreating the works from print editions not protected by U.S. copyright\nlaw means that no one owns a United States copyright in these works,\nso the Foundation (and you!) can copy and distribute it in the United\nStates without permission and without paying copyright\nroyalties. Special rules, set forth in the General Terms of Use part\nof this license, apply to copying and distributing Project\nGutenberg-tm electronic works to protect the PROJECT GUTENBERG-tm\nconcept and trademark. Project Gutenberg is a registered trademark,\nand may not be used if you charge for the eBooks, unless you receive\nspecific permission. If you do not charge anything for copies of 

In [None]:
# remove chunk from the bottom in english
text = text[:-18815]
print(text[-300:])


to de
sus escritos enteramente, como deseaba, pues no ha sido otro mi deseo que
poner en aborrecimiento de los hombres las fingidas y disparatadas
historias de los libros de caballerías, que, por las de mi verdadero don
Quijote, van ya tropezando, y han de caer del todo, sin duda alguna. Vale.

Fin



In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !"'(),-.01234567:;?ABCDEFGHIJLMNOPQRSTUVWXYZ]abcdefghijlmnopqrstuvxyz¡«»¿ÁÉÍÑÓÚàáéíïñóùúü—
92


In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("holi :)"))
print(decode(encode("holi :)")))

[54, 60, 57, 55, 1, 18, 6]
holi :)


In [None]:
# Inspecting mapping. Note that \n -> 0 and 'space' -> 1, this will be very frequent in the dataset.
stoi

{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 "'": 4,
 '(': 5,
 ')': 6,
 ',': 7,
 '-': 8,
 '.': 9,
 '0': 10,
 '1': 11,
 '2': 12,
 '3': 13,
 '4': 14,
 '5': 15,
 '6': 16,
 '7': 17,
 ':': 18,
 ';': 19,
 '?': 20,
 'A': 21,
 'B': 22,
 'C': 23,
 'D': 24,
 'E': 25,
 'F': 26,
 'G': 27,
 'H': 28,
 'I': 29,
 'J': 30,
 'L': 31,
 'M': 32,
 'N': 33,
 'O': 34,
 'P': 35,
 'Q': 36,
 'R': 37,
 'S': 38,
 'T': 39,
 'U': 40,
 'V': 41,
 'W': 42,
 'X': 43,
 'Y': 44,
 'Z': 45,
 ']': 46,
 'a': 47,
 'b': 48,
 'c': 49,
 'd': 50,
 'e': 51,
 'f': 52,
 'g': 53,
 'h': 54,
 'i': 55,
 'j': 56,
 'l': 57,
 'm': 58,
 'n': 59,
 'o': 60,
 'p': 61,
 'q': 62,
 'r': 63,
 's': 64,
 't': 65,
 'u': 66,
 'v': 67,
 'x': 68,
 'y': 69,
 'z': 70,
 '¡': 71,
 '«': 72,
 '»': 73,
 '¿': 74,
 'Á': 75,
 'É': 76,
 'Í': 77,
 'Ñ': 78,
 'Ó': 79,
 'Ú': 80,
 'à': 81,
 'á': 82,
 'é': 83,
 'í': 84,
 'ï': 85,
 'ñ': 86,
 'ó': 87,
 'ù': 88,
 'ú': 89,
 'ü': 90,
 '—': 91}

In [None]:
# encode the entire text dataset and store it into a torch.Tensor
import torch

data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([2110731]) torch.int64
tensor([ 0,  0,  0,  0, 25, 57,  1, 55, 59, 53, 51, 59, 55, 60, 64, 60,  1, 54,
        55, 50, 47, 57, 53, 60,  1, 50, 60, 59,  1, 36, 66, 55, 56, 60, 65, 51,
         1, 50, 51,  1, 57, 47,  1, 32, 47, 59, 49, 54, 47,  0,  0,  0,  0, 61,
        60, 63,  1, 32, 55, 53, 66, 51, 57,  1, 50, 51,  1, 23, 51, 63, 67, 47,
        59, 65, 51, 64,  1, 38, 47, 47, 67, 51, 50, 63, 47,  0,  0,  0,  0,  0,
         0, 25, 57,  1, 55, 59, 53, 51, 59, 55, 60, 64, 60,  1, 54, 55, 50, 47,
        57, 53, 60,  1, 50, 60, 59,  1, 36, 66, 55, 56, 60, 65, 51,  1, 50, 51,
         1, 57, 47,  1, 32, 47, 59, 49, 54, 47,  0,  0,  0,  1,  1,  0, 39, 47,
        64, 47,  0,  0,  1,  1,  0, 39, 51, 64, 65, 55, 58, 60, 59, 55, 60,  1,
        50, 51,  1, 57, 47, 64,  1, 51, 63, 63, 47, 65, 47, 64,  0,  0,  1,  1,
         0, 25, 57,  1, 37, 51, 69,  0,  0,  1,  1,  0, 21, 57,  1, 24, 66, 62,
        66, 51,  1, 50, 51,  1, 22, 83, 56, 47, 63,  0,  0,  1,  1,  0, 35, 63,
      

In [None]:
# split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [None]:
# split into chunks to feed the transformer
block_size = 10
print(train_data[:block_size+1])
print(decode(train_data[:block_size+1].tolist()))

tensor([ 0,  0,  0,  0, 25, 57,  1, 55, 59, 53, 51])




El inge


In [None]:
# with a block we have 10 different targets
x = train_data[:block_size]
y = train_data[1:block_size+1] # we slide one character
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([0]) the target: 0
when input is tensor([0, 0]) the target: 0
when input is tensor([0, 0, 0]) the target: 0
when input is tensor([0, 0, 0, 0]) the target: 25
when input is tensor([ 0,  0,  0,  0, 25]) the target: 57
when input is tensor([ 0,  0,  0,  0, 25, 57]) the target: 1
when input is tensor([ 0,  0,  0,  0, 25, 57,  1]) the target: 55
when input is tensor([ 0,  0,  0,  0, 25, 57,  1, 55]) the target: 59
when input is tensor([ 0,  0,  0,  0, 25, 57,  1, 55, 59]) the target: 53
when input is tensor([ 0,  0,  0,  0, 25, 57,  1, 55, 59, 53]) the target: 51


In [None]:
batch_size = 5
data = train_data
ix = torch.randint(len(data) - block_size, (batch_size,)) # indexes of random chunks from the data
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
print(ix)
print(x)
print(y)
print(x.shape)
print(y.shape)

tensor([ 766401,  792440,  976090, 1732190, 1191664])
tensor([[60,  1, 49, 60, 58, 61, 63, 87,  1, 65],
        [58, 60, 64,  1, 51, 57,  1, 61, 47, 61],
        [66, 63, 48, 47, 63,  1, 57, 60, 64,  1],
        [61, 47, 63, 65, 51,  1, 65, 60, 50, 60],
        [ 1, 50, 55, 53, 60,  1, 69, 60,  1, 91]])
tensor([[ 1, 49, 60, 58, 61, 63, 87,  1, 65, 60],
        [60, 64,  1, 51, 57,  1, 61, 47, 61, 51],
        [63, 48, 47, 63,  1, 57, 60, 64,  1, 55],
        [47, 63, 65, 51,  1, 65, 60, 50, 60,  1],
        [50, 55, 53, 60,  1, 69, 60,  1, 91, 63]])
torch.Size([5, 10])
torch.Size([5, 10])


In [None]:
# we set de batch size. This groups multiple block_size chunks into the transformer, is for efficiency so we can keep de GPUs busy (parallel computation)
torch.manual_seed(786)
batch_size = 5 # how many independent sequences will we process in parallel?
block_size = 10 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,)) # index of random chunks from the data
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([5, 10])
tensor([[86, 60, 63, 47,  7,  1, 61, 66, 51, 64],
        [63,  1, 47,  1, 57, 47,  1, 64, 51, 86],
        [51, 63, 58, 60, 64, 60,  1, 52, 66, 51],
        [51, 64, 65, 63, 51, 57, 57, 47, 50, 60],
        [57, 60,  1, 50, 51,  1, 57, 47,  0, 58]])
targets:
torch.Size([5, 10])
tensor([[60, 63, 47,  7,  1, 61, 66, 51, 64, 65],
        [ 1, 47,  1, 57, 47,  1, 64, 51, 86, 60],
        [63, 58, 60, 64, 60,  1, 52, 66, 51, 64],
        [64, 65, 63, 51, 57, 57, 47, 50, 60,  1],
        [60,  1, 50, 51,  1, 57, 47,  0, 58, 47]])
----
when input is [86] the target: 60
when input is [86, 60] the target: 63
when input is [86, 60, 63] the target: 47
when input is [86, 60, 63, 47] the target: 7
when input is [86, 60, 63, 47, 7] the target: 1
when input is [86, 60, 63, 47, 7, 1] the target: 61
when input is [86, 60, 63, 47, 7, 1, 61] the target: 66
when input is [86, 60, 63, 47, 7, 1, 61, 66] the target: 51
when input is [86, 60, 63, 47, 7, 1, 61, 66, 51] the target: 

In [None]:
print(xb) # our input to the transformer

tensor([[86, 60, 63, 47,  7,  1, 61, 66, 51, 64],
        [63,  1, 47,  1, 57, 47,  1, 64, 51, 86],
        [51, 63, 58, 60, 64, 60,  1, 52, 66, 51],
        [51, 64, 65, 63, 51, 57, 57, 47, 50, 60],
        [57, 60,  1, 50, 51,  1, 57, 47,  0, 58]])


In [None]:
# for better visualization
torch.set_printoptions(
    linewidth=200,      # Ancho de línea antes del salto
    threshold=10000,    # Máximo de elementos antes de truncar con '...'
    edgeitems=10        # Elementos a mostrar al inicio/final cuando trunca
)

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device found: {device}')

Device found: cuda


In [None]:
# hyperparameters --> ~40' for 1 Tesla T4 GPU 14.74GB (GOOGLE COLAB)
# ------------
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 128 # what is the maximum context length for predictions?
max_iters = 20000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Device found: {device}')
eval_iters = 200
n_embd = 128
n_head = 8
n_layer = 6
dropout = 0.2
# ------------

torch.manual_seed(786)

with open('quijote.txt', 'r', encoding='utf-8') as f:
    text = f.read()
text = text[852:]
text = text[:-18815]
print("length of dataset in characters: ", len(text))
print(f'\nStart of the text:\n {text[:100]}')
print(f'\n\nEnd of the text:\n {text[-100:]}')

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


Device found: cuda
length of dataset in characters:  2110731

Start of the text:
 



El ingenioso hidalgo don Quijote de la Mancha



por Miguel de Cervantes Saavedra





El ingeni


End of the text:
 de mi verdadero don
Quijote, van ya tropezando, y han de caer del todo, sin duda alguna. Vale.

Fin

1.227612 M parameters
step 0: train loss 4.6485, val loss 4.6468
step 100: train loss 2.3586, val loss 2.3649
step 200: train loss 2.3224, val loss 2.3323
step 300: train loss 2.2838, val loss 2.2936
step 400: train loss 2.2526, val loss 2.2573
step 500: train loss 2.1838, val loss 2.1821
step 600: train loss 2.0935, val loss 2.0968
step 700: train loss 2.0201, val loss 2.0247
step 800: train loss 1.9684, val loss 1.9700
step 900: train loss 1.9170, val loss 1.9170
step 1000: train loss 1.8832, val loss 1.8805
step 1100: train loss 1.8489, val loss 1.8479
step 1200: train loss 1.8175, val loss 1.8163
step 1300: train loss 1.7884, val loss 1.7902
step 1400: train loss 1.7612, val loss 1.7

In [4]:
# ============================================
# GUARDAR MODELO ENTRENADO
# ============================================
torch.save({
    'model_state_dict': model.state_dict(),
    'vocab': {
        'stoi': stoi,
        'itos': itos,
        'chars': chars,
        'vocab_size': vocab_size
    },
    'config': {
        'n_embd': n_embd,
        'n_head': n_head,
        'n_layer': n_layer,
        'block_size': block_size,
        'dropout': dropout
    }
}, 'quijote_gpt.pth')

print("\n✓ Modelo guardado exitosamente en 'quijote_gpt.pth'")


✓ Modelo guardado exitosamente en 'quijote_gpt.pth'


In [1]:
import os
size_mb = os.path.getsize('quijote_gpt.pth') / (1024 * 1024)
print(f"Tamaño del modelo: {size_mb:.2f} MB")

FileNotFoundError: [Errno 2] No such file or directory: 'quijote_gpt.pth'