<a href="https://colab.research.google.com/github/ProKelly/nanogpt/blob/master/nano_gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-03-15 16:06:12--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2025-03-15 16:06:12 (14.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [None]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [None]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [None]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [None]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

we train our transformers on little chunks of the dataset so that the transfomer becomes useful with the context and will be able to use context to predict the next character and also because it's computationally more efficient.

In [None]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [None]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import requests

# Hyperparameters
batch_size = 1  # We generate one sequence at a time for conversational interaction
block_size = 128  # Context length (in words, not characters)
max_iters = 10000  # Training iterations
eval_interval = 500  # Evaluation frequency
learning_rate = 1e-3  # Adjusted learning rate
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 128  # Embedding size
n_head = 8  # Number of attention heads
n_layer = 6  # Number of transformer layers
dropout = 0.1  # Regularization to prevent overfitting
temperature = 0.7  # Temperature to adjust randomness of predictions
max_grad_norm = 1.0  # Gradient clipping max norm
patience = 5  # Early stopping patience (number of iterations without improvement)

torch.manual_seed(1337)

# Load the Holy Bible dataset (or any other dataset you're using)
bible_url = "https://www.gutenberg.org/files/10/10-0.txt"
response = requests.get(bible_url)
with open('bible.txt', 'w', encoding='utf-8') as f:
    f.write(response.text)

with open('bible.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Tokenization (Word-Level instead of Characters)
words = text.split()
vocab = sorted(set(words))
vocab_size = len(vocab)
stoi = {word: i for i, word in enumerate(vocab)}  # Word-to-index mapping
itos = {i: word for i, word in enumerate(vocab)}  # Index-to-word mapping

encode = lambda s: [stoi[word] for word in s.split() if word in stoi]  # Convert text to token IDs
decode = lambda l: ' '.join([itos[i] for i in l])  # Convert token IDs back to text

# Convert dataset to tensors
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

# Data batching
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

# Transformer Model
class Transformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)
        self.layers = nn.ModuleList([nn.TransformerEncoderLayer(n_embd, n_head, 4 * n_embd, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding(idx)  # (B, T, C)
        pos_emb = self.position_embedding(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb
        for layer in self.layers:
            x = layer(x)
        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            logits = logits.view(-1, vocab_size)
            targets = targets.view(-1)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens=100, temperature=1.0):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]  # Only use the last `block_size` tokens for context
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Scale by temperature
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample next token
            idx = torch.cat((idx, idx_next), dim=1)  # Add new token to context
        return idx

# Initialize model
model = Transformer().to(device)
print(f"Model has {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M parameters")

# Optimizer
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=1e-5)  # Weight decay for regularization

# Learning Rate Scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1000, gamma=0.9)  # Reduce LR every 1000 iterations

# Early Stopping variables
best_val_loss = float('inf')
early_stopping_counter = 0

# Training loop with early stopping, gradient clipping, and LR scheduler
for iter in range(max_iters):
    model.train()

    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"Step {iter}: Train loss {losses['train']:.4f}, Val loss {losses['val']:.4f}")

        # Early Stopping logic
        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            early_stopping_counter = 0
            # Save the model with the best validation loss
            torch.save(model.state_dict(), 'best_model.pt')
        else:
            early_stopping_counter += 1

        if early_stopping_counter >= patience:
            print(f"Early stopping at step {iter} due to no improvement in validation loss.")
            break

    # Data batch retrieval
    xb, yb = get_batch('train')

    # Forward pass
    logits, loss = model(xb, yb)

    # Zero out gradients, backward pass, optimize
    optimizer.zero_grad()
    loss.backward()

    # Clip gradients to prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

    optimizer.step()

    # Step the scheduler to adjust learning rate
    scheduler.step()

# Interactive chat loop
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # Starting with an empty context
print("Start chatting with the model! Type 'exit' to quit.\n")

while True:
    # Take input from the user
    user_input = input("You: ")

    if user_input.lower() == 'exit':
        break

    # Encode the input and add it to the context
    user_input_tokens = encode(user_input)
    context = torch.cat((context, torch.tensor(user_input_tokens, dtype=torch.long, device=device).unsqueeze(0)), dim=1)

    # Generate a response based on the current context
    generated_tokens = model.generate(context, max_new_tokens=150, temperature=temperature)
    response = decode(generated_tokens[0].tolist())

    # Extract the generated response and display it
    print(f"Model: {response[len(user_input):].strip()}")  # Remove user input part from the response


**Mini GPT**

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import requests

# Hyperparameters
batch_size = 1  # We generate one sequence at a time for conversational interaction
block_size = 128  # Context length (in words, not characters)
max_iters = 10000  # Training iterations
eval_interval = 500  # Evaluation frequency
learning_rate = 1e-3  # Adjusted learning rate
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 128  # Embedding size
n_head = 8  # Number of attention heads
n_layer = 6  # Number of transformer layers
dropout = 0.1  # Regularization to prevent overfitting
temperature = 0.7  # Temperature to adjust randomness of predictions
max_grad_norm = 1.0  # Gradient clipping max norm
patience = 5  # Early stopping patience (number of iterations without improvement)

torch.manual_seed(1337)

# Load the Holy Bible dataset (or any other dataset you're using)
bible_url = "https://www.gutenberg.org/files/10/10-0.txt"
response = requests.get(bible_url)
with open('bible.txt', 'w', encoding='utf-8') as f:
    f.write(response.text)

with open('bible.txt', 'r', encoding='utf-8') as f:
    text = f.read()

word level Tokenization

In [14]:
# Tokenization (Word-Level instead of Characters)
words = text.split()
vocab = sorted(set(words))
vocab_size = len(vocab)
stoi = {word: i for i, word in enumerate(vocab)}  # Word-to-index mapping
itos = {i: word for i, word in enumerate(vocab)}  # Index-to-word mapping

encode = lambda s: [stoi[word] for word in s.split() if word in stoi]  # Convert text to token IDs
decode = lambda l: ' '.join([itos[i] for i in l])  # Convert token IDs back to text


Batching

In [15]:
# Convert dataset to tensors
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

# Data batching
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

Transformer Model

In [16]:
# Transformer Model
class Transformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)
        self.layers = nn.ModuleList([nn.TransformerEncoderLayer(n_embd, n_head, 4 * n_embd, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding(idx)  # (B, T, C)
        pos_emb = self.position_embedding(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb
        for layer in self.layers:
            x = layer(x)
        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            logits = logits.view(-1, vocab_size)
            targets = targets.view(-1)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens=100, temperature=1.0):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]  # Only use the last `block_size` tokens for context
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Scale by temperature
            probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
            idx_next = torch.multinomial(probs, num_samples=1)  # Sample next token
            idx = torch.cat((idx, idx_next), dim=1)  # Add new token to context
        return idx

Initializing the Model

In [17]:
# Initialize model
model = Transformer().to(device)
print(f"Model has {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M parameters")

# Optimizer
optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=1e-5)  # Weight decay for regularization

# Learning Rate Scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=1000, gamma=0.9)  # Reduce LR every 1000 iterations

# Early Stopping variables
best_val_loss = float('inf')
early_stopping_counter = 0

Model has 9.80M parameters


Training

In [18]:
# Training loop with early stopping, gradient clipping, and LR scheduler
for iter in range(max_iters):
    model.train()

    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"Step {iter}: Train loss {losses['train']:.4f}, Val loss {losses['val']:.4f}")

        # Early Stopping logic
        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            early_stopping_counter = 0
            # Save the model with the best validation loss
            torch.save(model.state_dict(), 'best_model.pt')
        else:
            early_stopping_counter += 1

        if early_stopping_counter >= patience:
            print(f"Early stopping at step {iter} due to no improvement in validation loss.")
            break

    # Data batch retrieval
    xb, yb = get_batch('train')

    # Forward pass
    logits, loss = model(xb, yb)

    # Zero out gradients, backward pass, optimize
    optimizer.zero_grad()
    loss.backward()

    # Clip gradients to prevent exploding gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)

    optimizer.step()

    # Step the scheduler to adjust learning rate
    scheduler.step()

Step 0: Train loss 10.5402, Val loss 10.5469
Step 500: Train loss 6.6828, Val loss 7.2006
Step 1000: Train loss 6.5243, Val loss 7.1707
Step 1500: Train loss 6.5019, Val loss 7.1607
Step 2000: Train loss 6.4797, Val loss 7.0734
Step 2500: Train loss 6.4552, Val loss 7.0031
Step 3000: Train loss 6.4747, Val loss 7.0967
Step 3500: Train loss 6.3755, Val loss 7.0651
Step 4000: Train loss 6.4648, Val loss 7.1876
Step 4500: Train loss 6.3441, Val loss 7.0118
Step 5000: Train loss 6.3844, Val loss 7.0466
Early stopping at step 5000 due to no improvement in validation loss.


Working with the Model

In [None]:
# Interactive chat loop
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # Starting with an empty context
print("Start chatting with the model! Type 'exit' to quit.\n")

while True:
    # Take input from the user
    user_input = input("You: ")

    if user_input.lower() == 'exit':
        break

    # Encode the input and add it to the context
    user_input_tokens = encode(user_input)
    context = torch.cat((context, torch.tensor(user_input_tokens, dtype=torch.long, device=device).unsqueeze(0)), dim=1)

    # Generate a response based on the current context
    generated_tokens = model.generate(context, max_new_tokens=150, temperature=temperature)
    response = decode(generated_tokens[0].tolist())

    # Extract the generated response and display it
    print(f"Model: {response[len(user_input):].strip()}")  # Remove user input part from the response

Start chatting with the model! Type 'exit' to quit.

Model: ho is Jesus and a hand And in ye and the sea a house and the land a LORD. and smite she and the doors the people of the LORD; is Jesus the crown a saints: And the king which a righteous, for the LORD hast ye But my have and I it But the LORD, and thou shall unto the multitude of the son shall shall and they of the people and the married and the house of him. shall of the people And the tongue the land will the bed and by Israel, And cast And the LORD, and upon him of the unclean and he of his people and giveth the LORD for a sabbath to the top unto man and shall that it but it of her unto his LORD, in the ark, is to the terrible and be is to the fire I that it thither. with the
Model: ding who is Jesus Heaven of the south. and of the wall of the children among the heaven of Israel, of the land And was as to the blood and my shall the midst And command with thy of the land thy shekels and I of his land the border and the word w