# Build GPT from Scratch - Code Along with Karpathy

**Week 1, Day 2 Activity**

This notebook is for coding along with Andrej Karpathy's "Let's build GPT: from scratch, in code, spelled out" video.

**Video Link:** https://www.youtube.com/watch?v=kCc8FmEb1nY

## Goal
Build a character-level language model from scratch to deeply understand:
- Tokenization and data preparation
- Self-attention mechanisms
- Multi-head attention
- Transformer blocks
- Full GPT architecture
- Training and text generation

## Notes
This is exploratory/messy code. I'll extract clean implementations into `my_gpt.py` on Days 3-5.

## 1. Data Loading & Tokenization

In [None]:
# Load Shakespeare dataset
# Implement character-level tokenization
# Create train/val splits


### 1.1 Download the Tinyshakespear dataset

In [3]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt -P /Users/dikshant/Documents/PlayGround/nanochat-learning/data/shakespeare/

--2026-02-11 18:01:51--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‚Äò/Users/dikshant/Documents/PlayGround/nanochat-learning/data/shakespeare/input.txt‚Äô


2026-02-11 18:01:51 (16.8 MB/s) - ‚Äò/Users/dikshant/Documents/PlayGround/nanochat-learning/data/shakespeare/input.txt‚Äô saved [1115394/1115394]



### 1.2 Explore the data

In [8]:
import os                                                                                                                                               
print(os.getcwd()) 

/Users/dikshant/Documents/PlayGround/nanochat-learning/llm-training-journey/experiments/week1-nanogpt


In [12]:
import os
from pathlib import Path

# Get current directory and go up to nanochat-learning
current_dir = Path.cwd()
project_root = current_dir.parent.parent.parent  # Goes up to nanochat-learning

# Build path to data
data_path = project_root / 'data/shakespeare/input.txt'

with open(data_path, 'r', encoding='utf-8') as f:
    text = f.read()

In [13]:
# Print the length of characters in dataset
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [14]:
# Look at the first 10000 characters
print(text[:10000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [18]:
# Check out all the unique characters in the text

# Extract all unique characters from text and sort them alphabetically  
chars = sorted(list(set(text))) # print each function output in this line if needed
vocab_size = len(chars) # Possible elements in our sequences

print("The following are all the characters in the vocabulary of the input: ", ''.join(chars))
print("\n")
print("The number of unique characters we have in our vocabulary: ",vocab_size)

The following are all the characters in the vocabulary of the input:  
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


The number of unique characters we have in our vocabulary:  65


### 1.2 TOKENIZATION OF TEXT

In [23]:
# Create a mapping from characters to integers
stoi = {ch : i for i,ch in enumerate(chars) }
itos = {i : ch for i, ch in enumerate(chars) }

encode = lambda s : [stoi[c] for c in s ] # encoder: take a string, output a list of integers
decode = lambda l : ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print("The char i is encoded as integer: ",encode("i"))
print("The char i is decoded back from integer value",encode("i")," to ",decode(encode("i")))
print("\n\n")
print(encode("hii there"))
print(decode(encode("hii there")))

The char i is encoded as integer:  [47]
The char i is decoded back from integer value [47]  to  i



[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [28]:
# Tokenize the entire text input dataset and store it into a torch.Tensor

import torch
data = torch.tensor(encode(text), dtype = torch.long) # Why the long datatype?

print("The shape of the data is ", data.shape, "\n") #check the data shape
print("The data type of each values in the data object is " , data.dtype, "\n" ) # check the datatype

print("The 1st 1000 character encoding in the data object looks like ", data[:1000]) # check the encoding of the first 1000 characters of the input

The shape of the data is  torch.Size([1115394]) 

The data type of each values in the data object is  torch.int64 

The 1st 1000 character encoding in the data object looks like  tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 5

### 1.3 Split the Dataset into Training Dataset and Validation Dataset

In [34]:
# Split into train and validation Dataset

n = int(0.9 * len(data)) # First 90% train dataset, the rest will be validation dataset

train_data = data[:n] # Creating Training Dataset # Dataset the model is trained on
val_data = data[n:] # Creating Validation Dataset # Dataset that helps us test how much we are overfitting

In [32]:
# View Sample Train Data
train_data[:10]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])

In [33]:
# View Sample Val Data
val_data[:10]

tensor([12,  0,  0, 19, 30, 17, 25, 21, 27, 10])

### NOTE for Next Steps:
- Now we will start to train the Transformer.
- We won't feed the entire text (i.e train_data) to the transformer. Why?
- We feed only chunks of data to the transformer. These chunks of data are randomly picked

#### 1.3.1 Idea about block Size

In [36]:
block_size = 8 # This is also known as Context Window

train_data[: block_size + 1] # View the dataset in block sizes 

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

###### NOTE #1: The above block has multiple examples packed into it that is because all the characters Follow each other
######        We are simultaneously training the model to make predictions in each of these positions

In [37]:
# Divide the training data into x which are our features and y that is our actual output that we want to predict
x = train_data[ : block_size]
y = train_data[1: block_size + 1]

for t in range(block_size):
    context = x[ : t + 1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


###### NOTE #1 Continued: **As we can see from the above output, we are getting 8 separate training samples hiding into one chunk**
######      
###### ** We train all the elements in one block together not jsut for efficiency but also because we want the Transformer used to seeing the context from one to eight. Meaning it is used to see one character and then make prediction, two chars and then make the prediction, and so on **

### We looked at the time dimension of the Tensors that we are feeding into the Transformer. Now, we will look
### at the Batch Dimension

### IDEA of Batch Dimension: 
- We will have multiple batches of chunks of text that we are going to feed into the transformer.
- These batches will be stacked up in a single tensor. That's done for efficiency. GPUs are very good at parallel processing of data.
- These chunks are processed independently and they don't talk to each other

#### 1.3.2 Introducing Batch Dimension

In [43]:
import torch

# ============================================================
# SEED & HYPERPARAMETERS
# ============================================================

torch.manual_seed(1337)
# Think of this like telling a dice:
# "Every time I roll you, give me the same sequence of numbers"
# This means everyone running this code sees the exact same results
# Without this, every run gives different random numbers ‚Üí hard to debug

batch_size = 4
# How many independent sequences we process at the same time
# Think of it as: 4 students each reading a different page of a book
# They all do their work simultaneously ‚Üí faster training
# More batch_size = faster BUT needs more GPU memory

block_size = 8
# The maximum context window ‚Äî how far back the model can see
# To predict the next token, the model can look at upto 8 previous tokens
# Example: to predict token 9, it can look at tokens 1,2,3,4,5,6,7,8

# ============================================================
# GET BATCH FUNCTION
# ============================================================

def get_batch(split):
    """
    Description: Grabs a random batch of chunks from the dataset.
                 Returns inputs (x) and their corresponding targets (y)
    
    Input:  split ‚Üí either the string 'train' or 'val'
                    tells us which dataset to pull from
    
    Output: x ‚Üí input sequences  of shape [batch_size √ó block_size] = [4 √ó 8]
            y ‚Üí target sequences of shape [batch_size √ó block_size] = [4 √ó 8]
                y is just x shifted by 1 position to the right
                because y[t] is always the answer to "what comes after x[t]?"
    """

    # Pick the correct dataset based on the split argument
    # If we are training ‚Üí use train_data
    # If we are evaluating ‚Üí use val_data
    data = train_data if split == 'train' else val_data

    # ‚îÄ‚îÄ UNDERSTANDING ix ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Imagine your data is a long strip of 1,000,000 tokens:
    # [ h e l l o ' ' C l i t i z e n ... K a n e ... ]
    #   0 1 2 3 4  5  6 7 8 9 10 11 12     ‚Üë           
    #                                  some random position
    #
    # We throw 4 random darts at this strip to get 4 starting positions
    # BUT we must not start too close to the end ‚Äî otherwise we fall off!
    # 
    # WRONG ‚ùå ‚Äî starting at position 999,998 and asking for 8 tokens
    # RIGHT ‚úÖ ‚Äî last safe start = len(data) - block_size = 999,992
    #
    # torch.randint(N, (batch_size,)) means:
    # "give me (batch_size=4) random integers between 0 and N"
    # Result looks like: ix = [892, 4521, 7634, 1023]
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # ‚îÄ‚îÄ UNDERSTANDING x ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # For each starting position i in ix, grab the next 8 tokens
    # 
    # i=892  ‚Üí data[892  : 900 ] ‚Üí [y, o, u, ' ', a, n, d, ' ']
    # i=4521 ‚Üí data[4521 : 4529] ‚Üí [F, i, r, s,  t, ' ', C, i]
    # i=7634 ‚Üí data[7634 : 7642] ‚Üí [L, E, O, N,  T, E,  S, ' ']
    # i=1023 ‚Üí data[1023 : 1031] ‚Üí [K, a, n, e, ' ', t, h, e ]
    #
    # torch.stack() piles these 4 strips on top of each other like pancakesü•û
    # turning a list of 4 tensors (each of size 8) into ONE matrix of [4 √ó 8]
    x = torch.stack([data[i : i + block_size] for i in ix])

    # ‚îÄ‚îÄ UNDERSTANDING y ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # EXACTLY the same as x but shifted 1 position to the right
    # Because y is our answer key ‚Äî "what token should come AFTER each x token?"
    #
    # i=892  ‚Üí data[893  : 901 ] ‚Üí [o, u, ' ', a, n, d, ' ', I]
    # i=4521 ‚Üí data[4522 : 4530] ‚Üí [i, r, s,  t, ' ', C, i, t]
    # i=7634 ‚Üí data[7635 : 7643] ‚Üí [E, O, N,  T, E,  S, ' ', L]
    # i=1023 ‚Üí data[1024 : 1032] ‚Üí [a, n, e, ' ', t, h, e, ' ']
    #
    # x asks the question ‚îÄ‚îÄ‚Üí y holds the answer
    # They are the same data, just offset by 1 position
    # That single +1 shift is the entire secret of language model training!
    y = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])

    return x, y

# ============================================================
# CALL THE FUNCTION & INSPECT THE OUTPUT
# ============================================================

xb, yb = get_batch('train')
# xb = our inputs  ‚Üí shape: [4, 8] ‚Üí 4 sequences, each 8 tokens long
# yb = our targets ‚Üí shape: [4, 8] ‚Üí 4 answer keys, each 8 tokens long

print("INPUTS: ")
print("The shape of xb is: ", xb.shape)
# Prints ‚Üí torch.Size([4, 8])
# Read as: 4 sequences (batch dimension) √ó 8 tokens (time dimension)

print("The values of xb are: ", xb)
# Prints the actual token IDs (integers) inside the matrix

print("TARGETS: ")
print("The shape of yb is: ", yb.shape)
# Also ‚Üí torch.Size([4, 8]) ‚Äî same shape as xb, just shifted by 1

print("The values of yb are: ", yb)
# Prints the target token IDs

print("------")

# ============================================================
# UNPACKING ALL TRAINING EXAMPLES INSIDE THE BATCH
# ============================================================
# Remember: inside each [4 √ó 8] matrix, there are actually
# 4 √ó 8 = 32 individual training examples hidden inside!
# This loop unpacks and prints every single one of them

for b in range(batch_size):     # Loop over BATCH dimension ‚Üí which sequence? (0,1,2,3)
    for t in range(block_size): # Loop over TIME dimension  ‚Üí where in the sequence? (0‚Üí7)

        # xb[b, :t+1] means:
        # ‚Üí go to row b       (pick the b-th sequence out of our 4)
        # ‚Üí grab columns 0‚Üít  (grab tokens from start up to position t)
        # As t grows from 0 to 7, the context gets longer and longer
        # t=0 ‚Üí context is just 1 token
        # t=7 ‚Üí context is all 8 tokens
        context = xb[b, :t+1]

        # yb[b, t] means:
        # ‚Üí go to row b       (same sequence)
        # ‚Üí grab column t     (the single target token at position t)
        # This is what the model should predict given the context above
        target = yb[b, t]

        print(f"Batch {b} | Time {t} | when input is {context.tolist()} the target is: {target}")


INPUTS: 
The shape of xb is:  torch.Size([4, 8])
The values of xb are:  tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
TARGETS: 
The shape of yb is:  torch.Size([4, 8])
The values of yb are:  tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
------
Batch 0 | Time 0 | when input is [24] the target is: 43
Batch 0 | Time 1 | when input is [24, 43] the target is: 58
Batch 0 | Time 2 | when input is [24, 43, 58] the target is: 5
Batch 0 | Time 3 | when input is [24, 43, 58, 5] the target is: 57
Batch 0 | Time 4 | when input is [24, 43, 58, 5, 57] the target is: 1
Batch 0 | Time 5 | when input is [24, 43, 58, 5, 57, 1] the target is: 46
Batch 0 | Time 6 | when input is [24, 43, 58, 5, 57, 1, 46] the target is: 43
Batch 0 | Time 7 | when input is [24, 43, 

**These are the 32 input examples packed into one batch of tensor**

#### Example of what the Output Looks Like for Batch 0 as Characters

| Batch | Time | when input is | target |
| --- | --- | --- | --- |
| 0 | 0 | `[y]` | `o` |
| 0 | 1 | `[y, o]` | `u` |
| 0 | 2 | `[y, o, u]` | `' '` |
| 0 | 3 | `[y, o, u, ' ']` | `a` |
| 0 | 4 | `[y, o, u, ' ', a]` | `n` |
| 0 | 5 | `[y, o, u, ' ', a, n]` | `d` |
| 0 | 6 | `[y, o, u, ' ', a, n, d]` | `' '` |
| 0 | 7 | `[y, o, u, ' ', a, n, d, ' ']` | `I` |

#### Then the same pattern repeats for Batch 1, 2, and 3


#### **NEXT STEP: Now that we have our batch of inputs and outputs, let's feed them in to the transformer**

## 2. Bigram Model

Simple baseline model that predicts next character based only on current character.

In [None]:
# Implement bigram language model
# Train and generate samples


In [54]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)
# üé≤ Fixes the random number generator so results are the same every run.
# Like always rolling the same dice sequence. Good for debugging.

class BigramLanguageModel(nn.Module):
    # nn.Module is PyTorch's base blueprint for ALL neural networks.
    # By writing (nn.Module) we inherit all of PyTorch's standard machinery.
    # Think of it as: "I want to build MY model on top of PyTorch's foundation."

    def __init__(self, vocab_size):
        """
        DESCRIPTION:
            The setup function. Runs ONCE when you create the model.
            Builds the lookup table (the only thing this model learns).
            Think of it as: "Build the kitchen before you start cooking."

        INPUT:
            vocab_size ‚Üí how many unique tokens exist in our language.
                         In our small example: 4 (tokens: 'a','b','c','d')
                         In Karpathy's Shakespeare model: 65 characters.

        OUTPUT:
            None. Just sets up the model's internal structure.
            After this runs, the model exists but knows nothing yet.
            The lookup table starts with random garbage numbers.
        """
        # Runs ONCE when you create the model. Sets up the "kitchen."
        #
        # In our small example:
        #   vocab_size = 4  (only 4 tokens exist: 'a'=0, 'b'=1, 'c'=2, 'd'=3)

        super().__init__()
        # Tells PyTorch's nn.Module to do ITS setup first.
        # It prepares internal bookkeeping:
        #   ‚Üí tracks all learnable parameters
        #   ‚Üí enables .to(device), .parameters(), .train(), .eval()
        # Skip this line and PyTorch throws an error immediately.
        # Rule: ALWAYS call this first in __init__. No exceptions.

        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        # Creates a LOOKUP TABLE of shape [vocab_size √ó vocab_size]
        # In our example: a 4 √ó 4 table
        #
        #   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
        #   ‚îÇ           next token scores (C=4 columns)       ‚îÇ
        #   ‚îÇ         'a'    'b'    'c'    'd'                 ‚îÇ
        #   ‚îÇ  'a'(0) [0.1,  0.8,  0.3,  0.5]  ‚Üê row 0       ‚îÇ
        #   ‚îÇ  'b'(1) [0.6,  0.2,  0.9,  0.1]  ‚Üê row 1       ‚îÇ
        #   ‚îÇ  'c'(2) [0.4,  0.7,  0.2,  0.6]  ‚Üê row 2       ‚îÇ
        #   ‚îÇ  'd'(3) [0.9,  0.1,  0.5,  0.3]  ‚Üê row 3       ‚îÇ
        #   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
        #
        # Each ROW = one token's opinion of what comes next.
        # e.g. Row for 'b' = [0.6, 0.2, 0.9, 0.1]
        #      ‚Üí model currently thinks 'c' (score=0.9) most likely follows 'b'
        #
        # These numbers START as random garbage (normal distribution, mean=0, std=1).
        # During training, backprop nudges these numbers to be less wrong.
        # THIS TABLE is the ONLY thing the Bigram model learns. That's it.
        #
        # Note: vocab_size ‚â† T (block_size/context window)!
        # vocab_size (C=4) = how many unique tokens EXIST in the language
        # T          (=3)  = how long each sequence is
        # These are completely independent. Like:
        #   English has 26 letters (vocab_size=26)
        #   "cat" is 3 letters long (T=3)
        #   T does NOT have to equal 26!

    def forward(self, idx, targets=None):
        """
        DESCRIPTION:
            The prediction function. Runs every time you call m(xb, yb).
            Takes a sequence of token IDs, looks each one up in the table,
            and returns a score for every possible next token.
            Optionally computes loss if targets are provided.

            Two modes:
              TRAINING mode   ‚Üí call m(xb, yb)  ‚Üí returns logits + loss
              GENERATION mode ‚Üí call self(idx)   ‚Üí returns logits + None

        INPUT:
            idx     ‚Üí (B, T) tensor of token IDs ‚Äî the input sequences.
                      B = batch_size = how many sequences at once.
                      T = block_size = how long each sequence is.
                      In our example: B=2, T=3.

                      Example:
                                t=0  t=1  t=2
                      seq 0:  [  0,   1,   2 ]  ‚Üí tokens: a, b, c
                      seq 1:  [  3,   0,   1 ]  ‚Üí tokens: d, a, b

            targets ‚Üí (B, T) tensor of correct next token IDs. OPTIONAL.
                      Same shape as idx but shifted one step forward.
                      "What token SHOULD come after each position?"

                      Example:
                                t=0  t=1  t=2
                      seq 0:  [  1,   2,   3 ]  ‚Üí expected next: b, c, d
                      seq 1:  [  0,   1,   2 ]  ‚Üí expected next: a, b, c

        OUTPUT:
            logits ‚Üí (B*T, C) tensor of raw prediction scores.
                     One row per token position, C scores per row.
                     In our example: shape (6, 4).

            loss   ‚Üí single number measuring how wrong the predictions are.
                     Lower is better. Starts near log(vocab_size) for a
                     random model. e.g. log(4) ‚âà 1.38 in our example.
                     Returns None if no targets were provided.
        """
        # Called every time you run the model on data.
        # Defines: "given input tokens, how do we compute predictions?"
        # PyTorch calls this automatically when you do m(xb, yb).
        #
        # targets=None makes targets OPTIONAL:
        #   During TRAINING:    m(xb, yb)  ‚Üí targets given   ‚Üí loss calculated
        #   During GENERATION:  self(idx)  ‚Üí no targets       ‚Üí loss skipped
        #
        # ‚îÄ‚îÄ INPUTS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
        # idx     shape: (B, T) ‚Üí input token IDs
        # targets shape: (B, T) ‚Üí correct next tokens (what we want to predict)
        #
        # B = batch_size = number of sequences processed in parallel
        # T = block_size = context window = how long each sequence is
        #
        # In our example: B=2, T=3
        #
        # idx looks like this ‚Äî a (2 √ó 3) grid of token IDs:
        #
        #             t=0   t=1   t=2
        # sequence 0: [ 0,    1,    2 ]  ‚Üê tokens: a, b, c
        # sequence 1: [ 3,    0,    1 ]  ‚Üê tokens: d, a, b
        #
        # targets is the SAME shape but shifted one step forward:
        # "what token should come AFTER each position?"
        #
        #             t=0   t=1   t=2
        # sequence 0: [ 1,    2,    3 ]  ‚Üê expected next: b, c, d
        # sequence 1: [ 0,    1,    2 ]  ‚Üê expected next: a, b, c
        # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

        logits = self.token_embedding_table(idx)
        # For EVERY token ID in idx, go to the table and grab its ROW.
        # Each row has C=vocab_size numbers (scores for each possible next token).
        #
        # idx was (B=2, T=3) ‚Äî a flat grid of integers
        # logits becomes (B=2, T=3, C=4) ‚Äî a 3D cube of scores
        #
        # What happened? Each integer got SWAPPED with its full row of C scores:
        #
        #   idx[0][0] = 0 (token 'a') ‚Üí look up row 0 ‚Üí [0.1, 0.8, 0.3, 0.5]
        #   idx[0][1] = 1 (token 'b') ‚Üí look up row 1 ‚Üí [0.6, 0.2, 0.9, 0.1]
        #   idx[0][2] = 2 (token 'c') ‚Üí look up row 2 ‚Üí [0.4, 0.7, 0.2, 0.6]
        #   idx[1][0] = 3 (token 'd') ‚Üí look up row 3 ‚Üí [0.9, 0.1, 0.5, 0.3]
        #   idx[1][1] = 0 (token 'a') ‚Üí look up row 0 ‚Üí [0.1, 0.8, 0.3, 0.5]
        #   idx[1][2] = 1 (token 'b') ‚Üí look up row 1 ‚Üí [0.6, 0.2, 0.9, 0.1]
        #
        # These scores are called LOGITS ‚Äî raw, unnormalized predictions.
        # They are NOT yet probabilities. To convert: apply softmax.
        # The highest logit = the model's current best guess for next token.
        #
        # Example: logits[0][1] = [0.6, 0.2, 0.9, 0.1]
        #                              a    b    c    d
        #                                        ‚Üë
        #                          Highest! Model guesses 'c' follows 'b'

        # ‚îÄ‚îÄ RESHAPE FOR LOSS CALCULATION ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

        if targets is None:
            # We land here during GENERATION (no targets provided).
            # No targets = nothing to compare against = no loss to calculate.
            # Just return the raw logits and None for loss.
            loss = None
        else:
            # We land here during TRAINING (targets provided).
            # Now we can measure how wrong our predictions are.

            B, T, C = logits.shape
            # Unpack the 3 dimensions of our cube into separate variables.
            # logits.shape = (2, 3, 4) ‚Üí B=2, T=3, C=4
            # We need these as separate numbers for the reshape step below.

            logits = logits.view(B*T, C)
            # RESHAPE logits from 3D cube ‚Üí 2D flat table.
            # (B=2, T=3, C=4) ‚Üí (B*T=6, C=4)
            #
            # WHY? F.cross_entropy strictly requires C in the SECOND position.
            # It just wants a simple list of predictions ‚Äî it doesn't care
            # about batches or sequences. So we flatten B and T into one.
            # B*T = 2*3 = 6 total individual predictions.
            #
            #   BEFORE ‚Äî 3D cube (B=2 pages, T=3 rows, C=4 scores):
            #
            #   Page B=0 (seq: a,b,c):        Page B=1 (seq: d,a,b):
            #   t=0: [.1, .8, .3, .5] ‚Üê a     t=0: [.9, .1, .5, .3] ‚Üê d
            #   t=1: [.6, .2, .9, .1] ‚Üê b     t=1: [.1, .8, .3, .5] ‚Üê a
            #   t=2: [.4, .7, .2, .6] ‚Üê c     t=2: [.6, .2, .9, .1] ‚Üê b
            #
            #   AFTER view(B*T, C) ‚Äî 2D table (6 rows, 4 scores each):
            #
            #   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
            #   ‚îÇ      c=0   c=1   c=2   c=3              ‚îÇ
            #   ‚îÇ  0: [ .1,   .8,   .3,   .5 ] ‚Üê B=0,t=0 ‚îÇ  (was token 'a')
            #   ‚îÇ  1: [ .6,   .2,   .9,   .1 ] ‚Üê B=0,t=1 ‚îÇ  (was token 'b')
            #   ‚îÇ  2: [ .4,   .7,   .2,   .6 ] ‚Üê B=0,t=2 ‚îÇ  (was token 'c')
            #   ‚îÇ  3: [ .9,   .1,   .5,   .3 ] ‚Üê B=1,t=0 ‚îÇ  (was token 'd')
            #   ‚îÇ  4: [ .1,   .8,   .3,   .5 ] ‚Üê B=1,t=1 ‚îÇ  (was token 'a')
            #   ‚îÇ  5: [ .6,   .2,   .9,   .1 ] ‚Üê B=1,t=2 ‚îÇ  (was token 'b')
            #   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
            #
            # ZERO data changed. Just reorganized. Like unfolding a box flat.

            targets = targets.view(B*T)
            # RESHAPE targets from 2D grid ‚Üí 1D flat list.
            # (B=2, T=3) ‚Üí (B*T=6,)
            #
            # Must match logits exactly ‚Äî 6 predictions need 6 correct answers.
            #
            #   BEFORE ‚Äî 2D grid (2 √ó 3):     AFTER view(B*T) ‚Äî 1D list (6,):
            #
            #             t=0  t=1  t=2
            #   seq 0:  [  1,   2,   3  ]  ‚Üí  [ 1, 2, 3, 0, 1, 2 ]
            #   seq 1:  [  0,   1,   2  ]       ‚Üë  ‚Üë  ‚Üë  ‚Üë  ‚Üë  ‚Üë
            #                                  B0 B0 B0 B1 B1 B1
            #                                  t0 t1 t2 t0 t1 t2
            #
            # targets.view(-1) does the EXACT same thing.
            # The -1 means "figure out the size yourself."
            # PyTorch sees 6 total elements ‚Üí fills in -1 as 6.
            # Both are correct. view(-1) is just shorter to write.

            loss = F.cross_entropy(logits, targets)
            # Measures HOW WRONG our predictions are. Returns ONE number.
            #
            # For each of the 6 predictions, it asks:
            # "Is the CORRECT next token's score the HIGHEST score?"
            #
            #   Prediction 0 ‚Äî token 'a', target='b'(id=1):
            #   logits: [ .1,  .8,  .3,  .5 ]
            #              a    b    c    d
            #                   ‚Üë target 'b' has score .8 ‚Üí highest ‚úÖ ‚Üí LOW loss
            #
            #   Prediction 2 ‚Äî token 'c', target='d'(id=3):
            #   logits: [ .4,  .7,  .2,  .6 ]
            #              a    b    c    d
            #                             ‚Üë target 'd' score=.6 ‚Üí NOT highest ‚ùå
            #                             ‚Üí 'b' (.7) is higher ‚Üí HIGH loss
            #
            # The TOTAL loss = average across all 6 predictions.
            #
            # Internally cross_entropy does 3 steps:
            #   Step 1 ‚Äî softmax:  raw scores ‚Üí probabilities (sum to 100%)
            #   Step 2 ‚Äî pick:     grab ONLY the probability of the correct token
            #   Step 3 ‚Äî -log():   convert that probability to a loss number
            #
            #   Why -log()? Because it has the perfect shape:
            #
            #   Prob of correct token    -log(p)    Meaning
            #   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            #         100%          ‚Üí      0.0      üéâ Perfect
            #          50%          ‚Üí      0.69     üòê Okay
            #          25%          ‚Üí      1.38     üòü Bad ‚Üê random model starts HERE
            #          10%          ‚Üí      2.30     üò± Very bad
            #           1%          ‚Üí      4.60     üíÄ Terrible
            #
            # Untrained random model ‚Üí all 4 tokens get ~25% probability each
            # ‚Üí expected starting loss = -log(0.25) = log(4) ‚âà 1.38
            # This is your SANITY CHECK. Always verify this before training!

        return logits, loss
        # Returns TWO things:
        #   logits ‚Üí raw predictions
        #   loss   ‚Üí how wrong we are (None if no targets were given)
        #
        # Training loop will use loss to do backprop ‚Üí update the table
        # ‚Üí next time loss will be slightly lower ‚Üí repeat thousands of times.

    def generate(self, idx, max_new_tokens):
        """
        DESCRIPTION:
            The text generation function.
            Takes a starting seed token and grows the sequence
            one token at a time, max_new_tokens times.
            Like giving the model a single letter and asking it
            to keep writing from there.

            Each step:
              1. Run the current sequence through forward()
              2. Look at ONLY the last token's scores
              3. Convert scores ‚Üí probabilities via softmax
              4. Randomly sample one token from those probabilities üé≤
              5. Append that new token to the sequence
              6. Repeat

        INPUT:
            idx            ‚Üí (B, T) tensor of starting token IDs.
                             Usually (1, 1) ‚Äî one sequence, one seed token.

                             Example: torch.zeros((1,1), dtype=torch.long)
                             ‚îå‚îÄ‚îÄ‚îÄ‚îê
                             ‚îÇ 0 ‚îÇ  ‚Üê token id=0 = 'a', used as start signal
                             ‚îî‚îÄ‚îÄ‚îÄ‚îò
                             shape: (B=1, T=1)
                             B=1 = 1 sequence in the batch (COUNT, not index!)
                             T=1 = that sequence is 1 token long

            max_new_tokens ‚Üí how many NEW tokens to generate and add.
                             e.g. 100 ‚Üí sequence grows from T=1 to T=101.

        OUTPUT:
            idx ‚Üí (B, T + max_new_tokens) tensor.
                  The original seed tokens PLUS all newly generated tokens.

                  Example with max_new_tokens=3, seed='a':
                  Start:       [[0]]           shape:(1,1) ‚Üí 'a'
                  After loop1: [[0, 3]]        shape:(1,2) ‚Üí 'a','d'
                  After loop2: [[0, 3, 1]]     shape:(1,3) ‚Üí 'a','d','b'
                  After loop3: [[0, 3, 1, 2]]  shape:(1,4) ‚Üí 'a','d','b','c'
        """

        # In our example: idx starts as (B=1, T=1) ‚Äî one sequence, one token
        # e.g. idx = [[0]]  ‚Üí just the token 'a' as a starting seed

        for _ in range(max_new_tokens):
            # We repeat this loop max_new_tokens times.
            # Each loop iteration = generate ONE new token and add it to idx.
            # Think of it like adding one word at a time to a growing sentence.

            # ‚îÄ‚îÄ STEP 1: GET PREDICTIONS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            logits, loss = self(idx)
            # Run the model on the ENTIRE current sequence.
            # No targets needed here ‚Üí loss will be None (that's fine, we ignore it).
            # logits shape: (B, T, C) ‚Äî a score vector for every position.
            #
            # Example after 1st iteration with idx=[[0]] (just token 'a'):
            # logits shape: (1, 1, 4)
            #   ‚Üí 1 sequence, 1 position, 4 scores

            # ‚îÄ‚îÄ STEP 2: FOCUS ONLY ON THE LAST TOKEN ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            logits = logits[:, -1, :]
            # WHY only the last token?
            # Because in a Bigram model, ONLY the most recent token matters.
            # It doesn't use older context ‚Äî it purely asks:
            # "given the LAST token I saw, what comes next?"
            #
            # logits was (B, T, C) ‚Üí logits[:, -1, :] grabs the LAST time step
            # ‚Üí becomes (B, C)
            #
            # Example: idx = [[0, 1, 2]]  (sequence: a, b, c)
            #
            #   logits BEFORE [:, -1, :] ‚Äî shape (1, 3, 4):
            #   ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
            #   ‚îÇ t=0 (after 'a'): [.1, .8, .3, .5]  ‚îÇ
            #   ‚îÇ t=1 (after 'b'): [.6, .2, .9, .1]  ‚îÇ
            #   ‚îÇ t=2 (after 'c'): [.4, .7, .2, .6]  ‚îÇ ‚Üê -1 grabs THIS row
            #   ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
            #
            #   logits AFTER [:, -1, :] ‚Äî shape (1, 4):
            #   [ .4, .7, .2, .6 ]  ‚Üê just the scores for "what follows 'c'?"
            #      a    b    c    d
            #           ‚Üë highest score ‚Üí model thinks 'b' follows 'c'

            # ‚îÄ‚îÄ STEP 3: CONVERT SCORES ‚Üí PROBABILITIES ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            probs = F.softmax(logits, dim=-1)
            # Softmax squashes raw logit scores into proper probabilities.
            # All values become positive and sum to exactly 1.0 (100%).
            # dim=-1 means "apply softmax along the LAST dimension" (across C scores).
            #
            # Example:
            #   logits: [ .4,  .7,  .2,  .6 ]   ‚Üê raw scores (don't sum to 1)
            #              a    b    c    d
            #              ‚Üì softmax
            #   probs:  [.22, .30, .18, .27]     ‚Üê now sum to ~1.0 (100%) ‚úÖ
            #              a    b    c    d
            #            22%  30%  18%  27%
            #
            # 'b' still has the highest probability (30%) ‚Äî same winner, but
            # now expressed as a proper probability we can SAMPLE from.

            # ‚îÄ‚îÄ STEP 4: SAMPLE ONE TOKEN FROM THE PROBABILITIES ‚îÄ‚îÄ‚îÄ
            idx_next = torch.multinomial(probs, num_samples=1)
            # üé≤ THIS is the ONLY step that introduces randomness.
            # Pick ONE token by RANDOMLY SAMPLING from the probability distribution.
            # Higher probability = more likely to be picked. But NOT guaranteed.
            #
            # This is different from just taking the HIGHEST probability (argmax).
            # Sampling keeps the output VARIED and interesting.
            # Argmax always picks the same token ‚Üí boring, repetitive text.
            #
            # Example with probs = [.22, .30, .18, .27]:
            #   'b' has 30% chance of being picked
            #   'a' has 22% chance
            #   'd' has 27% chance
            #   'c' has 18% chance
            #   ‚Üí maybe this roll picks 'd' ‚Üí idx_next = [[3]]
            #
            # idx_next shape: (B, 1) = (1, 1) ‚Üí one new token per sequence

            # ‚îÄ‚îÄ STEP 5: APPEND NEW TOKEN TO THE SEQUENCE ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
            idx = torch.cat((idx, idx_next), dim=1)
            # Glue the new token onto the END of the current sequence.
            # dim=1 means "concatenate along the T dimension" (add a new column).
            #
            # Example (iteration by iteration):
            #
            #   Start:       idx = [[0]]           shape: (1, 1)  ‚Üí 'a'
            #   After loop1: idx = [[0, 3]]         shape: (1, 2)  ‚Üí 'a','d'
            #   After loop2: idx = [[0, 3, 1]]      shape: (1, 3)  ‚Üí 'a','d','b'
            #   After loop3: idx = [[0, 3, 1, 2]]   shape: (1, 4)  ‚Üí 'a','d','b','c'
            #   ...and so on for max_new_tokens steps
            #
            # Each loop, T grows by 1. After 100 loops, T = original_T + 100.

        return idx
        # Returns the FULL sequence: original seed tokens + all newly generated tokens.
        # Shape: (B, T + max_new_tokens) = (1, 1 + 100) = (1, 101) in our example.


# ‚îÄ‚îÄ OUTSIDE THE CLASS ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

m = BigramLanguageModel(vocab_size)
# BUILD the model. Runs __init__ once.
# Lookup table is created with random starting numbers.
# Like: "Build the kitchen and put in a blank (random) cheat sheet."
# Nothing is learned yet. Just the structure is ready.

logits, loss = m(xb, yb)
# RUN the model on training data. Returns predictions + loss.
#
#   m = BigramLanguageModel(vocab_size) ‚Üí BUILD  (set up the kitchen)
#   logits, loss = m(xb, yb)           ‚Üí RUN    (cook and taste the food)
#                  ‚Üë       ‚Üë
#              predictions  how wrong we are

print("The shape of the logits is", logits.shape)
# torch.Size([6, 4]) = (B*T, C) = 6 predictions, 4 scores each ‚úÖ

print("The loss is:", loss)
# tensor(‚âà1.38) ‚Üê close to log(4) = 1.386, random model as expected ‚úÖ

print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
# This line does 5 things chained together. Reading inside ‚Üí out:
#
# ‚îÄ‚îÄ PIECE 1: torch.zeros((1, 1), dtype=torch.long) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Creates the STARTING SEED ‚Äî a (1√ó1) tensor containing just [[0]].
# Token id=0 is used as the "kickoff" seed ‚Äî like pressing Enter to start.
#
#   ‚îå‚îÄ‚îÄ‚îÄ‚îê
#   ‚îÇ 0 ‚îÇ  ‚Üê token id=0 = 'a' in our vocab
#   ‚îî‚îÄ‚îÄ‚îÄ‚îò
#   shape: (B=1, T=1)
#   B=1 = 1 sequence EXISTS in the batch (COUNT, not index!)
#   T=1 = that sequence is 1 token long
#
# ‚îÄ‚îÄ PIECE 2: m.generate(..., max_new_tokens=100) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Runs the generate loop 100 times.
# Each loop adds one new token to the sequence.
#
#   Seed idx = [[0]]   shape: (1, 1)  ‚Üí just 'a'
#   Loop 1: last token='a' ‚Üí samples 'd' ‚Üí idx=[[0,3]]
#   Loop 2: last token='d' ‚Üí samples 'b' ‚Üí idx=[[0,3,1]]
#   Loop 3: last token='b' ‚Üí samples 'c' ‚Üí idx=[[0,3,1,2]]
#   ... 100 times total
#
#   Returns: [[0, 3, 1, 2, ...]]   shape: (B=1, T=101)
#
# ‚îÄ‚îÄ PIECE 3: [0] ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Grabs the FIRST (and only) sequence from the batch.
# Shape goes from (1, 101) ‚Üí (101,) ‚Üí just a flat list of 101 token IDs.
#
#   [[0, 3, 1, 2, ...]][0]  ‚Üí  [0, 3, 1, 2, ...]
#
# WHY [0] and not [1]?
# B=1 means 1 sequence EXISTS (the COUNT).
# [0] is how we ACCESS it (the INDEX). Indexing always starts at 0.
# B=1 ‚Üí only valid index is [0].
# B=3 ‚Üí valid indices would be [0], [1], [2].
#
#   B=1 (1 sequence in batch)       ‚Üê COUNT
#        ‚Üì
#   [[0, 3, 1, 2, ...]]  shape:(1,101)  ‚Üê full tensor
#     ‚Üë
#    [0]                              ‚Üê INDEX to grab the first sequence
#        ‚Üì
#   [0, 3, 1, 2, ...]    shape:(101,)  ‚Üê flat list, batch wrapper removed
#
# ‚îÄ‚îÄ PIECE 4: .tolist() ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Converts PyTorch tensor ‚Üí plain Python list.
# e.g. tensor([0, 3, 1]) ‚Üí [0, 3, 1]
# Needed because decode() expects a Python list, not a tensor.
#
# ‚îÄ‚îÄ PIECE 5: decode(...) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Converts token IDs back to readable characters.
# e.g. [0, 3, 1, 2] ‚Üí "adbc"
#
# ‚îÄ‚îÄ FULL CHAIN IN ONE PICTURE ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
#
#  torch.zeros  ‚Üí m.generate()  ‚Üí    [0]      ‚Üí .tolist() ‚Üí decode() ‚Üí print
#      ‚Üì               ‚Üì               ‚Üì            ‚Üì          ‚Üì
#   [[0]]       [[0,3,1,2...]]   [0,3,1,2...]  [0,3,1,2...]  "adbc..."
#  (1,1)           (1,101)          (101,)      Python list   text! üéâ
#  seed token   full sequence    1 sequence     no tensor
#
# QUESTION: "Are we feeding the entire history or context?"
# Technically YES ‚Äî we feed the full growing idx to forward() each time.
# BUT the Bigram model throws away everything EXCEPT the last token!
# (That's what logits[:, -1, :] does ‚Äî it ignores all but the final position.)
# So in practice, Bigram has NO memory. It only ever looks at 1 token back.
# This is its biggest weakness ‚Äî and exactly why we'll need Transformers later!
#
#   GPT/Transformer: "I look at ALL previous tokens to decide what's next"
#   Bigram:          "I only look at the LAST token. History? What history?"

The shape of the logits is torch.Size([32, 65])
The loss is: tensor(4.8786, grad_fn=<NllLossBackward0>)

SKIcLT;AcELMoTbvZv C?nq-QE33:CJqkOKH-q;:la!oiywkHjgChzbQ?u!3bLIgwevmyFJGUGp
wnYWmnxKWWev-tDqXErVKLgJ


In [58]:
# =============================================================
# WHAT IS AN OPTIMIZER?
# Think of your neural network like a person lost in hilly 
# terrain (the "loss landscape"), trying to find the lowest 
# valley (lowest loss). The optimizer is their strategy for 
# walking downhill.
# =============================================================

# AdamW = "Adam" optimizer + "Weight Decay" fix
# 'm.parameters()' = we hand AdamW ALL the knobs (weights) 
#                    inside our model that it's allowed to tune
# lr = "learning rate" = how BIG each step is when walking downhill
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

# lr = 1e-3 means 0.001
# Why 0.001?
#   - Too HIGH (e.g. 0.1)  ‚Üí you overshoot the valley, bounce around, never settle
#   - Too LOW  (e.g. 1e-7) ‚Üí you move SO slowly, training takes forever
#   - 1e-3 is a "safe default" that works well for Adam-based optimizers
#   - Karpathy uses 1e-3 here because this is a small, toy-scale model
#   - For bigger models (like real GPT), lr is often ~3e-4 with a scheduler


## How Does Adam Work? (Intuitively)

#Think of training like rolling a ball down a hilly landscape to find the lowest point.

#**Plain SGD** gives the ball a push in the downhill direction. That's it. Same size push every time.

#**Adam** is much smarter. It does **two extra things** on top of SGD:

#---

### üß≠ Thing 1: It remembers *direction* (Momentum)
#> "Which way have I *mostly* been going recently?"

#Adam keeps a **running average of past gradients** (directions). If you've been consistently moving left, it builds up speed in that direction ‚Äî like a ball gaining momentum rolling downhill.

#This helps it **not get confused by noisy, jumpy gradients**.

#---

### üìè Thing 2: It adjusts *step size per weight* (Adaptive Learning Rate)
#> "How bumpy is this particular direction?"

#Adam also tracks **how large the gradients have been** for each individual weight. If one weight keeps getting huge gradients, Adam says *"slow down here, it's bumpy"*. If another weight gets tiny gradients, Adam says *"speed up here, it's flat"*.

#This means **every single weight gets its own personal learning rate**, automatically.


### The Formula (Simply Put):

### new_weight = old_weight - lr √ó (momentum / (sqrt(squared_avg) + tiny_number))

In [63]:
## The Big Picture: One Loop Iteration

#Here's the **full cycle** every single step:
#
#‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
#‚îÇ                                                     ‚îÇ
#‚îÇ  1. get_batch()  ‚Üí  grab 32 random training chunks  ‚îÇ
#‚îÇ         ‚Üì                                           ‚îÇ
#‚îÇ  2. m(xb, yb)    ‚Üí  model predicts, measures loss   ‚îÇ
#‚îÇ         ‚Üì                                           ‚îÇ
#‚îÇ  3. zero_grad()  ‚Üí  wipe the slate clean            ‚îÇ
#‚îÇ         ‚Üì                                           ‚îÇ
#‚îÇ  4. loss.backward() ‚Üí figure out WHO caused the loss‚îÇ
#‚îÇ         ‚Üì                                           ‚îÇ
#‚îÇ  5. optimizer.step() ‚Üí nudge all weights to improve ‚îÇ
#‚îÇ                                                     ‚îÇ
#‚îÇ  Repeat 1000x ‚Üí model gets smarter each time üß†     ‚îÇ
#‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò




# =============================================================
# THE TRAINING LOOP ‚Äî This is the HEART of learning.
# Every iteration = one "experience" the model learns from.
# Think of it like a student doing 1000 practice problems.
# =============================================================

# batch_size = 32 means: 
#   "Don't learn from 1 example at a time ‚Äî grab 32 examples 
#    simultaneously and learn from all of them at once"
# Why 32 and not 1? Or 10,000?
#   - Too small (1): very noisy signal, slow, unstable learning
#   - Too large (10000): very smooth but needs huge memory & can get stuck
#   - 32 is a sweet spot: stable signal, fits in memory, fast
batch_size = 32

# Run 1000 training steps (1000 practice rounds)
for steps in range(20000):

    # ----------------------------------------------------------
    # STEP 1: GET A BATCH OF DATA
    # get_batch() is defined earlier in nanoGPT ‚Äî it randomly 
    # grabs 'batch_size' chunks from the training text.
    # 
    # xb = INPUT  tokens (e.g. "The cat sat on")   shape: [32, block_size]
    # yb = TARGET tokens (e.g. "cat sat on the")   shape: [32, block_size]
    #      yb is xb shifted by 1 position ‚Äî the "correct answers"
    #
    # Each row is an independent training example.
    # We get 32 of them at once = one batch.
    # ----------------------------------------------------------
    xb, yb = get_batch('train')  # 'train' = use training data, not validation

    # ----------------------------------------------------------
    # STEP 2: FORWARD PASS ‚Äî Run the model, compute the loss
    # The model looks at xb, makes predictions (logits),
    # then compares predictions to yb (the right answers).
    # 'loss' is a single number: how WRONG the model is right now.
    # Lower loss = better predictions.
    # ----------------------------------------------------------
    logits, loss = m(xb, yb)

    # ----------------------------------------------------------
    # STEP 3: ZERO OUT OLD GRADIENTS (CRITICAL ‚Äî easy to forget!)
    #
    # Gradients are the "feedback signals" that tell each weight 
    # which direction to move. 
    #
    # By DEFAULT, PyTorch ACCUMULATES (adds up) gradients across 
    # steps. If you don't clear them, the feedback from step 1 
    # bleeds into step 2, step 3, etc. ‚Äî like trying to hear 
    # new music while the last song is still playing loudly.
    #
    # set_to_none=True ‚Üí instead of setting gradients to 0,
    # it sets them to None (slightly faster & less memory)
    # ----------------------------------------------------------
    optimizer.zero_grad(set_to_none=True)

    # ----------------------------------------------------------
    # STEP 4: BACKWARD PASS ‚Äî Compute the gradients
    # 
    # loss.backward() is where the MAGIC happens.
    # PyTorch walks BACKWARDS through every operation in the model
    # and asks: "How much did each weight CONTRIBUTE to this loss?"
    # 
    # This uses the Chain Rule from calculus (backpropagation).
    # The result: every weight now has a .grad value attached to it.
    # 
    # Think of it as: "Who is to blame for this mistake, and HOW MUCH?"
    # ----------------------------------------------------------
    loss.backward()

    # ----------------------------------------------------------
    # STEP 5: UPDATE THE WEIGHTS ‚Äî The actual "learning" step
    #
    # Now that we know the gradient (direction of blame) for each 
    # weight, AdamW uses that info to NUDGE each weight in the 
    # direction that reduces loss.
    #
    # This is where AdamW's smart per-weight step sizes kick in.
    # 
    # After this line, the model is SLIGHTLY smarter than before.
    # Do this 1000 times ‚Üí the model has learned a lot.
    # ----------------------------------------------------------
    optimizer.step()

# After all 1000 steps, print the final loss.
# .item() converts the PyTorch tensor to a plain Python number.
# If this number went DOWN from where it started ‚Üí learning worked! üéâ
print(loss.item())




2.5473523139953613


In [67]:
### Now that we have the loss to around ~2.547 Let's check the output
print(decode(m.generate(idx = torch.zeros((1,1), dtype = torch.long), max_new_tokens = 1000)[0].tolist()))


ANoamery,

Poove. f howendofld:
BRCar wr uny fapou th len sthed chithuterist gin me.
AReidy imut bearg, hendinsouto I ty,
TIZAnghe ICHengouprearsonosomithizewile
YCak inecor qurofous;
Thole wo nthis myoaity
ICHothad wror, he DUK:

h s t;

Se hen'To begh gang weepin pr heslooul w iounguare nche he bln gin, itl, str my y gue. cheinot y on hand

u w t wiene, mollathevevie, bare, eat ule ue:
INI t ful yountoteeagou blit tong chadef thisorecth?

Aspeas agn st n ICOfttold hag-sttersind olldaitowee for bura-y'd d bous thiorlues;
KI det, bus


QUKI k ok;
RCHe,
I athy ss at byo, preror.
Rid beres y
V:
MAnd,
BONG sas ondessely, I thrichat ouprr IOLOLOLABun.
Age:
S:
LIUCUS:

OFFORETRARI pesu hond;
ABy qungas t s
Cay ghe:

Y:

Cro thoundeir t wit ter t, band ty, wewat befat tirgarsur tosee t, aththithayo thy mave a tullelir il ur sothardowhe lot rr lly 's o.
LAmp:
THAs cate,
PELEN siter myoot me wit CERY:
T: aveesher t alld cos nd fllled cak gry s dicowere
Foth rppasos, fof frtou ayomane whethat'

## 3. Self-Attention

Core mechanism that allows tokens to communicate with each other.

In [None]:
# Implement self-attention mechanism
# - Queries, Keys, Values
# - Attention scores and masking
# - Weighted aggregation


## 4. Multi-Head Attention

Multiple attention heads running in parallel to attend to different representation subspaces.

In [None]:
# Implement multi-head attention
# - Multiple heads
# - Concatenation and projection


## 5. Transformer Blocks

Complete transformer block with:
- Multi-head attention
- Feed-forward network
- Layer normalization
- Residual connections

In [None]:
# Implement transformer block
# - Attention sublayer
# - FFN sublayer
# - LayerNorm and residuals


## 6. Full GPT Model

Stack multiple transformer blocks and add token + position embeddings.

In [None]:
# Implement full GPT model
# - Token embeddings
# - Position embeddings
# - Stack of transformer blocks
# - Final layer norm and linear head


## 7. Training

In [None]:
# Training loop
# - Batch generation
# - Forward pass and loss calculation
# - Backward pass and optimization
# - Logging and evaluation


## 8. Generation

Sample from the trained model to generate new text.

In [None]:
# Text generation
# - Autoregressive sampling
# - Temperature control
# - Generate and decode samples
