# A Micro (243 LoC) but Complete Generative Pretrained Transformer Model (GPT) for Generating First Names - Data Handling, Training, and Inference


## Copyright @Andrej Karpathy

This is the code from Andrej Karpathy's [GitHub repository](https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95) implementing Micro GPT from scratch in pure Python.

## References
- https://karpathy.github.io/2026/02/12/microgpt/
- https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95
- https://www.youtube.com/watch?v=VMj-3S1tku0
- https://htmlpreview.github.io/?https://github.com/tanpuekai/microGPT_webEdu/blob/main/index.html
- https://genmind.ch/posts/N-gram-Language-Models-The-Classic-Building-Blocks-of-Predictive-Text/
- https://lena-voita.github.io/nlp_course/language_modeling.html
- Attention is all you Need. Paper - https://arxiv.org/pdf/1706.03762
- Transformer Architecture - https://research.google/blog/transformer-a-novel-neural-network-architecture-for-language-understanding/
- Attention is all you Need. Youtube https://www.youtube.com/watch?v=rBCqOTEfxvg
- https://jalammar.github.io/illustrated-transformer/
- https://jalammar.github.io/illustrated-gpt2/
- https://sebastianraschka.com/books/ml-q-and-ai-chapters/ch17/
- https://microgpt-academy.vercel.app/



# Internet of Text - Common Crawl, Wikipedia, Facebook, StackExchange, Databases, Documents, Project Gutenberg, E-Books etc.



![Internet of Text](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/InternetofText.jpg?raw=true)


# Text is Sequential. Meaning is determined by position and order of the words in a sentence.

"I went to Delhi."   <- Grammatically Correct

"Delhi went to I"    <- Incorrect

## A Para of Text from Wikipedia

![A Para of Text From Wikipedia](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/Aparagraphoftext.png?raw=true)

# A Language Model

A language model calculates the probability of a sentence or sequence of words  occurring in a natural language, assigning a <b>probability distribution</b> to words.

The goal of building langauge models is to determine the most likely next word or sequence of words based on context (Autocomplete).

The language models learn from text written by humans (text corpora) by capturing patterns the represent grammatical structure, context and semantic meaning.



## Fill in the blanks:

Give me a cup of ____?


- Coffee 32%
- Tea 23%
- Water 17%
- Milk 10.5%
- Rasam 8.2%
- Ganji 5.2%
- Beer 4%
- Earth 0.05%
- Soap 0.03%
- Keyboard 0.01%
- Book 0.01%  

# How is the probability distribution computed?


![Probabilities](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/probability.png?raw=true)






![LanguageModel](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/languagemodel.png?raw=true)




![Google Autocomplete](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/GoogleAutocomplete.png?raw=true)

## This is a self-supervised learning task.
----------------------------------------------------------------

# Large Language Models
### Large Language Models deep neural networks (Transformer Architecture) trained on enormously Large Corpora of Text (Entire Internet). They learn to predict next work in a sequence of words in natural language really well. They can be fine tuned for multiple downstream tasks.


- ChatGPT (version 3.5) was trained on 45 TB of text data. This massive dataset, sourced from web crawls (like Common Crawl), books, Wikipedia, and other internet articles, consisted of over 300 billion words (approx. 500 billion tokens). GPT-4 on over 1 petabyte of data.



# Transformer - The original Encoder-Decoder Architecture

![Transformer_Original](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/Transformer_attentionisallyouneed.png?raw=true)




![TransformerOriginal_Sebastian](https://sebastianraschka.com/images/books/ml-q-and-ai/ch17-fig01.png)

The original Transformer Architecture had Encoder blocks as well as Decoder blocks.


We can have Encoder only Transformers or Decoder only Transformers.


![DecoderOnly](https://towardsdatascience.com/wp-content/uploads/2024/05/1Qww2aaIdqrWVeNmo3AS0ZQ-2048x1314.png)



# Generative Pretrained Transformer - Decoder Only Architecture

![DecoderOnly_GPT2](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/decoderonly_gpt2.png?raw=true)


# Steps in building a Large Language Model using Transformer Deep Neural Network

## 1.Data Preparation & Tokenization

- Corpus Collection: Gathering trillions of tokens from the web (Common Crawl), books, code (GitHub), and academic papers.

- Data Cleaning: Massive filtering to remove "noise" (HTML tags, spam), toxic content, and duplicate text.

- Tokenization: Breaking text into "tokens" (sub-word units). For example, the word "friendship" might be split into friend and ship.

- Embeddings & Positional Encoding: Each token is converted into a high-dimensional vector. Since Transformers process all tokens in a sentence simultaneously (parallelism), Positional Encodings are added to these vectors so the model knows the order of the words.

<b>Simple One Hot Represention of Words into Vectors</b>
![WordstoVectors](https://c8j9w8r3.rocketcdn.me/wp-content/uploads/2018/01/one-hot-word-embedding-vectors-768x276.png)


<b>Positional Embeddings</b>

![PositionalEmbeddings](https://aiml.com/wp-content/uploads/2023/09/example_of_positional_encoding_in_transformers.png)



## 2. Self-Supervised Pre-training : Learning to Predict the Next Token

- This is the most expensive and time-consuming stage of training language models. This uses Self-Supervised Learning, no human labels are needed; the text itself provides the "answers."

- Objective of this training is Next-Token Prediction. The model is given a sequence and must guess the next word.

- The model uses Self-Attention to determine which words in a sentence are relevant to others (e.g., in "The cat sat on the mat because it was tired," attention helps the model link "it" to "cat").

- This produces a Base Model (like Llama 3) that knows facts and grammar but doesn't yet know how to follow instructions.

## 3. Supervised Fine-Tuning (SFT)
- SFT teaches the base model how to respond to prompts.


## 4. Alignment & RLHF (The "Safety" Phase)
- Even after SFT, a model might be rude, biased, or confidently wrong. Alignment ensures the model's values match human expectations. This is done using preference labeling, reward model, reinforcement learning (PPO, DPO etc.).


# Walkthrough of Andrej Karpathy's Code for microGPT

https://karpathy.github.io/2026/02/12/microgpt/


In this code, we are implementing the self-supervised pretraining of a very small GPT-2 architecture. The steps we follow include:

1. Data Ingestion, the dataset we use is already clean.
2. Write a tokenizer for our data and tokenize our data
3. Write our automatic differentation class (AutoGrad) that computes local gradient at each node in the neural network, it has backward() to propagate the gradients backwards in the NN.
4. Initialize the hyper-parameters for our neural network
5. Define the layers and their placement in our neural network (architecture design)
6. Write our Adam optimizer for weight update of our NN at each step of gradient calculation
7. Train our model on our dataset
8. Use the model for inference - Generate new first names for us.


## 1. Import required Python libraries

In [1]:
import os       # os.path.exists
import math     # math.log, math.exp
import random   # random.seed, random.choices, random.gauss, random.shuffle
random.seed(42) # Let there be order among chaos

## 2. Create a text dataset to train out language model on

The data is coming from names.txt file shared by Andrej Karpathy:

https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt

- The file contains 32033 names (first names) of human beings.

- Code below, reads the file names.txt at above URL, stores its content in a file called input.txt and then strips the spaces around each line of text.

- Then it randomly shuffles the contents.

Here is what the dataset looks like:

![DatasetContents](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/dataset.png?raw=true)


In [2]:
# Let there be a Dataset `docs`: list[str] of documents (e.g. a list of names)
if not os.path.exists('input.txt'):
    import urllib.request
    names_url = 'https://raw.githubusercontent.com/karpathy/makemore/988aa59/names.txt'
    urllib.request.urlretrieve(names_url, 'input.txt')
docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
print(f"num docs: {len(docs)}")

num docs: 32033


## 3. Create a tokenizer to convert strings (representing names) to sequences of integers(tokens) and back.

- Unique characters in the dataset become token ids 0 to n-1 (There are 27 unique characters in English language)
- BOS -  is a special token representing Beginning of Sequence
- Vocabulary consists of unique tokens in the dataset
- Vocabulary size is the number of unique tokens + 1 (BOS)

In [3]:
# Let there be a Tokenizer to translate strings to sequences of integers ("tokens") and back
uchars = sorted(set(''.join(docs))) # unique characters in the dataset become token ids 0..n-1
BOS = len(uchars) # token id for a special Beginning of Sequence (BOS) token
vocab_size = len(uchars) + 1 # total number of unique tokens, +1 is for BOS
print(f"vocab size: {vocab_size}")

vocab size: 27


## 4. Build a class for calculating gradient (automatic differentiation) of the loss function and pass the gradient backwards in the neural network (computational graph) for weight updation

# For a function with one output and many inputs, we calculate the Gradient:

![Gradient](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/Gradient.png?raw=true)

# For a function with multiple outputs and multiple inputs, we calculate the Jacobian:


![Jacobian](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/Jacobian.png?raw=true)


# Chain Rule of Calculus helps us compute Gradients at each Node in the NN:

![ChainRule](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/ChainRule.png?raw=true)

The gradient of loss function is propagated backwards from output node of the neural network towards first hidden layer, and the weights of all nodes are updated using an Optimizer such as Adam, RMSProp, SGD etc.

![BackPropagation](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/Backpropagation.png?raw=true)

# MicroGPT Architecture

![MicroGPT Architecture](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/ModelArchitecture_microGPT.png?raw=true)



# Value Class
![ValueClass](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/ValueClass.png?raw=true)

In [4]:
# Let there be Autograd to recursively apply the chain rule through a computation graph
class Value:
    __slots__ = ('data', 'grad', '_children', '_local_grads') # Python optimization for memory usage

    def __init__(self, data, children=(), local_grads=()):
        self.data = data                # scalar value of this node calculated during forward pass
        self.grad = 0                   # derivative of the loss w.r.t. this node, calculated in backward pass
        self._children = children       # children of this node in the computation graph
        self._local_grads = local_grads # local derivative of this node w.r.t. its children

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data + other.data, (self, other), (1, 1))

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        return Value(self.data * other.data, (self, other), (other.data, self.data))

    def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))
    def log(self): return Value(math.log(self.data), (self,), (1/self.data,))
    def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))
    def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))
    def __neg__(self): return self * -1
    def __radd__(self, other): return self + other
    def __sub__(self, other): return self + (-other)
    def __rsub__(self, other): return other + (-self)
    def __rmul__(self, other): return self * other
    def __truediv__(self, other): return self * other**-1
    def __rtruediv__(self, other): return other * self**-1

    def backward(self):
        topo = []
        visited = set()
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._children:
                    build_topo(child)
                topo.append(v)
        build_topo(self)
        self.grad = 1
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

## 5. Initialize the parameters of microGPT

In [5]:
# Initialize the parameters, to store the knowledge of the model
n_layer = 1     # depth of the transformer neural network (number of layers)
n_embd = 16     # width of the network (embedding dimension)
block_size = 16 # maximum context length of the attention window (note: the longest name is 15 characters)
n_head = 4      # number of attention heads
head_dim = n_embd // n_head # derived dimension of each head
matrix = lambda nout, nin, std=0.08: [[Value(random.gauss(0, std)) for _ in range(nin)] for _ in range(nout)]
state_dict = {'wte': matrix(vocab_size, n_embd), 'wpe': matrix(block_size, n_embd), 'lm_head': matrix(vocab_size, n_embd)}
for i in range(n_layer):
    state_dict[f'layer{i}.attn_wq'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wk'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wv'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.attn_wo'] = matrix(n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc1'] = matrix(4 * n_embd, n_embd)
    state_dict[f'layer{i}.mlp_fc2'] = matrix(n_embd, 4 * n_embd)
params = [p for mat in state_dict.values() for row in mat for p in row] # flatten params into a single list[Value]
print(f"num params: {len(params)}")


num params: 4192


## 6. Define the model architecture (a model is a function mapping tokens and parameters to logits over what comes next in a sequence)

This model follows the architecture of Generative Pretrained Transformer -2, with minor differences:
- layernorm is replaced by rmsnorm,
- there are no biases,
- GeLU is replaced by ReLU

# Multiheaded Attention Loop in MicroGPT

![MultiHeadedAttention](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/MultiHeadedAttentionLoop.png?raw=true)

In [6]:
#Linear Layer
def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]


#Softmax Layer
def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    return [e / total for e in exps]


# RMSNorm Layer
def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]


# Complete Transformer Architecture
def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id] # token embedding
    pos_emb = state_dict['wpe'][pos_id] # position embedding
    x = [t + p for t, p in zip(tok_emb, pos_emb)] # joint token and position embedding
    x = rmsnorm(x) # note: not redundant due to backward pass via the residual connection

    for li in range(n_layer):
        # 1) Multi-head Attention block
        x_residual = x
        x = rmsnorm(x)
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])
        #Keys and values
        keys[li].append(k)
        values[li].append(v)
        x_attn = []
        for h in range(n_head):
            hs = h * head_dim
            q_h = q[hs:hs+head_dim]
            k_h = [ki[hs:hs+head_dim] for ki in keys[li]]
            v_h = [vi[hs:hs+head_dim] for vi in values[li]]
            #Calculating attention scores
            attn_logits = [sum(q_h[j] * k_h[t][j] for j in range(head_dim)) / head_dim**0.5 for t in range(len(k_h))]
            attn_weights = softmax(attn_logits)
            head_out = [sum(attn_weights[t] * v_h[t][j] for t in range(len(v_h))) for j in range(head_dim)]
            x_attn.extend(head_out)
        #Apply linear layer to attention scores
        x = linear(x_attn, state_dict[f'layer{li}.attn_wo'])
        x = [a + b for a, b in zip(x, x_residual)]


        # 2) MLP block
        x_residual = x
        x = rmsnorm(x)
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])
        x = [a + b for a, b in zip(x, x_residual)]

    logits = linear(x, state_dict['lm_head'])
    return logits

## 7. Training Loop

- Define the Adam optimizer for updating the model parameters at each step of gradient calculation


![TrainingLoop](https://github.com/NimritaKoul/microGPTStudyGroup/blob/main/TrainingLoop.png?raw=true)

In [7]:
learning_rate, beta1, beta2, eps_adam = 0.01, 0.85, 0.99, 1e-8
m = [0.0] * len(params) # first moment buffer
v = [0.0] * len(params) # second moment buffer

# Repeat in sequence
num_steps = 1000 # number of training steps
for step in range(num_steps):

    # Take single document, tokenize it, surround it with BOS special token on both sides
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward the token sequence through the model, building up the computation graph all the way to the loss
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses) # final average loss over the document sequence. May yours be low.

    # Backward the loss, calculating the gradients with respect to all model parameters
    loss.backward()

    # Adam optimizer update: update the model parameters based on the corresponding gradients
    lr_t = learning_rate * (1 - step / num_steps) # linear learning rate decay
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

    print(f"step {step+1:4d} / {num_steps:4d} | loss {loss.data:.4f}", end='\r')



## 8. Use the model for inference - it will generate new names similar to the names in training data.

In [8]:
# Inference: may the model babble back to us
temperature = 0.5 # in (0, 1], control the "creativity" of generated text, low to high
print("\n--- inference (new, hallucinated names) ---")
for sample_idx in range(20):
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    token_id = BOS
    sample = []
    for pos_id in range(block_size):
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")


--- inference (new, hallucinated names) ---
sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan
sample  6: karia
sample  7: yeran
sample  8: anna
sample  9: areli
sample 10: kaina
sample 11: konna
sample 12: keylen
sample 13: liole
sample 14: alerin
sample 15: earan
sample 16: lenne
sample 17: kana
sample 18: lara
sample 19: alela
sample 20: anton


# Thank You
# In further sessions, we will dive deep into each of the above code blocks of microGPT architecture.
