<a href="https://colab.research.google.com/github/JayThibs/gpt-experiments/blob/main/notebooks/minGPT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End-to-End minGPT

This is an implementation of minGPT in order to quickly get a feel for the end-to-end training of a GPT model.

## Imports

In [3]:
import math
import torch
from torch import nn
import torch.nn.functional as F

## Model configuration

We are now going to create a class where we can initialize all the parameters of the model. This is where we include all the hyperparameters for the model. Since we are doing an implementation of minGPT, we don't have the same model as GPT-2 or GPT-3. However, to get those models, we can simple add more layers, increase maximum sequence length, and embedding dimension.

Those bigger models have some additional tricks for training, but the general idea is just a bigger model and more data.

In [4]:
class GPTConfig:
    attn_dropout = 0.1
    embed_dropout = 0.1
    ff_dropout = 0.1

    def __init__(self, vocab_size, max_len, **kwargs):
        self.vocab_size = vocab_size
        self.max_len = max_len
        for key, value in kwargs.items():
            setattr(self, key, value)

class GPT1Config(GPTConfig):
    num_heads = 12
    num_blocks = 12
    embed_dim = 768

In [5]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.embed_dim
        self.max_len = config.max_len
        self.tok_embed = nn.Embedding(
            config.vocab_size, embed_dim
        )
        self.pos_embed = nn.Parameter(
            torch.zeros(1, self.max_len, embed_dim)
        )
        self.dropout = nn.Dropout(config.embed_dropout)
        self.blocks = nn.Sequential(
            *[Block(config) for _ in range(config.num_blocks)]
        )
        self.ln = nn.LayerNorm(embed_dim)
        self.fc = nn.Linear(embed_dim, config.vocab_size)

    def forward(self, x, target=None):
        # batch_size = x.size(0) # (batch_size, sequence_length, embedding_dimension)
        seq_len = x.size(1)
        assert seq_len <= self.max_len, "sequence longer than model capacity"

        tok_embedding = self.tok_embed(x)
        # tok_embedding.shape == (batch_size, seq_len, embed_dim)
        pos_embedding = self.pos_embed[:, :seq_len, :] # cuts pos_embed shorter based on seq_len passed
        # pos_embedding.shape == (1, seq_len, embed_dim)
        x = self.dropout(tok_embedding + pos_embedding)
        x = self.blocks(x)
        x = self.ln(x)
        x = self.fc(x) # logits
        # x.shape == (batch_size, seq_len, vocab_size)
        return x

In [None]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()

    
    def forward(self, x):
        