<a href="https://colab.research.google.com/github/ChiefExecutiv/Ben_Detectron/blob/main/DumE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**DumE** is a small language model meant for study and experimentation.
This model is my naive attempt at creating a language model AI.
It is trained on a wikipidea article "The origins of the universe", focusing on the Big Bang Theory.

You will notice the output below, it's not very coherent but it's not too bad either. I will emphasize that this model is not built for performance.
Reproducing a reasonable language model would require multiple training hours in a very powerful compute environment, I have a cheap laptop :)


# Model Implementation

Below is the full implementation of DumE.
A few technical details:
It's a decoder only model hence utilizes Causal Self-attention for masking future tokens in the input sequence.
If you're familiar with self-attention, Causal self-attention is a variant of that, it essentially restricts attention to only tensors to the left of the seqence. The model can't "attend" to future tokens.


In [7]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import math


class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super(CausalSelfAttention, self).__init__()

        self.embed_dim = config.embed_dim
        self.num_heads = config.num_heads

        if config.embed_dim % config.num_heads != 0:
            raise ValueError("Embedding dimension must be divisible by number of heads")

        # Query, Key and Value projections
        self.qkv_proj = nn.Linear(config.embed_dim, 3 * config.embed_dim) # These can also be projected individaully

        # Output Projection
        self.output_proj = nn.Linear(config.embed_dim, config.embed_dim)

        # dropout to avoid overfitting
        self.attn_dropout = nn.Dropout(config.attn_drop)
        self.resid_dropout = nn.Dropout(config.resid_drop)

        # Returns a causal mask to ensure attention isn't applied to future tokens. It's applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                .view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimension

        # Query, key, values for all the heads - a head is essentially a single instance of the attention mechanism
        qkv = self.qkv_proj(x) # Shape: (B, T, 3*C)
        q, k, v = torch.chunk(qkv, 3, dim=-1) # Split into query, key and value tensors

        # Reshape for multi-head attention
        q = q.view(B, T, self.num_heads, C // self.num_heads).transpose(1, 2)
        k = k.view(B, T, self.num_heads, C // self.num_heads).transpose(1, 2)
        v = v.view(B, T, self.num_heads, C // self.num_heads).transpose(1, 2)

        # compute attention scores
        attn_scores = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(C // self.num_heads))

        # Apply causal mask
        causal_mask = self.bias[:, :, :T, :T]  # Extract relevant part of the mask
        attn_scores = attn_scores.masked_fill(causal_mask == 0, float('-inf'))

        # Compute attention probabilities
        attn_probs = F.softmax(attn_scores, dim=-1)
        attn_probs = self.attn_dropout(attn_probs)

        # Attention output
        attn_output = attn_probs @ v  # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        attn_output = attn_output.transpose(1, 2).contiguous().view(B, T, C)  # Reassemble heads

        # Output projection
        y = self.resid_dropout(self.output_proj(attn_output))
        return y


class TransformerBlock(nn.Module):

    def __init__(self, config):
        super().__init__()

        self.ln_1 = nn.LayerNorm(config.embed_dim)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.embed_dim)

        # Feed-forward neural network
        self.mlp = nn.Sequential(
            nn.Linear(config.embed_dim, 4 * config.embed_dim),
            nn.GELU(),
            nn.Linear(4 * config.embed_dim, config.embed_dim),
            nn.Dropout(config.resid_drop)
        )

    def forward(self, x):
        # Apply LayerNorm, Attention, and Residual Connection
        attn_output = self.attn(self.ln_1(x))
        x = x + attn_output

        # Apply LayerNorm, MLP, and Residual Connection
        mlp_output = self.mlp(self.ln_2(x))
        x = x + mlp_output

        return x


class DumE(nn.Module):
    def __init__(self, config):
        super(DumE, self).__init__()

        # Embedding layers
        self.token_embedding = nn.Embedding(config.vocab_size, config.embed_dim)
        self.position_embedding = nn.Embedding(config.block_size, config.embed_dim)

        # Transformer blocks
        self.blocks = nn.ModuleList([TransformerBlock(config) for _ in range(config.num_layers)])

        # LayerNorm before output
        self.ln_f = nn.LayerNorm(config.embed_dim)

        # Output projection layer
        self.head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)

        # Store block size
        self.block_size = config.block_size

    def forward(self, idx, targets=None):
        B, T = idx.size()

        # Ensure sequence length does not exceed block size
        assert T <= self.block_size, "Sequence length exceeds model block size"

        # Compute token and position embeddings
        token_embeddings = self.token_embedding(idx)  # Shape: (B, T, C)
        position_ids = torch.arange(0, T, dtype=torch.long, device=idx.device).unsqueeze(0)
        position_embeddings = self.position_embedding(position_ids)  # Shape: (1, T, C)

        x = token_embeddings + position_embeddings

        # Pass through Transformer blocks
        for block in self.blocks:
            x = block(x)

        # Apply final LayerNorm and output projection
        x = self.ln_f(x)
        logits = self.head(x)  # Shape: (B, T, vocab_size)

        # Compute loss if targets are provided
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
            return logits, loss

        return logits

    @torch.no_grad()
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            # Crop the context window to block size
            idx_cond = idx[:, -self.block_size:]

            # Forward pass to get logits
            logits = self(idx_cond)

            # Focus only on the last token's logits
            logits = logits[:, -1, :]

            # Apply softmax to convert logits to probabilities
            probs = F.softmax(logits, dim=-1)

            # Sample from the distribution
            next_token = torch.multinomial(probs, num_samples=1)

            # Append the sampled token to the sequence
            idx = torch.cat((idx, next_token), dim=1)

        return idx

# The Trainer
This is responsible for creating the model's training run.
It's an implementation of Andre Karpathy's trainer but with a few changes and a few components I stripped away for simplicity.


In [2]:
import time
from collections import defaultdict
import torch
from torch.utils.data.dataloader import DataLoader


"""
The Trainer class is courtesy of Andrej Karpathy
"""
class Trainer:

    def __init__(self, config, model, train_dataset):
        self.config = config
        self.model = model
        self.optimizer = None
        self.train_dataset = train_dataset
        self.callbacks = defaultdict(list)

        # The device to be trained on, a gpu is heavily recommended
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

        self.model = self.model.to(self.device)

        print(f"Running on device: {self.device}")


        self.iter_num = 0
        self.iter_time = 0.0
        self.iter_dt = 0.0

    def add_callback(self, onevent: str, callback):
        self.callbacks[onevent].append(callback)

    def set_callback(self, onevent: str, callback):
        self.callbacks[onevent] = [callback]

    def trigger_callbacks(self, onevent: str):
        for callback in self.callbacks.get(onevent, []):
            callback(self)

    def run(self):
        model, config = self.model, self.config

        # setup the optimizer
        self.optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)
        config.grad_norm_clip = 1.0

        # dataloader
        train_loader = DataLoader(
            self.train_dataset,
            sampler = torch.utils.data.RandomSampler(self.train_dataset, replacement=True, num_samples=int(1e10)),
            shuffle = False,
            pin_memory = True,
            batch_size = config.batch_size,
            num_workers = config.num_workers,
        )

        model.train()
        self.iter_num = 0
        self.iter_time = time.time()
        data_iter = iter(train_loader)
        while True:

            # fetch the next batch(x, y)
            try:
                batch = next(data_iter)
            except StopIteration:
                data_iter = iter(train_loader)
                batch = next(data_iter)
            batch = [t.to(self.device) for t in batch]
            x, y = batch

            # forward the model
            logits, self.loss = model(x, y)

            # backpropagation and update parameters
            model.zero_grad(set_to_none=True)
            self.loss.backward()
            torch.nn.utils.clip_grad_norm(model.parameters(), config.grad_norm_clip)
            self.optimizer.step()

            self.trigger_callbacks('on_batch_end')
            self.iter_num += 1
            tnow = time.time()
            self.iter_dt = tnow - self.iter_time
            self.iter_time = tnow

            # termination conditions
            if config.max_iters is not None and self.iter_num >= config.max_iters:
                break

# Model Training

With the model implemented, a trainer in place, we're ready to train DumE.
Note: Training on a cpu was really nerve-wrecking, which is why I moved development to this colab environment.

In [8]:
from torch.utils.data import Dataset, DataLoader

class ModelConfig:
    """
    Configuration for the language model.
    """
    def __init__(self, vocab_size, block_size, embed_dim, num_heads, num_layers, dropout, attn_drop, resid_drop):
        self.vocab_size = vocab_size
        self.block_size = block_size
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.num_layers = num_layers
        self.dropout = dropout
        self.attn_drop = attn_drop
        self.resid_drop = resid_drop

    def summary(self):
        """
        Print a summary of the configuration.
        """
        config_dict = vars(self)
        print("Model Configuration:")
        for key, value in config_dict.items():
            print(f"{key}: {value}")


class TrainerConfig:
    """
    Configuration for training.
    """
    def __init__(self, batch_size, num_workers, learning_rate, max_iters, eval_interval, eval_iters):
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.learning_rate = learning_rate
        self.max_iters = max_iters
        self.eval_interval = eval_interval
        self.eval_iters = eval_iters

    def summary(self):
        """
        Print a summary of the training configuration.
        """
        config_dict = vars(self)
        print("Trainer Configuration:")
        for key, value in config_dict.items():
            print(f"{key}: {value}")

class CharDataset(Dataset):
    def __init__(self, data, block_size):
        chars = sorted(list(set(data)))
        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for i, ch in enumerate(chars)}
        self.vocab_size = len(chars)
        self.block_size = block_size
        self.data = data

    def __len__(self):
        return len(self.data) - self.block_size

    def __getitem__(self, idx):
        chunk = self.data[idx:idx + self.block_size + 1]
        dix = [self.stoi[ch] for ch in chunk]
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

if __name__ == "__main__":
    # Load text data
    with open("/content/sample_data/big-bang.txt", "r") as f:
        text = f.read()

    # Initialize dataset and configurations
    block_size = 128
    train_dataset = CharDataset(text, block_size)

    model_config = ModelConfig(
        vocab_size=train_dataset.vocab_size,
        block_size=block_size,
        embed_dim=256,
        num_heads=8,
        num_layers=6,
        dropout=0.1,
        attn_drop=0.1,
        resid_drop=0.1
    )

    trainer_config = TrainerConfig(
        batch_size=32,
        num_workers=2,
        learning_rate=3e-4,
        max_iters=3000,
        eval_interval=100,
        eval_iters=10
    )

    # Create model and trainer
    model = DumE(model_config)
    trainer = Trainer(trainer_config, model, train_dataset)

    # Print configurations
    model_config.summary()
    trainer_config.summary()

    # Train the model
    trainer.run()

    # Save the model
    torch.save(model.state_dict(), "trained_model.pt")
    print("Model saved as trained_model.pt")

    # Generate samples
    with torch.no_grad():
        context = "The Big Bang"
        x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None, ...].to(trainer.device)
        y = model.generate(x, 500)[0]
        completion = ''.join([train_dataset.itos[int(i)] for i in y])
        print("Generated text:")
        print(completion)

Running on device: cuda
Model Configuration:
vocab_size: 87
block_size: 128
embed_dim: 256
num_heads: 8
num_layers: 6
dropout: 0.1
attn_drop: 0.1
resid_drop: 0.1
Trainer Configuration:
batch_size: 32
num_workers: 2
learning_rate: 0.0003
max_iters: 3000
eval_interval: 100
eval_iters: 10


  torch.nn.utils.clip_grad_norm(model.parameters(), config.grad_norm_clip)


Model saved as trained_model.pt
Generated text:
The Big Bang was preceded by Sthew the eBig Bang theory's confluction is a comprehensive ever must have been withound 
and out 20106 – each last horizon as muched that the universe, and today is very emery nearly possible trans; and the oldn baryon matter; and through galaxies ward and quasars and galaxies (with as billion years.

The universe resumpirality
In a vistance of the age of the universe
If the earliest anst describes then any ampecularing atums 
he uniforpred every fred the expansion of the unive


Above is model output from the prompt "The Big bang" not too bad given that it trained for about 2 minutes on remote gpu.