# Task 0.0.3: GPT-1 Baseline (Colab-Ready)

**Purpose**: Establish baseline GPT-1 performance for fair comparison with RWKV-6 and Mamba-2  
**Phase**: V0.5 Phase 0 - Base Model Characterization  
**Status**: ‚úÖ COMPLETE (2026-01-12)  
**Documentation**: See [BASE_MODEL_CHARACTERIZATION.md](../BASE_MODEL_CHARACTERIZATION.md), [HANDOFF.md](../HANDOFF.md)

## Key Findings

| Metric | Value | Note |
|--------|-------|------|
| **Characterization** | **AMPLIFIER (782x)** | Extreme amplification |
| Variance | 0.02 ‚Üí 16.7 | 782x total over 8 layers |
| Final loss | 6.77 | 50 steps with BlinkDL init |
| Max prob | 0.058 | Healthy (no saturation) |
| Entropy | 70.0% | Of random (9.68) |
| Logits range | [-4.5, +4.5] | Well-bounded |

**Key Insight:** GPT-1 is an extreme amplifier (782x) compared to RWKV-6 (5.5x) and Mamba-2 (2.0x). BlinkDL init keeps it stable by starting from tiny variance (0.02).

## Architecture

- **GPT-1 style**: Decoder-only transformer with causal attention
- **Scale**: 4.37M params (8 layers √ó 144 hidden) to match RWKV-6 and Mamba-2
- **Activation**: GELU (same as other baselines)
- **Init**: BlinkDL-style (architecture-agnostic, proven on all three models)

## Execution

1. **VS Code + Colab** (RECOMMENDED): Select Kernel ‚Üí Connect to Google Colab
2. **Local WSL**: Should work (pure PyTorch, no CUDA dependencies)

In [1]:
# Cell 0: Colab setup
import os

try:
    IN_COLAB = 'COLAB_GPU' in os.environ or 'google.colab' in str(get_ipython())
except:
    IN_COLAB = False

if IN_COLAB:
    print("‚úì Running on Google Colab")
    if not os.path.exists('groundthink'):
        !git clone https://github.com/9to5ninja-projects/groundthink.git
    else:
        !cd groundthink && git pull --quiet
    os.chdir('groundthink')
    !pip install -q tokenizers
    print("‚úì Dependencies installed")
else:
    print("Running locally (WSL/Linux)")
    if os.path.basename(os.getcwd()) == 'notebooks':
        os.chdir('..')

‚úì Running on Google Colab
‚úì Dependencies installed


In [2]:
# Cell 1: Memory monitoring
import resource
import gc

def mem_mb():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024

def mem_check(label):
    gc.collect()
    print(f"[{label}] Memory: {mem_mb():.0f} MB")

mem_check("Before imports")

[Before imports] Memory: 147 MB


In [3]:
# Cell 2: Import PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
mem_check("After torch import")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

[After torch import] Memory: 280 MB
Device: cpu


In [4]:
# Cell 3: Load tokenizer
import sys
import os

if os.path.exists('groundthink') and os.getcwd().endswith('content'):
    os.chdir('groundthink')
sys.path.insert(0, os.getcwd())

from data.tokenizer import BPETokenizer

tokenizer = BPETokenizer('data/tokenizer_wikitext.json')
print(f"Loaded tokenizer: {tokenizer.vocab_size} vocab")
mem_check("After tokenizer")

Loaded tokenizer: 16000 vocab
[After tokenizer] Memory: 297 MB


In [5]:
# Cell 4: Load data (same as Task 0.0.1/0.0.2)
TOKEN_FILE = 'data/wikitext103_tokens.bin'

if os.path.exists(TOKEN_FILE):
    import numpy as np
    tokens = torch.from_numpy(np.fromfile(TOKEN_FILE, dtype=np.int32)).long()
    print(f"‚úì Loaded {len(tokens):,} tokens from cache")
else:
    print("Streaming from HuggingFace...")
    if IN_COLAB:
        !pip install -q datasets
    from datasets import load_dataset
    
    ds = load_dataset("wikitext", "wikitext-103-raw-v1", split="train", streaming=True)
    all_tokens = []
    char_count = 0
    max_chars = 50 * 1024 * 1024
    
    for i, item in enumerate(ds):
        text = item['text']
        if not text.strip():
            continue
        char_count += len(text)
        all_tokens.extend(tokenizer.encode(text))
        if i % 10000 == 0 and i > 0:
            print(f"  {i:,} items, {char_count/1e6:.1f}MB")
        if char_count >= max_chars:
            break
    
    tokens = torch.tensor(all_tokens, dtype=torch.long)
    print(f"‚úì Tokenized: {len(tokens):,} tokens")
    del all_tokens
    gc.collect()

mem_check("After tokenization")

Streaming from HuggingFace...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


  20,000 items, 5.9MB
  30,000 items, 8.8MB
  60,000 items, 17.7MB
  70,000 items, 20.6MB
  80,000 items, 23.5MB
  90,000 items, 26.5MB
  100,000 items, 29.4MB
  120,000 items, 35.4MB
  130,000 items, 38.4MB
  140,000 items, 41.3MB
  160,000 items, 47.2MB
  170,000 items, 50.1MB
‚úì Tokenized: 11,926,606 tokens
[After tokenization] Memory: 1219 MB


In [6]:
# Cell 5: Dataset setup
BATCH_SIZE = 1
SEQ_LEN = 64

n_tokens = (len(tokens) // (BATCH_SIZE * SEQ_LEN)) * (BATCH_SIZE * SEQ_LEN)
tokens = tokens[:n_tokens]
print(f"Dataset: {n_tokens // SEQ_LEN:,} sequences of length {SEQ_LEN}")
mem_check("After dataset setup")

Dataset: 186,353 sequences of length 64
[After dataset setup] Memory: 1219 MB


In [7]:
# Cell 6: GPT-1 Model Definition

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(dim))
        self.eps = eps
    def forward(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps) * self.weight

class CausalSelfAttention(nn.Module):
    """Standard causal multi-head attention (GPT-style)."""
    def __init__(self, d_model, n_heads, max_seq_len=512):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.d_model = d_model
        
        self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out_proj = nn.Linear(d_model, d_model, bias=False)
        
        # Causal mask
        mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
        self.register_buffer('mask', mask.view(1, 1, max_seq_len, max_seq_len))
    
    def forward(self, x):
        B, T, C = x.shape
        
        qkv = self.qkv_proj(x)
        q, k, v = qkv.chunk(3, dim=-1)
        
        q = q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        
        # Scaled dot-product attention
        scale = 1.0 / math.sqrt(self.d_head)
        attn = (q @ k.transpose(-2, -1)) * scale
        attn = attn.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        
        out = attn @ v
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.out_proj(out)

print("‚úì CausalSelfAttention defined")

‚úì CausalSelfAttention defined


In [8]:
# Cell 7: GPT-1 Block and Model

class GPT1Block(nn.Module):
    """GPT-1 transformer block: attention + FFN with residuals."""
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.ln1 = RMSNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads)
        self.ln2 = RMSNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4, bias=False),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model, bias=False),
        )
    
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x

class GPT1Model(nn.Module):
    """GPT-1 decoder-only transformer."""
    def __init__(self, vocab_size, d_model=144, n_layers=8, n_heads=4):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(512, d_model)  # max 512 positions
        self.blocks = nn.ModuleList([GPT1Block(d_model, n_heads) for _ in range(n_layers)])
        self.ln_out = RMSNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.head.weight = self.embed.weight  # Tie weights
    
    def forward(self, x):
        B, T = x.shape
        pos = torch.arange(T, device=x.device).unsqueeze(0)
        x = self.embed(x) + self.pos_embed(pos)
        for block in self.blocks:
            x = block(x)
        return self.head(self.ln_out(x))

# Create model
model = GPT1Model(tokenizer.vocab_size, d_model=144, n_layers=8, n_heads=4)
params = sum(p.numel() for p in model.parameters())
print(f"GPT-1 Model: {params:,} parameters ({params/1e6:.2f}M)")
mem_check("After model creation")

GPT-1 Model: 4,370,832 parameters (4.37M)
[After model creation] Memory: 1219 MB


In [9]:
# Cell 8: Apply BlinkDL initialization (architecture-agnostic)

def apply_blinkdl_init(model):
    """Apply proven BlinkDL init pattern."""
    with torch.no_grad():
        # Small embedding init
        nn.init.uniform_(model.embed.weight, -1e-4, 1e-4)
        nn.init.uniform_(model.pos_embed.weight, -1e-4, 1e-4)
        print("‚úì Embeddings: uniform(-1e-4, 1e-4)")
        
        # Zero output projections in each block
        for i, block in enumerate(model.blocks):
            nn.init.zeros_(block.attn.out_proj.weight)
            nn.init.zeros_(block.ffn[2].weight)  # FFN output
        print(f"‚úì Zeroed out_proj in all {len(model.blocks)} blocks")

apply_blinkdl_init(model)
print("‚úì BlinkDL init applied")

‚úì Embeddings: uniform(-1e-4, 1e-4)
‚úì Zeroed out_proj in all 8 blocks
‚úì BlinkDL init applied


In [10]:
# Cell 9: Training loop (50 steps)
import time

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
NUM_STEPS = 50
LOG_EVERY = 10

model.train()
losses = []
start_time = time.time()

for step in range(NUM_STEPS):
    idx = (step * SEQ_LEN) % (len(tokens) - SEQ_LEN - 1)
    x = tokens[idx:idx + SEQ_LEN].unsqueeze(0)
    y = tokens[idx + 1:idx + SEQ_LEN + 1].unsqueeze(0)
    
    logits = model(x)
    loss = F.cross_entropy(logits.view(-1, tokenizer.vocab_size), y.view(-1))
    
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    
    losses.append(loss.item())
    
    if (step + 1) % LOG_EVERY == 0:
        elapsed = time.time() - start_time
        print(f"Step {step+1}/{NUM_STEPS}: loss={losses[-1]:.2f}, {elapsed:.1f}s")

print(f"\n‚úì Training complete: {losses[0]:.2f} ‚Üí {losses[-1]:.2f}")
mem_check("After training")

Step 10/50: loss=9.18, 0.9s
Step 20/50: loss=8.75, 1.7s
Step 30/50: loss=7.30, 2.5s
Step 40/50: loss=6.72, 3.3s
Step 50/50: loss=6.77, 4.3s

‚úì Training complete: 9.68 ‚Üí 6.77
[After training] Memory: 1219 MB


In [11]:
# Cell 10: Diagnostic - output health check

print("=== Model Output Diagnostic ===")

model.eval()
with torch.no_grad():
    sample_x = tokens[:32*64].view(32, 64)
    sample_logits = model(sample_x)

print(f"Logits shape: {sample_logits.shape}")
print(f"Logits range: [{sample_logits.min().item():.2f}, {sample_logits.max().item():.2f}]")

probs = F.softmax(sample_logits, dim=-1)
max_prob = probs.max().item()
entropy = -(probs * torch.log(probs + 1e-10)).sum(-1).mean().item()
random_entropy = math.log(tokenizer.vocab_size)

print(f"Max prob: {max_prob:.4f}")
print(f"Entropy: {entropy:.2f} / {random_entropy:.2f} ({100*entropy/random_entropy:.1f}%)")

if max_prob > 0.99:
    print("‚ö†Ô∏è Softmax saturating")
elif entropy < 2.0:
    print("‚ö†Ô∏è Low entropy - model overconfident")
else:
    print("‚úì Model outputs healthy")

=== Model Output Diagnostic ===
Logits shape: torch.Size([32, 64, 16000])
Logits range: [-4.46, 4.43]
Max prob: 0.0583
Entropy: 6.77 / 9.68 (70.0%)
‚úì Model outputs healthy


In [12]:
# Cell 11: Layer-wise variance analysis

print("=== Layer-wise Variance Analysis ===")

layer_outputs = {}

def make_hook(name):
    def hook(module, input, output):
        layer_outputs[name] = output.detach()
    return hook

hooks = []
for i, block in enumerate(model.blocks):
    h = block.register_forward_hook(make_hook(f'layer_{i}'))
    hooks.append(h)

model.eval()
with torch.no_grad():
    sample_x = tokens[:64].unsqueeze(0)
    pos = torch.arange(64).unsqueeze(0)
    embed_out = model.embed(sample_x) + model.pos_embed(pos)
    layer_outputs['embed'] = embed_out
    _ = model(sample_x)

for h in hooks:
    h.remove()

print("\nLayer-wise statistics:")
print("-" * 50)
variances = []
for name in ['embed'] + [f'layer_{i}' for i in range(len(model.blocks))]:
    out = layer_outputs[name]
    var = out.std().item()
    variances.append(var)
    print(f"{name:12s}: std={var:.4f}")

ratio = variances[-1] / variances[0] if variances[0] > 0 else 0
print("-" * 50)
print(f"Variance evolution: {variances[0]:.2f} ‚Üí {variances[-1]:.2f} ({ratio:.2f}x)")

if ratio > 1.5:
    char = "AMPLIFIER"
elif ratio < 0.5:
    char = "DAMPER"
else:
    char = "NEUTRAL"
print(f"\nüéØ CHARACTERIZATION: **{char}**")

=== Layer-wise Variance Analysis ===

Layer-wise statistics:
--------------------------------------------------
embed       : std=0.0214
layer_0     : std=1.0030
layer_1     : std=3.6235
layer_2     : std=6.7759
layer_3     : std=9.4000
layer_4     : std=11.5465
layer_5     : std=13.4393
layer_6     : std=15.1841
layer_7     : std=16.7061
--------------------------------------------------
Variance evolution: 0.02 ‚Üí 16.71 (781.90x)

üéØ CHARACTERIZATION: **AMPLIFIER**


In [13]:
# Cell 12: Save findings
import json
from datetime import datetime

findings = {
    'task': '0.0.3',
    'model': 'GPT-1 Baseline',
    'architecture': {
        'type': 'decoder-only transformer',
        'layers': 8,
        'hidden': 144,
        'n_heads': 4,
        'params': params,
    },
    'init': 'BlinkDL (uniform emb, zero out_proj)',
    'characterization': char,
    'variance_evolution': {
        'input': variances[0],
        'output': variances[-1],
        'ratio': ratio,
    },
    'training': {
        'steps': NUM_STEPS,
        'initial_loss': losses[0],
        'final_loss': losses[-1],
    },
    'softmax': {
        'max_prob': max_prob,
        'entropy': entropy,
        'random_entropy': random_entropy,
    },
    'timestamp': datetime.now().isoformat(),
}

os.makedirs('logs', exist_ok=True)
with open('logs/gpt1_baseline_findings.json', 'w') as f:
    json.dump(findings, f, indent=2)

print("=== Task 0.0.3 Findings ===")
print(json.dumps(findings, indent=2))
print(f"\n‚úì Saved to logs/gpt1_baseline_findings.json")

=== Task 0.0.3 Findings ===
{
  "task": "0.0.3",
  "model": "GPT-1 Baseline",
  "architecture": {
    "type": "decoder-only transformer",
    "layers": 8,
    "hidden": 144,
    "n_heads": 4,
    "params": 4370832
  },
  "init": "BlinkDL (uniform emb, zero out_proj)",
  "characterization": "AMPLIFIER",
  "variance_evolution": {
    "input": 0.021366169676184654,
    "output": 16.70612144470215,
    "ratio": 781.8959456885373
  },
  "training": {
    "steps": 50,
    "initial_loss": 9.680351257324219,
    "final_loss": 6.774981498718262
  },
  "softmax": {
    "max_prob": 0.05834708362817764,
    "entropy": 6.774600505828857,
    "random_entropy": 9.680344001221918
  },
  "timestamp": "2026-01-12T20:41:32.027629"
}

‚úì Saved to logs/gpt1_baseline_findings.json


## Summary (Task 0.0.3 Complete - 2026-01-12)

### Comparison Table

| Metric | GPT-1 | RWKV-6 | Mamba-2 |
|--------|-------|--------|---------|
| Params | 4.37M | 4.3M | 4.4M |
| **Characterization** | **AMPLIFIER (782x)** | AMPLIFIER (5.5x) | AMPLIFIER (2.0x) |
| Variance ratio | 0.02 ‚Üí 16.7 | 1.0 ‚Üí 5.6 | 1.0 ‚Üí 2.0 |
| Final loss (50 steps) | 6.77 | 7.9 | 6.75 |
| Max prob | 0.058 | 0.08 | 0.05 |
| Entropy (% of random) | 70.0% | 95% | 72.1% |
| Logits range | [-4.5, +4.5] | [-10, +10] | [-66, +155]* |

*Mamba-2 baseline (before BlinkDL init)

### Key Finding: GPT-1 is an Extreme Amplifier

GPT-1 amplifies variance **782x** through 8 layers vs:
- RWKV-6: 5.5x
- Mamba-2: 2.0x

**Why doesn't it saturate?** The BlinkDL init (tiny embeddings + zero out_proj) keeps starting variance at 0.02 instead of 1.0. The amplification ratio is huge, but absolute values stay reasonable (logits [-4.5, +4.5]).

### Implications for Fusion

1. **All three are amplifiers** at the full-model level
2. **GPT-1 amplifies most aggressively** (782x vs 5.5x vs 2.0x)
3. **BlinkDL init is critical** - works across all architectures
4. **SSM layers show different behavior:**
   - RWKV-6 raw layer: amplifier
   - Mamba-2 raw layer: **damper** (0.005x)
   - GPT-1 attention: amplifier

### Next Steps
- ‚úÖ Task 0.0.3: COMPLETE
- ‚¨ú Task 0.0.4: Comparative analysis document
- ‚¨ú Phase 1: Implement GRU Arbiter