# 1.12b: Train Lil Gatsby from Supermassive Black Hole (bfloat16-native)

**Goal:** Initialize all token embeddings to the same point (supermassive BH) plus optional Gaussian noise in **native bfloat16**, then train and watch what happens.

## Hypothesis

If untrained tokens experience only "thermal jostling" (small gradient updates from being wrong), starting from a supermassive BH might produce a spongecrystal-like structure:

1. **Trained tokens** get large gradients → boil off and disperse
2. **Untrained tokens** get tiny gradients → stay in thermal bath, freeze into lattice
3. **Optimal σ** exists where initial cloud is small enough to stay coherent but large enough to explore lattice neighborhood

## Initialization Strategy (bfloat16-native)

**Pure bfloat16 throughout** (matching Qwen 3's native training precision):

- Pick random point from N(0, 0.02) **in bfloat16** (matching GPT-2 init scale)
- Add Gaussian noise N(0, σ) **in bfloat16**
- Initialize all W vectors to `bh_center + noise`
- σ = 0: Perfect supermassive BH (all tokens identical)
- σ > 0: Thermal cloud around BH

**No float32 intermediate step** - pure bfloat16 from generation to training.

## Model Architecture

- **Vocabulary:** 128 tokens (ASCII)
- **Context window:** 128 tokens
- **Hidden dimensions:** 64
- **Layers:** 2
- **Attention heads:** 2
- **Weight tying:** E = W^T (matches Qwen 3 4B)
- **Total parameters:** ~117k

## Training Configuration

- **Steps:** 10,000
- **Batch size:** 1 (with gradient accumulation if needed)
- **Optimizer:** Adam (lr=0.001, β₁=0.9, β₂=0.999)
- **Weight decay:** 0.0 (no regularization)
- **Precision:** Native bfloat16

## Data Recording (Every Step)

- **Embeddings:** (10,001, 128, 64) bfloat16
- **Gradients:** (10,001, 128, 64) bfloat16  
- **Adam momentum:** (10,001, 128, 64) bfloat16
- **Adam variance:** (10,001, 128, 64) bfloat16
- **Logits:** (10,001, 128) bfloat16
- **Loss:** (10,001,) bfloat16

Total: ~660 MB saved as safetensors

## Parameters

In [1]:
# Model architecture
VOCAB_SIZE = 128
HIDDEN_DIM = 64
N_LAYER = 2
N_HEAD = 2
MAX_SEQ_LEN = 128

# Training
BATCH_SIZE = 1
GRADIENT_ACCUMULATION = 1
NUM_TRAIN_STEPS = 10000
LEARNING_RATE = 1e-3  # 0.001 for Adam
WEIGHT_DECAY = 0.0

# Optimizer: Adam
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.999
ADAM_EPSILON = 1e-8

# Initialization (SUPERMASSIVE BLACK HOLE - bfloat16 native)
INIT_SCALE = 0.02  # Match GPT-2 default initialization scale
SIGMA = 0.0        # Gaussian noise std dev (0 = perfect BH, >0 = thermal cloud)

# Data
CORPUS_PATH = "../data/gatsby_clean.txt"
OUTPUT_DIR = "../tensors/Lil_Gatsby"
OUTPUT_FILE = f"1.12b_training_data_sigma{SIGMA:.0e}.safetensors"

# Instrumentation
RECORD_EVERY_N_STEPS = 1  # Record every step

RANDOM_SEED = 42

## Imports

In [2]:
import torch
import torch.nn as nn
from transformers import GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments
from torch.utils.data import Dataset
import numpy as np
from pathlib import Path
from safetensors.torch import save_file
import time

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print("✓ Imports complete")

✓ Imports complete


## Device Detection

In [3]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Load Corpus

In [4]:
print(f"Loading corpus: {CORPUS_PATH}")

with open(CORPUS_PATH, 'r', encoding='ascii') as f:
    corpus_text = f.read()

# Convert to byte array (only keep bytes < VOCAB_SIZE)
corpus_bytes = [b for b in corpus_text.encode('ascii') if b < VOCAB_SIZE]

print(f"  Total bytes: {len(corpus_bytes):,}")

# Pre-load corpus to device
corpus_tensor = torch.tensor(corpus_bytes, dtype=torch.long, device=device)
print(f"✓ Corpus on device")

Loading corpus: ../data/gatsby_clean.txt
  Total bytes: 268,928
✓ Corpus on device


## Dataset

In [5]:
class ByteDataset(Dataset):
    def __init__(self, corpus_tensor, max_seq_len):
        self.corpus = corpus_tensor
        self.max_seq_len = max_seq_len
    
    def __len__(self):
        return max(0, len(self.corpus) - self.max_seq_len)
    
    def __getitem__(self, idx):
        chunk = self.corpus[idx : idx + self.max_seq_len + 1]
        return {
            'input_ids': chunk[:-1],
            'labels': chunk[1:]
        }

dataset = ByteDataset(corpus_tensor, MAX_SEQ_LEN)
print(f"✓ Dataset: {len(dataset):,} examples")

✓ Dataset: 268,800 examples


## Model

In [6]:
config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=MAX_SEQ_LEN,
    n_embd=HIDDEN_DIM,
    n_layer=N_LAYER,
    n_head=N_HEAD,
    resid_pdrop=0.0,
    embd_pdrop=0.0,
    attn_pdrop=0.0,
    tie_word_embeddings=True,
)

model = GPT2LMHeadModel(config)
model = model.to(torch.bfloat16).to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"✓ Model created: {total_params:,} parameters")

✓ Model created: 116,480 parameters


## bfloat16-Native Initialization (SUPERMASSIVE BLACK HOLE)

In [7]:
print(f"\n{'='*80}")
print(f"INITIALIZING SUPERMASSIVE BLACK HOLE (bfloat16-native)")
print(f"{'='*80}\n")

torch.manual_seed(RANDOM_SEED)

# Generate BH center directly in bfloat16
bh_center_bf16 = (torch.randn(HIDDEN_DIM, dtype=torch.float32, device=device) * INIT_SCALE).to(torch.bfloat16)

print(f"Black hole center (bfloat16):")
print(f"  First 8 dims: {bh_center_bf16[:8].cpu().float().tolist()}")
print(f"  Norm: {torch.norm(bh_center_bf16.float()).item():.6f}")
print()

# Initialize all W vectors to BH center + noise (all in bfloat16)
if SIGMA == 0.0:
    # Perfect supermassive BH - all tokens identical
    init_bf16 = bh_center_bf16.unsqueeze(0).expand(VOCAB_SIZE, HIDDEN_DIM).clone()
    print(f"σ = {SIGMA:.0e} → Perfect supermassive black hole")
    print(f"  All {VOCAB_SIZE} tokens initialized to identical bfloat16 vector")
else:
    # Thermal cloud around BH (noise generated in bfloat16)
    noise_bf16 = (torch.randn(VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32, device=device) * SIGMA).to(torch.bfloat16)
    init_bf16 = bh_center_bf16.unsqueeze(0) + noise_bf16
    print(f"σ = {SIGMA:.0e} → Thermal cloud around black hole")
    print(f"  Initial cloud size (std): {noise_bf16.float().std().item():.6f}")
    print(f"  Initial cloud radius (max dist from center): {torch.norm(noise_bf16.float(), dim=1).max().item():.6f}")

print()

# Assign to model
with torch.no_grad():
    model.transformer.wte.weight[:] = init_bf16

print(f"✓ Initialized embeddings (pure bfloat16)")
print(f"  Shape: {model.transformer.wte.weight.shape}")
print(f"  Dtype: {model.transformer.wte.weight.dtype}")
print()

# Verify initialization
W_check = model.transformer.wte.weight.cpu().float()
pairwise_dists = torch.cdist(W_check, W_check)
max_dist = pairwise_dists.max().item()
mean_dist = pairwise_dists[torch.triu(torch.ones_like(pairwise_dists), diagonal=1) == 1].mean().item()

print(f"Initial embedding statistics:")
print(f"  Max pairwise distance: {max_dist:.6f}")
print(f"  Mean pairwise distance: {mean_dist:.6f}")

if SIGMA == 0.0:
    if max_dist < 1e-6:
        print(f"  ✓ All vectors identical (as expected for σ=0)")
    else:
        print(f"  ⚠️  Quantization artifacts: max_dist = {max_dist:.6f} (expected ~0)")

print()
print(f"{'='*80}\n")


INITIALIZING SUPERMASSIVE BLACK HOLE (bfloat16-native)

Black hole center (bfloat16):
  First 8 dims: [0.01806640625, 0.00445556640625, 0.0029144287109375, 0.01470947265625, -0.00830078125, -0.0022125244140625, 0.00909423828125, -0.06640625]
  Norm: 0.156989

σ = 0e+00 → Perfect supermassive black hole
  All 128 tokens initialized to identical bfloat16 vector

✓ Initialized embeddings (pure bfloat16)
  Shape: torch.Size([128, 64])
  Dtype: torch.bfloat16

Initial embedding statistics:
  Max pairwise distance: 0.000000
  Mean pairwise distance: 0.000000
  ✓ All vectors identical (as expected for σ=0)




## Comprehensive Recorder

Records everything at every step, holds in RAM until training completes.

In [8]:
class ComprehensiveRecorder:
    """Records embeddings, gradients, optimizer state, logits, loss at every step in bfloat16."""
    
    def __init__(self, vocab_size, hidden_dim, record_every_n):
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.record_every_n = record_every_n
        
        # Storage (lists of tensors, keep in RAM)
        self.recorded_steps = []
        self.embeddings = []      # [n_recorded, vocab_size, hidden_dim]
        self.grads = []           # [n_recorded, vocab_size, hidden_dim]
        self.momentum = []        # [n_recorded, vocab_size, hidden_dim]
        self.variance = []        # [n_recorded, vocab_size, hidden_dim]
        self.logits = []          # [n_recorded, vocab_size]
        self.losses = []          # [n_recorded]
        
        # Temporary storage
        self.current_step = 0
        self.recorded_initial = False
        self.grad_before = None
        self.loss_value = None
        self.logits_sample = None
    
    def record_initial_state(self, model, optimizer):
        """Record step 0: initial state before training."""
        if not self.recorded_initial:
            W = model.transformer.wte.weight.data.clone().cpu().bfloat16()
            
            # Step 0: no gradients, no optimizer state yet (zeros)
            self.recorded_steps.append(0)
            self.embeddings.append(W)
            self.grads.append(torch.zeros_like(W))
            self.momentum.append(torch.zeros_like(W))
            self.variance.append(torch.zeros_like(W))
            self.logits.append(torch.zeros(self.vocab_size, dtype=torch.bfloat16))
            self.losses.append(torch.tensor(float('nan'), dtype=torch.bfloat16))  # No loss yet
            
            self.recorded_initial = True
            self.current_step = 1
            
            print(f"✓ Recorded initial state (step 0)")
    
    def record_before_step(self, model, loss, logits):
        """Call after forward/backward, before optimizer step."""
        if self.current_step % self.record_every_n == 0:
            # Capture gradients in bfloat16
            if model.transformer.wte.weight.grad is not None:
                self.grad_before = model.transformer.wte.weight.grad.clone().cpu().bfloat16()
            else:
                self.grad_before = torch.zeros(self.vocab_size, self.hidden_dim, dtype=torch.bfloat16)
            
            # Capture loss
            self.loss_value = loss.item()
            
            # Capture logits from first sequence, last position in bfloat16
            self.logits_sample = logits[0, -1, :].detach().cpu().bfloat16()
    
    def record_after_step(self, model, optimizer):
        """Call after optimizer step."""
        if self.current_step % self.record_every_n == 0:
            if self.grad_before is not None and self.loss_value is not None:
                # Capture embeddings in bfloat16
                W = model.transformer.wte.weight.data.clone().cpu().bfloat16()

                # Capture optimizer state (Adam momentum and variance)
                param = model.transformer.wte.weight
                if param in optimizer.state:
                    state = optimizer.state[param]
                    # Get state tensors if they exist, convert to bfloat16
                    mom_src = state.get('exp_avg', None)
                    var_src = state.get('exp_avg_sq', None)
                    mom = mom_src.clone().cpu().bfloat16() if mom_src is not None else torch.zeros_like(W)
                    var = var_src.clone().cpu().bfloat16() if var_src is not None else torch.zeros_like(W)
                else:
                    mom = torch.zeros_like(W)
                    var = torch.zeros_like(W)

                # Store everything
                self.recorded_steps.append(self.current_step)
                self.embeddings.append(W)
                self.grads.append(self.grad_before)
                self.momentum.append(mom)
                self.variance.append(var)
                self.logits.append(self.logits_sample)
                self.losses.append(torch.tensor(self.loss_value, dtype=torch.bfloat16))

                # Clear temp storage
                self.grad_before = None
                self.loss_value = None
                self.logits_sample = None
                
                # Progress indicator every 1000 steps
                if self.current_step % 1000 == 0:
                    print(f"  Recorded step {self.current_step:,}")

        self.current_step += 1
    
    def get_data(self):
        """Return recorded data as stacked tensors."""
        print(f"\nStacking {len(self.embeddings)} recorded states...")
        
        return {
            'recorded_steps': torch.tensor(self.recorded_steps, dtype=torch.long),
            'embeddings': torch.stack(self.embeddings) if self.embeddings else torch.tensor([]),
            'grads': torch.stack(self.grads) if self.grads else torch.tensor([]),
            'momentum': torch.stack(self.momentum) if self.momentum else torch.tensor([]),
            'variance': torch.stack(self.variance) if self.variance else torch.tensor([]),
            'logits': torch.stack(self.logits) if self.logits else torch.tensor([]),
            'losses': torch.stack(self.losses) if self.losses else torch.tensor([]),
        }

print("✓ Recorder class defined")

✓ Recorder class defined


## Custom Trainer with Instrumentation

In [9]:
class InstrumentedTrainer(Trainer):
    def __init__(self, recorder, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.recorder = recorder
        self.last_logits = None

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        """Override to capture logits."""
        outputs = model(**inputs)
        loss = outputs.loss
        
        # Store logits for recorder
        self.last_logits = outputs.logits
        
        return (loss, outputs) if return_outputs else loss

    def training_step(self, model, inputs, num_items_in_batch=None):
        """Override to inject recording."""
        # Standard forward + backward
        loss = super().training_step(model, inputs, num_items_in_batch)
        
        # Record BEFORE optimizer step
        self.recorder.record_before_step(model, loss, self.last_logits)
        
        return loss

    def _maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time=None, **kwargs):
        """Override to record AFTER optimizer step."""
        # Record AFTER optimizer updates parameters
        self.recorder.record_after_step(model, self.optimizer)
        
        # Call parent
        super()._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time, **kwargs)

print("✓ InstrumentedTrainer defined")

✓ InstrumentedTrainer defined


## Training Configuration

In [10]:
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

recorder = ComprehensiveRecorder(VOCAB_SIZE, HIDDEN_DIM, RECORD_EVERY_N_STEPS)

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    max_steps=NUM_TRAIN_STEPS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    adam_beta1=ADAM_BETA1,
    adam_beta2=ADAM_BETA2,
    adam_epsilon=ADAM_EPSILON,
    optim="adamw_torch",  # Adam optimizer
    logging_steps=1000,
    save_steps=NUM_TRAIN_STEPS + 1,  # Don't save checkpoints
    save_total_limit=0,
    dataloader_num_workers=0,
    dataloader_pin_memory=False,
    bf16=True,  # Native bfloat16 training
    seed=RANDOM_SEED,
    report_to="none",
    disable_tqdm=False,
)

trainer = InstrumentedTrainer(
    recorder=recorder,
    model=model,
    args=training_args,
    train_dataset=dataset,
)

print("✓ Trainer ready (Adam, bf16=True)")

✓ Trainer ready (Adam, bf16=True)


## Record Initial State

In [11]:
recorder.record_initial_state(model, trainer.optimizer)

✓ Recorded initial state (step 0)


## Train

**This will take ~2 minutes.**

In [12]:
print(f"\n{'='*80}")
print(f"STARTING 10,000-STEP TRAINING RUN")
print(f"{'='*80}")
print(f"\nConfiguration:")
print(f"  Initialization: Supermassive BH (σ = {SIGMA:.0e}, bfloat16-native)")
print(f"  Optimizer: Adam")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Adam beta1: {ADAM_BETA1}")
print(f"  Adam beta2: {ADAM_BETA2}")
print(f"  Weight decay: {WEIGHT_DECAY}")
print(f"  Precision: bfloat16 (native)")
print(f"  Steps: {NUM_TRAIN_STEPS:,} (plus initial step 0)")
print(f"  Recording every: {RECORD_EVERY_N_STEPS} step(s)")
print(f"  Expected records: {NUM_TRAIN_STEPS // RECORD_EVERY_N_STEPS + 1:,}")
print(f"  Expected file size: ~660 MB")
print(f"  Estimated runtime: ~2 minutes")
print(f"\nRecording:")
print(f"  - Embeddings (positions)")
print(f"  - Gradients (forces)")
print(f"  - Momentum (Adam first moment)")
print(f"  - Variance (Adam second moment)")
print(f"  - Logits (predictions)")
print(f"  - Loss")
print(f"\n{'='*80}\n")

start_time = time.time()
trainer.train()
elapsed = time.time() - start_time

print(f"\n{'='*80}")
print(f"✓ Training complete")
print(f"  Elapsed time: {elapsed/60:.1f} minutes ({elapsed:.1f} seconds)")
print(f"  Throughput: {NUM_TRAIN_STEPS / elapsed:.1f} steps/second")
print(f"{'='*80}")


STARTING 10,000-STEP TRAINING RUN

Configuration:
  Initialization: Supermassive BH (σ = 0e+00, bfloat16-native)
  Optimizer: Adam
  Learning rate: 0.001
  Adam beta1: 0.9
  Adam beta2: 0.999
  Weight decay: 0.0
  Precision: bfloat16 (native)
  Steps: 10,000 (plus initial step 0)
  Recording every: 1 step(s)
  Expected records: 10,001
  Expected file size: ~660 MB
  Estimated runtime: ~2 minutes

Recording:
  - Embeddings (positions)
  - Gradients (forces)
  - Momentum (Adam first moment)
  - Variance (Adam second moment)
  - Logits (predictions)
  - Loss




`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
1000,3.1728
2000,3.1101
3000,3.1056
4000,3.0267
5000,2.9535
6000,2.9447
7000,2.94
8000,2.9407
9000,2.935
10000,2.9368


  Recorded step 1,000
  Recorded step 2,000
  Recorded step 3,000
  Recorded step 4,000
  Recorded step 5,000
  Recorded step 6,000
  Recorded step 7,000
  Recorded step 8,000
  Recorded step 9,000
  Recorded step 10,000

✓ Training complete
  Elapsed time: 1.6 minutes (95.6 seconds)
  Throughput: 104.6 steps/second


## Save Recorded Data

In [13]:
print(f"\nPreparing data for save...\n")

recorded_data = recorder.get_data()

save_dict = {
    'recorded_steps': recorded_data['recorded_steps'],
    'embeddings': recorded_data['embeddings'],
    'grads': recorded_data['grads'],
    'momentum': recorded_data['momentum'],
    'variance': recorded_data['variance'],
    'logits': recorded_data['logits'],
    'losses': recorded_data['losses'],
    'bh_center': bh_center_bf16.cpu(),
    'init_scale': torch.tensor(INIT_SCALE, dtype=torch.float32),
    'sigma': torch.tensor(SIGMA, dtype=torch.float32),
    'learning_rate': torch.tensor(LEARNING_RATE, dtype=torch.float32),
    'weight_decay': torch.tensor(WEIGHT_DECAY, dtype=torch.float32),
    'adam_beta1': torch.tensor(ADAM_BETA1, dtype=torch.float32),
    'adam_beta2': torch.tensor(ADAM_BETA2, dtype=torch.float32),
}

output_path = Path(OUTPUT_DIR) / OUTPUT_FILE

print(f"Saving to: {output_path}")
print(f"This may take a minute...\n")

save_start = time.time()
save_file(save_dict, str(output_path))
save_elapsed = time.time() - save_start

file_size_mb = output_path.stat().st_size / 1e6
file_size_gb = file_size_mb / 1000

print(f"✓ Saved successfully")
print(f"  File: {output_path}")
print(f"  Size: {file_size_gb:.2f} GB ({file_size_mb:.1f} MB)")
print(f"  Save time: {save_elapsed:.1f} seconds")
print(f"  Recorded steps: {len(recorded_data['recorded_steps']):,}")
print(f"  Step range: {recorded_data['recorded_steps'][0]} to {recorded_data['recorded_steps'][-1]}")


Preparing data for save...


Stacking 10001 recorded states...
Saving to: ../tensors/Lil_Gatsby/1.12b_training_data_sigma0e+00.safetensors
This may take a minute...

✓ Saved successfully
  File: ../tensors/Lil_Gatsby/1.12b_training_data_sigma0e+00.safetensors
  Size: 0.66 GB (658.1 MB)
  Save time: 2.1 seconds
  Recorded steps: 10,001
  Step range: 0 to 10000


## Verify Data Shapes

In [14]:
print(f"\nData shapes:")
print(f"  embeddings: {recorded_data['embeddings'].shape}")
print(f"  grads: {recorded_data['grads'].shape}")
print(f"  momentum: {recorded_data['momentum'].shape}")
print(f"  variance: {recorded_data['variance'].shape}")
print(f"  logits: {recorded_data['logits'].shape}")
print(f"  losses: {recorded_data['losses'].shape}")


Data shapes:
  embeddings: torch.Size([10001, 128, 64])
  grads: torch.Size([10001, 128, 64])
  momentum: torch.Size([10001, 128, 64])
  variance: torch.Size([10001, 128, 64])
  logits: torch.Size([10001, 128])
  losses: torch.Size([10001])


## Quick Verification

In [15]:
print(f"\n{'='*80}")
print(f"VERIFICATION")
print(f"{'='*80}\n")

embeddings = recorded_data['embeddings']
grads = recorded_data['grads']
logits_vec = recorded_data['logits']
losses = recorded_data['losses']
momentum = recorded_data['momentum']
variance = recorded_data['variance']

print(f"Embedding evolution:")
print(f"  Step 0 centroid norm: {embeddings[0].float().mean(dim=0).norm().item():.6f}")
print(f"  Step {NUM_TRAIN_STEPS} centroid norm: {embeddings[-1].float().mean(dim=0).norm().item():.6f}")

print(f"\nPairwise distances:")
step0_dists = torch.cdist(embeddings[0].float(), embeddings[0].float())
stepN_dists = torch.cdist(embeddings[-1].float(), embeddings[-1].float())
print(f"  Step 0 max: {step0_dists.max().item():.6f} (should be ~0 for σ=0)")
print(f"  Step {NUM_TRAIN_STEPS} max: {stepN_dists.max().item():.4f}")

print(f"\nGradient magnitudes (first step):")
grad_norms_step1 = torch.norm(grads[1].float(), p=2, dim=1)
print(f"  Mean: {grad_norms_step1.mean().item():.6e}")
print(f"  Max: {grad_norms_step1.max().item():.6e}")

print(f"\nAdam state (after warmup):")
print(f"  Momentum at step 100: {momentum[100].float().abs().mean().item():.6e}")
print(f"  Variance at step 100: {variance[100].float().abs().mean().item():.6e}")

print(f"\nLoss trajectory:")
print(f"  Step 1: {losses[1].float().item():.4f}")
print(f"  Step {NUM_TRAIN_STEPS}: {losses[-1].float().item():.4f}")
print(f"  Reduction: {(losses[1].float() - losses[-1].float()).item():.4f}")

print(f"\n{'='*80}")


VERIFICATION

Embedding evolution:
  Step 0 centroid norm: 0.156989
  Step 10000 centroid norm: 0.162018

Pairwise distances:
  Step 0 max: 0.000000 (should be ~0 for σ=0)
  Step 10000 max: 1.0715

Gradient magnitudes (first step):
  Mean: 8.051573e-02
  Max: 8.757275e-01

Adam state (after warmup):
  Momentum at step 100: 1.042301e-03
  Variance at step 100: 8.490872e-06

Loss trajectory:
  Step 1: 4.8438
  Step 10000: 2.9062
  Reduction: 1.9375



## Summary

In [16]:
print(f"\n{'='*80}")
print(f"SUMMARY")
print(f"{'='*80}\n")
print(f"Model: {N_LAYER}L, {N_HEAD}H, {HIDDEN_DIM}D ({total_params:,} params)")
print(f"Training: {NUM_TRAIN_STEPS:,} steps, batch {BATCH_SIZE}")
print(f"Initialization: Supermassive BH (σ = {SIGMA:.0e}, bfloat16-native)")
print(f"Final loss: {losses[-1].float().item():.4f}")
print()
print(f"Data saved: {output_path}")
print(f"Size: {file_size_mb:.1f} MB")
print()
print(f"Next steps:")
print(f"  1. Extract W[0], W[1], W[2], W[10000] from saved data")
print(f"  2. Run 1.13a on each to check for lattice structure")
print(f"  3. Look for spongecrystal formation at different timesteps")
print(f"{'='*80}")


SUMMARY

Model: 2L, 2H, 64D (116,480 params)
Training: 10,000 steps, batch 1
Initialization: Supermassive BH (σ = 0e+00, bfloat16-native)
Final loss: 2.9062

Data saved: ../tensors/Lil_Gatsby/1.12b_training_data_sigma0e+00.safetensors
Size: 658.1 MB

Next steps:
  1. Extract W[0], W[1], W[2], W[10000] from saved data
  2. Run 1.13a on each to check for lattice structure
  3. Look for spongecrystal formation at different timesteps
