# 1.20f: Flannel 6 - Determinism Test

**Purpose:** Test whether training is deterministic or if the random seed affects trajectory.

## The Question

Does the training process contain hidden randomness that causes different trajectories from the **same initialization**, or is training fully deterministic given the seed?

## Experimental Design

- **One initialization**: Create W using seed=42
- **Ten training runs**: Train with seeds 42-51, all starting from the same W
- **Same everything else**: Model architecture, data, optimizer, hyperparameters

## Expected Outcomes

**If training is deterministic:**
- All 10 trajectories will be identical: `||W[i,t] - W[j,t]||_F = 0` for all i,j,t
- Only initialization matters; Flannel 4's variation comes entirely from different starting points

**If training has randomness:**
- Trajectories will diverge over time
- Batch ordering or other stochastic factors affect evolution
- Need to treat each run as exploring a different possible future from the same past

## Parameters

In [1]:
# === BATCH EXPERIMENT CONFIG ===
NUM_RUNS = 10          # Number of training runs from same initialization
INIT_SEED = 42         # Seed for creating initial W (fixed)
BASE_TRAIN_SEED = 42   # First training seed (subsequent: 43, 44, ...)

# === RECORDING CONFIG ===
RECORD_CONFIG = {
    'W': True,
    'grads': False,
    'momentum': False,
    'variance': False,
    'logits': False,
    'losses': True,
}

# === MODEL ARCHITECTURE ===
VOCAB_SIZE = 10000
HIDDEN_DIM = 64
N_LAYER = 2
N_HEAD = 2
MAX_SEQ_LEN = 128

# === TRAINING CONFIG ===
BATCH_SIZE = 32
NUM_TRAIN_STEPS = 1000
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 0.0

# Optimizer: Adam
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.999
ADAM_EPSILON = 1e-8

# Initialization
INIT_SCALE = 0.02  # N(0, 0.02)

# === DATA PATHS ===
TOKENIZER_PATH = "../data/flannel_tokenizer_chars.json"
CORPUS_PATH = "../data/flannel_model_corpus.txt"
TOKEN_MASK_PATH = "../tensors/Flannel/live_dead_tokens.safetensors"
OUTPUT_DIR = "../tensors/Flannel"
OUTPUT_FILE = "1.20f_flannel_6.safetensors"

print("✓ Parameters set")

✓ Parameters set


## Imports

In [2]:
import torch
import torch.nn as nn
from transformers import GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments
from tokenizers import Tokenizer
from torch.utils.data import Dataset
import numpy as np
from pathlib import Path
from safetensors.torch import save_file, load_file
import time

print("✓ Imports complete")

✓ Imports complete


## Memory & Disk Requirements

In [3]:
print(f"\n{'='*80}")
print(f"MEMORY & DISK REQUIREMENTS")
print(f"{'='*80}\n")

bytes_per_element = 2  # bfloat16

# Calculate recording size
tensor_sizes = {}
if RECORD_CONFIG['W']:
    tensor_sizes['W'] = NUM_RUNS * (NUM_TRAIN_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_per_element
if RECORD_CONFIG['losses']:
    tensor_sizes['losses'] = NUM_RUNS * (NUM_TRAIN_STEPS+1) * bytes_per_element

total_recorded = sum(tensor_sizes.values())

# Model memory
embedding_params = VOCAB_SIZE * HIDDEN_DIM
params_per_layer = 12 * HIDDEN_DIM**2
transformer_params = N_LAYER * params_per_layer
total_model_params = embedding_params + transformer_params
model_memory = total_model_params * bytes_per_element
optimizer_memory = 2 * total_model_params * 4

peak_ram = total_recorded + model_memory + optimizer_memory

print(f"Experiment: {NUM_RUNS} runs from SAME initialization")
print(f"  Initialization seed: {INIT_SEED}")
print(f"  Training seeds:      {BASE_TRAIN_SEED}–{BASE_TRAIN_SEED + NUM_RUNS - 1}")
print(f"  Steps per run:       {NUM_TRAIN_STEPS:,}")
print()
print(f"Recording: {', '.join([k for k, v in RECORD_CONFIG.items() if v])}")
print(f"  Total data:  {total_recorded/1e9:.2f} GB")
print()
print(f"Model parameters: {total_model_params:,}")
print(f"  Model (bf16):     {model_memory/1e9:.2f} GB")
print(f"  Optimizer (fp32): {optimizer_memory/1e9:.2f} GB")
print()
print(f"{'─'*80}")
print(f"PEAK RAM:     {peak_ram/1e9:.2f} GB")
print(f"DISK NEEDED:  {total_recorded/1e9:.2f} GB")
print(f"{'─'*80}")

if peak_ram <= 24e9:
    print(f"\n✓ Resources within budget\n")
else:
    print(f"\n⚠️  WARNING: Exceeds 24 GB RAM budget!\n")

print(f"{'='*80}\n")


MEMORY & DISK REQUIREMENTS

Experiment: 10 runs from SAME initialization
  Initialization seed: 42
  Training seeds:      42–51
  Steps per run:       1,000

Recording: W, losses
  Total data:  12.81 GB

Model parameters: 738,304
  Model (bf16):     0.00 GB
  Optimizer (fp32): 0.01 GB

────────────────────────────────────────────────────────────────────────────────
PEAK RAM:     12.82 GB
DISK NEEDED:  12.81 GB
────────────────────────────────────────────────────────────────────────────────

✓ Resources within budget




## Device Detection

In [4]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Load Data

In [5]:
# Tokenizer
print(f"Loading tokenizer: {TOKENIZER_PATH}")
tokenizer = Tokenizer.from_file(str(TOKENIZER_PATH))
print(f"  ✓ Vocabulary: {tokenizer.get_vocab_size():,} tokens\n")

# Corpus
print(f"Loading corpus: {CORPUS_PATH}")
with open(CORPUS_PATH, 'r', encoding='utf-8') as f:
    corpus_text = f.read()
encoding = tokenizer.encode(corpus_text)
tokens = encoding.ids
corpus_tensor = torch.tensor(tokens, dtype=torch.long, device=device)
print(f"  ✓ Tokens: {len(tokens):,}\n")

# Token masks
print(f"Loading token masks: {TOKEN_MASK_PATH}")
mask_data = load_file(TOKEN_MASK_PATH)
dead_indices = mask_data['dead_indices']
n_dead = mask_data['dead_mask'].sum().item()
n_live = mask_data['live_mask'].sum().item()
print(f"  ✓ Live: {n_live:,} | Dead: {n_dead:,}")

Loading tokenizer: ../data/flannel_tokenizer_chars.json
  ✓ Vocabulary: 10,000 tokens

Loading corpus: ../data/flannel_model_corpus.txt
  ✓ Tokens: 1,371,328

Loading token masks: ../tensors/Flannel/live_dead_tokens.safetensors
  ✓ Live: 6,301 | Dead: 3,699


## Dataset

In [6]:
class TokenDataset(Dataset):
    def __init__(self, corpus_tensor, max_seq_len):
        self.corpus = corpus_tensor
        self.max_seq_len = max_seq_len
    
    def __len__(self):
        return max(0, len(self.corpus) - self.max_seq_len)
    
    def __getitem__(self, idx):
        chunk = self.corpus[idx : idx + self.max_seq_len + 1]
        return {
            'input_ids': chunk[:-1],
            'labels': chunk[1:]
        }

dataset = TokenDataset(corpus_tensor, MAX_SEQ_LEN)
print(f"\n✓ Dataset: {len(dataset):,} examples")


✓ Dataset: 1,371,200 examples


## Create Initial Embedding Matrix

**Critical:** Initialize W **once** using seed=42, save it, then reuse for all training runs.

In [7]:
print(f"\nCreating initial embedding matrix (seed={INIT_SEED})...\n")

# Set seed for initialization ONLY
torch.manual_seed(INIT_SEED)
np.random.seed(INIT_SEED)

# Create initial W in float32, then convert to bfloat16
W_initial_f32 = torch.randn(VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32) * INIT_SCALE
W_initial = W_initial_f32.to(torch.bfloat16)

print(f"  Shape: {tuple(W_initial.shape)}")
print(f"  Dtype: {W_initial.dtype}")
print(f"  Mean:  {W_initial.float().mean():.6f}")
print(f"  Std:   {W_initial.float().std():.6f}")
print(f"\n✓ Initial W created and frozen")


Creating initial embedding matrix (seed=42)...

  Shape: (10000, 64)
  Dtype: torch.bfloat16
  Mean:  -0.000041
  Std:   0.020019

✓ Initial W created and frozen


## Pre-allocate Recording Tensors

In [8]:
print("\nPre-allocating recording tensors...\n")

tensors = {}

if RECORD_CONFIG['W']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensors['W'] = torch.zeros(shape, dtype=torch.bfloat16)
    print(f"  W:        {shape}")

if RECORD_CONFIG['losses']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1)
    tensors['losses'] = torch.full(shape, float('nan'), dtype=torch.bfloat16)
    print(f"  losses:   {shape}")

print(f"\n✓ All tensors allocated on CPU")


Pre-allocating recording tensors...

  W:        (10, 1001, 10000, 64)
  losses:   (10, 1001)

✓ All tensors allocated on CPU


## Batch Recorder

In [9]:
class BatchRecorder:
    """Records data directly into pre-allocated tensors."""
    
    def __init__(self, tensors, record_config, run_idx):
        self.tensors = tensors
        self.config = record_config
        self.run_idx = run_idx
        self.current_step = 0
        self.recorded_initial = False
        self.loss_value = None
    
    def record_initial_state(self, model, optimizer):
        """Record step 0."""
        if not self.recorded_initial:
            t = 0
            if self.config['W']:
                self.tensors['W'][self.run_idx, t] = model.transformer.wte.weight.data.clone().cpu().bfloat16()
            self.recorded_initial = True
            self.current_step = 1
            print(f"    ✓ Recorded initial state (t=0)")
    
    def record_before_step(self, model, loss, logits):
        """Capture data after backward, before optimizer step."""
        if self.config['losses']:
            self.loss_value = loss.item()
    
    def record_after_step(self, model, optimizer):
        """Record data after optimizer step."""
        t = self.current_step
        
        if t > self.tensors['W'].shape[1] - 1 if 'W' in self.tensors else float('inf'):
            return
        
        if self.config['W']:
            self.tensors['W'][self.run_idx, t] = model.transformer.wte.weight.data.clone().cpu().bfloat16()
        
        if self.config['losses'] and self.loss_value is not None:
            self.tensors['losses'][self.run_idx, t] = self.loss_value
            self.loss_value = None
        
        if t % 100 == 0:
            print(f"    Step {t}")
        
        self.current_step += 1

print("✓ Recorder class defined")

✓ Recorder class defined


## Instrumented Trainer

In [10]:
class InstrumentedTrainer(Trainer):
    def __init__(self, recorder, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.recorder = recorder
        self.last_logits = None

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        outputs = model(**inputs)
        loss = outputs.loss
        self.last_logits = outputs.logits
        return (loss, outputs) if return_outputs else loss

    def training_step(self, model, inputs, num_items_in_batch=None):
        loss = super().training_step(model, inputs, num_items_in_batch)
        self.recorder.record_before_step(model, loss, self.last_logits)
        return loss

    def _maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time=None, **kwargs):
        self.recorder.record_after_step(model, self.optimizer)
        super()._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time, **kwargs)

print("✓ InstrumentedTrainer defined")

✓ InstrumentedTrainer defined


## Training Loop

**Key difference from Flannel 4:** All runs start from the **same** W_initial, only the training seed varies.

In [11]:
print(f"\n{'='*80}")
print(f"FLANNEL 6: DETERMINISM TEST")
print(f"{'='*80}\n")

print(f"Configuration:")
print(f"  Runs:                {NUM_RUNS}")
print(f"  Steps per run:       {NUM_TRAIN_STEPS:,}")
print(f"  Initialization seed: {INIT_SEED} (FIXED for all runs)")
print(f"  Training seeds:      {BASE_TRAIN_SEED}–{BASE_TRAIN_SEED + NUM_RUNS - 1}")
print(f"  Recording:           {', '.join([k for k, v in RECORD_CONFIG.items() if v])}")
print(f"\n{'='*80}\n")

experiment_start = time.time()

for run_idx in range(NUM_RUNS):
    train_seed = BASE_TRAIN_SEED + run_idx
    
    print(f"\n{'='*80}")
    print(f"RUN {run_idx + 1}/{NUM_RUNS} (train_seed={train_seed})")
    print(f"{'='*80}\n")
    
    # Set training seed (NOT initialization seed!)
    torch.manual_seed(train_seed)
    np.random.seed(train_seed)
    
    # Create model
    config = GPT2Config(
        vocab_size=VOCAB_SIZE,
        n_positions=MAX_SEQ_LEN,
        n_embd=HIDDEN_DIM,
        n_layer=N_LAYER,
        n_head=N_HEAD,
        resid_pdrop=0.0,
        embd_pdrop=0.0,
        attn_pdrop=0.0,
        tie_word_embeddings=True,
    )
    
    model = GPT2LMHeadModel(config).to(torch.bfloat16).to(device)
    
    # CRITICAL: Use the SAME initial W for all runs
    with torch.no_grad():
        model.transformer.wte.weight[:] = W_initial.to(device)
    
    print(f"  ✓ Model initialized with SHARED W_initial (seed={INIT_SEED})")
    print(f"  ✓ Training seed set to {train_seed}")
    
    # Create recorder
    recorder = BatchRecorder(tensors, RECORD_CONFIG, run_idx)
    
    # Training args
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        max_steps=NUM_TRAIN_STEPS,
        per_device_train_batch_size=BATCH_SIZE,
        learning_rate=LEARNING_RATE,
        weight_decay=WEIGHT_DECAY,
        adam_beta1=ADAM_BETA1,
        adam_beta2=ADAM_BETA2,
        adam_epsilon=ADAM_EPSILON,
        optim="adamw_torch",
        logging_steps=1000,
        save_steps=NUM_TRAIN_STEPS + 1,
        save_total_limit=0,
        dataloader_num_workers=0,
        dataloader_pin_memory=False,
        bf16=True,
        seed=train_seed,  # THIS controls data shuffling
        report_to="none",
        disable_tqdm=True,
    )
    
    trainer = InstrumentedTrainer(
        recorder=recorder,
        model=model,
        args=training_args,
        train_dataset=dataset,
    )
    
    # Record initial state
    recorder.record_initial_state(model, trainer.optimizer)
    
    # Train
    print(f"  Training...")
    run_start = time.time()
    trainer.train()
    run_elapsed = time.time() - run_start
    
    print(f"\n  ✓ Run {run_idx + 1} complete ({run_elapsed:.1f}s)")
    
    # Clean up
    del model, trainer, recorder
    
    if device == 'mps':
        torch.mps.empty_cache()
    elif device == 'cuda':
        torch.cuda.empty_cache()

experiment_elapsed = time.time() - experiment_start

print(f"\n{'='*80}")
print(f"✓ All {NUM_RUNS} runs complete")
print(f"  Total time: {experiment_elapsed:.1f}s ({experiment_elapsed/60:.1f} minutes)")
print(f"  Average per run: {experiment_elapsed/NUM_RUNS:.1f}s")
print(f"{'='*80}")


FLANNEL 6: DETERMINISM TEST

Configuration:
  Runs:                10
  Steps per run:       1,000
  Initialization seed: 42 (FIXED for all runs)
  Training seeds:      42–51
  Recording:           W, losses



RUN 1/10 (train_seed=42)

  ✓ Model initialized with SHARED W_initial (seed=42)
  ✓ Training seed set to 42
    ✓ Recorded initial state (t=0)
  Training...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


    Step 100
    Step 200
    Step 300
    Step 400
    Step 500
    Step 600
    Step 700
    Step 800
    Step 900
    Step 1000
{'loss': 7.1151, 'grad_norm': 0.2197265625, 'learning_rate': 1e-06, 'epoch': 0.023337222870478413}
{'train_runtime': 26.9225, 'train_samples_per_second': 1188.595, 'train_steps_per_second': 37.144, 'train_loss': 7.11508935546875, 'epoch': 0.023337222870478413}

  ✓ Run 1 complete (27.0s)

RUN 2/10 (train_seed=43)

  ✓ Model initialized with SHARED W_initial (seed=42)
  ✓ Training seed set to 43
    ✓ Recorded initial state (t=0)
  Training...
    Step 100
    Step 200
    Step 300
    Step 400
    Step 500
    Step 600
    Step 700
    Step 800
    Step 900
    Step 1000
{'loss': 7.1049, 'grad_norm': 0.23828125, 'learning_rate': 1e-06, 'epoch': 0.023337222870478413}
{'train_runtime': 25.7111, 'train_samples_per_second': 1244.6, 'train_steps_per_second': 38.894, 'train_loss': 7.10493408203125, 'epoch': 0.023337222870478413}

  ✓ Run 2 complete (25.8s)

RUN 3

## Quick Determinism Check

Before saving, let's check if the runs are identical.

In [12]:
print(f"\n{'='*80}")
print(f"DETERMINISM CHECK")
print(f"{'='*80}\n")

# Check if all runs have identical W at t=0 (should be true by construction)
W_all = tensors['W']
max_diff_t0 = 0.0
for i in range(NUM_RUNS):
    for j in range(i+1, NUM_RUNS):
        diff = (W_all[i, 0] - W_all[j, 0]).abs().max().item()
        max_diff_t0 = max(max_diff_t0, diff)

print(f"t=0 (initialization):")
print(f"  Max difference between any two runs: {max_diff_t0:.10f}")
if max_diff_t0 < 1e-6:
    print(f"  ✓ All runs start from identical W\n")
else:
    print(f"  ⚠️  UNEXPECTED: Runs have different initializations!\n")

# Check final state
max_diff_final = 0.0
for i in range(NUM_RUNS):
    for j in range(i+1, NUM_RUNS):
        diff = (W_all[i, -1] - W_all[j, -1]).abs().max().item()
        max_diff_final = max(max_diff_final, diff)

print(f"t={NUM_TRAIN_STEPS} (final):")
print(f"  Max difference between any two runs: {max_diff_final:.10f}")
if max_diff_final < 1e-6:
    print(f"  ✓ TRAINING IS DETERMINISTIC")
    print(f"  → All trajectories are identical given the same initialization")
    print(f"  → Only initialization seed matters for Flannel dynamics")
else:
    print(f"  ✓ TRAINING HAS RANDOMNESS")
    print(f"  → Different training seeds produce different trajectories")
    print(f"  → Batch ordering or other stochastic factors affect evolution")

print(f"\n{'='*80}")


DETERMINISM CHECK

t=0 (initialization):
  Max difference between any two runs: 0.0000000000
  ✓ All runs start from identical W

t=1000 (final):
  Max difference between any two runs: 0.3867187500
  ✓ TRAINING HAS RANDOMNESS
  → Different training seeds produce different trajectories
  → Batch ordering or other stochastic factors affect evolution



## Save Data

In [13]:
print(f"\nSaving data...\n")

# Build save dictionary
save_dict = {
    # Metadata
    'n_runs': torch.tensor(NUM_RUNS, dtype=torch.long),
    'init_seed': torch.tensor(INIT_SEED, dtype=torch.long),
    'base_train_seed': torch.tensor(BASE_TRAIN_SEED, dtype=torch.long),
    'n_steps': torch.tensor(NUM_TRAIN_STEPS, dtype=torch.long),
    'n_live': torch.tensor(n_live, dtype=torch.long),
    'n_dead': torch.tensor(n_dead, dtype=torch.long),
    'vocab_size': torch.tensor(VOCAB_SIZE, dtype=torch.long),
    'hidden_dim': torch.tensor(HIDDEN_DIM, dtype=torch.long),
    'init_scale': torch.tensor(INIT_SCALE, dtype=torch.float32),
    'learning_rate': torch.tensor(LEARNING_RATE, dtype=torch.float32),
    'weight_decay': torch.tensor(WEIGHT_DECAY, dtype=torch.float32),
    'adam_beta1': torch.tensor(ADAM_BETA1, dtype=torch.float32),
    'adam_beta2': torch.tensor(ADAM_BETA2, dtype=torch.float32),
    # Initial W (the shared starting point)
    'W_initial': W_initial,
    # Record config
    'recorded_W': torch.tensor(RECORD_CONFIG['W'], dtype=torch.bool),
    'recorded_losses': torch.tensor(RECORD_CONFIG['losses'], dtype=torch.bool),
}

# Add recorded tensors
for name, tensor in tensors.items():
    save_dict[name] = tensor
    print(f"  {name:12} {str(tuple(tensor.shape)):30}")

# Save
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
output_path = Path(OUTPUT_DIR) / OUTPUT_FILE

print(f"\nSaving to: {output_path}\n")

save_start = time.time()
save_file(save_dict, str(output_path))
save_elapsed = time.time() - save_start

file_size_mb = output_path.stat().st_size / 1e6
file_size_gb = file_size_mb / 1000

print(f"✓ Saved successfully")
print(f"  File: {output_path.name}")
print(f"  Size: {file_size_mb:.1f} MB ({file_size_gb:.2f} GB)")
print(f"  Save time: {save_elapsed:.1f}s")


Saving data...

  W            (10, 1001, 10000, 64)         
  losses       (10, 1001)                    

Saving to: ../tensors/Flannel/1.20f_flannel_6.safetensors

✓ Saved successfully
  File: 1.20f_flannel_6.safetensors
  Size: 12814.1 MB (12.81 GB)
  Save time: 11.2s


## Summary

In [14]:
print(f"\n{'='*80}")
print(f"FLANNEL 6 COMPLETE")
print(f"{'='*80}\n")

print(f"Experiment: {NUM_RUNS} runs from SAME initialization")
print(f"  Initialization seed:  {INIT_SEED} (shared by all runs)")
print(f"  Training seeds:       {BASE_TRAIN_SEED}–{BASE_TRAIN_SEED + NUM_RUNS - 1}")
print(f"  Steps per run:        {NUM_TRAIN_STEPS:,}")
print(f"  Recorded:             {', '.join([k for k, v in RECORD_CONFIG.items() if v])}")
print()
print(f"Data saved: {output_path}")
print(f"  Size: {file_size_gb:.2f} GB")
print(f"  Total experiment time: {experiment_elapsed/60:.1f} minutes")
print()
print(f"Max divergence at t={NUM_TRAIN_STEPS}: {max_diff_final:.10f}")
print()
if max_diff_final < 1e-6:
    print(f"CONCLUSION: Training is DETERMINISTIC")
    print(f"  → Flannel 4's variation comes entirely from different initializations")
    print(f"  → Training seed only affects batch ordering, which doesn't matter")
else:
    print(f"CONCLUSION: Training has RANDOMNESS")
    print(f"  → Same initialization can lead to different outcomes")
    print(f"  → Training seed affects trajectory (batch order matters)")
print()
print(f"Next steps:")
print(f"  1. Analyze pairwise Frobenius norms over time (use 1.23d)")
print(f"  2. Compare to Flannel 4 divergence patterns")
print(f"  3. Quantify when/how trajectories diverge (if at all)")

print(f"\n{'='*80}")


FLANNEL 6 COMPLETE

Experiment: 10 runs from SAME initialization
  Initialization seed:  42 (shared by all runs)
  Training seeds:       42–51
  Steps per run:        1,000
  Recorded:             W, losses

Data saved: ../tensors/Flannel/1.20f_flannel_6.safetensors
  Size: 12.81 GB
  Total experiment time: 4.3 minutes

Max divergence at t=1000: 0.3867187500

CONCLUSION: Training has RANDOMNESS
  → Same initialization can lead to different outcomes
  → Training seed affects trajectory (batch order matters)

Next steps:
  1. Analyze pairwise Frobenius norms over time (use 1.23d)
  2. Compare to Flannel 4 divergence patterns
  3. Quantify when/how trajectories diverge (if at all)

