# 1.20e: Flannel 5 - Initialization Scale Sweep

**Purpose:** Test whether the Inhale-Sneeze-Fimbulwinter epochs depend on initialization scale σ.

## Research Question

Flannel 4 confirmed that the five epochs are reproducible across different random seeds at σ=0.02.

**But:** Are these dynamics universal, or specific to σ=0.02?

## Experimental Design

- **Fixed:** Random seed (42), all other hyperparameters
- **Swept:** Initialization scale σ ∈ [0.005, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045]
- **Control:** σ=0.020 (run 4, our known reference point)

## Expected Outcomes

**Hypothesis 1:** Epoch structure persists, but timing/magnitude scale with σ
**Hypothesis 2:** Phase transition at some critical σ where dynamics change qualitatively
**Hypothesis 3:** Fimbulwinter timing depends on σ (gradient magnitudes → quantization freeze)

## Data

- 9 runs × 1000 steps each
- Recording: W + losses
- Storage: ~11.5 GB

## Parameters

In [1]:
# === SWEEP CONFIG ===
# Sweep initialization scale, fixed seed
INIT_SCALES = [0.005, 0.010, 0.015, 0.020, 0.025, 0.030, 0.035, 0.040, 0.045]
NUM_RUNS = len(INIT_SCALES)  # 9 runs
FIXED_SEED = 42  # Same seed for all runs

# === RECORDING CONFIG ===
RECORD_CONFIG = {
    'W': True,           # Embedding matrix (ALWAYS RECOMMENDED)
    'grads': False,      # Gradients ∂L/∂W
    'momentum': False,   # Adam exp_avg
    'variance': False,   # Adam exp_avg_sq
    'logits': False,     # Model outputs (large!)
    'losses': True,      # Loss per step (tiny, always useful)
}

# === MODEL ARCHITECTURE ===
VOCAB_SIZE = 10000
HIDDEN_DIM = 64
N_LAYER = 2
N_HEAD = 2
MAX_SEQ_LEN = 128

# === TRAINING CONFIG ===
BATCH_SIZE = 32
NUM_TRAIN_STEPS = 1000
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 0.0

# Optimizer: Adam
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.999
ADAM_EPSILON = 1e-8

# === DATA PATHS ===
TOKENIZER_PATH = "../data/flannel_tokenizer_chars.json"
CORPUS_PATH = "../data/flannel_model_corpus.txt"
TOKEN_MASK_PATH = "../tensors/Flannel/live_dead_tokens.safetensors"
OUTPUT_DIR = "../tensors/Flannel"
OUTPUT_FILE = "1.20e_flannel_5.safetensors"

print("✓ Parameters set")

✓ Parameters set


## Imports

In [2]:
import torch
import torch.nn as nn
from transformers import GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments
from tokenizers import Tokenizer
from torch.utils.data import Dataset
import numpy as np
from pathlib import Path
from safetensors.torch import save_file, load_file
import time

print("✓ Imports complete")

✓ Imports complete


## Memory & Disk Requirements Calculator

In [3]:
print(f"\n{'='*80}")
print(f"MEMORY & DISK REQUIREMENTS")
print(f"{'='*80}\n")

bytes_per_element = 2  # bfloat16

# Calculate size for each enabled recording
tensor_sizes = {}
tensor_shapes = {}

if RECORD_CONFIG['W']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensor_shapes['W'] = shape
    tensor_sizes['W'] = np.prod(shape) * bytes_per_element

if RECORD_CONFIG['grads']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensor_shapes['grads'] = shape
    tensor_sizes['grads'] = np.prod(shape) * bytes_per_element

if RECORD_CONFIG['momentum']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensor_shapes['momentum'] = shape
    tensor_sizes['momentum'] = np.prod(shape) * bytes_per_element

if RECORD_CONFIG['variance']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensor_shapes['variance'] = shape
    tensor_sizes['variance'] = np.prod(shape) * bytes_per_element

if RECORD_CONFIG['logits']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE)
    tensor_shapes['logits'] = shape
    tensor_sizes['logits'] = np.prod(shape) * bytes_per_element

if RECORD_CONFIG['losses']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1)
    tensor_shapes['losses'] = shape
    tensor_sizes['losses'] = np.prod(shape) * bytes_per_element

total_recorded_data = sum(tensor_sizes.values())

# Model memory
embedding_params = VOCAB_SIZE * HIDDEN_DIM
params_per_layer = 12 * HIDDEN_DIM**2
transformer_params = N_LAYER * params_per_layer
total_model_params = embedding_params + transformer_params

model_memory = total_model_params * bytes_per_element
optimizer_memory = 2 * total_model_params * 4

peak_ram = total_recorded_data + model_memory + optimizer_memory
disk_space = total_recorded_data

# Display
print(f"Experiment configuration:")
print(f"  Sweep type: Initialization scale σ")
print(f"  σ values: {INIT_SCALES}")
print(f"  Runs: {NUM_RUNS}")
print(f"  Steps per run: {NUM_TRAIN_STEPS:,}")
print(f"  Fixed seed: {FIXED_SEED}")
print()

enabled_items = [k for k, v in RECORD_CONFIG.items() if v]
print(f"Recording {len(enabled_items)} item(s): {', '.join(enabled_items)}")
print()

for name, size_bytes in tensor_sizes.items():
    size_gb = size_bytes / 1e9
    shape = tensor_shapes[name]
    print(f"  {name:12} {str(shape):30} {size_gb:8.2f} GB")

print()
print(f"Model parameters: {total_model_params:,}")
print(f"  Model memory (bf16):     {model_memory/1e9:8.2f} GB")
print(f"  Optimizer state (fp32):  {optimizer_memory/1e9:8.2f} GB")
print()
print(f"{'─'*80}")
print(f"PEAK RAM ESTIMATE:         {peak_ram/1e9:8.2f} GB")
print(f"DISK SPACE NEEDED:         {disk_space/1e9:8.2f} GB")
print(f"{'─'*80}")

# Warnings
RAM_BUDGET = 24
DISK_BUDGET = 50

if peak_ram > RAM_BUDGET * 1e9:
    print(f"\n⚠️  WARNING: Peak RAM ({peak_ram/1e9:.1f} GB) exceeds {RAM_BUDGET} GB budget!")
    print(f"   Consider reducing NUM_TRAIN_STEPS or disabling expensive recordings.")

if disk_space > DISK_BUDGET * 1e9:
    print(f"\n⚠️  WARNING: Disk space ({disk_space/1e9:.1f} GB) exceeds {DISK_BUDGET} GB budget!")

if peak_ram <= RAM_BUDGET * 1e9 and disk_space <= DISK_BUDGET * 1e9:
    print(f"\n✓ Resources within budget. Ready to proceed.")

print(f"\n{'='*80}\n")


MEMORY & DISK REQUIREMENTS

Experiment configuration:
  Sweep type: Initialization scale σ
  σ values: [0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045]
  Runs: 9
  Steps per run: 1,000
  Fixed seed: 42

Recording 2 item(s): W, losses

  W            (9, 1001, 10000, 64)              11.53 GB
  losses       (9, 1001)                          0.00 GB

Model parameters: 738,304
  Model memory (bf16):         0.00 GB
  Optimizer state (fp32):      0.01 GB

────────────────────────────────────────────────────────────────────────────────
PEAK RAM ESTIMATE:            11.54 GB
DISK SPACE NEEDED:            11.53 GB
────────────────────────────────────────────────────────────────────────────────

✓ Resources within budget. Ready to proceed.




## Device Detection

In [4]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Load Data

In [5]:
# Tokenizer
print(f"Loading tokenizer: {TOKENIZER_PATH}")
tokenizer = Tokenizer.from_file(str(TOKENIZER_PATH))
print(f"  ✓ Vocabulary: {tokenizer.get_vocab_size():,} tokens\n")

# Corpus
print(f"Loading corpus: {CORPUS_PATH}")
with open(CORPUS_PATH, 'r', encoding='utf-8') as f:
    corpus_text = f.read()
encoding = tokenizer.encode(corpus_text)
tokens = encoding.ids
corpus_tensor = torch.tensor(tokens, dtype=torch.long, device=device)
print(f"  ✓ Tokens: {len(tokens):,}\n")

# Token masks
print(f"Loading token masks: {TOKEN_MASK_PATH}")
mask_data = load_file(TOKEN_MASK_PATH)
dead_indices = mask_data['dead_indices']
n_dead = mask_data['dead_mask'].sum().item()
n_live = mask_data['live_mask'].sum().item()
print(f"  ✓ Live: {n_live:,} | Dead: {n_dead:,}")

Loading tokenizer: ../data/flannel_tokenizer_chars.json
  ✓ Vocabulary: 10,000 tokens

Loading corpus: ../data/flannel_model_corpus.txt
  ✓ Tokens: 1,371,328

Loading token masks: ../tensors/Flannel/live_dead_tokens.safetensors
  ✓ Live: 6,301 | Dead: 3,699


## Dataset

In [6]:
class TokenDataset(Dataset):
    def __init__(self, corpus_tensor, max_seq_len):
        self.corpus = corpus_tensor
        self.max_seq_len = max_seq_len
    
    def __len__(self):
        return max(0, len(self.corpus) - self.max_seq_len)
    
    def __getitem__(self, idx):
        chunk = self.corpus[idx : idx + self.max_seq_len + 1]
        return {
            'input_ids': chunk[:-1],
            'labels': chunk[1:]
        }

dataset = TokenDataset(corpus_tensor, MAX_SEQ_LEN)
print(f"\n✓ Dataset: {len(dataset):,} examples")


✓ Dataset: 1,371,200 examples


## Pre-allocate Recording Tensors

In [7]:
print("\nPre-allocating recording tensors...\n")

tensors = {}

if RECORD_CONFIG['W']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensors['W'] = torch.zeros(shape, dtype=torch.bfloat16)
    print(f"  W:        {shape}")

if RECORD_CONFIG['grads']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensors['grads'] = torch.zeros(shape, dtype=torch.bfloat16)
    print(f"  grads:    {shape}")

if RECORD_CONFIG['momentum']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensors['momentum'] = torch.zeros(shape, dtype=torch.bfloat16)
    print(f"  momentum: {shape}")

if RECORD_CONFIG['variance']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE, HIDDEN_DIM)
    tensors['variance'] = torch.zeros(shape, dtype=torch.bfloat16)
    print(f"  variance: {shape}")

if RECORD_CONFIG['logits']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1, VOCAB_SIZE)
    tensors['logits'] = torch.zeros(shape, dtype=torch.bfloat16)
    print(f"  logits:   {shape}")

if RECORD_CONFIG['losses']:
    shape = (NUM_RUNS, NUM_TRAIN_STEPS+1)
    tensors['losses'] = torch.full(shape, float('nan'), dtype=torch.bfloat16)
    print(f"  losses:   {shape}")

print(f"\n✓ All tensors allocated on CPU")


Pre-allocating recording tensors...

  W:        (9, 1001, 10000, 64)
  losses:   (9, 1001)

✓ All tensors allocated on CPU


## Batch Recorder

In [8]:
class BatchRecorder:
    """Records data directly into pre-allocated tensors."""
    
    def __init__(self, tensors, record_config, run_idx):
        self.tensors = tensors
        self.config = record_config
        self.run_idx = run_idx
        self.current_step = 0
        self.recorded_initial = False
        
        self.grad_before = None
        self.loss_value = None
        self.logits_sample = None
    
    def record_initial_state(self, model, optimizer):
        """Record step 0."""
        if not self.recorded_initial:
            t = 0
            
            if self.config['W']:
                self.tensors['W'][self.run_idx, t] = model.transformer.wte.weight.data.clone().cpu().bfloat16()
            
            self.recorded_initial = True
            self.current_step = 1
            print(f"    ✓ Recorded initial state (t=0)")
    
    def record_before_step(self, model, loss, logits):
        """Capture data after backward, before optimizer step."""
        if self.config['grads'] and model.transformer.wte.weight.grad is not None:
            self.grad_before = model.transformer.wte.weight.grad.clone().cpu().bfloat16()
        
        if self.config['losses']:
            self.loss_value = loss.item()
        
        if self.config['logits']:
            self.logits_sample = logits[0, -1, :].detach().cpu().bfloat16()
    
    def record_after_step(self, model, optimizer):
        """Record data after optimizer step."""
        t = self.current_step
        
        if t > self.tensors['W'].shape[1] - 1 if 'W' in self.tensors else float('inf'):
            return
        
        if self.config['W']:
            self.tensors['W'][self.run_idx, t] = model.transformer.wte.weight.data.clone().cpu().bfloat16()
        
        if self.config['grads'] and self.grad_before is not None:
            self.tensors['grads'][self.run_idx, t] = self.grad_before
            self.grad_before = None
        
        if self.config['momentum']:
            param = model.transformer.wte.weight
            if param in optimizer.state and 'exp_avg' in optimizer.state[param]:
                self.tensors['momentum'][self.run_idx, t] = optimizer.state[param]['exp_avg'].clone().cpu().bfloat16()
        
        if self.config['variance']:
            param = model.transformer.wte.weight
            if param in optimizer.state and 'exp_avg_sq' in optimizer.state[param]:
                self.tensors['variance'][self.run_idx, t] = optimizer.state[param]['exp_avg_sq'].clone().cpu().bfloat16()
        
        if self.config['logits'] and self.logits_sample is not None:
            self.tensors['logits'][self.run_idx, t] = self.logits_sample
            self.logits_sample = None
        
        if self.config['losses'] and self.loss_value is not None:
            self.tensors['losses'][self.run_idx, t] = self.loss_value
            self.loss_value = None
        
        if t % 100 == 0:
            print(f"    Step {t}")
        
        self.current_step += 1

print("✓ Recorder class defined")

✓ Recorder class defined


## Instrumented Trainer

In [9]:
class InstrumentedTrainer(Trainer):
    def __init__(self, recorder, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.recorder = recorder
        self.last_logits = None

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        outputs = model(**inputs)
        loss = outputs.loss
        self.last_logits = outputs.logits
        return (loss, outputs) if return_outputs else loss

    def training_step(self, model, inputs, num_items_in_batch=None):
        loss = super().training_step(model, inputs, num_items_in_batch)
        self.recorder.record_before_step(model, loss, self.last_logits)
        return loss

    def _maybe_log_save_evaluate(self, tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time=None, **kwargs):
        self.recorder.record_after_step(model, self.optimizer)
        super()._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval, start_time, **kwargs)

print("✓ InstrumentedTrainer defined")

✓ InstrumentedTrainer defined


## Initialization Scale Sweep Loop

In [10]:
print(f"\n{'='*80}")
print(f"FLANNEL 5: INITIALIZATION SCALE SWEEP")
print(f"{'='*80}\n")

print(f"Configuration:")
print(f"  Runs: {NUM_RUNS}")
print(f"  Steps per run: {NUM_TRAIN_STEPS:,}")
print(f"  Fixed seed: {FIXED_SEED}")
print(f"  σ values: {INIT_SCALES}")
print(f"  Recording: {', '.join([k for k, v in RECORD_CONFIG.items() if v])}")
print(f"\n{'='*80}\n")

experiment_start = time.time()

for run_idx, init_scale in enumerate(INIT_SCALES):
    print(f"\n{'='*80}")
    print(f"RUN {run_idx + 1}/{NUM_RUNS} (σ={init_scale:.3f})")
    print(f"{'='*80}\n")
    
    # Set seed (same for all runs)
    torch.manual_seed(FIXED_SEED)
    np.random.seed(FIXED_SEED)
    
    # Create model
    config = GPT2Config(
        vocab_size=VOCAB_SIZE,
        n_positions=MAX_SEQ_LEN,
        n_embd=HIDDEN_DIM,
        n_layer=N_LAYER,
        n_head=N_HEAD,
        resid_pdrop=0.0,
        embd_pdrop=0.0,
        attn_pdrop=0.0,
        tie_word_embeddings=True,
    )
    
    model = GPT2LMHeadModel(config).to(torch.bfloat16).to(device)
    
    # Initialize with current σ
    init_f32 = torch.randn(VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32, device=device) * init_scale
    with torch.no_grad():
        model.transformer.wte.weight[:] = init_f32.to(torch.bfloat16)
    
    print(f"  ✓ Model initialized (seed={FIXED_SEED}, σ={init_scale:.3f})")
    
    # Create recorder
    recorder = BatchRecorder(tensors, RECORD_CONFIG, run_idx)
    
    # Training args
    training_args = TrainingArguments(
        output_dir=OUTPUT_DIR,
        max_steps=NUM_TRAIN_STEPS,
        per_device_train_batch_size=BATCH_SIZE,
        learning_rate=LEARNING_RATE,
        weight_decay=WEIGHT_DECAY,
        adam_beta1=ADAM_BETA1,
        adam_beta2=ADAM_BETA2,
        adam_epsilon=ADAM_EPSILON,
        optim="adamw_torch",
        logging_steps=1000,
        save_steps=NUM_TRAIN_STEPS + 1,
        save_total_limit=0,
        dataloader_num_workers=0,
        dataloader_pin_memory=False,
        bf16=True,
        seed=FIXED_SEED,
        report_to="none",
        disable_tqdm=True,
    )
    
    trainer = InstrumentedTrainer(
        recorder=recorder,
        model=model,
        args=training_args,
        train_dataset=dataset,
    )
    
    # Record initial state
    recorder.record_initial_state(model, trainer.optimizer)
    
    # Train
    print(f"  Training...")
    run_start = time.time()
    trainer.train()
    run_elapsed = time.time() - run_start
    
    print(f"\n  ✓ Run {run_idx + 1} complete ({run_elapsed:.1f}s)")
    
    # Clean up
    del model, trainer, recorder
    
    if device == 'mps':
        torch.mps.empty_cache()
    elif device == 'cuda':
        torch.cuda.empty_cache()

experiment_elapsed = time.time() - experiment_start

print(f"\n{'='*80}")
print(f"✓ All {NUM_RUNS} runs complete")
print(f"  Total time: {experiment_elapsed:.1f}s ({experiment_elapsed/60:.1f} minutes)")
print(f"  Average per run: {experiment_elapsed/NUM_RUNS:.1f}s")
print(f"{'='*80}")


FLANNEL 5: INITIALIZATION SCALE SWEEP

Configuration:
  Runs: 9
  Steps per run: 1,000
  Fixed seed: 42
  σ values: [0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045]
  Recording: W, losses



RUN 1/9 (σ=0.005)

  ✓ Model initialized (seed=42, σ=0.005)
    ✓ Recorded initial state (t=0)
  Training...


`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


    Step 100
    Step 200
    Step 300
    Step 400
    Step 500
    Step 600
    Step 700
    Step 800
    Step 900
    Step 1000
{'loss': 7.1415, 'grad_norm': 0.2158203125, 'learning_rate': 1e-06, 'epoch': 0.023337222870478413}
{'train_runtime': 26.2006, 'train_samples_per_second': 1221.348, 'train_steps_per_second': 38.167, 'train_loss': 7.14151318359375, 'epoch': 0.023337222870478413}

  ✓ Run 1 complete (26.3s)

RUN 2/9 (σ=0.010)

  ✓ Model initialized (seed=42, σ=0.010)
    ✓ Recorded initial state (t=0)
  Training...
    Step 100
    Step 200
    Step 300
    Step 400
    Step 500
    Step 600
    Step 700
    Step 800
    Step 900
    Step 1000
{'loss': 7.1265, 'grad_norm': 0.2412109375, 'learning_rate': 1e-06, 'epoch': 0.023337222870478413}
{'train_runtime': 25.247, 'train_samples_per_second': 1267.479, 'train_steps_per_second': 39.609, 'train_loss': 7.12645361328125, 'epoch': 0.023337222870478413}

  ✓ Run 2 complete (25.3s)

RUN 3/9 (σ=0.015)

  ✓ Model initialized (seed=42,

## Save Data

In [11]:
print(f"\nSaving data...\n")

# Build save dictionary
save_dict = {
    # Metadata
    'n_runs': torch.tensor(NUM_RUNS, dtype=torch.long),
    'fixed_seed': torch.tensor(FIXED_SEED, dtype=torch.long),
    'init_scales': torch.tensor(INIT_SCALES, dtype=torch.float32),
    'n_steps': torch.tensor(NUM_TRAIN_STEPS, dtype=torch.long),
    'n_live': torch.tensor(n_live, dtype=torch.long),
    'n_dead': torch.tensor(n_dead, dtype=torch.long),
    'vocab_size': torch.tensor(VOCAB_SIZE, dtype=torch.long),
    'hidden_dim': torch.tensor(HIDDEN_DIM, dtype=torch.long),
    'learning_rate': torch.tensor(LEARNING_RATE, dtype=torch.float32),
    'weight_decay': torch.tensor(WEIGHT_DECAY, dtype=torch.float32),
    'adam_beta1': torch.tensor(ADAM_BETA1, dtype=torch.float32),
    'adam_beta2': torch.tensor(ADAM_BETA2, dtype=torch.float32),
    # Record config
    'recorded_W': torch.tensor(RECORD_CONFIG['W'], dtype=torch.bool),
    'recorded_grads': torch.tensor(RECORD_CONFIG['grads'], dtype=torch.bool),
    'recorded_momentum': torch.tensor(RECORD_CONFIG['momentum'], dtype=torch.bool),
    'recorded_variance': torch.tensor(RECORD_CONFIG['variance'], dtype=torch.bool),
    'recorded_logits': torch.tensor(RECORD_CONFIG['logits'], dtype=torch.bool),
    'recorded_losses': torch.tensor(RECORD_CONFIG['losses'], dtype=torch.bool),
}

# Add all recorded tensors
for name, tensor in tensors.items():
    save_dict[name] = tensor
    print(f"  {name:12} {str(tuple(tensor.shape)):30}")

# Save
Path(OUTPUT_DIR).mkdir(parents=True, exist_ok=True)
output_path = Path(OUTPUT_DIR) / OUTPUT_FILE

print(f"\nSaving to: {output_path}\n")

save_start = time.time()
save_file(save_dict, str(output_path))
save_elapsed = time.time() - save_start

file_size_mb = output_path.stat().st_size / 1e6
file_size_gb = file_size_mb / 1000

print(f"✓ Saved successfully")
print(f"  File: {output_path.name}")
print(f"  Size: {file_size_mb:.1f} MB ({file_size_gb:.2f} GB)")
print(f"  Save time: {save_elapsed:.1f}s")


Saving data...

  W            (9, 1001, 10000, 64)          
  losses       (9, 1001)                     

Saving to: ../tensors/Flannel/1.20e_flannel_5.safetensors

✓ Saved successfully
  File: 1.20e_flannel_5.safetensors
  Size: 11531.5 MB (11.53 GB)
  Save time: 6.1s


## Summary

In [12]:
print(f"\n{'='*80}")
print(f"FLANNEL 5 COMPLETE")
print(f"{'='*80}\n")

print(f"Experiment: Initialization scale sweep")
print(f"  Runs: {NUM_RUNS}")
print(f"  Steps per run: {NUM_TRAIN_STEPS:,}")
print(f"  Fixed seed: {FIXED_SEED}")
print(f"  σ values: {INIT_SCALES}")
print(f"  Control (σ=0.020): run {INIT_SCALES.index(0.020)}")
print(f"  Recorded: {', '.join([k for k, v in RECORD_CONFIG.items() if v])}")
print()
print(f"Data saved: {output_path}")
print(f"  Size: {file_size_gb:.2f} GB")
print(f"  Total experiment time: {experiment_elapsed/60:.1f} minutes")
print()
print(f"Data structure:")
for name in tensors.keys():
    print(f"  {name}: {tuple(tensors[name].shape)}")
print()
print(f"Next steps:")
print(f"  1. Create 1.23b analysis notebook (adapt 1.23a)")
print(f"  2. Plot mean radius trajectories for all σ values")
print(f"  3. Compare epoch timing across scales")
print(f"  4. Check for phase transitions or scaling laws")

print(f"\n{'='*80}")


FLANNEL 5 COMPLETE

Experiment: Initialization scale sweep
  Runs: 9
  Steps per run: 1,000
  Fixed seed: 42
  σ values: [0.005, 0.01, 0.015, 0.02, 0.025, 0.03, 0.035, 0.04, 0.045]
  Control (σ=0.020): run 3
  Recorded: W, losses

Data saved: ../tensors/Flannel/1.20e_flannel_5.safetensors
  Size: 11.53 GB
  Total experiment time: 3.8 minutes

Data structure:
  W: (9, 1001, 10000, 64)
  losses: (9, 1001)

Next steps:
  1. Create 1.23b analysis notebook (adapt 1.23a)
  2. Plot mean radius trajectories for all σ values
  3. Compare epoch timing across scales
  4. Check for phase transitions or scaling laws

