# 1.18b: Data Rate Calculator

**Goal:** Calculate storage requirements for training experiments given hyperparameters.

## Why This Matters

We want to record the full W matrix (unembedding) at every training step to watch token dynamics. But how much disk space will this require?

**Answer depends on:**
- Vocabulary size (number of tokens)
- Hidden dimension size
- Number of training steps
- Data type (float32, bfloat16, float16)

This notebook lets us **tune hyperparameters** before training to ensure we don't run out of disk space mid-run.

## What We Calculate

For a given set of hyperparameters:
1. **Single snapshot size:** How big is one W matrix?
2. **Per-1000-steps:** How much data per 1,000 training steps?
3. **Total run size:** How much for a full training run (N steps)?
4. **RAM requirements:** How much memory needed to hold snapshots before saving?

## Storage Format

We save W matrices using **safetensors** format:
- Minimal overhead (~100 bytes per file)
- Compressed if possible
- But essentially: `file_size ≈ vocab_size × hidden_dim × bytes_per_float`

## Parameters

Edit these to match your planned training run.

In [1]:
# Model architecture
VOCAB_SIZE = 10000        # Flannel tokenizer siz
HIDDEN_DIM = 64          # Same as Lil Gatsby / Wordybird
NUM_LAYERS = 2           # Number of transformer layers
NUM_HEADS = 2            # Number of attention heads

# Training parameters
TRAINING_STEPS = 10000   # Total number of training steps
SAVE_EVERY_N_STEPS = 1   # How often to save W matrix (1 = every step)

# Data type
DTYPE = "bfloat16"       # Options: "float32" (4 bytes), "bfloat16" (2 bytes), "float16" (2 bytes)

# Bytes per data type
BYTES_PER_DTYPE = {
    "float32": 4,
    "bfloat16": 2,
    "float16": 2
}

print("✓ Parameters set")

✓ Parameters set


## Calculate W Matrix Size

The W matrix (unembedding) is what we're tracking. It's a 2D matrix:
- Shape: `(vocab_size, hidden_dim)`
- Size in bytes: `vocab_size × hidden_dim × bytes_per_float`

In [2]:
# Calculate W matrix size
bytes_per_float = BYTES_PER_DTYPE[DTYPE]
w_matrix_elements = VOCAB_SIZE * HIDDEN_DIM
w_matrix_bytes = w_matrix_elements * bytes_per_float

# Convert to human-readable units
def format_bytes(b):
    """Format bytes as human-readable string"""
    if b < 1024:
        return f"{b} B"
    elif b < 1024**2:
        return f"{b/1024:.2f} KB"
    elif b < 1024**3:
        return f"{b/1024**2:.2f} MB"
    else:
        return f"{b/1024**3:.2f} GB"

print(f"W Matrix (Unembedding):")
print(f"  Shape: ({VOCAB_SIZE:,} tokens, {HIDDEN_DIM} dims)")
print(f"  Elements: {w_matrix_elements:,}")
print(f"  Data type: {DTYPE} ({bytes_per_float} bytes per element)")
print(f"  Size per snapshot: {format_bytes(w_matrix_bytes)}")
print()

W Matrix (Unembedding):
  Shape: (10,000 tokens, 64 dims)
  Elements: 640,000
  Data type: bfloat16 (2 bytes per element)
  Size per snapshot: 1.22 MB



## Calculate Storage Requirements per Interval

In [3]:
# How many snapshots will we save?
num_snapshots = TRAINING_STEPS // SAVE_EVERY_N_STEPS

# Total storage for W matrices
total_w_storage = w_matrix_bytes * num_snapshots

# Per-1000-steps
snapshots_per_1000 = 1000 // SAVE_EVERY_N_STEPS
storage_per_1000 = w_matrix_bytes * snapshots_per_1000

print(f"Storage Requirements:")
print(f"  Training steps: {TRAINING_STEPS:,}")
print(f"  Save frequency: every {SAVE_EVERY_N_STEPS} step(s)")
print(f"  Total snapshots: {num_snapshots:,}")
print()

print(f"  Per snapshot: {format_bytes(w_matrix_bytes)}")
print(f"  Per 1,000 steps: {format_bytes(storage_per_1000)}")
print(f"  Total for {TRAINING_STEPS:,} steps: {format_bytes(total_w_storage)}")
print()

Storage Requirements:
  Training steps: 10,000
  Save frequency: every 1 step(s)
  Total snapshots: 10,000

  Per snapshot: 1.22 MB
  Per 1,000 steps: 1.19 GB
  Total for 10,000 steps: 11.92 GB



## Additional Data

Besides W matrices, we might also save:
- E matrix (embedding) - same size as W
- Optimizer state (Adam: 2× model parameters for momentum/variance)
- Loss values (negligible)
- Checkpoints (full model state)

In [4]:
# E matrix (embedding) - same size as W
e_matrix_bytes = w_matrix_bytes

# If we save both E and W
total_embeddings = (w_matrix_bytes + e_matrix_bytes) * num_snapshots

print(f"Additional Data (optional):")
print(f"  E matrix per snapshot: {format_bytes(e_matrix_bytes)}")
print(f"  Both E + W per snapshot: {format_bytes(w_matrix_bytes + e_matrix_bytes)}")
print(f"  Both E + W per 1,000 steps: {format_bytes((w_matrix_bytes + e_matrix_bytes) * snapshots_per_1000)}")
print(f"  Both E + W total: {format_bytes(total_embeddings)}")
print()

Additional Data (optional):
  E matrix per snapshot: 1.22 MB
  Both E + W per snapshot: 2.44 MB
  Both E + W per 1,000 steps: 2.38 GB
  Both E + W total: 23.84 GB



## RAM Requirements

During training, we need to hold:
1. Current model parameters
2. Optimizer state (2× parameters for Adam)
3. Batch activations
4. Gradient tensors

In [5]:
# Estimate model parameter count
# E + W matrices
embedding_params = 2 * VOCAB_SIZE * HIDDEN_DIM

# Transformer layers (rough estimate)
# Per layer: attention (4 × hidden²) + FFN (2 × 4 × hidden²)
params_per_layer = 4 * HIDDEN_DIM**2 + 8 * HIDDEN_DIM**2
transformer_params = NUM_LAYERS * params_per_layer

total_params = embedding_params + transformer_params

# Memory usage
model_memory = total_params * bytes_per_float
optimizer_memory = 2 * model_memory  # Adam keeps momentum + variance
total_ram = model_memory + optimizer_memory

print(f"RAM Requirements (approximate):")
print(f"  Total parameters: {total_params:,}")
print(f"  Model memory: {format_bytes(model_memory)}")
print(f"  Optimizer state (Adam): {format_bytes(optimizer_memory)}")
print(f"  Minimum RAM needed: {format_bytes(total_ram)}")
print(f"  Recommended RAM: {format_bytes(int(total_ram * 1.5))} (with 50% buffer)")
print()

RAM Requirements (approximate):
  Total parameters: 1,378,304
  Model memory: 2.63 MB
  Optimizer state (Adam): 5.26 MB
  Minimum RAM needed: 7.89 MB
  Recommended RAM: 11.83 MB (with 50% buffer)



## Scaling Analysis

How does storage scale with different vocab sizes?

In [6]:
import pandas as pd

print(f"Scaling Analysis (for {TRAINING_STEPS:,} steps, {DTYPE}):")
print()

vocab_sizes = [128, 256, 1000, 5000, 10000, 50000]
scaling_data = []

for vocab in vocab_sizes:
    w_bytes = vocab * HIDDEN_DIM * bytes_per_float
    total = w_bytes * num_snapshots
    per_1000 = w_bytes * snapshots_per_1000
    
    scaling_data.append({
        'Vocab Size': f"{vocab:,}",
        'W per snapshot': format_bytes(w_bytes),
        'Per 1,000 steps': format_bytes(per_1000),
        f'Total ({TRAINING_STEPS:,} steps)': format_bytes(total)
    })

df = pd.DataFrame(scaling_data)
print(df.to_string(index=False))

Scaling Analysis (for 10,000 steps, bfloat16):

Vocab Size W per snapshot Per 1,000 steps Total (10,000 steps)
       128       16.00 KB        15.62 MB            156.25 MB
       256       32.00 KB        31.25 MB            312.50 MB
     1,000      125.00 KB       122.07 MB              1.19 GB
     5,000      625.00 KB       610.35 MB              5.96 GB
    10,000        1.22 MB         1.19 GB             11.92 GB
    50,000        6.10 MB         5.96 GB             59.60 GB


## Summary

In [7]:
print(f"\n{'='*70}")
print(f"DATA RATE SUMMARY")
print(f"{'='*70}\n")

print(f"Model Configuration:")
print(f"  Vocabulary: {VOCAB_SIZE:,} tokens")
print(f"  Hidden dimension: {HIDDEN_DIM}")
print(f"  Layers: {NUM_LAYERS}")
print(f"  Attention heads: {NUM_HEADS}")
print(f"  Data type: {DTYPE}")
print()

print(f"Training Plan:")
print(f"  Total steps: {TRAINING_STEPS:,}")
print(f"  Save frequency: every {SAVE_EVERY_N_STEPS} step(s)")
print(f"  Total snapshots: {num_snapshots:,}")
print()

print(f"Storage Requirements (W matrix only):")
print(f"  Per snapshot: {format_bytes(w_matrix_bytes)}")
print(f"  Per 1,000 steps: {format_bytes(storage_per_1000)}")
print(f"  Total: {format_bytes(total_w_storage)}")
print()

print(f"RAM Requirements:")
print(f"  Model + optimizer: {format_bytes(total_ram)}")
print(f"  Recommended: {format_bytes(int(total_ram * 1.5))}")
print()

# Sanity checks
if total_w_storage > 100 * 1024**3:  # > 100 GB
    print(f"⚠️  WARNING: Total storage exceeds 100 GB!")
    print(f"   Consider reducing vocab size or training steps.")
    print()

if total_ram > 16 * 1024**3:  # > 16 GB
    print(f"⚠️  WARNING: RAM requirements exceed 16 GB!")
    print(f"   Consider reducing model size or using bfloat16.")
    print()

print(f"Next steps:")
print(f"  → If storage is acceptable, proceed with training (1.20a+)")
print(f"  → If too large, adjust VOCAB_SIZE or TRAINING_STEPS above")
print(f"  → Consider using bfloat16 instead of float32 (halves storage)")
print()
print(f"{'='*70}")


DATA RATE SUMMARY

Model Configuration:
  Vocabulary: 10,000 tokens
  Hidden dimension: 64
  Layers: 2
  Attention heads: 2
  Data type: bfloat16

Training Plan:
  Total steps: 10,000
  Save frequency: every 1 step(s)
  Total snapshots: 10,000

Storage Requirements (W matrix only):
  Per snapshot: 1.22 MB
  Per 1,000 steps: 1.19 GB
  Total: 11.92 GB

RAM Requirements:
  Model + optimizer: 7.89 MB
  Recommended: 11.83 MB

Next steps:
  → If storage is acceptable, proceed with training (1.20a+)
  → If too large, adjust VOCAB_SIZE or TRAINING_STEPS above
  → Consider using bfloat16 instead of float32 (halves storage)

