# Thimble 6: Ground State or Still Cooling? (t=6000, HDF5)

**Purpose:** Distinguish quantum ground state equilibrium from continued cooling.

## Context

**Thimble 5** (t=0–3000) revealed:
- **Individual tokens:** 99.9% frozen in late training (t=2400-3000)
- **Global freeze:** Only 22.5% of timesteps (674/3000)
- **Late training:** 64% globally frozen (t=2000-3000)
- **55 freeze/thaw cycles** with longest run = 82 steps (ending at t=2999)

**Two competing hypotheses:**

1. **Ground state equilibrium:** System has reached minimum energy. The 64% global freeze / 36% jitter is the permanent quantum ground state. Statistics stabilize—no further cooling.

2. **Still cooling:** System asymptotically approaching 100% global freeze. The trend from 22.5% (overall) → 64% (t=2000-3000) continues. Eventually reaches permanent Fimbulwinter.

## Question

By doubling observation window to t=6000, which hypothesis is correct?

## Technical Changes

**HDF5 streaming format:**
- No memory accumulation (was 33 GB for safetensors)
- Write incrementally to disk (~1 GB RAM during training)
- Analysis notebooks load only needed slices
- gzip compression (level 1, fast)
- Chunked storage optimized for timestep access

**Memory:** ~1 GB during training (vs. ~33 GB for safetensors)

**Time:** ~7-8 minutes

## Parameters

In [4]:
# Model architecture
VOCAB_SIZE = 10000
HIDDEN_DIM = 64
N_LAYERS = 2
N_HEADS = 2
MAX_SEQ_LEN = 128

# Training
NUM_STEPS = 6000  # ← DOUBLE OBSERVATION WINDOW
BATCH_SIZE = 128
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 0.0

# Optimizer (AdamW)
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.999
ADAM_EPSILON = 1e-8

# Initialization
INIT_SCALE = 0.02
SEED = 42

# Paths
TOKENIZER_PATH = "../data/flannel_tokenizer_chars.json"
CORPUS_PATH = "../data/flannel_model_corpus.txt"
TOKEN_MASK_PATH = "../tensors/Flannel/live_dead_tokens.safetensors"
OUTPUT_PATH = "../tensors/Thimble/thimble_6.h5"  # ← HDF5 FORMAT

print("✓ Parameters set")

✓ Parameters set


## Imports

In [5]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Config, GPT2LMHeadModel
from tokenizers import Tokenizer
import numpy as np
from pathlib import Path
from safetensors.torch import load_file
import h5py
from tqdm.auto import tqdm
import time

print("✓ Imports complete")

✓ Imports complete


## Memory Safety Check

In [6]:
print(f"\n{'='*80}")
print(f"MEMORY & DISK SAFETY CHECK (HDF5 STREAMING)")
print(f"{'='*80}\n")

# HDF5 streaming: NO RAM accumulation
bytes_bf16 = 2
bytes_f32 = 4

# Disk space (uncompressed estimate)
disk_w = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_bf16
disk_grad = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_bf16
disk_momentum = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_bf16
disk_variance = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_bf16
disk_losses = (NUM_STEPS+1) * bytes_f32
disk_metadata = 1e6  # Token masks, hyperparams

total_disk_uncompressed = disk_w + disk_grad + disk_momentum + disk_variance + disk_losses + disk_metadata
total_disk_compressed = total_disk_uncompressed * 0.7  # Estimate 30% compression

print(f"Disk space (uncompressed):")
print(f"  W:         {disk_w/1e9:.2f} GB")
print(f"  grad_W:    {disk_grad/1e9:.2f} GB")
print(f"  momentum:  {disk_momentum/1e9:.2f} GB")
print(f"  variance:  {disk_variance/1e9:.2f} GB")
print(f"  losses:    {disk_losses/1e9:.4f} GB")
print(f"  metadata:  {disk_metadata/1e9:.4f} GB")
print(f"  {'─'*40}")
print(f"  Total:     {total_disk_uncompressed/1e9:.2f} GB")
print(f"  With gzip: ~{total_disk_compressed/1e9:.2f} GB (estimated)")
print()

# RAM during training (streaming writes)
model_params = VOCAB_SIZE * HIDDEN_DIM + N_LAYERS * (12 * HIDDEN_DIM**2)
model_memory = model_params * bytes_bf16
optimizer_memory = 2 * model_params * bytes_bf16
activation_memory = BATCH_SIZE * MAX_SEQ_LEN * HIDDEN_DIM * N_LAYERS * 2 * bytes_bf16
corpus_memory = 1371328 * 8
hdf5_buffer = 100e6  # HDF5 write buffer
misc_overhead = 500e6

peak_ram = model_memory + optimizer_memory + activation_memory + corpus_memory + hdf5_buffer + misc_overhead

print(f"RAM during training (streaming):")
print(f"  Model+opt+act: {(model_memory + optimizer_memory + activation_memory)/1e9:.2f} GB")
print(f"  Corpus:        {corpus_memory/1e9:.2f} GB")
print(f"  HDF5 buffer:   {hdf5_buffer/1e9:.2f} GB")
print(f"  Misc overhead: {misc_overhead/1e9:.2f} GB")
print(f"  {'─'*40}")
print(f"  Total:         {peak_ram/1e9:.2f} GB")
print()

print(f"{'='*80}")
if peak_ram <= 24e9:
    print(f"✓ SAFE: Peak RAM ({peak_ram/1e9:.1f} GB) within 24 GB budget")
    print(f"  HDF5 streaming avoids {total_disk_uncompressed/1e9:.1f} GB accumulation!")
else:
    print(f"⚠️  WARNING: Peak RAM ({peak_ram/1e9:.1f} GB) exceeds 24 GB budget!")
print(f"{'='*80}\n")


MEMORY & DISK SAFETY CHECK (HDF5 STREAMING)

Disk space (uncompressed):
  W:         7.68 GB
  grad_W:    7.68 GB
  momentum:  7.68 GB
  variance:  7.68 GB
  losses:    0.0000 GB
  metadata:  0.0010 GB
  ────────────────────────────────────────
  Total:     30.73 GB
  With gzip: ~21.51 GB (estimated)

RAM during training (streaming):
  Model+opt+act: 0.01 GB
  Corpus:        0.01 GB
  HDF5 buffer:   0.10 GB
  Misc overhead: 0.50 GB
  ────────────────────────────────────────
  Total:         0.62 GB

✓ SAFE: Peak RAM (0.6 GB) within 24 GB budget
  HDF5 streaming avoids 30.7 GB accumulation!



## Device Detection

In [7]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Set Random Seeds

In [8]:
torch.manual_seed(SEED)
np.random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"✓ Random seed set to {SEED}")

✓ Random seed set to 42


## Load Data

In [9]:
# Tokenizer
print(f"Loading tokenizer: {TOKENIZER_PATH}")
tokenizer = Tokenizer.from_file(str(TOKENIZER_PATH))
print(f"  ✓ Vocabulary: {tokenizer.get_vocab_size():,} tokens\n")

# Corpus
print(f"Loading corpus: {CORPUS_PATH}")
with open(CORPUS_PATH, 'r', encoding='utf-8') as f:
    corpus_text = f.read()
encoding = tokenizer.encode(corpus_text)
tokens = encoding.ids
corpus_tensor = torch.tensor(tokens, dtype=torch.long)
print(f"  ✓ Tokens: {len(tokens):,}\n")

# Token masks
print(f"Loading token masks: {TOKEN_MASK_PATH}")
mask_data = load_file(TOKEN_MASK_PATH)
live_mask = mask_data['live_mask'].bool()
dead_mask = mask_data['dead_mask'].bool()
live_ids = mask_data['live_indices'].long()
dead_ids = mask_data['dead_indices'].long()
n_live = live_mask.sum().item()
n_dead = dead_mask.sum().item()
print(f"  ✓ Live: {n_live:,} | Dead: {n_dead:,}")

Loading tokenizer: ../data/flannel_tokenizer_chars.json
  ✓ Vocabulary: 10,000 tokens

Loading corpus: ../data/flannel_model_corpus.txt
  ✓ Tokens: 1,371,328

Loading token masks: ../tensors/Flannel/live_dead_tokens.safetensors
  ✓ Live: 6,301 | Dead: 3,699


## Dataset and DataLoader

In [10]:
class TokenDataset(Dataset):
    def __init__(self, corpus_tensor, max_seq_len):
        self.corpus = corpus_tensor
        self.max_seq_len = max_seq_len
    
    def __len__(self):
        return max(0, len(self.corpus) - self.max_seq_len)
    
    def __getitem__(self, idx):
        chunk = self.corpus[idx : idx + self.max_seq_len + 1]
        return {
            'input_ids': chunk[:-1],
            'labels': chunk[1:]
        }

dataset = TokenDataset(corpus_tensor, MAX_SEQ_LEN)

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(SEED)

dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    generator=g,
    worker_init_fn=seed_worker,
    num_workers=0,
)

print(f"\n✓ Dataset: {len(dataset):,} examples")
print(f"✓ DataLoader: {len(dataloader):,} batches per epoch")


✓ Dataset: 1,371,200 examples
✓ DataLoader: 10,713 batches per epoch


## Create Model (BFLOAT16)

In [11]:
print("Creating model...\n")

config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=MAX_SEQ_LEN,
    n_embd=HIDDEN_DIM,
    n_layer=N_LAYERS,
    n_head=N_HEADS,
    resid_pdrop=0.0,
    embd_pdrop=0.0,
    attn_pdrop=0.0,
    tie_word_embeddings=True,
)

model = GPT2LMHeadModel(config)

# Initialize embedding weights with N(0, 0.02)
with torch.no_grad():
    nn.init.normal_(model.transformer.wte.weight, mean=0.0, std=INIT_SCALE)

# Convert to bfloat16 and move to device
model = model.to(torch.bfloat16).to(device)

n_params = sum(p.numel() for p in model.parameters())

print(f"  Architecture: {N_LAYERS} layers, {N_HEADS} heads, {HIDDEN_DIM}d embeddings")
print(f"  Parameters: {n_params:,}")
print(f"  Device: {device}")
print(f"  Dtype: {model.transformer.wte.weight.dtype} (BFLOAT16)")
print(f"\n✓ Model created")

Creating model...

  Architecture: 2 layers, 2 heads, 64d embeddings
  Parameters: 748,288
  Device: mps
  Dtype: torch.bfloat16 (BFLOAT16)

✓ Model created


## Create Optimizer

In [12]:
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=LEARNING_RATE,
    betas=(ADAM_BETA1, ADAM_BETA2),
    eps=ADAM_EPSILON,
    weight_decay=WEIGHT_DECAY,
)

print(f"✓ Optimizer: AdamW")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Betas: ({ADAM_BETA1}, {ADAM_BETA2})")
print(f"  Epsilon: {ADAM_EPSILON}")
print(f"  Weight decay: {WEIGHT_DECAY}")
print(f"\n  Optimizer states will be BFLOAT16 (matching param dtype)")

✓ Optimizer: AdamW
  Learning rate: 0.001
  Betas: (0.9, 0.999)
  Epsilon: 1e-08
  Weight decay: 0.0

  Optimizer states will be BFLOAT16 (matching param dtype)


## Create HDF5 File with Datasets

In [13]:
print("\nCreating HDF5 file with streaming datasets...\n")

Path(OUTPUT_PATH).parent.mkdir(parents=True, exist_ok=True)

# Open HDF5 file for writing
h5file = h5py.File(OUTPUT_PATH, 'w')

# Create datasets with chunking and compression
# Chunk size: (1, vocab, hidden) = one timestep at a time
chunk_shape = (1, VOCAB_SIZE, HIDDEN_DIM)

# Use float16 (HDF5 doesn't have native bfloat16, but we'll convert)
# We store as float16, then cast to bfloat16 when loading
W_dset = h5file.create_dataset(
    'W', 
    shape=(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM),
    dtype='float16',
    chunks=chunk_shape,
    compression='gzip',
    compression_opts=1
)

grad_dset = h5file.create_dataset(
    'grad_W',
    shape=(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM),
    dtype='float16',
    chunks=chunk_shape,
    compression='gzip',
    compression_opts=1
)

momentum_dset = h5file.create_dataset(
    'momentum_W',
    shape=(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM),
    dtype='float16',
    chunks=chunk_shape,
    compression='gzip',
    compression_opts=1
)

variance_dset = h5file.create_dataset(
    'variance_W',
    shape=(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM),
    dtype='float16',
    chunks=chunk_shape,
    compression='gzip',
    compression_opts=1
)

loss_dset = h5file.create_dataset(
    'losses',
    shape=(NUM_STEPS+1,),
    dtype='float32',
    chunks=(1000,),
    compression='gzip',
    compression_opts=1
)

# Store metadata (token masks, hyperparameters)
h5file.create_dataset('live_mask', data=live_mask.numpy())
h5file.create_dataset('dead_mask', data=dead_mask.numpy())
h5file.create_dataset('live_ids', data=live_ids.numpy())
h5file.create_dataset('dead_ids', data=dead_ids.numpy())

# Store hyperparameters as attributes
h5file.attrs['vocab_size'] = VOCAB_SIZE
h5file.attrs['hidden_dim'] = HIDDEN_DIM
h5file.attrs['n_layers'] = N_LAYERS
h5file.attrs['n_heads'] = N_HEADS
h5file.attrs['num_steps'] = NUM_STEPS
h5file.attrs['batch_size'] = BATCH_SIZE
h5file.attrs['learning_rate'] = LEARNING_RATE
h5file.attrs['weight_decay'] = WEIGHT_DECAY
h5file.attrs['adam_beta1'] = ADAM_BETA1
h5file.attrs['adam_beta2'] = ADAM_BETA2
h5file.attrs['adam_epsilon'] = ADAM_EPSILON
h5file.attrs['init_scale'] = INIT_SCALE
h5file.attrs['seed'] = SEED
h5file.attrs['n_live'] = n_live
h5file.attrs['n_dead'] = n_dead

print(f"  Created datasets with shape: ({NUM_STEPS+1}, {VOCAB_SIZE}, {HIDDEN_DIM})")
print(f"  Chunking: {chunk_shape} (one timestep per chunk)")
print(f"  Compression: gzip level 1")
print(f"  Dtype: float16 (will represent bfloat16 data)")
print(f"\n✓ HDF5 file initialized (streaming writes)")


Creating HDF5 file with streaming datasets...

  Created datasets with shape: (6001, 10000, 64)
  Chunking: (1, 10000, 64) (one timestep per chunk)
  Compression: gzip level 1
  Dtype: float16 (will represent bfloat16 data)

✓ HDF5 file initialized (streaming writes)


## Training Loop with HDF5 Streaming

In [None]:
print(f"\n{'='*80}")
print(f"THIMBLE 6: GROUND STATE OR STILL COOLING? (t=0 → t=6000)")
print(f"{'='*80}\n")

# Record initial state (step 0)
W_dset[0] = model.transformer.wte.weight.data.cpu().float().numpy()  # bfloat16 → float16 for storage
grad_dset[0] = np.zeros((VOCAB_SIZE, HIDDEN_DIM), dtype=np.float16)
momentum_dset[0] = np.zeros((VOCAB_SIZE, HIDDEN_DIM), dtype=np.float16)
variance_dset[0] = np.zeros((VOCAB_SIZE, HIDDEN_DIM), dtype=np.float16)
loss_dset[0] = np.nan
print("✓ Recorded initial state (t=0)\n")

# Create infinite iterator over dataloader
data_iter = iter(dataloader)

# Training loop
model.train()
start_time = time.time()

for step in tqdm(range(1, NUM_STEPS+1), desc="Training"):
    # Get next batch
    try:
        batch = next(data_iter)
    except StopIteration:
        data_iter = iter(dataloader)
        batch = next(data_iter)
    
    # Move batch to device
    input_ids = batch['input_ids'].to(device)
    labels = batch['labels'].to(device)
    
    # Forward pass
    outputs = model(input_ids=input_ids, labels=labels)
    loss = outputs.loss
    
    # Backward pass
    loss.backward()
    
    # === STREAM GRADIENTS TO HDF5 ===
    grad_dset[step] = model.transformer.wte.weight.grad.cpu().float().numpy()
    
    # Optimizer step
    optimizer.step()
    optimizer.zero_grad()
    
    # === STREAM WEIGHTS & OPTIMIZER STATE TO HDF5 ===
    W_dset[step] = model.transformer.wte.weight.data.cpu().float().numpy()
    
    wte_param = model.transformer.wte.weight
    if wte_param in optimizer.state:
        opt_state = optimizer.state[wte_param]
        momentum_dset[step] = opt_state['exp_avg'].cpu().float().numpy()
        variance_dset[step] = opt_state['exp_avg_sq'].cpu().float().numpy()
    else:
        momentum_dset[step] = np.zeros((VOCAB_SIZE, HIDDEN_DIM), dtype=np.float16)
        variance_dset[step] = np.zeros((VOCAB_SIZE, HIDDEN_DIM), dtype=np.float16)
    
    loss_dset[step] = loss.item()

elapsed = time.time() - start_time

# Close HDF5 file
h5file.close()

print(f"\n{'='*80}")
print(f"✓ Training complete")
print(f"  Time: {elapsed:.1f}s ({elapsed/60:.1f} minutes)")
print(f"  Final loss: {loss_dset[-1]:.4f}")
print(f"✓ HDF5 file closed")
print(f"{'='*80}")


THIMBLE 6: GROUND STATE OR STILL COOLING? (t=0 → t=6000)



RuntimeError: Unable to synchronously get dataspace (identifier is not of specified type)

## Verify Output

In [16]:
print(f"\nVerifying output file...\n")

file_size_bytes = Path(OUTPUT_PATH).stat().st_size
file_size_gb = file_size_bytes / 1e9

print(f"✓ File created successfully")
print(f"  Path: {OUTPUT_PATH}")
print(f"  Size: {file_size_gb:.2f} GB")
print()

# Quick check: load a single timestep
with h5py.File(OUTPUT_PATH, 'r') as f:
    print(f"  Datasets:")
    for key in f.keys():
        if key in ['W', 'grad_W', 'momentum_W', 'variance_W']:
            print(f"    {key}: {f[key].shape} ({f[key].dtype})")
        elif key == 'losses':
            print(f"    {key}: {f[key].shape} ({f[key].dtype})")
        else:
            print(f"    {key}: {f[key].shape}")
    
    print(f"\n  Attributes:")
    for key in f.attrs.keys():
        print(f"    {key}: {f.attrs[key]}")
    
    # Test loading a single timestep
    W_test = torch.from_numpy(f['W'][3000]).to(torch.bfloat16)
    print(f"\n  ✓ Test load successful: W[3000] shape={tuple(W_test.shape)}, dtype={W_test.dtype}")

print(f"\n✓ Output verification complete")


Verifying output file...

✓ File created successfully
  Path: ../tensors/Thimble/thimble_6.h5
  Size: 16.24 GB

  Datasets:
    W: (6001, 10000, 64) (float16)
    dead_ids: (3699,)
    dead_mask: (10000,)
    grad_W: (6001, 10000, 64) (float16)
    live_ids: (6301,)
    live_mask: (10000,)
    losses: (6001,) (float32)
    momentum_W: (6001, 10000, 64) (float16)
    variance_W: (6001, 10000, 64) (float16)

  Attributes:
    adam_beta1: 0.9
    adam_beta2: 0.999
    adam_epsilon: 1e-08
    batch_size: 128
    hidden_dim: 64
    init_scale: 0.02
    learning_rate: 0.001
    n_dead: 3699
    n_heads: 2
    n_layers: 2
    n_live: 6301
    num_steps: 6000
    seed: 42
    vocab_size: 10000
    weight_decay: 0.0

  ✓ Test load successful: W[3000] shape=(10000, 64), dtype=torch.bfloat16

✓ Output verification complete


## Summary

In [17]:
print(f"\n{'='*80}")
print(f"THIMBLE 6 COMPLETE: GROUND STATE OR STILL COOLING?")
print(f"{'='*80}\n")

print(f"Trained for {NUM_STEPS:,} steps with pure bfloat16 pipeline")
print(f"  Seed: {SEED}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Weight decay: {WEIGHT_DECAY}")
print()
print(f"Recorded at every step (HDF5 streaming):")
print(f"  • W: embedding weights")
print(f"  • grad_W: gradients")
print(f"  • momentum_W: Adam exp_avg")
print(f"  • variance_W: Adam exp_avg_sq")
print(f"  • losses: training loss")
print(f"  • Token masks: live/dead masks and IDs (self-contained)")
print()
print(f"Data saved: {OUTPUT_PATH}")
print(f"  Size: {file_size_gb:.2f} GB (compressed)")
print(f"  Format: HDF5 with gzip compression")
print(f"  Training time: {elapsed/60:.1f} minutes")
print()
print(f"Next: Analyze to distinguish:")
print(f"  1. Ground state equilibrium (statistics stabilize)")
print(f"  2. Still cooling (global freeze % continues rising)")
print(f"\n{'='*80}")


THIMBLE 6 COMPLETE: GROUND STATE OR STILL COOLING?

Trained for 6,000 steps with pure bfloat16 pipeline
  Seed: 42
  Batch size: 128
  Learning rate: 0.001
  Weight decay: 0.0

Recorded at every step (HDF5 streaming):
  • W: embedding weights
  • grad_W: gradients
  • momentum_W: Adam exp_avg
  • variance_W: Adam exp_avg_sq
  • losses: training loss
  • Token masks: live/dead masks and IDs (self-contained)

Data saved: ../tensors/Thimble/thimble_6.h5
  Size: 16.24 GB (compressed)
  Format: HDF5 with gzip compression
  Training time: 14.0 minutes

Next: Analyze to distinguish:
  1. Ground state equilibrium (statistics stabilize)
  2. Still cooling (global freeze % continues rising)

