# Thimble 2: Gravity Probe B for AdamW

**Objective:** Prove the AdamW formula by achieving **bitwise equality** between predicted and observed weights.

**Hypothesis:** If we train in full float32 precision (no quantization) and apply the exact AdamW formula that PyTorch uses, we should get:

```python
predicted_W[t] == observed_W[t]  # Bitwise identical for all t
```

Not approximately equal. Not ratio ‚âà 1. **Exactly equal.**

**Why this matters:**
- Thimble 1 showed accounting discrepancies even after simulating bfloat16 quantization
- We need to validate the formula itself before we can understand quantization effects
- If float32 gives perfect equality, we know the formula is right and can debug quantization
- If float32 still fails, we're missing something fundamental

**Method:**
- Train in **full float32** (no `.to(bfloat16)`)
- Record W, gradients, momentum, variance‚Äîall in float32
- Test: `predicted_W[t] == observed_W[t]` element-wise

This is our Gravity Probe B. Let's prove Einstein. üêïü•Ω

## Parameters

In [1]:
# Model architecture
VOCAB_SIZE = 10000
HIDDEN_DIM = 64
N_LAYERS = 2
N_HEADS = 2
MAX_SEQ_LEN = 128

# Training
NUM_STEPS = 1000
BATCH_SIZE = 128
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 0.0  # Disabled for simplicity

# Optimizer (AdamW)
ADAM_BETA1 = 0.9
ADAM_BETA2 = 0.999
ADAM_EPSILON = 1e-8

# Initialization
INIT_SCALE = 0.02  # N(0, 0.02)
SEED = 42

# Paths
TOKENIZER_PATH = "../data/flannel_tokenizer_chars.json"
CORPUS_PATH = "../data/flannel_model_corpus.txt"
TOKEN_MASK_PATH = "../tensors/Flannel/live_dead_tokens.safetensors"
OUTPUT_PATH = "../tensors/Thimble/thimble_2.safetensors"

print("‚úì Parameters set")

‚úì Parameters set


## Imports

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Config, GPT2LMHeadModel
from tokenizers import Tokenizer
import numpy as np
from pathlib import Path
from safetensors.torch import save_file, load_file
from tqdm.auto import tqdm
import time

print("‚úì Imports complete")

‚úì Imports complete


## Memory Safety Check

In [3]:
print(f"\n{'='*80}")
print(f"MEMORY & DISK SAFETY CHECK")
print(f"{'='*80}\n")

# Recording tensors - ALL FLOAT32 this time
bytes_f32 = 4  # float32

recording_w = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_f32
recording_grad = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_f32
recording_momentum = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_f32
recording_variance = (NUM_STEPS+1) * VOCAB_SIZE * HIDDEN_DIM * bytes_f32
recording_losses = (NUM_STEPS+1) * bytes_f32

total_recording = recording_w + recording_grad + recording_momentum + recording_variance + recording_losses

print(f"Recording tensors (CPU memory, all float32):")
print(f"  W:         {recording_w/1e9:.2f} GB")
print(f"  grad_W:    {recording_grad/1e9:.2f} GB")
print(f"  momentum:  {recording_momentum/1e9:.2f} GB")
print(f"  variance:  {recording_variance/1e9:.2f} GB")
print(f"  losses:    {recording_losses/1e9:.4f} GB")
print(f"  {'‚îÄ'*40}")
print(f"  Total:     {total_recording/1e9:.2f} GB")
print()

# Model memory (float32 weights + float32 optimizer states)
embedding_params = VOCAB_SIZE * HIDDEN_DIM
params_per_layer = 12 * HIDDEN_DIM**2
transformer_params = N_LAYERS * params_per_layer
total_model_params = embedding_params + transformer_params

model_memory = total_model_params * bytes_f32
optimizer_memory = 2 * total_model_params * bytes_f32  # Adam: m and v
activation_memory = BATCH_SIZE * MAX_SEQ_LEN * HIDDEN_DIM * N_LAYERS * 2 * bytes_f32

print(f"Model memory (device, all float32):")
print(f"  Model weights: {model_memory/1e9:.2f} GB ({total_model_params:,} params)")
print(f"  Optimizer:     {optimizer_memory/1e9:.2f} GB (Adam states)")
print(f"  Activations:   {activation_memory/1e9:.2f} GB (batch={BATCH_SIZE})")
print(f"  {'‚îÄ'*40}")
print(f"  Total:         {(model_memory + optimizer_memory + activation_memory)/1e9:.2f} GB")
print()

# Peak RAM
corpus_memory = 1371328 * 8
misc_overhead = 1e9
peak_ram = total_recording + model_memory + optimizer_memory + activation_memory + corpus_memory + misc_overhead

print(f"Peak RAM estimate:")
print(f"  Recording:     {total_recording/1e9:.2f} GB")
print(f"  Model+opt+act: {(model_memory + optimizer_memory + activation_memory)/1e9:.2f} GB")
print(f"  Corpus+misc:   {(corpus_memory + misc_overhead)/1e9:.2f} GB")
print(f"  {'‚îÄ'*40}")
print(f"  Total:         {peak_ram/1e9:.2f} GB")
print()

# Disk space
disk_needed = total_recording + 1e6
print(f"Disk space needed:")
print(f"  Safetensors:   {disk_needed/1e9:.2f} GB")
print()

# Safety verdict
print(f"{'='*80}")
if peak_ram <= 24e9:
    print(f"‚úì SAFE: Peak RAM ({peak_ram/1e9:.1f} GB) within 24 GB budget")
else:
    print(f"‚ö†Ô∏è  WARNING: Peak RAM ({peak_ram/1e9:.1f} GB) exceeds 24 GB budget!")
print(f"{'='*80}\n")


MEMORY & DISK SAFETY CHECK

Recording tensors (CPU memory, all float32):
  W:         2.56 GB
  grad_W:    2.56 GB
  momentum:  2.56 GB
  variance:  2.56 GB
  losses:    0.0000 GB
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Total:     10.25 GB

Model memory (device, all float32):
  Model weights: 0.00 GB (738,304 params)
  Optimizer:     0.01 GB (Adam states)
  Activations:   0.02 GB (batch=128)
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Total:         0.03 GB

Peak RAM estimate:
  Recording:     10.25 GB
  Model+opt+act: 0.03 GB
  Corpus+misc:   1.01 GB
  ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
  Total:         11.29 GB

Disk space needed:
  Safetensors:   10.25 GB

‚úì SAFE: Peak RAM (11.3 GB) within 24 GB budget



## Device Detection

In [4]:
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Set Random Seeds

In [5]:
torch.manual_seed(SEED)
np.random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"‚úì Random seed set to {SEED}")

‚úì Random seed set to 42


## Load Data

In [6]:
# Tokenizer
print(f"Loading tokenizer: {TOKENIZER_PATH}")
tokenizer = Tokenizer.from_file(str(TOKENIZER_PATH))
print(f"  ‚úì Vocabulary: {tokenizer.get_vocab_size():,} tokens\n")

# Corpus
print(f"Loading corpus: {CORPUS_PATH}")
with open(CORPUS_PATH, 'r', encoding='utf-8') as f:
    corpus_text = f.read()
encoding = tokenizer.encode(corpus_text)
tokens = encoding.ids
corpus_tensor = torch.tensor(tokens, dtype=torch.long)
print(f"  ‚úì Tokens: {len(tokens):,}\n")

# Token masks (for analysis later)
print(f"Loading token masks: {TOKEN_MASK_PATH}")
mask_data = load_file(TOKEN_MASK_PATH)
live_mask = mask_data['live_mask'].bool()
dead_mask = mask_data['dead_mask'].bool()
n_live = live_mask.sum().item()
n_dead = dead_mask.sum().item()
print(f"  ‚úì Live: {n_live:,} | Dead: {n_dead:,}")

Loading tokenizer: ../data/flannel_tokenizer_chars.json
  ‚úì Vocabulary: 10,000 tokens

Loading corpus: ../data/flannel_model_corpus.txt
  ‚úì Tokens: 1,371,328

Loading token masks: ../tensors/Flannel/live_dead_tokens.safetensors
  ‚úì Live: 6,301 | Dead: 3,699


## Dataset and DataLoader

In [7]:
class TokenDataset(Dataset):
    def __init__(self, corpus_tensor, max_seq_len):
        self.corpus = corpus_tensor
        self.max_seq_len = max_seq_len
    
    def __len__(self):
        return max(0, len(self.corpus) - self.max_seq_len)
    
    def __getitem__(self, idx):
        chunk = self.corpus[idx : idx + self.max_seq_len + 1]
        return {
            'input_ids': chunk[:-1],
            'labels': chunk[1:]
        }

dataset = TokenDataset(corpus_tensor, MAX_SEQ_LEN)

# DataLoader with deterministic sampling
def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(SEED)

dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    generator=g,
    worker_init_fn=seed_worker,
    num_workers=0,  # Single-threaded for reproducibility
)

print(f"\n‚úì Dataset: {len(dataset):,} examples")
print(f"‚úì DataLoader: {len(dataloader):,} batches per epoch")


‚úì Dataset: 1,371,200 examples
‚úì DataLoader: 10,713 batches per epoch


## Create Model (FLOAT32 - No Quantization)

In [8]:
print("Creating model...\n")

config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=MAX_SEQ_LEN,
    n_embd=HIDDEN_DIM,
    n_layer=N_LAYERS,
    n_head=N_HEADS,
    resid_pdrop=0.0,
    embd_pdrop=0.0,
    attn_pdrop=0.0,
    tie_word_embeddings=True,
)

model = GPT2LMHeadModel(config)

# Initialize embedding weights with N(0, 0.02)
with torch.no_grad():
    nn.init.normal_(model.transformer.wte.weight, mean=0.0, std=INIT_SCALE)

# Move to device - KEEP AS FLOAT32 (no .to(bfloat16))
model = model.to(device)

# Count parameters
n_params = sum(p.numel() for p in model.parameters())

print(f"  Architecture: {N_LAYERS} layers, {N_HEADS} heads, {HIDDEN_DIM}d embeddings")
print(f"  Parameters: {n_params:,}")
print(f"  Device: {device}")
print(f"  Dtype: {model.transformer.wte.weight.dtype} (FLOAT32 - no quantization)")
print(f"\n‚úì Model created")

Creating model...

  Architecture: 2 layers, 2 heads, 64d embeddings
  Parameters: 748,288
  Device: mps
  Dtype: torch.float32 (FLOAT32 - no quantization)

‚úì Model created


## Create Optimizer

In [9]:
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=LEARNING_RATE,
    betas=(ADAM_BETA1, ADAM_BETA2),
    eps=ADAM_EPSILON,
    weight_decay=WEIGHT_DECAY,
)

print(f"‚úì Optimizer: AdamW")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Betas: ({ADAM_BETA1}, {ADAM_BETA2})")
print(f"  Epsilon: {ADAM_EPSILON}")
print(f"  Weight decay: {WEIGHT_DECAY}")

‚úì Optimizer: AdamW
  Learning rate: 0.001
  Betas: (0.9, 0.999)
  Epsilon: 1e-08
  Weight decay: 0.0


## Pre-allocate Recording Tensors (ALL FLOAT32)

In [10]:
print("\nPre-allocating recording tensors...\n")

# ALL FLOAT32 - no dtype conversions
W_history = torch.zeros(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32)
grad_history = torch.zeros(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32)
momentum_history = torch.zeros(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32)
variance_history = torch.zeros(NUM_STEPS+1, VOCAB_SIZE, HIDDEN_DIM, dtype=torch.float32)
loss_history = torch.zeros(NUM_STEPS+1, dtype=torch.float32)

# Memory calculation
bytes_per_elem = 4  # float32
memory_w = W_history.numel() * bytes_per_elem
memory_grad = grad_history.numel() * bytes_per_elem
memory_momentum = momentum_history.numel() * bytes_per_elem
memory_variance = variance_history.numel() * bytes_per_elem
memory_loss = loss_history.numel() * bytes_per_elem
total_memory = memory_w + memory_grad + memory_momentum + memory_variance + memory_loss

print(f"  W:         {tuple(W_history.shape)} (float32) = {memory_w/1e9:.2f} GB")
print(f"  grad_W:    {tuple(grad_history.shape)} (float32) = {memory_grad/1e9:.2f} GB")
print(f"  momentum:  {tuple(momentum_history.shape)} (float32) = {memory_momentum/1e9:.2f} GB")
print(f"  variance:  {tuple(variance_history.shape)} (float32) = {memory_variance/1e9:.2f} GB")
print(f"  losses:    {tuple(loss_history.shape)} (float32) = {memory_loss/1e9:.4f} GB")
print(f"\n  Total: {total_memory/1e9:.2f} GB")
print(f"\n‚úì Tensors allocated (all float32)")


Pre-allocating recording tensors...

  W:         (1001, 10000, 64) (float32) = 2.56 GB
  grad_W:    (1001, 10000, 64) (float32) = 2.56 GB
  momentum:  (1001, 10000, 64) (float32) = 2.56 GB
  variance:  (1001, 10000, 64) (float32) = 2.56 GB
  losses:    (1001,) (float32) = 0.0000 GB

  Total: 10.25 GB

‚úì Tensors allocated (all float32)


## Training Loop

In [11]:
print(f"\n{'='*80}")
print(f"THIMBLE 2: TRAINING (FLOAT32)")
print(f"{'='*80}\n")

# Record initial state (step 0)
W_history[0] = model.transformer.wte.weight.data.clone().cpu()
loss_history[0] = float('nan')  # No loss before first step
print("‚úì Recorded initial state (t=0)\n")

# Create infinite iterator over dataloader
data_iter = iter(dataloader)

# Training loop
model.train()
start_time = time.time()

for step in tqdm(range(1, NUM_STEPS+1), desc="Training"):
    # Get next batch (cycle through dataset if needed)
    try:
        batch = next(data_iter)
    except StopIteration:
        data_iter = iter(dataloader)
        batch = next(data_iter)
    
    # Move batch to device
    input_ids = batch['input_ids'].to(device)
    labels = batch['labels'].to(device)
    
    # Forward pass
    outputs = model(input_ids=input_ids, labels=labels)
    loss = outputs.loss
    
    # Backward pass
    loss.backward()
    
    # === RECORD GRADIENTS (before optimizer.step) ===
    grad_history[step] = model.transformer.wte.weight.grad.clone().cpu()
    
    # Optimizer step
    optimizer.step()
    optimizer.zero_grad()
    
    # === RECORD WEIGHTS & OPTIMIZER STATE (after optimizer.step) ===
    W_history[step] = model.transformer.wte.weight.data.clone().cpu()
    
    # Get optimizer state for embedding weights
    wte_param = model.transformer.wte.weight
    if wte_param in optimizer.state:
        opt_state = optimizer.state[wte_param]
        momentum_history[step] = opt_state['exp_avg'].clone().cpu()
        variance_history[step] = opt_state['exp_avg_sq'].clone().cpu()
    
    loss_history[step] = loss.item()

elapsed = time.time() - start_time

print(f"\n{'='*80}")
print(f"‚úì Training complete")
print(f"  Time: {elapsed:.1f}s ({elapsed/60:.1f} minutes)")
print(f"  Final loss: {loss_history[-1]:.4f}")
print(f"{'='*80}")


THIMBLE 2: TRAINING (FLOAT32)

‚úì Recorded initial state (t=0)



Training:   0%|          | 0/1000 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.



‚úì Training complete
  Time: 71.9s (1.2 minutes)
  Final loss: 6.4703


## Validation: Test for BITWISE EQUALITY

This is Gravity Probe B. We're proving Einstein.

If the AdamW formula is correct, and we're doing exactly what PyTorch does, then:

```python
predicted_W[t] == observed_W[t]  # Should be True for every element
```

In [12]:
print("\nValidating AdamW accounting (testing for bitwise equality)...\n")

# Test at several timesteps
test_steps = [1, 10, 50, 100, 200, 500, 800]

for t in test_steps:
    if t > NUM_STEPS:
        continue
    
    # Get optimizer states
    m_t = momentum_history[t]
    v_t = variance_history[t]
    W_prev = W_history[t-1]
    
    # Apply AdamW formula
    bias_correction1 = 1 - ADAM_BETA1**t
    bias_correction2 = 1 - ADAM_BETA2**t
    m_hat = m_t / bias_correction1
    v_hat = v_t / bias_correction2
    
    # Compute predicted W[t]
    dW = -LEARNING_RATE * m_hat / (torch.sqrt(v_hat) + ADAM_EPSILON)
    predicted_W = W_prev + dW
    
    # Observed W[t]
    observed_W = W_history[t]
    
    # Test for exact equality
    exactly_equal = torch.equal(predicted_W, observed_W)
    max_abs_diff = torch.max(torch.abs(predicted_W - observed_W)).item()
    
    # Also compute approximate metrics for context
    norm_pred = torch.norm(predicted_W - W_prev)
    norm_obs = torch.norm(observed_W - W_prev)
    ratio = norm_pred / norm_obs
    
    print(f"t={t:3d}: exact_match={exactly_equal}, max_diff={max_abs_diff:.2e}, ratio={ratio:.6f}")

print("\nIf exact_match=True for all timesteps, we have proven the formula.")
print("If not, we're still missing something fundamental.")
print("\n‚úì Validation complete")


Validating AdamW accounting (testing for bitwise equality)...

t=  1: exact_match=False, max_diff=7.45e-09, ratio=1.000000
t= 10: exact_match=False, max_diff=7.45e-09, ratio=1.000000
t= 50: exact_match=False, max_diff=7.45e-09, ratio=1.000000
t=100: exact_match=False, max_diff=7.45e-09, ratio=1.000000
t=200: exact_match=False, max_diff=7.45e-09, ratio=1.000000
t=500: exact_match=False, max_diff=1.49e-08, ratio=1.000000
t=800: exact_match=False, max_diff=2.98e-08, ratio=1.000000

If exact_match=True for all timesteps, we have proven the formula.
If not, we're still missing something fundamental.

‚úì Validation complete


## Save Data

In [13]:
print(f"\nSaving data to {OUTPUT_PATH}...\n")

# Create output directory if needed
Path(OUTPUT_PATH).parent.mkdir(parents=True, exist_ok=True)

# Build save dictionary
save_dict = {
    # Training trajectories (all float32)
    'W': W_history,
    'grad_W': grad_history,
    'momentum_W': momentum_history,
    'variance_W': variance_history,
    'losses': loss_history,
    
    # Model hyperparameters
    'vocab_size': torch.tensor(VOCAB_SIZE, dtype=torch.long),
    'hidden_dim': torch.tensor(HIDDEN_DIM, dtype=torch.long),
    'n_layers': torch.tensor(N_LAYERS, dtype=torch.long),
    'n_heads': torch.tensor(N_HEADS, dtype=torch.long),
    
    # Training hyperparameters
    'num_steps': torch.tensor(NUM_STEPS, dtype=torch.long),
    'batch_size': torch.tensor(BATCH_SIZE, dtype=torch.long),
    'learning_rate': torch.tensor(LEARNING_RATE, dtype=torch.float32),
    'weight_decay': torch.tensor(WEIGHT_DECAY, dtype=torch.float32),
    'adam_beta1': torch.tensor(ADAM_BETA1, dtype=torch.float32),
    'adam_beta2': torch.tensor(ADAM_BETA2, dtype=torch.float32),
    'adam_epsilon': torch.tensor(ADAM_EPSILON, dtype=torch.float32),
    'init_scale': torch.tensor(INIT_SCALE, dtype=torch.float32),
    'seed': torch.tensor(SEED, dtype=torch.long),
    
    # Token counts
    'n_live': torch.tensor(n_live, dtype=torch.long),
    'n_dead': torch.tensor(n_dead, dtype=torch.long),
}

# Save
save_start = time.time()
save_file(save_dict, str(OUTPUT_PATH))
save_elapsed = time.time() - save_start

# File size
file_size_bytes = Path(OUTPUT_PATH).stat().st_size
file_size_gb = file_size_bytes / 1e9

print(f"‚úì Saved successfully")
print(f"  File: {Path(OUTPUT_PATH).name}")
print(f"  Size: {file_size_gb:.2f} GB")
print(f"  Save time: {save_elapsed:.1f}s")


Saving data to ../tensors/Thimble/thimble_2.safetensors...

‚úì Saved successfully
  File: thimble_2.safetensors
  Size: 10.25 GB
  Save time: 8.2s


## Summary

In [14]:
print(f"\n{'='*80}")
print(f"THIMBLE 2 COMPLETE: GRAVITY PROBE B")
print(f"{'='*80}\n")

print(f"Trained language model for {NUM_STEPS:,} steps in FULL FLOAT32")
print(f"  Seed: {SEED}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Weight decay: {WEIGHT_DECAY}")
print()
print(f"Recorded at every step (all float32):")
print(f"  ‚Ä¢ W: embedding weights")
print(f"  ‚Ä¢ grad_W: gradients")
print(f"  ‚Ä¢ momentum_W: Adam exp_avg")
print(f"  ‚Ä¢ variance_W: Adam exp_avg_sq")
print(f"  ‚Ä¢ losses: training loss")
print()
print(f"Data saved: {OUTPUT_PATH}")
print(f"  Size: {file_size_gb:.2f} GB")
print(f"  Training time: {elapsed/60:.1f} minutes")
print()
print(f"Next: Analyze in separate notebook to test for BITWISE EQUALITY.")
print(f"If predicted_W[t] == observed_W[t], we have proven the formula.")
print(f"\n{'='*80}")


THIMBLE 2 COMPLETE: GRAVITY PROBE B

Trained language model for 1,000 steps in FULL FLOAT32
  Seed: 42
  Batch size: 128
  Learning rate: 0.001
  Weight decay: 0.0

Recorded at every step (all float32):
  ‚Ä¢ W: embedding weights
  ‚Ä¢ grad_W: gradients
  ‚Ä¢ momentum_W: Adam exp_avg
  ‚Ä¢ variance_W: Adam exp_avg_sq
  ‚Ä¢ losses: training loss

Data saved: ../tensors/Thimble/thimble_2.safetensors
  Size: 10.25 GB
  Training time: 1.2 minutes

Next: Analyze in separate notebook to test for BITWISE EQUALITY.
If predicted_W[t] == observed_W[t], we have proven the formula.

