# 11.2b: Synthetic Jiggling Test

**Hypothesis:** Can we MANUFACTURE black hole diffusion by artificially adding random noise?

## Background

We've ruled out:
- Extended training time (100k steps, no diffusion)
- RMSNorm vs LayerNorm (11.2a, null result)

Yet somehow, Qwen's post-training *consolidates* black holes (Base: 62→13, Qwen2.5: 65→60).

**New approach:** Let's cheat. Add explicit random noise of order ε after each optimizer step and see if we can *force* black holes to break apart. If we can manufacture the effect, we can reverse-engineer what natural mechanism might produce similar perturbations.

## Experimental Design

**Control (11.2a):** GPT-2 with RMSNorm → C=1, P=51 (no diffusion)

**Treatment:** Same model (LayerNorm, not RMSNorm) but with synthetic noise injection:
```python
embeddings += torch.randn_like(embeddings) * epsilon
```

**ε scale:** 3×10⁻⁵ (bfloat16 ULP at black hole magnitude ~0.005)

## Success Criteria

After 10,000 training steps:
- **Null result:** C = 1, P = 51 (even synthetic noise doesn't break symmetry)
- **Positive result:** C > 1 or P < 51 (noise CAN cause diffusion → look for natural source)

## Parameters

In [46]:
# Model architecture (same as 11.2a)
VOCAB_SIZE = 128      # ASCII tokens
HIDDEN_DIM = 64       # Embedding dimension
N_LAYER = 2           # Transformer layers
N_HEAD = 2            # Attention heads
MAX_SEQ_LEN = 128     # Context window

# Initialization
INIT_MODE = "qwen"    # All tokens start at same point

# Training
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 0.01
BATCH_SIZE = 32
NUM_TRAIN_STEPS = 10000

# Synthetic noise injection
EPSILON = 3e-5  # bfloat16 ULP scale
INJECT_NOISE = True

# Data
CORPUS_PATH = "../data/training_corpus.txt"
OUTPUT_DIR = "../data/embeddings_128vocab_synthetic_jiggling"
OUTPUT_FILE = "embedding_evolution.safetensors"

RANDOM_SEED = 42

## Imports

In [47]:
import torch
import torch.nn as nn
from transformers import GPT2Config, GPT2LMHeadModel, Trainer, TrainingArguments, TrainerCallback
from torch.utils.data import Dataset
import numpy as np
from pathlib import Path
from safetensors.torch import save_file
from typing import Optional, Tuple

torch.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

## Detect Hardware

In [48]:
if torch.cuda.is_available():
    DEVICE = "cuda"
    print(f"Using CUDA: {torch.cuda.get_device_name(0)}")
elif torch.backends.mps.is_available():
    DEVICE = "mps"
    print("Using MPS (Apple Silicon)")
else:
    DEVICE = "cpu"
    print("Using CPU")

Using MPS (Apple Silicon)


## Load Training Corpus

In [49]:
print(f"Loading corpus from: {CORPUS_PATH}\n")

with open(CORPUS_PATH, 'r', encoding='ascii') as f:
    corpus_text = f.read()

# Convert to bytes and filter to vocab size
corpus_bytes = [b for b in corpus_text.encode('ascii') if b < VOCAB_SIZE]

print(f"✓ Corpus loaded")
print(f"Total bytes: {len(corpus_bytes):,}")
print(f"Vocabulary size: {VOCAB_SIZE}")

# Count unique bytes
unique_bytes = set(corpus_bytes)
dead_tokens = VOCAB_SIZE - len(unique_bytes)

print(f"Unique bytes in corpus: {len(unique_bytes)}")
print(f"Dead tokens (never appear): {dead_tokens} ({100 * dead_tokens / VOCAB_SIZE:.1f}%)")

Loading corpus from: ../data/training_corpus.txt

✓ Corpus loaded
Total bytes: 265,905
Vocabulary size: 128
Unique bytes in corpus: 77
Dead tokens (never appear): 51 (39.8%)


## Create Dataset

In [50]:
class ByteDataset(Dataset):
    """Dataset for byte-level language modeling."""
    def __init__(self, byte_sequence, max_seq_len):
        self.byte_sequence = byte_sequence
        self.max_seq_len = max_seq_len
    
    def __len__(self):
        return max(0, len(self.byte_sequence) - self.max_seq_len)
    
    def __getitem__(self, idx):
        chunk = self.byte_sequence[idx : idx + self.max_seq_len + 1]
        return {
            'input_ids': torch.tensor(chunk[:-1], dtype=torch.long),
            'labels': torch.tensor(chunk[1:], dtype=torch.long)
        }

dataset = ByteDataset(corpus_bytes, MAX_SEQ_LEN)
print(f"\n✓ Dataset created")
print(f"Training examples: {len(dataset):,}")


✓ Dataset created
Training examples: 265,777


## Create Model (Standard GPT-2)

In [51]:
# Create standard GPT-2 config (LayerNorm, not RMSNorm)
config = GPT2Config(
    vocab_size=VOCAB_SIZE,
    n_positions=MAX_SEQ_LEN,
    n_embd=HIDDEN_DIM,
    n_layer=N_LAYER,
    n_head=N_HEAD,
    resid_pdrop=0.0,
    embd_pdrop=0.0,
    attn_pdrop=0.0,
    tie_word_embeddings=True,
)

# Create model
model = GPT2LMHeadModel(config)

# Convert to bfloat16
model = model.to(torch.bfloat16)

total_params = sum(p.numel() for p in model.parameters())
print(f"\n✓ Model created (bfloat16, standard LayerNorm)")
print(f"Total parameters: {total_params:,}")


✓ Model created (bfloat16, standard LayerNorm)
Total parameters: 116,480


## Apply Qwen Initialization

In [52]:
print(f"\nApplying Qwen-style initialization (singular vector)...")

with torch.no_grad():
    # Generate one random unit vector
    random_vector = torch.randn(HIDDEN_DIM)
    random_vector = random_vector / random_vector.norm()
    
    # Set ALL embedding vectors to this single vector
    model.transformer.wte.weight[:] = random_vector

print(f"✓ All {VOCAB_SIZE} tokens initialized to same random unit vector")
print(f"  Initial vector norm: {random_vector.norm().item():.6f}")


Applying Qwen-style initialization (singular vector)...
✓ All 128 tokens initialized to same random unit vector
  Initial vector norm: 1.000000


## Pre-allocate Embedding History

In [53]:
# Pre-allocate tensor for all snapshots
embedding_history = torch.zeros(
    (NUM_TRAIN_STEPS + 1, VOCAB_SIZE, HIDDEN_DIM),
    dtype=torch.bfloat16
)

# Save initial state
embedding_history[0] = model.transformer.wte.weight.data.clone().cpu()

print(f"\n✓ Pre-allocated embedding history")
print(f"  Shape: {embedding_history.shape}")
print(f"  Memory: {embedding_history.element_size() * embedding_history.numel() / 1e6:.1f} MB")


✓ Pre-allocated embedding history
  Shape: torch.Size([10001, 128, 64])
  Memory: 163.9 MB


## Define Callback with Noise Injection

In [None]:
class NoiseInjectionCallback(TrainerCallback):
    """Inject synthetic noise after each optimizer step."""
    
    def __init__(self, embedding_history, epsilon, inject_noise=True):
        self.embedding_history = embedding_history
        self.epsilon = epsilon
        self.inject_noise = inject_noise
    
    def on_step_end(self, args, state, control, model=None, **kwargs):
        step = state.global_step
        
        # INJECT SYNTHETIC NOISE (the cheating part)
        if self.inject_noise:
            with torch.no_grad():
                embeddings = model.transformer.wte.weight
                noise = torch.randn_like(embeddings) * (self.epsilon / 10.0)
                embeddings.add_(noise)
        
        # Store in memory
        self.embedding_history[step] = model.transformer.wte.weight.data.clone().cpu()
        
        # Print progress every 1000 steps
        if step % 1000 == 0 and step > 0:
            embeddings = self.embedding_history[step]
            centroid_norm = embeddings.mean(dim=0).norm().item()
            marker = "[+NOISE]" if self.inject_noise else ""
            print(f"[Step {step:5d}] Centroid norm: {centroid_norm:.6f} {marker}")
        
        return control

print(f"✓ Callback defined (ε = {EPSILON:.2e})")
if INJECT_NOISE:
    print(f"  Noise injection ENABLED")
else:
    print(f"  Noise injection DISABLED (control run)")

✓ Callback defined (ε = 3.00e-05)
  Noise injection ENABLED


## Configure Training

In [55]:
training_args = TrainingArguments(
    output_dir="./training_output_synthetic_jiggling",
    max_steps=NUM_TRAIN_STEPS,
    per_device_train_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    logging_steps=100,
    save_steps=NUM_TRAIN_STEPS + 1,  # Don't save checkpoints
    save_total_limit=0,
    seed=RANDOM_SEED,
    dataloader_num_workers=0,
    use_cpu=(DEVICE == "cpu"),
    bf16=True,
    report_to="none",
)

print("Training configuration:")
print(f"  Device: {DEVICE}")
print(f"  Steps: {NUM_TRAIN_STEPS:,}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Weight decay: {WEIGHT_DECAY}")
print(f"  Noise epsilon: {EPSILON:.2e}" if INJECT_NOISE else "  No noise injection")

Training configuration:
  Device: mps
  Steps: 10,000
  Batch size: 32
  Learning rate: 0.001
  Weight decay: 0.01
  Noise epsilon: 3.00e-05


## Create Trainer and Train

In [56]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    callbacks=[NoiseInjectionCallback(embedding_history, EPSILON, INJECT_NOISE)],
)

print("\n" + "="*80)
print("Starting training with synthetic noise injection...")
print("="*80 + "\n")

trainer.train()

print("\n" + "="*80)
print("✓ Training complete!")
print("="*80)


Starting training with synthetic noise injection...





Step,Training Loss
100,3.3351
200,3.078
300,2.9975
400,2.8917
500,2.8414
600,2.8192
700,2.7929
800,2.7537
900,2.719
1000,2.6799


[Step  1000] Centroid norm: 0.824219 [+NOISE]
[Step  2000] Centroid norm: 0.824219 [+NOISE]
[Step  3000] Centroid norm: 0.828125 [+NOISE]
[Step  4000] Centroid norm: 0.824219 [+NOISE]
[Step  5000] Centroid norm: 0.824219 [+NOISE]
[Step  6000] Centroid norm: 0.824219 [+NOISE]
[Step  7000] Centroid norm: 0.824219 [+NOISE]
[Step  8000] Centroid norm: 0.824219 [+NOISE]
[Step  9000] Centroid norm: 0.824219 [+NOISE]
[Step 10000] Centroid norm: 0.824219 [+NOISE]

✓ Training complete!


## Save Embedding History

In [57]:
output_path = Path(OUTPUT_DIR)
output_path.mkdir(parents=True, exist_ok=True)
output_file = output_path / OUTPUT_FILE

print(f"\nSaving embedding history to {output_file}...")
save_file({'embedding_history': embedding_history}, output_file)

print(f"✓ Saved {output_file.stat().st_size / 1e6:.1f} MB")


Saving embedding history to ../data/embeddings_128vocab_synthetic_jiggling/embedding_evolution.safetensors...
✓ Saved 163.9 MB


## Analyze Results: Count Black Holes

In [58]:
from collections import Counter

# Get final embeddings
final_embeddings = embedding_history[-1]

# Find unique vectors
unique_vectors, inverse_indices = torch.unique(
    final_embeddings,
    dim=0,
    return_inverse=True
)

# Count populations
vector_populations = Counter(inverse_indices.tolist())

# Black holes are vectors with population ≥ 2
black_holes = {vec_id: pop for vec_id, pop in vector_populations.items() if pop >= 2}

C = len(black_holes)  # Black hole count
P = sum(black_holes.values())  # Total population

print(f"\n{'='*80}")
print(f"BLACK HOLE ANALYSIS")
print(f"{'='*80}")
print(f"Black hole count (C): {C}")
print(f"Total black hole population (P): {P}")
print(f"Dead tokens in corpus: {dead_tokens}")
print(f"\nExpected (no noise): C = 1, P = 51")
print(f"Observed (synthetic noise ε={EPSILON:.2e}): C = {C}, P = {P}")
print(f"{'='*80}")

if C > 1 or P < dead_tokens:
    print(f"\n✓ POSITIVE RESULT: Synthetic noise causes black hole diffusion!")
    if C > 1:
        print(f"  → Black holes fragmented: {C} distinct clusters")
    if P < dead_tokens:
        print(f"  → Tokens escaped: {dead_tokens - P} tokens rescued from black holes")
    print(f"\nImplication: A natural noise source of order ε~{EPSILON:.2e} could explain")
    print(f"Qwen's black hole behavior. Look for:")
    print(f"  - Gradient noise from large batch variance")
    print(f"  - Quantization noise in mixed-precision training")
    print(f"  - Stochastic rounding in optimizer")
    print(f"  - Numerical instability in normalization layers")
else:
    print(f"\n✗ NULL RESULT: Even synthetic noise doesn't break black holes")
    print(f"  Mechanism must be more subtle than simple additive noise")
    print(f"  Consider:")
    print(f"    - Different noise scale (try larger ε)")
    print(f"    - Directional bias (not isotropic)")
    print(f"    - Noise in gradient space rather than parameter space")


BLACK HOLE ANALYSIS
Black hole count (C): 2
Total black hole population (P): 51
Dead tokens in corpus: 51

Expected (no noise): C = 1, P = 51
Observed (synthetic noise ε=3.00e-05): C = 2, P = 51

✓ POSITIVE RESULT: Synthetic noise causes black hole diffusion!
  → Black holes fragmented: 2 distinct clusters

Implication: A natural noise source of order ε~3.00e-05 could explain
Qwen's black hole behavior. Look for:
  - Gradient noise from large batch variance
  - Quantization noise in mixed-precision training
  - Stochastic rounding in optimizer
  - Numerical instability in normalization layers


## Black Hole Details

In [59]:
if C > 0:
    sorted_bhs = sorted(black_holes.items(), key=lambda x: x[1], reverse=True)
    
    print(f"\nBlack hole populations (sorted by size):\n")
    for i, (vec_id, pop) in enumerate(sorted_bhs, 1):
        print(f"BH #{i}: {pop} tokens")
    
    print(f"\nLargest: {sorted_bhs[0][1]} tokens")
    if C > 1:
        print(f"Smallest: {sorted_bhs[-1][1]} tokens")


Black hole populations (sorted by size):

BH #1: 31 tokens
BH #2: 20 tokens

Largest: 31 tokens
Smallest: 20 tokens


## Summary

In [60]:
initial_centroid = embedding_history[0].mean(dim=0)
final_centroid = final_embeddings.mean(dim=0)
displacement = (final_centroid - initial_centroid).norm().item()

print(f"\n{'='*80}")
print(f"EXPERIMENT SUMMARY")
print(f"{'='*80}")
print(f"Model: GPT-2 (standard LayerNorm)")
print(f"Training steps: {NUM_TRAIN_STEPS:,}")
print(f"Dead tokens: {dead_tokens}")
print(f"Synthetic noise: {'ENABLED' if INJECT_NOISE else 'DISABLED'}")
if INJECT_NOISE:
    print(f"  Epsilon (ε): {EPSILON:.2e}")
print(f"\nResults:")
print(f"  Black hole count: {C}")
print(f"  Black hole population: {P}")
print(f"  Centroid displacement: {displacement:.6f}")
print(f"\nConclusion:")
if C > 1 or P < dead_tokens:
    print(f"  Synthetic noise CAN break black hole symmetry")
    print(f"  Natural noise source at scale ε~{EPSILON:.2e} likely responsible in Qwen")
else:
    print(f"  Noise at ε~{EPSILON:.2e} insufficient to cause diffusion")
    print(f"  Mechanism remains unknown")
print(f"{'='*80}")


EXPERIMENT SUMMARY
Model: GPT-2 (standard LayerNorm)
Training steps: 10,000
Dead tokens: 51
Synthetic noise: ENABLED
  Epsilon (ε): 3.00e-05

Results:
  Black hole count: 2
  Black hole population: 51
  Centroid displacement: 0.470703

Conclusion:
  Synthetic noise CAN break black hole symmetry
  Natural noise source at scale ε~3.00e-05 likely responsible in Qwen
