# 08.1a: Training Data Preparation

**Download and prepare ASCII training corpus for embedding evolution experiment**

We're testing two hypotheses about embedding matrix evolution during training:

1. **Normal initialization**: Soft explosion, cloud grows, centroid random-walks
2. **Qwen initialization**: Violent explosion from singular point, dead tokens left as black holes

To test this, we'll train a tiny transformer (256-token byte-level vocab, 64-dim hidden space) on pure ASCII text. This guarantees ~50% dead tokens (bytes 128-255 never appear in training).

## Corpus Choice

**The Great Gatsby** by F. Scott Fitzgerald
- Public domain (published 1925)
- ~250 KB of text
- Rich vocabulary and varied sentence structure
- Available from Project Gutenberg

## Parameters

In [1]:
# Source
GUTENBERG_URL = "https://www.gutenberg.org/cache/epub/64317/pg64317.txt"

# Output
OUTPUT_PATH = "../data/training_corpus.txt"

RANDOM_SEED = 42

## Imports

In [2]:
import requests
import numpy as np
from collections import Counter
import os

np.random.seed(RANDOM_SEED)

## Download Gatsby

In [3]:
print(f"Downloading from Project Gutenberg...\n")
print(f"URL: {GUTENBERG_URL}")

response = requests.get(GUTENBERG_URL)
response.raise_for_status()

raw_text = response.text

print(f"\n✓ Downloaded successfully")
print(f"Raw size: {len(raw_text):,} characters")

Downloading from Project Gutenberg...

URL: https://www.gutenberg.org/cache/epub/64317/pg64317.txt

✓ Downloaded successfully
Raw size: 296,858 characters


## Strip Gutenberg Header/Footer

Project Gutenberg files include licensing headers and footers. We'll remove them to get just the novel text.

In [4]:
# Find start marker (typical Gutenberg format)
start_markers = [
    "*** START OF THE PROJECT GUTENBERG EBOOK",
    "*** START OF THIS PROJECT GUTENBERG EBOOK",
    "***START OF THE PROJECT GUTENBERG EBOOK"
]

start_idx = 0
for marker in start_markers:
    idx = raw_text.find(marker)
    if idx != -1:
        # Skip past the marker line
        start_idx = raw_text.find('\n', idx) + 1
        break

# Find end marker
end_markers = [
    "*** END OF THE PROJECT GUTENBERG EBOOK",
    "*** END OF THIS PROJECT GUTENBERG EBOOK",
    "***END OF THE PROJECT GUTENBERG EBOOK"
]

end_idx = len(raw_text)
for marker in end_markers:
    idx = raw_text.find(marker)
    if idx != -1:
        end_idx = idx
        break

# Extract just the novel
clean_text = raw_text[start_idx:end_idx].strip()

print(f"Stripped header/footer")
print(f"Clean text size: {len(clean_text):,} characters")
print(f"Removed: {len(raw_text) - len(clean_text):,} characters")

Stripped header/footer
Clean text size: 277,090 characters
Removed: 19,768 characters


## Verify ASCII Purity

Check that all characters are valid ASCII (bytes 0-127). If not, we'll convert or strip non-ASCII chars.

In [5]:
# Convert to bytes and check
text_bytes = clean_text.encode('utf-8', errors='replace')

# Find non-ASCII bytes
non_ascii = [b for b in text_bytes if b > 127]

print(f"Total bytes: {len(text_bytes):,}")
print(f"Non-ASCII bytes: {len(non_ascii):,}")

if non_ascii:
    print(f"\nNon-ASCII byte values: {set(non_ascii)}")
    print(f"\n⚠ Text contains non-ASCII characters")
    print(f"Converting to pure ASCII...")
    
    # Strip non-ASCII
    ascii_text = clean_text.encode('ascii', errors='ignore').decode('ascii')
    text_bytes = ascii_text.encode('ascii')
    
    print(f"✓ Converted to pure ASCII")
    print(f"Final size: {len(text_bytes):,} bytes")
else:
    print(f"\n✓ Text is pure ASCII")
    ascii_text = clean_text

Total bytes: 286,640
Non-ASCII bytes: 14,335

Non-ASCII byte values: {128, 226, 195, 166, 167, 169, 138, 170, 148, 180, 152, 153, 156, 157}

⚠ Text contains non-ASCII characters
Converting to pure ASCII...
✓ Converted to pure ASCII
Final size: 272,305 bytes


## Byte Statistics

Analyze which bytes appear and their frequencies. This tells us:
- How many unique bytes are in the corpus (should be <128)
- Which ASCII bytes are dead (never appear)
- Frequency distribution (some will be common, some rare)

In [6]:
byte_counts = Counter(text_bytes)

# Unique bytes that appear
present_bytes = set(byte_counts.keys())
n_present = len(present_bytes)

# ASCII bytes that never appear
all_ascii = set(range(128))
dead_ascii = all_ascii - present_bytes
n_dead_ascii = len(dead_ascii)

# High bytes (128-255) never appear by construction
n_dead_high = 128

print(f"Byte statistics:")
print(f"  Total bytes in corpus: {len(text_bytes):,}")
print(f"  Unique bytes present: {n_present} / 128 ASCII")
print(f"  Dead ASCII bytes (0-127): {n_dead_ascii}")
print(f"  Dead high bytes (128-255): {n_dead_high}")
print(f"  Total dead in 256-byte vocab: {n_dead_ascii + n_dead_high} ({100 * (n_dead_ascii + n_dead_high) / 256:.1f}%)")

# Show most/least common
most_common = byte_counts.most_common(10)
least_common = byte_counts.most_common()[-10:]

print(f"\nMost common bytes:")
for byte_val, count in most_common:
    char = chr(byte_val) if 32 <= byte_val < 127 else f"\\x{byte_val:02x}"
    print(f"  {byte_val:3d} ({char:>4s}): {count:6,} times ({100 * count / len(text_bytes):5.2f}%)")

print(f"\nLeast common bytes:")
for byte_val, count in least_common:
    char = chr(byte_val) if 32 <= byte_val < 127 else f"\\x{byte_val:02x}"
    print(f"  {byte_val:3d} ({char:>4s}): {count:6,} times ({100 * count / len(text_bytes):5.2f}%)")

Byte statistics:
  Total bytes in corpus: 272,305
  Unique bytes present: 78 / 128 ASCII
  Dead ASCII bytes (0-127): 50
  Dead high bytes (128-255): 128
  Total dead in 256-byte vocab: 178 (69.5%)

Most common bytes:
   32 (    ): 44,251 times (16.25%)
  101 (   e): 25,012 times ( 9.19%)
  116 (   t): 18,098 times ( 6.65%)
   97 (   a): 16,843 times ( 6.19%)
  111 (   o): 15,739 times ( 5.78%)
  110 (   n): 14,065 times ( 5.17%)
  105 (   i): 12,532 times ( 4.60%)
  115 (   s): 12,370 times ( 4.54%)
  104 (   h): 12,240 times ( 4.49%)
  114 (   r): 11,342 times ( 4.17%)

Least common bytes:
   81 (   Q):      4 times ( 0.00%)
   50 (   2):      4 times ( 0.00%)
   56 (   8):      3 times ( 0.00%)
   88 (   X):      2 times ( 0.00%)
   55 (   7):      2 times ( 0.00%)
   52 (   4):      2 times ( 0.00%)
   91 (   [):      2 times ( 0.00%)
   93 (   ]):      2 times ( 0.00%)
   36 (   $):      2 times ( 0.00%)
   90 (   Z):      1 times ( 0.00%)


## Show Dead ASCII Bytes

Display which ASCII characters never appear in Gatsby.

In [7]:
if dead_ascii:
    print(f"Dead ASCII bytes ({len(dead_ascii)} total):")
    print(f"Values: {sorted(dead_ascii)}")
    
    # Show as characters where printable
    printable = [chr(b) if 32 <= b < 127 else f"\\x{b:02x}" for b in sorted(dead_ascii)]
    print(f"Characters: {printable}")
else:
    print(f"No dead ASCII bytes - all 128 ASCII values appear in corpus")

Dead ASCII bytes (50 total):
Values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 35, 37, 38, 39, 43, 47, 60, 61, 62, 64, 92, 94, 95, 96, 123, 124, 125, 126, 127]
Characters: ['\\x00', '\\x01', '\\x02', '\\x03', '\\x04', '\\x05', '\\x06', '\\x07', '\\x08', '\\x09', '\\x0b', '\\x0c', '\\x0e', '\\x0f', '\\x10', '\\x11', '\\x12', '\\x13', '\\x14', '\\x15', '\\x16', '\\x17', '\\x18', '\\x19', '\\x1a', '\\x1b', '\\x1c', '\\x1d', '\\x1e', '\\x1f', '"', '#', '%', '&', "'", '+', '/', '<', '=', '>', '@', '\\', '^', '_', '`', '{', '|', '}', '~', '\\x7f']


## Save Corpus

Write the clean ASCII text to disk for training.

In [8]:
# Ensure output directory exists
os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

# Write as ASCII bytes
with open(OUTPUT_PATH, 'w', encoding='ascii') as f:
    f.write(ascii_text)

print(f"✓ Saved corpus to: {OUTPUT_PATH}")
print(f"File size: {len(ascii_text):,} bytes ({len(ascii_text) / 1024:.1f} KB)")

✓ Saved corpus to: ../data/training_corpus.txt
File size: 272,305 bytes (265.9 KB)


## Summary

In [9]:
print(f"{'='*80}")
print("TRAINING CORPUS SUMMARY")
print(f"{'='*80}")
print(f"Source: The Great Gatsby (F. Scott Fitzgerald)")
print(f"Corpus size: {len(text_bytes):,} bytes ({len(text_bytes) / 1024:.1f} KB)")
print(f"\nVocabulary (256-byte tokenizer):")
print(f"  Present: {n_present} bytes")
print(f"  Dead ASCII (0-127): {n_dead_ascii} bytes")
print(f"  Dead high (128-255): {n_dead_high} bytes")
print(f"  Total dead: {n_dead_ascii + n_dead_high} ({100 * (n_dead_ascii + n_dead_high) / 256:.1f}%)")
print(f"\nOutput: {OUTPUT_PATH}")
print(f"{'='*80}")

TRAINING CORPUS SUMMARY
Source: The Great Gatsby (F. Scott Fitzgerald)
Corpus size: 272,305 bytes (265.9 KB)

Vocabulary (256-byte tokenizer):
  Present: 78 bytes
  Dead ASCII (0-127): 50 bytes
  Dead high (128-255): 128 bytes
  Total dead: 178 (69.5%)

Output: ../data/training_corpus.txt
