# 1.9a: The Great Gatsby Corpus Preparation

**Download and prepare ASCII training corpus for 100k-step encounter hypothesis test**

## The Plan

We're testing the encounter hypothesis: can unreachable tokens escape the primordial cluster via rare prediction encounters?

To maximize opportunities for encounters, we'll train for 100,000 steps on The Great Gatsby corpus.

## Corpus Choice

**The Great Gatsby** by F. Scott Fitzgerald
- Public domain (published 1925)
- ~265 KB of pure ASCII text
- Rich vocabulary and varied sentence structure
- Available from Project Gutenberg
- Guarantees ~50 dead tokens (ASCII bytes that never appear)

## This Notebook

1. Download raw text from Project Gutenberg
2. Strip headers/footers
3. Convert to pure ASCII (remove UTF-8 artifacts)
4. Analyze byte statistics
5. Save corpus to `blog/data/the_great_gatsby.txt`
6. Save token lists to `blog/tensors/Lil_Gatsby/1.9a_token_lists.safetensors`

## Parameters

In [1]:
# Source
GUTENBERG_URL = "https://www.gutenberg.org/cache/epub/64317/pg64317.txt"

# Outputs
CORPUS_OUTPUT = "../data/the_great_gatsby.txt"
TENSOR_OUTPUT = "../tensors/Lil_Gatsby/1.9a_token_lists.safetensors"

# Vocabulary
VOCAB_SIZE = 128  # ASCII only (0-127)

RANDOM_SEED = 42

## Imports

In [2]:
import requests
import numpy as np
import torch
from collections import Counter
from pathlib import Path
from safetensors.torch import save_file

np.random.seed(RANDOM_SEED)

## Download The Great Gatsby

In [3]:
print(f"Downloading from Project Gutenberg...\n")
print(f"URL: {GUTENBERG_URL}")

response = requests.get(GUTENBERG_URL)
response.raise_for_status()

raw_text = response.text

print(f"\n✓ Downloaded successfully")
print(f"Raw size: {len(raw_text):,} characters")

Downloading from Project Gutenberg...

URL: https://www.gutenberg.org/cache/epub/64317/pg64317.txt

✓ Downloaded successfully
Raw size: 296,858 characters


## Strip Project Gutenberg Headers/Footers

In [4]:
# Find start marker (typical Gutenberg format)
start_markers = [
    "*** START OF THE PROJECT GUTENBERG EBOOK",
    "*** START OF THIS PROJECT GUTENBERG EBOOK",
    "***START OF THE PROJECT GUTENBERG EBOOK"
]

start_idx = 0
for marker in start_markers:
    idx = raw_text.find(marker)
    if idx != -1:
        # Skip past the marker line
        start_idx = raw_text.find('\n', idx) + 1
        print(f"Found start marker: {marker[:50]}...")
        break

# Find end marker
end_markers = [
    "*** END OF THE PROJECT GUTENBERG EBOOK",
    "*** END OF THIS PROJECT GUTENBERG EBOOK",
    "***END OF THE PROJECT GUTENBERG EBOOK"
]

end_idx = len(raw_text)
for marker in end_markers:
    idx = raw_text.find(marker)
    if idx != -1:
        end_idx = idx
        print(f"Found end marker: {marker[:50]}...")
        break

# Extract just the novel
clean_text = raw_text[start_idx:end_idx].strip()

print(f"\nStripped header/footer:")
print(f"  Clean text: {len(clean_text):,} characters")
print(f"  Removed: {len(raw_text) - len(clean_text):,} characters")

Found start marker: *** START OF THE PROJECT GUTENBERG EBOOK...
Found end marker: *** END OF THE PROJECT GUTENBERG EBOOK...

Stripped header/footer:
  Clean text: 277,090 characters
  Removed: 19,768 characters


## Convert to Pure ASCII

In [5]:
# Try encoding as UTF-8 to see what we're dealing with
text_bytes_utf8 = clean_text.encode('utf-8', errors='replace')

# Find non-ASCII bytes
non_ascii = [b for b in text_bytes_utf8 if b > 127]

print(f"UTF-8 encoding analysis:")
print(f"  Total bytes: {len(text_bytes_utf8):,}")
print(f"  Non-ASCII bytes: {len(non_ascii):,}")

if non_ascii:
    print(f"  Non-ASCII byte values: {sorted(set(non_ascii))}")
    print(f"\n⚠️  Text contains non-ASCII characters (UTF-8 encoding artifacts)")
    print(f"Converting to pure ASCII by stripping...")
    
    # Strip non-ASCII characters
    ascii_text = clean_text.encode('ascii', errors='ignore').decode('ascii')
    text_bytes = ascii_text.encode('ascii')
    
    print(f"\n✓ Converted to pure ASCII")
    print(f"  Final size: {len(text_bytes):,} bytes")
    print(f"  Characters removed: {len(text_bytes_utf8) - len(text_bytes):,}")
else:
    print(f"\n✓ Text is already pure ASCII")
    ascii_text = clean_text
    text_bytes = ascii_text.encode('ascii')

UTF-8 encoding analysis:
  Total bytes: 286,640
  Non-ASCII bytes: 14,335
  Non-ASCII byte values: [128, 138, 148, 152, 153, 156, 157, 166, 167, 169, 170, 180, 195, 226]

⚠️  Text contains non-ASCII characters (UTF-8 encoding artifacts)
Converting to pure ASCII by stripping...

✓ Converted to pure ASCII
  Final size: 272,305 bytes
  Characters removed: 14,335


## Analyze Byte Statistics

In [6]:
byte_counts = Counter(text_bytes)

# Which ASCII bytes appear?
present_bytes = set(byte_counts.keys())
n_present = len(present_bytes)

# Which ASCII bytes are dead?
all_ascii = set(range(VOCAB_SIZE))
dead_bytes = all_ascii - present_bytes
n_dead = len(dead_bytes)

print(f"\nByte statistics (ASCII 0-{VOCAB_SIZE-1}):")
print(f"  Total bytes in corpus: {len(text_bytes):,}")
print(f"  Unique bytes present: {n_present} / {VOCAB_SIZE}")
print(f"  Dead bytes: {n_dead} / {VOCAB_SIZE} ({100 * n_dead / VOCAB_SIZE:.1f}%)")


Byte statistics (ASCII 0-127):
  Total bytes in corpus: 272,305
  Unique bytes present: 78 / 128
  Dead bytes: 50 / 128 (39.1%)


In [7]:
# Show most common bytes
most_common = byte_counts.most_common(10)

print(f"\nMost common bytes:")
for byte_val, count in most_common:
    char = repr(chr(byte_val)) if 32 <= byte_val < 127 else f"\\x{byte_val:02x}"
    print(f"  {byte_val:3d} ({char:>6s}): {count:6,} ({100 * count / len(text_bytes):5.2f}%)")


Most common bytes:
   32 (   ' '): 44,251 (16.25%)
  101 (   'e'): 25,012 ( 9.19%)
  116 (   't'): 18,098 ( 6.65%)
   97 (   'a'): 16,843 ( 6.19%)
  111 (   'o'): 15,739 ( 5.78%)
  110 (   'n'): 14,065 ( 5.17%)
  105 (   'i'): 12,532 ( 4.60%)
  115 (   's'): 12,370 ( 4.54%)
  104 (   'h'): 12,240 ( 4.49%)
  114 (   'r'): 11,342 ( 4.17%)


In [8]:
# Show least common bytes (that actually appear)
least_common = byte_counts.most_common()[-10:]

print(f"\nLeast common bytes:")
for byte_val, count in least_common:
    char = repr(chr(byte_val)) if 32 <= byte_val < 127 else f"\\x{byte_val:02x}"
    print(f"  {byte_val:3d} ({char:>6s}): {count:6,} ({100 * count / len(text_bytes):5.2f}%)")


Least common bytes:
   81 (   'Q'):      4 ( 0.00%)
   50 (   '2'):      4 ( 0.00%)
   56 (   '8'):      3 ( 0.00%)
   88 (   'X'):      2 ( 0.00%)
   55 (   '7'):      2 ( 0.00%)
   52 (   '4'):      2 ( 0.00%)
   91 (   '['):      2 ( 0.00%)
   93 (   ']'):      2 ( 0.00%)
   36 (   '$'):      2 ( 0.00%)
   90 (   'Z'):      1 ( 0.00%)


In [9]:
# Show dead bytes
if dead_bytes:
    print(f"\nDead bytes ({len(dead_bytes)} total):")
    print(f"  Byte values: {sorted(dead_bytes)}")
    
    # Show as characters where printable
    printable = [repr(chr(b)) if 32 <= b < 127 else f"\\x{b:02x}" for b in sorted(dead_bytes)]
    print(f"  Characters: {', '.join(printable[:20])}..." if len(printable) > 20 else f"  Characters: {', '.join(printable)}")
else:
    print(f"\nNo dead bytes - all {VOCAB_SIZE} ASCII values appear in corpus")


Dead bytes (50 total):
  Byte values: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 35, 37, 38, 39, 43, 47, 60, 61, 62, 64, 92, 94, 95, 96, 123, 124, 125, 126, 127]
  Characters: \x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x09, \x0b, \x0c, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15...


## Save Corpus to Disk

In [10]:
# Create output directory if needed
corpus_path = Path(CORPUS_OUTPUT)
corpus_path.parent.mkdir(parents=True, exist_ok=True)

# Write as ASCII text
with open(corpus_path, 'w', encoding='ascii') as f:
    f.write(ascii_text)

print(f"✓ Saved corpus to: {corpus_path}")
print(f"  File size: {len(ascii_text):,} bytes ({len(ascii_text) / 1024:.1f} KB)")

✓ Saved corpus to: ../data/the_great_gatsby.txt
  File size: 272,305 bytes (265.9 KB)


## Save Token Lists

In [11]:
# Create dead/live token ID lists
dead_token_ids = torch.tensor(sorted(dead_bytes), dtype=torch.int64)
live_token_ids = torch.tensor(sorted(present_bytes), dtype=torch.int64)

# Create output directory if needed
tensor_path = Path(TENSOR_OUTPUT)
tensor_path.parent.mkdir(parents=True, exist_ok=True)

# Save to safetensors
save_file({
    'dead_token_ids': dead_token_ids,
    'live_token_ids': live_token_ids,
    'vocab_size': torch.tensor(VOCAB_SIZE, dtype=torch.int64),
}, tensor_path)

print(f"\n✓ Saved token lists to: {tensor_path}")
print(f"  Dead tokens: {len(dead_token_ids)}")
print(f"  Live tokens: {len(live_token_ids)}")


✓ Saved token lists to: ../tensors/Lil_Gatsby/1.9a_token_lists.safetensors
  Dead tokens: 50
  Live tokens: 78


## Summary

In [12]:
print(f"\n{'='*80}")
print("THE GREAT GATSBY CORPUS - PREPARATION COMPLETE")
print(f"{'='*80}")
print(f"\nSource:")
print(f"  Novel: The Great Gatsby by F. Scott Fitzgerald")
print(f"  URL: {GUTENBERG_URL}")
print(f"\nCorpus:")
print(f"  Size: {len(text_bytes):,} bytes ({len(text_bytes) / 1024:.1f} KB)")
print(f"  Encoding: Pure ASCII (0-{VOCAB_SIZE-1})")
print(f"\nVocabulary ({VOCAB_SIZE}-byte tokenizer):")
print(f"  Live tokens: {len(live_token_ids)} ({100 * len(live_token_ids) / VOCAB_SIZE:.1f}%)")
print(f"  Dead tokens: {len(dead_token_ids)} ({100 * len(dead_token_ids) / VOCAB_SIZE:.1f}%)")
print(f"\nOutputs:")
print(f"  Corpus: {corpus_path}")
print(f"  Token lists: {tensor_path}")
print(f"\nReady for 1.9b: 100k-step training run")
print(f"{'='*80}")


THE GREAT GATSBY CORPUS - PREPARATION COMPLETE

Source:
  Novel: The Great Gatsby by F. Scott Fitzgerald
  URL: https://www.gutenberg.org/cache/epub/64317/pg64317.txt

Corpus:
  Size: 272,305 bytes (265.9 KB)
  Encoding: Pure ASCII (0-127)

Vocabulary (128-byte tokenizer):
  Live tokens: 78 (60.9%)
  Dead tokens: 50 (39.1%)

Outputs:
  Corpus: ../data/the_great_gatsby.txt
  Token lists: ../tensors/Lil_Gatsby/1.9a_token_lists.safetensors

Ready for 1.9b: 100k-step training run
