# 1.11a2: FineWeb Corpus Preparation

**Goal:** Download a 1MB sample from FineWeb and convert it to pure ASCII for training Lil Gatsby.

## Why FineWeb?

FineWeb is HuggingFace's newest web corpus (15 trillion tokens) containing:
- Web pages from Common Crawl (2013-2024)
- Extensive deduplication and filtering
- High-quality English text
- Modern, clean dataset using Parquet format

Unlike Gatsby (single book, ~270k tokens), FineWeb gives us:
- **Diversity:** Many writing styles, topics, domains
- **Scale:** Can grab 1MB, 10MB, or more
- **Non-repetition:** Fresh data throughout training (no memorization)

## The Corpus-as-Heat-Reservoir Hypothesis

**Hypothesis:** Larger, more diverse corpus keeps gradients large longer → tokens stay "warm" (mobile) longer → more time for structure formation before freezing.

**Experiment:**
1. Train on Gatsby (270k, repeats after 5 epochs) → freezes by step ~8k, k=2 clusters
2. Train on FineWeb (1MB, ~10 epochs) → freezes later? More clusters?
3. If yes → corpus size is the missing ingredient!

## Processing Steps

1. Load FineWeb via HuggingFace `datasets` (streaming mode)
2. Grab enough text to reach ~1MB
3. Convert to ASCII using `unidecode` (handles UTF-8 → ASCII gracefully)
4. Filter to valid ASCII bytes (0-127)
5. Save as `fineweb_1mb_ascii.txt`

## Expected Output

- File: `../data/fineweb_1mb_ascii.txt`
- Size: ~1,000,000 bytes
- All characters in range [0, 127] (pure ASCII)
- Used tokens: ~79 (letters, digits, punctuation)
- Unused tokens: ~49 (control chars, special symbols)

## Parameters

In [10]:
# Target corpus size
TARGET_SIZE_BYTES = 2_000_000  # 2M

# Output path
OUTPUT_PATH = "../data/fineweb_ascii.txt"

# FineWeb dataset
DATASET_NAME = "HuggingFaceFW/fineweb"
SPLIT = "train"

# Random seed for reproducibility
RANDOM_SEED = 42

## Imports

In [11]:
from datasets import load_dataset
from unidecode import unidecode
from pathlib import Path
import random

random.seed(RANDOM_SEED)

print("✓ Imports complete")

✓ Imports complete


## Load FineWeb (Streaming)

We use streaming mode to avoid downloading the entire 15T token dataset.

**Note:** First time running this will download some index files (~few MB). Subsequent runs are fast.

In [12]:
print(f"Loading FineWeb dataset (streaming mode)...\n")

# Load in streaming mode (doesn't download everything)
fineweb = load_dataset(DATASET_NAME, split=SPLIT, streaming=True)

print(f"✓ Dataset loaded")
print(f"  Name: {DATASET_NAME}")
print(f"  Split: {SPLIT}")
print(f"  Mode: streaming (on-demand download)")

Loading FineWeb dataset (streaming mode)...



Resolving data files:   0%|          | 0/27468 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/27468 [00:00<?, ?it/s]

✓ Dataset loaded
  Name: HuggingFaceFW/fineweb
  Split: train
  Mode: streaming (on-demand download)


## Sample and Convert to ASCII

We'll iterate through examples, convert each to ASCII via `unidecode`, and accumulate until we hit 1MB.

In [13]:
print(f"\nSampling {TARGET_SIZE_BYTES:,} bytes from FineWeb...\n")

corpus_chunks = []
total_bytes = 0
num_examples = 0

for example in fineweb:
    # Extract text
    text = example.get('text', '')
    
    if not text:
        continue
    
    # Convert to ASCII (unidecode handles UTF-8 → ASCII gracefully)
    ascii_text = unidecode(text)
    
    # Filter to valid ASCII bytes (0-127)
    # We'll do this by encoding to ASCII with 'ignore' to drop non-ASCII
    try:
        ascii_bytes = ascii_text.encode('ascii', errors='ignore')
    except Exception as e:
        print(f"Warning: Failed to encode example {num_examples}: {e}")
        continue
    
    if not ascii_bytes:
        continue
    
    # Add to corpus
    corpus_chunks.append(ascii_bytes)
    total_bytes += len(ascii_bytes)
    num_examples += 1
    
    # Progress indicator
    if num_examples % 100 == 0:
        print(f"  Examples processed: {num_examples:,} | Bytes collected: {total_bytes:,} / {TARGET_SIZE_BYTES:,}")
    
    # Stop when we hit target
    if total_bytes >= TARGET_SIZE_BYTES:
        break

print(f"\n✓ Sampling complete")
print(f"  Examples used: {num_examples:,}")
print(f"  Total bytes: {total_bytes:,}")


Sampling 2,000,000 bytes from FineWeb...

  Examples processed: 100 | Bytes collected: 252,659 / 2,000,000
  Examples processed: 200 | Bytes collected: 640,529 / 2,000,000
  Examples processed: 300 | Bytes collected: 957,297 / 2,000,000
  Examples processed: 400 | Bytes collected: 1,201,090 / 2,000,000
  Examples processed: 500 | Bytes collected: 1,469,396 / 2,000,000
  Examples processed: 600 | Bytes collected: 1,773,172 / 2,000,000

✓ Sampling complete
  Examples used: 610
  Total bytes: 2,001,727


## Combine and Trim to Exact Size

In [14]:
print(f"\nCombining chunks...\n")

# Combine all chunks
corpus_bytes = b''.join(corpus_chunks)

# Trim to exact target size (if we overshot)
corpus_bytes = corpus_bytes[:TARGET_SIZE_BYTES]

# Decode to string
corpus_text = corpus_bytes.decode('ascii')

print(f"✓ Corpus ready")
print(f"  Size: {len(corpus_bytes):,} bytes ({len(corpus_bytes) / 1024:.1f} KB)")
print(f"  Characters: {len(corpus_text):,}")


Combining chunks...

✓ Corpus ready
  Size: 2,000,000 bytes (1953.1 KB)
  Characters: 2,000,000


## Analyze Corpus Statistics

In [15]:
print(f"\nAnalyzing corpus...\n")

# Count unique bytes
unique_bytes = set(corpus_bytes)
byte_counts = {b: corpus_bytes.count(bytes([b])) for b in unique_bytes}

# Sort by frequency
sorted_bytes = sorted(byte_counts.items(), key=lambda x: x[1], reverse=True)

print(f"✓ Corpus statistics")
print(f"  Unique bytes used: {len(unique_bytes)} / 128")
print(f"  Unused bytes: {128 - len(unique_bytes)}")
print()
print(f"Top 10 most common bytes:")
for i, (byte, count) in enumerate(sorted_bytes[:10], 1):
    char = chr(byte) if 32 <= byte < 127 else f"<{byte}>"
    freq = 100 * count / len(corpus_bytes)
    print(f"  {i:2d}. '{char}' (byte {byte:3d}): {count:8,} ({freq:5.2f}%)")

print()
print(f"Unused ASCII bytes (untrained tokens):")
unused_bytes = sorted(set(range(128)) - unique_bytes)
print(f"  Count: {len(unused_bytes)}")
print(f"  Bytes: {unused_bytes[:20]}{'...' if len(unused_bytes) > 20 else ''}")


Analyzing corpus...

✓ Corpus statistics
  Unique bytes used: 94 / 128
  Unused bytes: 34

Top 10 most common bytes:
   1. ' ' (byte  32):  332,309 (16.62%)
   2. 'e' (byte 101):  186,940 ( 9.35%)
   3. 't' (byte 116):  136,595 ( 6.83%)
   4. 'a' (byte  97):  124,195 ( 6.21%)
   5. 'o' (byte 111):  120,346 ( 6.02%)
   6. 'i' (byte 105):  107,923 ( 5.40%)
   7. 'n' (byte 110):  107,816 ( 5.39%)
   8. 's' (byte 115):   98,548 ( 4.93%)
   9. 'r' (byte 114):   94,539 ( 4.73%)
  10. 'h' (byte 104):   72,259 ( 3.61%)

Unused ASCII bytes (untrained tokens):
  Count: 34
  Bytes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]...


## Save Corpus

In [16]:
print(f"\nSaving corpus to: {OUTPUT_PATH}\n")

# Create directory if needed
Path(OUTPUT_PATH).parent.mkdir(parents=True, exist_ok=True)

# Save as text file
with open(OUTPUT_PATH, 'w', encoding='ascii') as f:
    f.write(corpus_text)

print(f"✓ Corpus saved")
print(f"  File: {OUTPUT_PATH}")
print(f"  Size: {Path(OUTPUT_PATH).stat().st_size:,} bytes")


Saving corpus to: ../data/fineweb_ascii.txt

✓ Corpus saved
  File: ../data/fineweb_ascii.txt
  Size: 2,000,000 bytes


## Sample Preview

In [17]:
print(f"\n{'='*80}")
print(f"CORPUS PREVIEW (first 500 characters)")
print(f"{'='*80}\n")
print(corpus_text[:500])
print(f"\n[... {len(corpus_text) - 500:,} more characters ...]")
print(f"\n{'='*80}")


CORPUS PREVIEW (first 500 characters)

How AP reported in all formats from tornado-stricken regionsMarch 8, 2012
When the first serious bout of tornadoes of 2012 blew through middle America in the middle of the night, they touched down in places hours from any AP bureau. Our closest video journalist was Chicago-based Robert Ray, who dropped his plans to travel to Georgia for Super Tuesday, booked several flights to the cities closest to the strikes and headed for the airport. He'd decide once there which flight to take.
He never got 

[... 1,999,500 more characters ...]



## Summary

In [18]:
print(f"\n{'='*80}")
print(f"SUMMARY")
print(f"{'='*80}\n")
print(f"Source: FineWeb (HuggingFaceFW/fineweb)")
print(f"  Examples sampled: {num_examples:,}")
print(f"  Total bytes: {len(corpus_bytes):,}")
print()
print(f"Output: {OUTPUT_PATH}")
print(f"  Unique bytes: {len(unique_bytes)} / 128")
print(f"  Trained tokens: {len(unique_bytes)} (bytes that appear in corpus)")
print(f"  Untrained tokens: {len(unused_bytes)} (bytes that never appear)")
print()
print(f"Comparison to Gatsby:")
print(f"  Gatsby: 268,928 bytes, ~79 trained tokens")
print(f"  FineWeb (1MB): {len(corpus_bytes):,} bytes, ~{len(unique_bytes)} trained tokens")
print(f"  Size ratio: {len(corpus_bytes) / 268928:.1f}x larger")
print()
print(f"Next steps:")
print(f"  1. Run 1.12a with CORPUS_PATH = '{OUTPUT_PATH}'")
print(f"  2. Train for 10,000 steps (or more)")
print(f"  3. Run 1.13d to analyze bounding hypersphere dynamics")
print(f"  4. Compare to Gatsby results:")
print(f"     - Does freezing happen later?")
print(f"     - More fragmentation (k > 2)?")
print(f"     - Stronger contraction (R < 0.126)?")
print()
print(f"{'='*80}")


SUMMARY

Source: FineWeb (HuggingFaceFW/fineweb)
  Examples sampled: 610
  Total bytes: 2,000,000

Output: ../data/fineweb_ascii.txt
  Unique bytes: 94 / 128
  Trained tokens: 94 (bytes that appear in corpus)
  Untrained tokens: 34 (bytes that never appear)

Comparison to Gatsby:
  Gatsby: 268,928 bytes, ~79 trained tokens
  FineWeb (1MB): 2,000,000 bytes, ~94 trained tokens
  Size ratio: 7.4x larger

Next steps:
  1. Run 1.12a with CORPUS_PATH = '../data/fineweb_ascii.txt'
  2. Train for 10,000 steps (or more)
  3. Run 1.13d to analyze bounding hypersphere dynamics
  4. Compare to Gatsby results:
     - Does freezing happen later?
     - More fragmentation (k > 2)?
     - Stronger contraction (R < 0.126)?

