# 1.11a3: FineWeb Unicode Corpus Preparation

**Goal:** Download a small sample of FineWeb, prepare it for training Wordybird, and identify which GPT-2 tokens appear in the corpus.

## What is Wordybird?

Wordybird is our dimensional crowding experiment. We're testing whether spongecrystal formation requires cramming many untrained tokens into a limited number of dimensions.

**Hypothesis:** In 64D space with 50,257 tokens (GPT-2 vocab), thousands of untrained tokens compete for the same degrees of freedom, potentially causing:
- Forced collisions
- Quantization clustering (shared bfloat16 lattice)
- Gradient averaging (equally-wrong tokens moving together)

## This Notebook

1. Download ~2 MB of FineWeb (enough for 100-1000 training steps)
2. Keep as Unicode (unlike Gatsby's ASCII-only approach)
3. Tokenize with GPT-2 tokenizer
4. Identify trained vs untrained tokens
5. Save corpus and token masks for Wordybird training

## Parameters

In [1]:
# Corpus parameters
TARGET_SIZE_MB = 2.0  # Download ~2 MB of FineWeb
DATASET_NAME = "HuggingFaceFW/fineweb"
DATASET_CONFIG = "sample-10BT"  # 10B token sample
DATASET_SPLIT = "train"

# Output paths
CORPUS_OUTPUT = "../data/fineweb_2mb_unicode.txt"
MASK_OUTPUT = "../tensors/Wordybird/fineweb_token_masks.safetensors"

# Random seed
RANDOM_SEED = 42

## Imports

In [2]:
import torch
from datasets import load_dataset
from transformers import GPT2Tokenizer
from pathlib import Path
from safetensors.torch import save_file
import random

random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

print("✓ Imports complete")

✓ Imports complete


## Device Detection

In [3]:
# Detect available device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print(f"Using device: {device}")

Using device: mps


## Load GPT-2 Tokenizer

In [4]:
print("Loading GPT-2 tokenizer...\n")

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

print(f"✓ Loaded GPT-2 tokenizer")
print(f"  Vocabulary size: {len(tokenizer):,} tokens")
print(f"  Vocab size (attribute): {tokenizer.vocab_size:,}")

Loading GPT-2 tokenizer...

✓ Loaded GPT-2 tokenizer
  Vocabulary size: 50,257 tokens
  Vocab size (attribute): 50,257


## Download FineWeb Sample

Stream from HuggingFace until we hit our target size.

In [5]:
print(f"Downloading ~{TARGET_SIZE_MB} MB from FineWeb...\n")

# Load dataset in streaming mode
dataset = load_dataset(
    DATASET_NAME,
    name=DATASET_CONFIG,
    split=DATASET_SPLIT,
    streaming=True
)

# Collect text until we hit target size
target_bytes = int(TARGET_SIZE_MB * 1024 * 1024)
texts = []
total_bytes = 0

for example in dataset:
    text = example['text']
    text_bytes = len(text.encode('utf-8'))
    
    texts.append(text)
    total_bytes += text_bytes
    
    if total_bytes >= target_bytes:
        break

# Combine all texts
corpus_text = '\n\n'.join(texts)
actual_bytes = len(corpus_text.encode('utf-8'))
actual_mb = actual_bytes / (1024 * 1024)

print(f"✓ Downloaded corpus")
print(f"  Documents: {len(texts):,}")
print(f"  Characters: {len(corpus_text):,}")
print(f"  Bytes (UTF-8): {actual_bytes:,} ({actual_mb:.2f} MB)")

Downloading ~2.0 MB from FineWeb...



Resolving data files:   0%|          | 0/27468 [00:00<?, ?it/s]

✓ Downloaded corpus
  Documents: 668
  Characters: 2,089,201
  Bytes (UTF-8): 2,098,654 (2.00 MB)


## Analyze Corpus Statistics

In [6]:
print(f"\nAnalyzing corpus...\n")

# Tokenize the entire corpus
tokens = tokenizer.encode(corpus_text)
print(f"Total tokens: {len(tokens):,}")

# Find unique tokens that appear
unique_tokens = sorted(set(tokens))
print(f"Unique tokens in corpus: {len(unique_tokens):,}")
print(f"Token usage: {100 * len(unique_tokens) / len(tokenizer):.2f}% of vocabulary")
print()

# Identify trained vs untrained
all_token_ids = set(range(len(tokenizer)))
trained_token_ids = set(unique_tokens)
untrained_token_ids = all_token_ids - trained_token_ids

print(f"Trained tokens: {len(trained_token_ids):,}")
print(f"Untrained tokens: {len(untrained_token_ids):,}")
print()

# Character set analysis
unique_chars = set(corpus_text)
print(f"Unique characters: {len(unique_chars):,}")

# Show some untrained token examples
print(f"\nSample of untrained tokens (first 20):")
untrained_sample = sorted(untrained_token_ids)[:20]
for token_id in untrained_sample:
    token_str = tokenizer.decode([token_id])
    # Escape special chars for display
    display_str = repr(token_str)[1:-1]  # Remove outer quotes from repr
    print(f"  {token_id:5d}: {display_str}")


Analyzing corpus...



Token indices sequence length is longer than the specified maximum sequence length for this model (475160 > 1024). Running this sequence through the model will result in indexing errors


Total tokens: 475,160
Unique tokens in corpus: 30,590
Token usage: 60.87% of vocabulary

Trained tokens: 30,590
Untrained tokens: 19,667

Unique characters: 252

Sample of untrained tokens (first 20):
     90: {
     92: }
    106: �
    107: �
    114: �
    121: �
    124: �
    125: �
    128: �
    130: �
    131: �
    132: �
    133: �
    135: �
    136: �
    137: �
    142: �
    143: �
    144: �
    145: �


## Save Corpus

In [7]:
print(f"\nSaving corpus to {CORPUS_OUTPUT}...\n")

# Ensure directory exists
Path(CORPUS_OUTPUT).parent.mkdir(parents=True, exist_ok=True)

# Save as UTF-8
with open(CORPUS_OUTPUT, 'w', encoding='utf-8') as f:
    f.write(corpus_text)

print(f"✓ Saved corpus")
print(f"  Path: {CORPUS_OUTPUT}")
print(f"  Size: {actual_mb:.2f} MB")


Saving corpus to ../data/fineweb_2mb_unicode.txt...

✓ Saved corpus
  Path: ../data/fineweb_2mb_unicode.txt
  Size: 2.00 MB


## Create Token Masks

Save boolean masks indicating which tokens are trained/untrained.

In [8]:
print(f"\nCreating token masks...\n")

vocab_size = len(tokenizer)

# Create masks
trained_mask = torch.zeros(vocab_size, dtype=torch.bool)
trained_mask[list(trained_token_ids)] = True

untrained_mask = ~trained_mask

# Also save the actual token IDs for convenience
trained_indices = torch.tensor(sorted(trained_token_ids), dtype=torch.long)
untrained_indices = torch.tensor(sorted(untrained_token_ids), dtype=torch.long)

print(f"✓ Created masks")
print(f"  Trained mask: {trained_mask.sum().item():,} True values")
print(f"  Untrained mask: {untrained_mask.sum().item():,} True values")


Creating token masks...

✓ Created masks
  Trained mask: 30,590 True values
  Untrained mask: 19,667 True values


## Save Token Masks

In [9]:
print(f"\nSaving token masks to {MASK_OUTPUT}...\n")

# Ensure directory exists
Path(MASK_OUTPUT).parent.mkdir(parents=True, exist_ok=True)

# Save to safetensors
save_file(
    {
        'trained_mask': trained_mask,
        'untrained_mask': untrained_mask,
        'trained_indices': trained_indices,
        'untrained_indices': untrained_indices,
    },
    str(MASK_OUTPUT)
)

print(f"✓ Saved token masks")
print(f"  Path: {MASK_OUTPUT}")


Saving token masks to ../tensors/Wordybird/fineweb_token_masks.safetensors...

✓ Saved token masks
  Path: ../tensors/Wordybird/fineweb_token_masks.safetensors


## Summary

In [10]:
print(f"\n{'='*70}")
print(f"CORPUS PREPARATION COMPLETE")
print(f"{'='*70}\n")

print(f"Corpus:")
print(f"  Path: {CORPUS_OUTPUT}")
print(f"  Size: {actual_mb:.2f} MB")
print(f"  Tokens: {len(tokens):,}")
print()

print(f"Tokenizer: GPT-2")
print(f"  Vocabulary: {vocab_size:,} tokens")
print(f"  Trained: {len(trained_token_ids):,} ({100*len(trained_token_ids)/vocab_size:.1f}%)")
print(f"  Untrained: {len(untrained_token_ids):,} ({100*len(untrained_token_ids)/vocab_size:.1f}%)")
print()

print(f"Token masks saved to:")
print(f"  {MASK_OUTPUT}")
print()

print(f"Ready for Wordybird training!")
print(f"{'='*70}")


CORPUS PREPARATION COMPLETE

Corpus:
  Path: ../data/fineweb_2mb_unicode.txt
  Size: 2.00 MB
  Tokens: 475,160

Tokenizer: GPT-2
  Vocabulary: 50,257 tokens
  Trained: 30,590 (60.9%)
  Untrained: 19,667 (39.1%)

Token masks saved to:
  ../tensors/Wordybird/fineweb_token_masks.safetensors

Ready for Wordybird training!
