# 1.11b: ASCII Tokenizer Creation

**Goal:** Create a simple 128-token ASCII tokenizer for Lil Gatsby experiments.

## Overview

This notebook creates the tokenizer infrastructure for the 1.12 series (Gatsby training experiments).

**Design:**
- Vocabulary: 128 ASCII characters (bytes 0-127)
- Token ID = ASCII value (direct mapping)
- No BPE, no subword tokenization—just bytes
- HuggingFace-compatible for easy integration with transformers

## Outputs

1. **Tokenizer files** → `../data/tokenizers/ascii_128/`
   - `tokenizer.json` (HuggingFace format)
   - `vocab.json` (ASCII → token ID mapping)
   
2. **Token classification** → `../tensors/Lil_Gatsby/`
   - `1.11b_live_tokens.safetensors` (tokens that appear in Gatsby)
   - `1.11b_dead_tokens.safetensors` (tokens that never appear)

## Why This Matters

The dead tokens are our "primordial atom" test subjects. During training, they receive gradients from the loss function but never strong reinforcement from actually appearing in the training data. Their behavior reveals how bfloat16 quantization and gradient dynamics sculpt the embedding space.

## Parameters

In [1]:
# Tokenizer parameters
VOCAB_SIZE = 128  # ASCII characters 0-127

# Input path (clean corpus from 1.11a)
CORPUS_PATH = "../data/gatsby_clean.txt"

# Output paths
TOKENIZER_DIR = "../data/tokenizers/ascii_128"
TENSOR_DIR = "../tensors/Lil_Gatsby"

## Imports

In [2]:
import json
import torch
import numpy as np
from pathlib import Path
from collections import Counter
from tokenizers import Tokenizer, pre_tokenizers, models, processors
from transformers import PreTrainedTokenizerFast
from safetensors.torch import save_file

## Create Output Directories

In [3]:
# Create directories if they don't exist
Path(TOKENIZER_DIR).mkdir(parents=True, exist_ok=True)
Path(TENSOR_DIR).mkdir(parents=True, exist_ok=True)

print(f"✓ Created output directories:")
print(f"  Tokenizer: {TOKENIZER_DIR}")
print(f"  Tensors:   {TENSOR_DIR}")

✓ Created output directories:
  Tokenizer: ../data/tokenizers/ascii_128
  Tensors:   ../tensors/Lil_Gatsby


## Load Clean Corpus

In [4]:
# Load the clean corpus from 1.11a
with open(CORPUS_PATH, 'r', encoding='utf-8') as f:
    corpus_text = f.read()

print(f"✓ Loaded clean corpus from {CORPUS_PATH}")
print(f"  Characters: {len(corpus_text):,}")
print(f"  Lines: {len(corpus_text.splitlines()):,}")

✓ Loaded clean corpus from ../data/gatsby_clean.txt
  Characters: 266,462
  Lines: 1,613


## Analyze Token Usage in Gatsby

In [5]:
# Convert to ASCII bytes
corpus_bytes = corpus_text.encode('ascii', errors='ignore')

# Count which ASCII bytes appear
byte_counts = Counter(corpus_bytes)

# Identify live vs dead tokens
all_tokens = set(range(128))
live_tokens = set(byte_counts.keys())
dead_tokens = all_tokens - live_tokens

# Sort for consistent ordering
live_tokens_list = sorted(live_tokens)
dead_tokens_list = sorted(dead_tokens)

print(f"Token usage analysis:")
print(f"  Live tokens: {len(live_tokens_list)} (appear in Gatsby)")
print(f"  Dead tokens: {len(dead_tokens_list)} (never appear)")
print()

# Show the dead tokens
print(f"Dead tokens (ASCII values):")
print(f"  {dead_tokens_list}")
print()

# Interpret them as characters (where printable)
print(f"Dead tokens (characters):")
dead_chars = []
for token_id in dead_tokens_list:
    if 32 <= token_id < 127:  # Printable ASCII
        dead_chars.append(f"{token_id}:'{chr(token_id)}'")
    else:  # Control characters
        dead_chars.append(f"{token_id}:<ctrl>")

# Print in rows of 10 for readability
for i in range(0, len(dead_chars), 10):
    print(f"  {', '.join(dead_chars[i:i+10])}")

Token usage analysis:
  Live tokens: 76 (appear in Gatsby)
  Dead tokens: 52 (never appear)

Dead tokens (ASCII values):
  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 34, 35, 37, 38, 39, 43, 47, 60, 61, 62, 64, 88, 92, 94, 95, 96, 123, 124, 125, 126, 127]

Dead tokens (characters):
  0:<ctrl>, 1:<ctrl>, 2:<ctrl>, 3:<ctrl>, 4:<ctrl>, 5:<ctrl>, 6:<ctrl>, 7:<ctrl>, 8:<ctrl>, 9:<ctrl>
  11:<ctrl>, 12:<ctrl>, 13:<ctrl>, 14:<ctrl>, 15:<ctrl>, 16:<ctrl>, 17:<ctrl>, 18:<ctrl>, 19:<ctrl>, 20:<ctrl>
  21:<ctrl>, 22:<ctrl>, 23:<ctrl>, 24:<ctrl>, 25:<ctrl>, 26:<ctrl>, 27:<ctrl>, 28:<ctrl>, 29:<ctrl>, 30:<ctrl>
  31:<ctrl>, 34:'"', 35:'#', 37:'%', 38:'&', 39:''', 43:'+', 47:'/', 60:'<', 61:'='
  62:'>', 64:'@', 88:'X', 92:'\', 94:'^', 95:'_', 96:'`', 123:'{', 124:'|', 125:'}'
  126:'~', 127:<ctrl>


## Create ASCII Vocabulary

In [6]:
# Create vocabulary: character → token ID (identity mapping)
# Use readable representations for printable characters, hex for control chars
vocab = {}
for i in range(128):
    if 32 <= i < 127:  # Printable ASCII
        vocab[chr(i)] = i
    else:  # Control characters and extended ASCII
        vocab[f"<0x{i:02X}>"] = i

# Save vocab.json
vocab_path = Path(TOKENIZER_DIR) / "vocab.json"
with open(vocab_path, 'w') as f:
    json.dump(vocab, f, indent=2)

print(f"✓ Created vocabulary with {len(vocab)} tokens")
print(f"  Saved to {vocab_path}")
print()
print(f"Sample vocab entries:")
sample_items = list(vocab.items())[:10]
for char, token_id in sample_items:
    print(f"  '{char}' → {token_id}")

✓ Created vocabulary with 128 tokens
  Saved to ../data/tokenizers/ascii_128/vocab.json

Sample vocab entries:
  '<0x00>' → 0
  '<0x01>' → 1
  '<0x02>' → 2
  '<0x03>' → 3
  '<0x04>' → 4
  '<0x05>' → 5
  '<0x06>' → 6
  '<0x07>' → 7
  '<0x08>' → 8
  '<0x09>' → 9


## Build HuggingFace Tokenizer

In [7]:
# Create a simple character-level tokenizer
# We need to build it manually to ensure byte-level mapping

# Create inverse vocab (token_id → character)
id_to_token = {v: k for k, v in vocab.items()}

# Build tokenizer with WordLevel model (exact string matching)
tokenizer = Tokenizer(models.WordLevel(vocab=vocab, unk_token="<0x00>"))

# Use character-level pre-tokenizer (split into individual characters)
# This is a simple approach - just split every character
tokenizer.pre_tokenizer = pre_tokenizers.Sequence([
    pre_tokenizers.Split(pattern="", behavior="isolated")  # Split on every character
])

# No post-processing needed
tokenizer.post_processor = processors.TemplateProcessing(
    single="$A",
    special_tokens=[]
)

# Save the tokenizer
tokenizer_path = Path(TOKENIZER_DIR) / "tokenizer.json"
tokenizer.save(str(tokenizer_path))

print(f"✓ Created HuggingFace tokenizer")
print(f"  Saved to {tokenizer_path}")

✓ Created HuggingFace tokenizer
  Saved to ../data/tokenizers/ascii_128/tokenizer.json


## Test the Tokenizer

In [8]:
# Test with a sample from Gatsby
test_text = "In my younger years\n"

# Encode
encoded = tokenizer.encode(test_text)
token_ids = encoded.ids

print(f"Tokenizer test:")
print(f"  Input:  {test_text!r}")
print(f"  Tokens: {token_ids}")
print(f"  ASCII:  {[ord(c) for c in test_text]}")
print()

# Verify they match
expected = [ord(c) for c in test_text]
if token_ids == expected:
    print("✓ Tokenizer working correctly (token IDs match ASCII values)")
else:
    print("✗ ERROR: Token IDs don't match ASCII values!")
    print(f"  Expected: {expected}")
    print(f"  Got:      {token_ids}")

Tokenizer test:
  Input:  'In my younger years\n'
  Tokens: [73, 110, 32, 109, 121, 32, 121, 111, 117, 110, 103, 101, 114, 32, 121, 101, 97, 114, 115, 0]
  ASCII:  [73, 110, 32, 109, 121, 32, 121, 111, 117, 110, 103, 101, 114, 32, 121, 101, 97, 114, 115, 10]

✗ ERROR: Token IDs don't match ASCII values!
  Expected: [73, 110, 32, 109, 121, 32, 121, 111, 117, 110, 103, 101, 114, 32, 121, 101, 97, 114, 115, 10]
  Got:      [73, 110, 32, 109, 121, 32, 121, 111, 117, 110, 103, 101, 114, 32, 121, 101, 97, 114, 115, 0]


## Create Transformers-Compatible Tokenizer

In [9]:
# Wrap in PreTrainedTokenizerFast for use with transformers library
fast_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=tokenizer,
    unk_token="<0x00>",
    pad_token="<0x00>",  # Use null byte as padding
    bos_token=None,
    eos_token=None,
    clean_up_tokenization_spaces=False
)

# Save the wrapped tokenizer
fast_tokenizer.save_pretrained(TOKENIZER_DIR)

print(f"✓ Created transformers-compatible tokenizer")
print(f"  Saved to {TOKENIZER_DIR}")

✓ Created transformers-compatible tokenizer
  Saved to ../data/tokenizers/ascii_128


## Save Token Classifications

In [10]:
# Convert to tensors
live_tokens_tensor = torch.tensor(live_tokens_list, dtype=torch.long)
dead_tokens_tensor = torch.tensor(dead_tokens_list, dtype=torch.long)

# Save live tokens
live_path = Path(TENSOR_DIR) / "1.11b_live_tokens.safetensors"
save_file(
    {'token_ids': live_tokens_tensor},
    str(live_path),
    metadata={
        'description': 'ASCII tokens that appear in The Great Gatsby corpus',
        'count': str(len(live_tokens_list)),
        'source': 'Project Gutenberg edition 64317',
        'corpus': 'gatsby_clean.txt from 1.11a'
    }
)

# Save dead tokens
dead_path = Path(TENSOR_DIR) / "1.11b_dead_tokens.safetensors"
save_file(
    {'token_ids': dead_tokens_tensor},
    str(dead_path),
    metadata={
        'description': 'ASCII tokens that never appear in The Great Gatsby corpus',
        'count': str(len(dead_tokens_list)),
        'source': 'Project Gutenberg edition 64317',
        'corpus': 'gatsby_clean.txt from 1.11a'
    }
)

print(f"✓ Saved token classifications:")
print(f"  Live tokens: {live_path}")
print(f"  Dead tokens: {dead_path}")

✓ Saved token classifications:
  Live tokens: ../tensors/Lil_Gatsby/1.11b_live_tokens.safetensors
  Dead tokens: ../tensors/Lil_Gatsby/1.11b_dead_tokens.safetensors


## Summary

In [11]:
print("\n" + "=" * 80)
print("TOKENIZER CREATION COMPLETE")
print("=" * 80)
print()
print(f"Vocabulary:")
print(f"  Total tokens:  {VOCAB_SIZE}")
print(f"  Live tokens:   {len(live_tokens_list)} (appear in Gatsby)")
print(f"  Dead tokens:   {len(dead_tokens_list)} (primordial atom test subjects)")
print()
print(f"Output files:")
print(f"  Tokenizer:     {TOKENIZER_DIR}/tokenizer.json")
print(f"  Vocabulary:    {TOKENIZER_DIR}/vocab.json")
print(f"  Transformers:  {TOKENIZER_DIR}/tokenizer_config.json")
print(f"  Live tokens:   {TENSOR_DIR}/1.11b_live_tokens.safetensors")
print(f"  Dead tokens:   {TENSOR_DIR}/1.11b_dead_tokens.safetensors")
print()
print(f"Token statistics:")
print(f"  Most common live tokens:")
for byte_val, count in byte_counts.most_common(10):
    char = chr(byte_val) if 32 <= byte_val < 127 else f"<0x{byte_val:02X}>"
    print(f"    {byte_val:3d} '{char}': {count:,} occurrences")
print()
print(f"Next steps:")
print(f"  → 1.12a: Train Lil Gatsby with this tokenizer")
print(f"  → 1.13x: Analyze dead token behavior during training")
print()
print("=" * 80)


TOKENIZER CREATION COMPLETE

Vocabulary:
  Total tokens:  128
  Live tokens:   76 (appear in Gatsby)
  Dead tokens:   52 (primordial atom test subjects)

Output files:
  Tokenizer:     ../data/tokenizers/ascii_128/tokenizer.json
  Vocabulary:    ../data/tokenizers/ascii_128/vocab.json
  Transformers:  ../data/tokenizers/ascii_128/tokenizer_config.json
  Live tokens:   ../tensors/Lil_Gatsby/1.11b_live_tokens.safetensors
  Dead tokens:   ../tensors/Lil_Gatsby/1.11b_dead_tokens.safetensors

Token statistics:
  Most common live tokens:
     32 ' ': 46,699 occurrences
    101 'e': 25,007 occurrences
    116 't': 18,091 occurrences
     97 'a': 16,839 occurrences
    111 'o': 15,736 occurrences
    110 'n': 14,063 occurrences
    105 'i': 12,531 occurrences
    115 's': 12,368 occurrences
    104 'h': 12,239 occurrences
    114 'r': 11,340 occurrences

Next steps:
  → 1.12a: Train Lil Gatsby with this tokenizer
  → 1.13x: Analyze dead token behavior during training

