# 1.18a: Flannel Corpus Preparation

**Goal:** Prepare two disjoint corpora from FineWeb (English) and FineWeb-2 (Thai) for the Flannel experimental series.

## What is Flannel?

Flannel is our custom tokenizer experiment. We're testing whether we can engineer dead tokens by training a tokenizer on multilingual data, then training a model on English-only data.

**Hypothesis:** By training the tokenizer on 80% English + 20% Thai, we'll get hundreds of Thai tokens in the vocabulary. Then training the model on 100% English means those Thai tokens never appear during training → engineered dead tokens.

## This Notebook

1. Download ~80 MB English from FineWeb (original, highest quality English)
2. Download ~20 MB Thai from FineWeb-2 (`tha_Thai`)
3. Combine into tokenizer training corpus (~100 MB total)
4. Download ~5 MB English from a disjoint section for model training
5. Save both corpora as UTF-8 text files

## Dataset Choices

- **English:** FineWeb (original) - highest quality English web data
- **Thai:** FineWeb-2 `tha_Thai` subset - newest multilingual dataset (Dec 2024)
- Both use the same FineWeb processing pipeline for consistency

## Outputs

- `../data/flannel_tokenizer_corpus.txt` (~100 MB, 80% English + 20% Thai)
- `../data/flannel_model_corpus.txt` (~5 MB, 100% English, disjoint from tokenizer corpus)

## Parameters

In [9]:
# Tokenizer corpus (English + Thai)
TOKENIZER_ENGLISH_MB = 80.0
TOKENIZER_THAI_MB = 20.0
TOKENIZER_OUTPUT = "../data/flannel_tokenizer_corpus.txt"

# Model corpus (English only, disjoint)
MODEL_ENGLISH_MB = 5.0
MODEL_OUTPUT = "../data/flannel_model_corpus.txt"

# Datasets
ENGLISH_DATASET = "HuggingFaceFW/fineweb"
ENGLISH_CONFIG = "sample-10BT"  # 10B token sample
THAI_DATASET = "HuggingFaceFW/fineweb-2"
THAI_SUBSET = "tha_Thai"

# Skip count for model corpus (to ensure disjoint samples)
# We'll skip the first N documents for English to avoid overlap
MODEL_CORPUS_SKIP = 10000  # Skip first 10k docs used for tokenizer

# Random seed
RANDOM_SEED = 42

## Imports

In [10]:
from datasets import load_dataset
from pathlib import Path
import random

random.seed(RANDOM_SEED)

print("✓ Imports complete")

✓ Imports complete


## Create Output Directory

In [11]:
Path("../data").mkdir(parents=True, exist_ok=True)
print("✓ Created ../data directory")

✓ Created ../data directory


## Download English for Tokenizer Corpus

Stream from FineWeb (original) until we hit target size.

In [12]:
print(f"Downloading ~{TOKENIZER_ENGLISH_MB} MB of English for tokenizer corpus...\n")

# Load English dataset in streaming mode
english_dataset = load_dataset(
    ENGLISH_DATASET,
    name=ENGLISH_CONFIG,
    split="train",
    streaming=True
)

# Collect texts until we hit target size
target_bytes = int(TOKENIZER_ENGLISH_MB * 1024 * 1024)
english_texts = []
total_bytes = 0

for example in english_dataset:
    text = example['text']
    text_bytes = len(text.encode('utf-8'))
    
    english_texts.append(text)
    total_bytes += text_bytes
    
    if total_bytes >= target_bytes:
        break

english_bytes = sum(len(t.encode('utf-8')) for t in english_texts)
english_mb = english_bytes / (1024 * 1024)

print(f"✓ Downloaded English corpus")
print(f"  Documents: {len(english_texts):,}")
print(f"  Bytes (UTF-8): {english_bytes:,} ({english_mb:.2f} MB)")

Downloading ~80.0 MB of English for tokenizer corpus...



Resolving data files:   0%|          | 0/27468 [00:00<?, ?it/s]

✓ Downloaded English corpus
  Documents: 26,868
  Bytes (UTF-8): 83,888,103 (80.00 MB)


## Download Thai for Tokenizer Corpus

In [13]:
print(f"\nDownloading ~{TOKENIZER_THAI_MB} MB of Thai for tokenizer corpus...\n")

# Load Thai dataset in streaming mode
thai_dataset = load_dataset(
    THAI_DATASET,
    name=THAI_SUBSET,
    split="train",
    streaming=True
)

# Collect texts until we hit target size
target_bytes = int(TOKENIZER_THAI_MB * 1024 * 1024)
thai_texts = []
total_bytes = 0

for example in thai_dataset:
    text = example['text']
    text_bytes = len(text.encode('utf-8'))
    
    thai_texts.append(text)
    total_bytes += text_bytes
    
    if total_bytes >= target_bytes:
        break

thai_bytes = sum(len(t.encode('utf-8')) for t in thai_texts)
thai_mb = thai_bytes / (1024 * 1024)

print(f"✓ Downloaded Thai corpus")
print(f"  Documents: {len(thai_texts):,}")
print(f"  Bytes (UTF-8): {thai_bytes:,} ({thai_mb:.2f} MB)")


Downloading ~20.0 MB of Thai for tokenizer corpus...



Resolving data files:   0%|          | 0/18 [00:00<?, ?it/s]

✓ Downloaded Thai corpus
  Documents: 2,189
  Bytes (UTF-8): 20,974,721 (20.00 MB)


## Combine and Save Tokenizer Corpus

We'll keep them sequential (English first, then Thai). BPE tokenizer training doesn't care about order—it just counts pairs globally.

In [14]:
print(f"\nCombining tokenizer corpus...\n")

# Combine: English first, then Thai (sequential is fine for BPE)
tokenizer_corpus = '\n\n'.join(english_texts + thai_texts)

tokenizer_bytes = len(tokenizer_corpus.encode('utf-8'))
tokenizer_mb = tokenizer_bytes / (1024 * 1024)

print(f"✓ Combined tokenizer corpus")
print(f"  English docs: {len(english_texts):,} ({english_mb:.2f} MB)")
print(f"  Thai docs: {len(thai_texts):,} ({thai_mb:.2f} MB)")
print(f"  Total docs: {len(english_texts) + len(thai_texts):,}")
print(f"  Total size: {tokenizer_bytes:,} bytes ({tokenizer_mb:.2f} MB)")
print(f"  Ratio: {100*english_mb/tokenizer_mb:.1f}% English, {100*thai_mb/tokenizer_mb:.1f}% Thai")


Combining tokenizer corpus...

✓ Combined tokenizer corpus
  English docs: 26,868 (80.00 MB)
  Thai docs: 2,189 (20.00 MB)
  Total docs: 29,057
  Total size: 104,920,936 bytes (100.06 MB)
  Ratio: 80.0% English, 20.0% Thai


In [15]:
print(f"\nSaving tokenizer corpus to {TOKENIZER_OUTPUT}...\n")

with open(TOKENIZER_OUTPUT, 'w', encoding='utf-8') as f:
    f.write(tokenizer_corpus)

print(f"✓ Saved tokenizer corpus")
print(f"  Path: {TOKENIZER_OUTPUT}")
print(f"  Size: {tokenizer_mb:.2f} MB")


Saving tokenizer corpus to ../data/flannel_tokenizer_corpus.txt...

✓ Saved tokenizer corpus
  Path: ../data/flannel_tokenizer_corpus.txt
  Size: 100.06 MB


## Download English for Model Corpus (Disjoint)

Skip ahead in the English dataset to ensure no overlap with tokenizer corpus.

In [16]:
print(f"\nDownloading ~{MODEL_ENGLISH_MB} MB of English for model corpus...\n")
print(f"(Skipping first {MODEL_CORPUS_SKIP:,} documents to ensure disjoint samples)\n")

# Reload English dataset (streaming)
english_dataset_model = load_dataset(
    ENGLISH_DATASET,
    name=ENGLISH_CONFIG,
    split="train",
    streaming=True
)

# Skip ahead to avoid overlap
english_dataset_model = english_dataset_model.skip(MODEL_CORPUS_SKIP)

# Collect texts until we hit target size
target_bytes = int(MODEL_ENGLISH_MB * 1024 * 1024)
model_texts = []
total_bytes = 0

for example in english_dataset_model:
    text = example['text']
    text_bytes = len(text.encode('utf-8'))
    
    model_texts.append(text)
    total_bytes += text_bytes
    
    if total_bytes >= target_bytes:
        break

model_bytes = sum(len(t.encode('utf-8')) for t in model_texts)
model_mb = model_bytes / (1024 * 1024)

print(f"✓ Downloaded model corpus")
print(f"  Documents: {len(model_texts):,}")
print(f"  Bytes (UTF-8): {model_bytes:,} ({model_mb:.2f} MB)")


Downloading ~5.0 MB of English for model corpus...

(Skipping first 10,000 documents to ensure disjoint samples)



Resolving data files:   0%|          | 0/27468 [00:00<?, ?it/s]

✓ Downloaded model corpus
  Documents: 1,607
  Bytes (UTF-8): 5,253,807 (5.01 MB)


## Save Model Corpus

In [17]:
print(f"\nSaving model corpus to {MODEL_OUTPUT}...\n")

# Combine model texts
model_corpus = '\n\n'.join(model_texts)

with open(MODEL_OUTPUT, 'w', encoding='utf-8') as f:
    f.write(model_corpus)

print(f"✓ Saved model corpus")
print(f"  Path: {MODEL_OUTPUT}")
print(f"  Size: {model_mb:.2f} MB")


Saving model corpus to ../data/flannel_model_corpus.txt...

✓ Saved model corpus
  Path: ../data/flannel_model_corpus.txt
  Size: 5.01 MB


## Summary

In [18]:
print(f"\n{'='*70}")
print(f"CORPUS PREPARATION COMPLETE")
print(f"{'='*70}\n")

print(f"Tokenizer Training Corpus:")
print(f"  Path: {TOKENIZER_OUTPUT}")
print(f"  Size: {tokenizer_mb:.2f} MB")
print(f"  Composition: {100*english_mb/tokenizer_mb:.1f}% English, {100*thai_mb/tokenizer_mb:.1f}% Thai")
print(f"  Documents: {len(english_texts):,} English + {len(thai_texts):,} Thai = {len(english_texts) + len(thai_texts):,} total")
print()

print(f"Model Training Corpus:")
print(f"  Path: {MODEL_OUTPUT}")
print(f"  Size: {model_mb:.2f} MB")
print(f"  Composition: 100% English")
print(f"  Documents: {len(model_texts):,}")
print(f"  Disjoint: ✓ (skipped first {MODEL_CORPUS_SKIP:,} English docs)")
print()

print(f"Next steps:")
print(f"  → Use {TOKENIZER_OUTPUT} to train BPE tokenizer (notebook 1.19a)")
print(f"  → Use {MODEL_OUTPUT} to train Flannel models (notebook 1.20a+)")
print(f"  → Expect hundreds of Thai tokens to be dead (never appear in model training)")
print()
print(f"{'='*70}")


CORPUS PREPARATION COMPLETE

Tokenizer Training Corpus:
  Path: ../data/flannel_tokenizer_corpus.txt
  Size: 100.06 MB
  Composition: 80.0% English, 20.0% Thai
  Documents: 26,868 English + 2,189 Thai = 29,057 total

Model Training Corpus:
  Path: ../data/flannel_model_corpus.txt
  Size: 5.01 MB
  Composition: 100% English
  Documents: 1,607
  Disjoint: ✓ (skipped first 10,000 English docs)

Next steps:
  → Use ../data/flannel_tokenizer_corpus.txt to train BPE tokenizer (notebook 1.19a)
  → Use ../data/flannel_model_corpus.txt to train Flannel models (notebook 1.20a+)
  → Expect hundreds of Thai tokens to be dead (never appear in model training)

