# Chapter 7 Companion Notebook
**Build Your First LLM ‚Äî Chapter 7: Preparing Your Data**
This notebook walks through the data prep steps: cleaning, deduping, splitting, chunking, and saving to JSONL with quick stats.
Tiny toy data is inline; no external files needed.


In [None]:
# ===== IMPORTS =====
import re           # Regular expressions for pattern matching in text
import json         # Read/write JSON and JSONL files
import hashlib      # Create fingerprints for duplicate detection
import unicodedata  # Unicode normalization (handles special characters)
from collections import Counter  # Count word frequencies
import random       # Shuffle data for train/val/test splits

print('Setup complete')

## Python Tools Quick Reference

This notebook uses several Python tools. Here's a quick guide:

**Regular Expressions (regex):** Pattern-based find/replace in text
- `re.sub(pattern, replacement, text)` ‚Äî Find all matches of `pattern` and replace
- Pattern `<[^>]+>` means: find `<`, then any chars that aren't `>`, then `>`  ‚Üí matches HTML tags

**Hashing:** Create a unique "fingerprint" for any text
- Same text ‚Üí same hash (always). Different text ‚Üí different hash (almost always)
- Useful for detecting duplicates without comparing entire documents
- `hashlib.sha1(text.encode()).hexdigest()` ‚Üí 40-character fingerprint

**JSON/JSONL:** Data formats for storing structured data
- **JSON:** One big file with all data (must load entire file into memory)
- **JSONL:** One JSON record per line (can stream line-by-line = memory efficient)

**Sets:** Collections with no duplicates, fast "is X in this set?" checking
- `seen = set()` then `seen.add(item)` and `item in seen`

**Type hints** (`: str`, `: float = 0.8`): Documentation for humans (Python ignores them)
- `text: str` means "text should be a string"
- `train_p: float = 0.8` means "train_p should be a decimal, defaults to 0.8"

## Sample texts
A few toy paragraphs to simulate a tiny corpus.


In [None]:
raw_docs = [
    "THE TIME MACHINE ‚Äî CHAPTER I\n\nThis   is   a    sample text‚Ä¶ with   odd spacing, smart ‚Äúquotes‚Äù, and tabs\t.",
    "AI systems learn from examples. Data quality shapes model quality.",
    "The key to machine learning is data; the secret to building AI is understanding.",
]
print('Docs:', len(raw_docs))


## Cleaning & normalizing

Strip HTML, control chars, collapse whitespace, normalize quotes.


In [None]:
def clean_text(text: str) -> str:
    """Clean and normalize text for LLM training."""
    # Unicode normalization: Ô¨Å ‚Üí fi, ÔΩÜÔΩïÔΩåÔΩå ‚Üí full, etc.
    # NFKC = Compatibility decomposition + Canonical composition
    text = unicodedata.normalize("NFKC", text)
    
    # Strip HTML tags: <p>, <div>, <span class="foo">, etc.
    # Pattern: < followed by any chars that aren't >, then >
    text = re.sub(r"<[^>]+>", " ", text)
    
    # Replace newlines and tabs with spaces
    text = re.sub(r"[\n\t]", " ", text)
    
    # Collapse multiple spaces into one, remove leading/trailing spaces
    text = re.sub(r"\s+", " ", text).strip()
    
    # Normalize smart quotes to straight quotes
    text = text.replace(""", '"').replace(""", '"')
    
    return text

cleaned_docs = [clean_text(d) for d in raw_docs]
for i, d in enumerate(cleaned_docs):
    print(f"Cleaned {i}: {d[:80]}...")

## Deduplication

Hash paragraphs, drop repeats.


In [None]:
# Create a "fingerprint" for text using SHA1 hash
# Same text ‚Üí same fingerprint (always)
# Different text ‚Üí different fingerprint (with overwhelming probability)
def hash_chunk(text: str) -> str:
    # .encode("utf-8") converts string to bytes (required by hashlib)
    # .hexdigest() returns the hash as a 40-character string
    return hashlib.sha1(text.encode("utf-8")).hexdigest()

# Demonstrate hashing
sample = "The cat sat on the mat."
print(f"Text: '{sample}'")
print(f"Hash: {hash_chunk(sample)}")
print(f"Same text same hash: {hash_chunk(sample) == hash_chunk(sample)}")
print(f"Different text different hash: {hash_chunk(sample) != hash_chunk('Dog.')}")

def dedup_chunks(chunks):
    """Remove duplicate chunks using hash-based fingerprinting."""
    seen = set()    # Track hashes we've seen (fast lookup!)
    unique = []     # Keep only unique chunks
    
    for c in chunks:
        h = hash_chunk(c)
        if h in seen:
            continue        # Skip duplicate
        seen.add(h)         # Remember this hash
        unique.append(c)    # Keep the chunk
    
    return unique

# Add a duplicate to prove deduplication works
test_docs = cleaned_docs + [cleaned_docs[0]]  # Add copy of first doc
deduped = dedup_chunks(test_docs)
print(f'\nBefore dedup: {len(test_docs)} docs')
print(f'After dedup:  {len(deduped)} docs')

## Train/val/test split (by document)

Keep related text together; avoid leakage.


In [None]:
def split_docs(docs: list, train_p: float = 0.8, val_p: float = 0.1, seed: int = 42):
    """Split documents into train/val/test sets.
    
    Args:
        docs: List of documents to split
        train_p: Proportion for training (default 0.8 = 80%)
        val_p: Proportion for validation (default 0.1 = 10%)
        seed: Random seed for reproducibility
    
    Returns:
        train, val, test lists (test gets remaining proportion)
    """
    # Work on a copy to avoid mutating the caller's list
    docs = list(docs)
    random.seed(seed)
    random.shuffle(docs)
    
    n = len(docs)
    n_train = int(n * train_p)
    n_val = int(n * val_p)
    
    train = docs[:n_train]
    val = docs[n_train:n_train + n_val]
    test = docs[n_train + n_val:]
    
    return train, val, test

# Apply split to our deduped docs
train_docs, val_docs, test_docs = split_docs(deduped, train_p=0.6, val_p=0.2, seed=42)
print(f'Train: {len(train_docs)}, Val: {len(val_docs)}, Test: {len(test_docs)}')


## Chunking for context windows
Break long text with overlap to preserve context across chunk boundaries.


In [None]:
def chunk_text(text: str, max_chars: int = 200, overlap: int = 50):
    """Break text into overlapping chunks.
    
    Args:
        text: Input text to chunk
        max_chars: Maximum characters per chunk
        overlap: Number of characters to overlap between chunks
    
    Returns:
        List of text chunks
    """
    chunks = []
    start = 0
    while start < len(text):
        end = min(len(text), start + max_chars)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start += max_chars - overlap  # advance by (max_chars - overlap)
    return chunks

# Apply chunking to all splits
chunked = []
for split, docs in [('train', train_docs), ('val', val_docs), ('test', test_docs)]:
    for d in docs:
        for c in chunk_text(d, max_chars=120, overlap=30):
            chunked.append({'text': c, 'split': split, 'source': 'toy'})
            
print('Total chunks:', len(chunked))


## Visualize Overlap
See how chunks overlap to preserve context across boundaries.


In [None]:
# Create a test text with clear positions
test_text = "A" * 500  # 500 characters
chunks = chunk_text(test_text, max_chars=200, overlap=50)

print(f"Text length: {len(test_text)}")
print(f"Number of chunks: {len(chunks)}")
print(f"Chunk lengths: {[len(c) for c in chunks]}")

# Verify overlap between consecutive chunks
# Python slice notation:
#   text[-50:]  = last 50 characters (negative index counts from end)
#   text[:50]   = first 50 characters
if len(chunks) >= 2:
    # Last 50 chars of chunk 0 should equal first 50 chars of chunk 1
    chunk0_end = chunks[0][-50:]    # Last 50 chars of chunk 0
    chunk1_start = chunks[1][:50]   # First 50 chars of chunk 1
    overlap_matches = chunk0_end == chunk1_start
    
    print(f"\nOverlap verification: {overlap_matches}")
    print(f"Chunk 0 ends with: ...{chunks[0][-10:]}")
    print(f"Chunk 1 starts with: {chunks[1][:10]}...")
    
# With real text
real_text = "This is sentence one. This is sentence two. This is sentence three." * 5
real_chunks = chunk_text(real_text, max_chars=100, overlap=30)
print(f"\nReal text chunked into {len(real_chunks)} pieces")
print(f"Chunk 0: ...{real_chunks[0][-40:]}")
print(f"Chunk 1: {real_chunks[1][:40]}...")
print("\n‚úÖ Overlap preserves context across chunk boundaries!")

## Quality Checks & Sanity Validation
Catch problems early with automated checks that flag empty chunks, HTML leakage, and size issues.


In [None]:
def sanity_check(chunks, stage_name):
    """Run sanity checks on data at any pipeline stage."""
    print(f"\n{'='*50}")
    print(f"Sanity Check: {stage_name}")
    print(f"{'='*50}")
    
    if not chunks:
        print("‚ö†Ô∏è  WARNING: No chunks!")
        return
    
    # Basic stats
    print(f"‚úì Total chunks: {len(chunks)}")
    lengths = [len(c) if isinstance(c, str) else len(c.get('text', '')) for c in chunks]
    avg_len = sum(lengths) / len(lengths)
    print(f"‚úì Avg length: {avg_len:.0f}")
    print(f"‚úì Max length: {max(lengths)}")
    print(f"‚úì Min length: {min(lengths)}")
    
    # Check for issues
    if max(lengths) > 10 * avg_len:
        print("‚ö†Ô∏è  WARNING: Max length is 10x average - chunking may be broken")
    
    # Check for HTML leakage
    texts = [c if isinstance(c, str) else c.get('text', '') for c in chunks]
    all_text = ' '.join(texts).lower()
    html_words = {'div', 'span', 'href', 'html', 'class', 'src'}
    found_html = [w for w in html_words if w in all_text]
    if found_html:
        print(f"‚ö†Ô∏è  WARNING: HTML tags found: {found_html}")
    else:
        print("‚úì No HTML leakage detected")
    
    # Check for empty chunks
    empty = sum(1 for l in lengths if l < 10)
    if empty > 0:
        print(f"‚ö†Ô∏è  WARNING: {empty} chunks are < 10 chars")
    else:
        print("‚úì No empty chunks")
    
    # Sample
    sample = texts[0] if texts else "N/A"
    print(f"\n‚úì Sample: {sample[:100]}...")
    print()

# Run checks on our chunked data
sanity_check(chunked, "After Chunking")

# You can run this after each stage:
# sanity_check(cleaned_docs, "After Cleaning")
# sanity_check(deduped, "After Deduplication")


## Worked Example: End-to-End Pipeline
Complete walkthrough from raw documents (with HTML and duplicates) to JSONL-ready data.


In [None]:
print("="*60)
print("COMPLETE DATA PIPELINE WALKTHROUGH")
print("="*60)

# Step 1: Start with raw documents (messy, with duplicates and HTML)
print("\nüì• STEP 1: Raw Documents")
raw_pipeline_docs = [
    {"text": "<p>The cat sat on the mat.</p>", "source": "doc1"},
    {"text": "<p>The cat sat on the mat.</p>", "source": "doc2"},  # exact duplicate!
    {"text": "<p>The dog    ran\tin the park.</p>", "source": "doc3"},
    {"text": "The bird flew over the house.", "source": "doc4"}
]
print(f"   Raw documents: {len(raw_pipeline_docs)}")
for i, doc in enumerate(raw_pipeline_docs):
    print(f"   {i+1}. {doc['text'][:50]}...")

# Step 2: Clean each document
print("\nüßπ STEP 2: Cleaning")
for doc in raw_pipeline_docs:
    doc["text"] = clean_text(doc["text"])
print("   HTML removed, whitespace normalized")
for i, doc in enumerate(raw_pipeline_docs):
    print(f"   {i+1}. {doc['text']}")

# Step 3: Extract text and deduplicate
print("\nüîç STEP 3: Deduplication")
pipeline_texts = [d["text"] for d in raw_pipeline_docs]
unique_pipeline = dedup_chunks(pipeline_texts)
print(f"   Before: {len(pipeline_texts)} texts")
print(f"   After:  {len(unique_pipeline)} unique texts")
for i, text in enumerate(unique_pipeline):
    print(f"   {i+1}. {text}")

# Step 4: Split into train/val/test
print("\nüìä STEP 4: Train/Val/Test Split")
train_p, val_p, test_p = split_docs(unique_pipeline, train_p=0.34, val_p=0.33, seed=42)
print(f"   Train: {len(train_p)} docs - {train_p}")
print(f"   Val:   {len(val_p)} docs - {val_p}")
print(f"   Test:  {len(test_p)} docs - {test_p}")

# Step 5: Chunk (for longer documents, here it's small)
print("\n‚úÇÔ∏è  STEP 5: Chunking")
train_pipeline_chunks = []
for text in train_p:
    chunks = chunk_text(text, max_chars=50, overlap=10)
    train_pipeline_chunks.extend(chunks)
print(f"   Train chunks: {len(train_pipeline_chunks)}")
for i, chunk in enumerate(train_pipeline_chunks):
    print(f"   Chunk {i+1}: {chunk}")

# Step 6: Prepare JSONL records
print("\nüíæ STEP 6: JSONL Preparation")
final_records = [
    {"text": chunk, "split": "train", "length": len(chunk), "source": "example"}
    for chunk in train_pipeline_chunks
]
print(f"   Ready to save: {len(final_records)} records")
print(f"   Example record: {final_records[0]}")

print("\n‚úÖ PIPELINE COMPLETE!")
print(f"   Started with: {len(raw_pipeline_docs)} raw documents (with duplicate)")
print(f"   Ended with: {len(final_records)} clean, deduplicated JSONL records")
print(f"   Data is now ready for tokenization in Chapter 8!")


In [None]:
def save_jsonl(records, path):
    with open(path, 'w', encoding='utf-8') as f:
        for r in records:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")

save_jsonl(chunked, 'toy_corpus.jsonl')
print('Wrote toy_corpus.jsonl with', len(chunked), 'records')


## Quick stats

Duplicate rate, top words, and sample records.


In [None]:
def duplicate_rate(texts):
    hashes = [hash_chunk(t) for t in texts]
    return 1 - (len(set(hashes)) / len(hashes))

def top_words(texts, k=10):
    words = " ".join(texts).lower().split()
    return Counter(words).most_common(k)

texts_all = [r['text'] for r in chunked]
print('Duplicate rate:', duplicate_rate(texts_all))
print('Top words:', top_words(texts_all, k=8))
print('Sample records:', chunked[:2])


## Summary

In this notebook you've learned the complete data preparation pipeline for LLM training:

**1. Text Cleaning:** Remove HTML, normalize whitespace, handle special characters
**2. Deduplication:** Use hashing to identify and remove exact duplicates
**3. Train/Val/Test Split:** Separate data at document level to prevent leakage
**4. Chunking with Overlap:** Break long texts into LLM-sized pieces while preserving context
**5. Quality Checks:** Automated sanity checks catch issues early
**6. JSONL Format:** Save data in a streaming-friendly format

**Key Concepts:**
- **Overlap** preserves context across chunk boundaries (prevents chopping sentences in half)
- **Hashing** provides fast, deterministic fingerprints for deduplication
- **Document-level splitting** keeps related chunks together in the same split
- **Quality checks** catch HTML leakage, empty chunks, and size anomalies before they cause training problems

**Next Steps:**
- Chapter 8: Tokenization (converting text ‚Üí numbers)
- Scale up to real datasets (Wikipedia, Common Crawl, books)
- Experiment with different overlap values for your use case

The data pipeline is the foundation of every great LLM!
