# 03 - Synthetic Data Generation

Walk through the Claude-powered synthetic data generation pipeline for DocuMind.
We generate three types of synthetic training data:

1. **Instruction variants** - Diverse paraphrases of the receipt extraction instruction
2. **Synthetic receipts** - Realistic receipt JSON following the CORD v2 schema
3. **Error augmentations** - OCR-like corruptions paired with clean corrections

This notebook uses small sample sizes for demonstration. Use `scripts/generate_synthetic.py` for full-scale generation.

In [None]:
# ── Setup: works on both Colab and local ──────────────────────────
import os, sys

IN_COLAB = 'google.colab' in sys.modules or os.path.exists('/content')

if IN_COLAB:
    try:
        os.getcwd()
    except OSError:
        os.chdir("/content")

    REPO_URL = "https://github.com/NaveenPrasanth/DocuLLM-Finetune.git"
    REPO_DIR = "/content/DocuLLM-Finetune"
    if not os.path.exists(REPO_DIR):
        os.chdir("/content")
        !git clone {REPO_URL} {REPO_DIR}
    os.chdir(REPO_DIR)
    !pip install -q "anthropic>=0.18" "datasets>=2.20.0" "omegaconf>=2.3" \
        "pydantic>=2.5" "rapidfuzz>=3.5" "python-dotenv>=1.0" rich
    !pip install -q -e .
else:
    sys.path.insert(0, '..')

import json
from pathlib import Path

from src.config import get_env_var, load_base_config
from src.data.cord_loader import get_cord_schema
from src.data.synthetic_generator import SyntheticGenerator

print('Imports loaded successfully.')
print(f'CORD schema keys: {list(get_cord_schema().keys())}')

In [None]:
# Initialize generator with API key
# Set your key in the environment or paste it below (do NOT commit secrets)
api_key = get_env_var('ANTHROPIC_API_KEY')

if not api_key:
    raise RuntimeError(
        'ANTHROPIC_API_KEY not found. '
        'Set it in your .env file or export it in your shell.'
    )

generator = SyntheticGenerator(api_key=api_key)
print(f'Generator ready (model={generator.model})')

In [None]:
# Generate 5 instruction variants (demo)
instruction_variants = generator.generate_instruction_variants(num_variants=5)

print(f'Generated {len(instruction_variants)} instruction variants:\n')
for i, variant in enumerate(instruction_variants, 1):
    print(f'  [{i}] {variant}\n')

In [None]:
# Generate 3 synthetic receipts (demo)
synthetic_receipts = generator.generate_synthetic_receipts(num_receipts=3)

print(f'Generated {len(synthetic_receipts)} synthetic receipts:\n')
for i, receipt in enumerate(synthetic_receipts, 1):
    parsed = json.loads(receipt['ground_truth_json'])
    print(f'--- Receipt {i} ---')
    print(json.dumps(parsed, indent=2, ensure_ascii=False))
    print()

In [None]:
# Generate 2 error augmentation pairs (demo)
error_pairs = generator.generate_error_augmentations(
    receipts=synthetic_receipts,
    num_pairs=2,
)

print(f'Generated {len(error_pairs)} error augmentation pairs:\n')
for i, pair in enumerate(error_pairs, 1):
    print(f'=== Pair {i} ===')
    print('\nCORRUPTED (simulated OCR errors):')
    print(json.dumps(pair['corrupted'], indent=2, ensure_ascii=False))
    print('\nCORRECTED (clean ground truth):')
    print(json.dumps(pair['corrected'], indent=2, ensure_ascii=False))
    print()

In [None]:
# Display examples and statistics
usage = generator.get_usage_summary()

print('=' * 50)
print('  Generation Statistics')
print('=' * 50)
print(f'  Instruction variants : {len(instruction_variants)}')
print(f'  Synthetic receipts   : {len(synthetic_receipts)}')
print(f'  Error augment pairs  : {len(error_pairs)}')
print('-' * 50)
print(f'  Input tokens         : {usage["input_tokens"]:,}')
print(f'  Output tokens        : {usage["output_tokens"]:,}')
print(f'  Estimated cost (USD) : ${usage["estimated_cost_usd"]:.4f}')
print('=' * 50)

# Analyze receipt complexity
print('\nReceipt complexity analysis:')
for i, receipt in enumerate(synthetic_receipts, 1):
    parsed = json.loads(receipt['ground_truth_json'])
    num_menu_items = len(parsed.get('menu', []))
    has_total = 'total' in parsed
    has_subtotal = 'sub_total' in parsed
    total_fields = sum(
        len(v) if isinstance(v, (list, dict)) else 1
        for v in parsed.values()
    )
    print(
        f'  Receipt {i}: {num_menu_items} menu items, '
        f'total={has_total}, subtotal={has_subtotal}, '
        f'~{total_fields} top-level elements'
    )

# Analyze error types
print('\nError augmentation examples:')
for i, pair in enumerate(error_pairs, 1):
    corrupted_str = json.dumps(pair['corrupted'])
    corrected_str = json.dumps(pair['corrected'])
    # Simple diff: count character-level differences
    min_len = min(len(corrupted_str), len(corrected_str))
    diffs = sum(1 for a, b in zip(corrupted_str[:min_len], corrected_str[:min_len]) if a != b)
    len_diff = abs(len(corrupted_str) - len(corrected_str))
    print(
        f'  Pair {i}: {diffs} char substitutions, '
        f'{len_diff} length difference'
    )

## Analysis

### Instruction Variants
The generated instruction variants cover multiple communication styles (formal, casual, concise, detailed, etc.), which helps the model generalise across different user phrasings at inference time. During training, we randomly sample from these variants instead of using a single fixed instruction.

### Synthetic Receipts
Each synthetic receipt follows the CORD v2 schema with varying complexity:
- **Simple receipts** (2-3 items) simulate quick purchases
- **Complex receipts** (6-8 items) simulate restaurant or grocery bills
- Prices, quantities, and totals are mathematically plausible
- Sparse fields (void_menu, discount, e-money) appear only occasionally, matching real-world distributions

### Error Augmentations
OCR error simulation covers the most common failure modes:
- Character confusion (l/1, O/0, S/5) which are frequent in receipt fonts
- Word merging and splitting from poor line segmentation
- Number corruption which directly impacts financial accuracy
- Field truncation simulating partial scans

Training on these pairs teaches the model to recover clean data from noisy OCR input.

### Cost Considerations
For production-scale generation (20 instructions + 100 receipts + 50 error pairs), expect approximately:
- ~50k-100k input tokens and ~100k-200k output tokens
- Estimated cost: $1-3 USD with Claude 3.5 Sonnet

### Next Steps
Run the full generation pipeline with:
```bash
python scripts/generate_synthetic.py --num-receipts 100 --num-error-pairs 50 --output-dir data/synthetic
```
Then integrate synthetic data into training via `dataset_builder.add_synthetic_data()`.