# Multi-Dataset Pipeline (Prepare → Mix → Train → Export → Validate → Benchmark)

This notebook runs the re-runnable dev-only pipeline script: `scripts/run_multi_dataset_pipeline.py`.

What it does:
- Prepares real Hugging Face datasets into local span-JSONL (text + gold entities)
- Mixes multiple prepared datasets into a single training set
- Fine-tunes a token-classification model
- Exports to ONNX, validates parity, optionally INT8-quantizes, and benchmarks

Notes:
- The first run typically needs `--allow-network` to download datasets and (optionally) the base model.
- Subsequent runs can be offline once caches are populated.
- Outputs follow the AI_WAREHOUSE 3.0 layout under `/mnt/c` (cache/models) and `/mnt/data` (datasets/training).


In [None]:
import os

# AI_WAREHOUSE 3.0 cache layout (avoid $HOME/.cache)
os.environ.setdefault('EDGE_DEID_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('EDGE_DEID_MODELS_HOME', '/mnt/c/ai_models')
os.environ.setdefault('EDGE_DEID_DATA_HOME', '/mnt/data')

os.environ.setdefault('HF_HOME', '/mnt/c/ai_cache/huggingface')
os.environ.setdefault('TRANSFORMERS_CACHE', os.environ['HF_HOME'])
os.environ.setdefault('TORCH_HOME', '/mnt/c/ai_cache/torch')
os.environ.setdefault('XDG_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('PIP_CACHE_DIR', '/mnt/c/ai_cache/pip')


In [None]:
# End-to-end run using real datasets (requires network on first run).
# Use --config to keep the run reproducible.
!PYTHONPATH=src python scripts/run_multi_dataset_pipeline.py \
  --config configs/training/multi_zh_ner_demo.yaml \
  --allow-network \
  --trust-remote-code


In [None]:
import json
from pathlib import Path

report_path = Path('/mnt/data/training/logs/edge_deid/multi-zh-ner-demo/report.json')
report = json.loads(report_path.read_text(encoding='utf-8'))

print('ONNX model:', report.get('onnx_model'))
print('INT8 model:', report.get('onnx_model_int8'))
print('Benchmark:', report.get('benchmark_onnx_ner', {}))


In [None]:
# Offline re-run (after the first run populated caches and prepared JSONL).
!PYTHONPATH=src python scripts/run_multi_dataset_pipeline.py \
  --config configs/training/multi_zh_ner_demo.yaml \
  --json-out /mnt/data/training/logs/edge_deid/multi-zh-ner-demo/report.offline.json
