# Large PII Training Pipeline (Prepare → Mix → Train → Export → Validate → Benchmark)

This notebook runs the re-runnable dev-only pipeline script: `scripts/run_multi_dataset_pipeline.py`.

Recommended workflow:
- First run: use the **smoke** profile to validate downloads, preprocessing, training, and ONNX export.
- Full run: switch to `configs/training/pii_large_mix_gpu.yaml` and increase training duration.

Notes:
- Network is **opt-in**: pass `--allow-network` explicitly on the first run.
- Outputs follow AI_WAREHOUSE 3.0: `/mnt/c` (cache/models) and `/mnt/data` (datasets/training).


In [None]:
import os

# AI_WAREHOUSE 3.0 cache layout (avoid $HOME/.cache)
os.environ.setdefault('EDGE_DEID_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('EDGE_DEID_MODELS_HOME', '/mnt/c/ai_models')
os.environ.setdefault('EDGE_DEID_DATA_HOME', '/mnt/data')

os.environ.setdefault('HF_HOME', '/mnt/c/ai_cache/huggingface')
os.environ.setdefault('TRANSFORMERS_CACHE', os.environ['HF_HOME'])
os.environ.setdefault('TORCH_HOME', '/mnt/c/ai_cache/torch')
os.environ.setdefault('XDG_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('PIP_CACHE_DIR', '/mnt/c/ai_cache/pip')


In [None]:
# Smoke run (real datasets, minimal steps). Requires network on the first run.
!PYTHONPATH=src python scripts/run_multi_dataset_pipeline.py \
  --config configs/training/pii_large_mix_smoke_gpu.yaml \
  --allow-network \
  --trust-remote-code


In [None]:
import json
from pathlib import Path

report_path = Path('/mnt/data/training/logs/edge_deid/pii-large-zh-gpu-smoke/report.json')
report = json.loads(report_path.read_text(encoding='utf-8'))

print('ONNX model:', report.get('onnx_model'))
print('INT8 model:', report.get('onnx_model_int8'))
print('Benchmark:', report.get('benchmark_onnx_ner', {}))


In [None]:
# Offline re-run (after the first run populated caches and prepared JSONL).
!PYTHONPATH=src python scripts/run_multi_dataset_pipeline.py \
  --config configs/training/pii_large_mix_smoke_gpu.yaml
