# GPT-2 (Chinese) Placeholder Fine-Tuning Pipeline

This notebook runs the repeatable dev-only pipeline script: `scripts/run_gpt2_pipeline.py`.

What it does:
- Builds a local JSONL corpus from multiple sources (masked-pair datasets + token NER datasets)
- Canonicalizes placeholder tokens (e.g. `<LASTNAME_1>` â†’ `<NAME>`)
- Fine-tunes a local GPT-2 style model on the placeholder corpus
- Copies the final model into `/mnt/c/ai_models/llm/edge_deid/<run_slug>/`

Notes:
- This is **dev-only** and requires explicit network enablement on the first run.
- The runtime de-identification pipeline remains local-only and does not depend on GPT-2.


In [None]:
import os
from pathlib import Path

# AI_WAREHOUSE 3.0 cache layout (avoid $HOME/.cache)
os.environ.setdefault('EDGE_DEID_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('EDGE_DEID_MODELS_HOME', '/mnt/c/ai_models')
os.environ.setdefault('EDGE_DEID_DATA_HOME', '/mnt/data')

os.environ.setdefault('HF_HOME', '/mnt/c/ai_cache/huggingface')
os.environ.setdefault('TRANSFORMERS_CACHE', os.environ['HF_HOME'])
os.environ.setdefault('TORCH_HOME', '/mnt/c/ai_cache/torch')
os.environ.setdefault('XDG_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('PIP_CACHE_DIR', '/mnt/c/ai_cache/pip')

for key in ('HF_HOME', 'TORCH_HOME', 'XDG_CACHE_HOME', 'PIP_CACHE_DIR'):
    Path(os.environ[key]).expanduser().mkdir(parents=True, exist_ok=True)


In [None]:
# Smoke run (downloads base model + datasets on first run).
!PYTHONPATH=src python scripts/run_gpt2_pipeline.py \
  --config configs/training/gpt2_zh_placeholder_smoke.yaml


In [None]:
import json
from pathlib import Path

report_path = Path('/mnt/data/training/logs/edge_deid/gpt2-zh-placeholder-smoke/report.json')
report = json.loads(report_path.read_text(encoding='utf-8'))

print('Corpus:', report.get('corpus_jsonl'))
print('Training output:', report.get('training_output_dir'))
print('Models output:', report.get('models_output_dir'))
