# Train a Token-Classification NER Model\n\nThis notebook trains a token-classification model using a deterministic synthetic dataset.\n\nRecommended workflow:\n1) Prepare a span-JSONL dataset via `scripts/prepare_dataset.py` (see `notebooks/08_prepare_datasets.ipynb`).\n2) Train from the prepared JSONL via `scripts/train_token_classifier.py --input-jsonl ...`.\n\nFor real datasets (WikiAnn/MSRA/Weibo), use dataset adapters under `src/deid_pipeline/training/datasets.py` and enable network explicitly when needed.\n

In [None]:
import os\nfrom pathlib import Path\n\n# AI_WAREHOUSE 3.0 cache layout (avoid $HOME/.cache)\nos.environ.setdefault('HF_HOME', '/mnt/c/ai_cache/huggingface')\nos.environ.setdefault('TRANSFORMERS_CACHE', os.environ['HF_HOME'])\nos.environ.setdefault('TORCH_HOME', '/mnt/c/ai_cache/torch')\nos.environ.setdefault('XDG_CACHE_HOME', '/mnt/c/ai_cache')\nos.environ.setdefault('PIP_CACHE_DIR', '/mnt/c/ai_cache/pip')\n\nfor key in ('HF_HOME', 'TORCH_HOME', 'XDG_CACHE_HOME', 'PIP_CACHE_DIR'):\n    Path(os.environ[key]).expanduser().mkdir(parents=True, exist_ok=True)\n\n

In [None]:
# Prepare a deterministic synthetic dataset (offline by default).\n!PYTHONPATH=src python scripts/prepare_dataset.py \\\n  --dataset synthetic \\\n  --language zh \\\n  --split train \\\n  --max-examples 200\n\n!PYTHONPATH=src python scripts/train_token_classifier.py \\\n  --model-dir /mnt/c/ai_models/detection/edge_deid/bert-ner-zh \\\n  --output-dir /mnt/data/training/runs/edge_deid/synthetic-ner-demo \\\n  --language zh \\\n  --input-jsonl /mnt/data/datasets/edge_deid/processed/synthetic/train/dataset.jsonl \\\n  --max-length 256 \\\n  --epochs 1 \\\n  --batch-size 8\n