# Dataset Quality Report (Span JSONL)

This notebook generates a dataset quality report for a prepared span-JSONL file (`dataset.jsonl`).

Checks included:
- Span bounds and empty spans
- Missing entity types
- Entity text mismatch (when `entity.text` is provided)
- Overlapping spans

Outputs:
- A JSON report printed to stdout
- Optional `--json-out` file for CI/dev automation


In [None]:
import os
from pathlib import Path

# AI_WAREHOUSE 3.0 cache layout (avoid $HOME/.cache)
os.environ.setdefault('HF_HOME', '/mnt/c/ai_cache/huggingface')
os.environ.setdefault('TRANSFORMERS_CACHE', os.environ['HF_HOME'])
os.environ.setdefault('TORCH_HOME', '/mnt/c/ai_cache/torch')
os.environ.setdefault('XDG_CACHE_HOME', '/mnt/c/ai_cache')
os.environ.setdefault('PIP_CACHE_DIR', '/mnt/c/ai_cache/pip')

for key in ('HF_HOME', 'TORCH_HOME', 'XDG_CACHE_HOME', 'PIP_CACHE_DIR'):
    Path(os.environ[key]).expanduser().mkdir(parents=True, exist_ok=True)


In [None]:
# Prepare a deterministic synthetic dataset (offline by default).
# `scripts/prepare_dataset.py` also writes a `quality.json` file next to the dataset.
!PYTHONPATH=src python scripts/prepare_dataset.py \
  --dataset synthetic \
  --language zh \
  --split train \
  --max-examples 500


In [None]:
# Generate a quality report for any prepared span-JSONL dataset
!PYTHONPATH=src python scripts/report_dataset_quality.py \
  --input-jsonl /mnt/data/datasets/edge_deid/processed/synthetic/train/dataset.jsonl \
  --json-out /mnt/data/datasets/edge_deid/processed/synthetic/train/quality.report.json
