# 🧪 01 · Data Samples — DocScribe

This notebook creates a **tiny evaluation set** and demo assets for DocScribe:

**Outputs**
- `eval/eval_transcripts.jsonl` — 6 synthetic transcripts with gold `diagnosis` / `orders`
- `data/samples_audio/` — empty `.wav` placeholders for UI wiring

> Keep these small and synthetic for hackathon demos. You can add more lines anytime.

In [1]:
from pathlib import Path

ROOT = Path("..").resolve()  # assuming notebooks/ structure
EVAL_DIR = ROOT / "eval"
AUDIO_DIR = ROOT / "data" / "samples_audio"
PROMPTS_DIR = ROOT / "prompts"

EVAL_DIR.mkdir(parents=True, exist_ok=True)
AUDIO_DIR.mkdir(parents=True, exist_ok=True)
PROMPTS_DIR.mkdir(parents=True, exist_ok=True)

print("Project root:", ROOT)
print("Eval dir:", EVAL_DIR)
print("Audio dir:", AUDIO_DIR)
print("Prompts dir:", PROMPTS_DIR)

Project root: /Users/saturnine/DocScribe
Eval dir: /Users/saturnine/DocScribe/eval
Audio dir: /Users/saturnine/DocScribe/data/samples_audio
Prompts dir: /Users/saturnine/DocScribe/prompts


In [2]:
import json

samples = [
    {
        "text": "Fever and cough for 3 days. Mild shortness of breath. Likely community-acquired pneumonia. Order chest X-ray and start azithromycin 500 mg daily for 5 days. Follow up in 2 days.",
        "diagnosis": ["community-acquired pneumonia"],
        "orders": ["chest X-ray", "azithromycin 500 mg daily x5"]
    },
    {
        "text": "Left ankle pain after inversion injury yesterday. Suspect lateral ankle sprain. X-ray ankle to rule out fracture. RICE and ibuprofen 400 mg as needed.",
        "diagnosis": ["ankle sprain"],
        "orders": ["ankle X-ray", "ibuprofen 400 mg PRN"]
    },
    {
        "text": "Dysuria and urinary frequency for two days. No fever or flank pain. Likely uncomplicated UTI. Urinalysis and nitrofurantoin 100 mg twice daily for five days.",
        "diagnosis": ["urinary tract infection"],
        "orders": ["urinalysis", "nitrofurantoin 100 mg BID x5"]
    },
    {
        "text": "Sore throat, painful swallowing, and fever. Tonsillar exudates on exam. Rapid strep test positive. Start amoxicillin for 10 days.",
        "diagnosis": ["streptococcal pharyngitis"],
        "orders": ["rapid strep test", "amoxicillin 10 days"]
    },
    {
        "text": "New-onset burning chest pain after meals. Worse when lying down. Likely GERD. Start omeprazole daily and recommend lifestyle changes.",
        "diagnosis": ["gastroesophageal reflux disease"],
        "orders": ["omeprazole daily", "lifestyle modification counseling"]
    },
    {
        "text": "Frequent sneezing, itchy eyes, and clear nasal discharge during spring. Consistent with seasonal allergic rhinitis. Start cetirizine and intranasal steroid.",
        "diagnosis": ["seasonal allergic rhinitis"],
        "orders": ["cetirizine", "intranasal corticosteroid"]
    }
]

out_path = EVAL_DIR / "eval_transcripts.jsonl"
with out_path.open("w", encoding="utf-8") as f:
    for ex in samples:
        f.write(json.dumps(ex) + "\n")

print("Wrote:", out_path, "lines:", len(samples))

Wrote: /Users/saturnine/DocScribe/eval/eval_transcripts.jsonl lines: 6


In [3]:
# Read back and show a compact preview
lines = out_path.read_text(encoding="utf-8").strip().splitlines()
print(f"Total records: {len(lines)}")
print("First record:\n", lines[0][:300] + ("..." if len(lines[0]) > 300 else ""))

Total records: 6
First record:
 {"text": "Fever and cough for 3 days. Mild shortness of breath. Likely community-acquired pneumonia. Order chest X-ray and start azithromycin 500 mg daily for 5 days. Follow up in 2 days.", "diagnosis": ["community-acquired pneumonia"], "orders": ["chest X-ray", "azithromycin 500 mg daily x5"]}


In [4]:
# Create empty wav files as UI placeholders; you can replace with real audio later.
for i in range(1, 4):
    p = AUDIO_DIR / f"case_{i}.wav"
    if not p.exists():
        with p.open("wb") as f:
            f.write(b"")
sorted([p.name for p in AUDIO_DIR.glob("*.wav")])

['case_1.wav', 'case_2.wav', 'case_3.wav']

In [5]:
fewshot_text = """You are a clinical documentation assistant. Extract structured fields from the clinician transcript.
Return STRICT JSON with keys: chief_complaint (str), assessment (str), diagnosis (list[str]), orders (list[str]), plan (list[str]), follow_up (str).

### Example A
TRANSCRIPT:
"Fever and cough for three days, mild shortness of breath. Likely community-acquired pneumonia. Order chest X-ray and start azithromycin 500 mg daily for five days. Follow up in two days."
JSON:
{"chief_complaint":"Fever and cough for 3 days","assessment":"Likely pneumonia","diagnosis":["community-acquired pneumonia"],"orders":["chest X-ray","azithromycin 500 mg daily x5"],"plan":["begin antibiotics","symptomatic care"],"follow_up":"2 days"}

### Example B
TRANSCRIPT:
"Left ankle pain after inversion injury yesterday. Suspect lateral ankle sprain. X-ray to rule out fracture. RICE and ibuprofen 400 mg as needed. Return if worsening."
JSON:
{"chief_complaint":"Left ankle pain after inversion injury","assessment":"Likely lateral ankle sprain","diagnosis":["ankle sprain"],"orders":["ankle X-ray"],"plan":["RICE","ibuprofen 400 mg PRN"],"follow_up":"return if worsening"}
"""
prompt_path = PROMPTS_DIR / "extractor_fewshot.md"
prompt_path.write_text(fewshot_text, encoding="utf-8")
print("Few-shot prompt saved to:", prompt_path)

Few-shot prompt saved to: /Users/saturnine/DocScribe/prompts/extractor_fewshot.md


✅ Data samples created.

**Next →** run `02_extractor_dev.ipynb` to build and test the FLAN-T5 extractor using:
- `prompts/extractor_fewshot.md`
- `eval/eval_transcripts.jsonl`

If you add more examples later, just append lines to `eval_transcripts.jsonl`.