# ASQ-PHI: Synthetic Clinical Query Generation Pipeline

This notebook implements the generation pipeline for **ASQ-PHI** (Adversarial Synthetic Queries for Protected Health Information de-identification), a benchmark dataset of 1,051 fully synthetic clinical search queries designed to stress test HIPAA-compliant de-identification software.

**The problem this solves:** Large Language Models (LLMs) running inside HIPAA Business Associate Agreements (BAAs) can legally process PHI, but any query sent to external tools (web search APIs, MCP servers, medical databases) must be de-identified first. Current de-identification systems were trained on EHR notes, not search queries, leading to both PHI leakage and over-redaction of clinically essential information when tested on search queries.

**What this notebook does:** Generates synthetic clinical queries with ground truth PHI annotations to test whether de-identification systems can remove identifiers while preserving clinical utility.

All content is 100% synthetic. No real patient data were used.

---

## Requirements

- Python 3.12 or newer  
- Azure OpenAI Service (GPT-4o deployment recommended)  
- Dependencies listed in `requirements.txt`  
- Environment variables set via `code/.env` (see `code/.env.example`):  
  - `AZURE_OPENAI_API_KEY_4o`  
  - `AZURE_OPENAI_ENDPOINT_4o`  
  - `AZURE_OPENAI_DEPLOYMENT_4o`  
  - `OUTPUT_PATH` (optional, defaults to `synthetic_clinical_queries.txt`)

---

## How to Use This Notebook

1. **Set up configuration**  
   Create `code/.env` from `code/.env.example` and fill in your Azure OpenAI credentials.

2. **Run the pipeline**  
   Execute cells from top to bottom. The notebook will:
   - Initialize the Azure OpenAI client  
   - Generate synthetic queries with `generate_phi_queries(n=1051)`  
   - Write output to `synthetic_clinical_queries.txt` in the format:
```
     ===QUERY===
     [query text]
     ===PHI_TAGS===
     {"identifier_type": "TYPE", "value": "..."}
```
   - Validate dataset quality with `validate_dataset()` (hard negative ratio, mean PHI per query, malformed entries)

3. **Review outputs**  
   Check `synthetic_clinical_queries.txt` for the generated dataset and validation summary statistics.

---

## Customizing for Domain-Specific Variants

ASQ-PHI is designed to be adapted for any clinical workflow where queries contain PHI and require external tools. To create a domain-focused variant (oncology, cardiology, pediatrics, etc.):

### Adaptation Protocol

1. **Edit the few-shot examples**  
   Scroll to the `PHI_QUERY_SYSTEM_PROMPT` cell and locate section **"7. Correctly Formatted Examples"**.  
   Replace the three example queries with cases reflecting your target domain.

2. **Customize by:**
   - **Specialty**: Tumor staging queries, cardiac imaging requests, neonatal protocols  
   - **Setting**: Inpatient wards, outpatient clinics, telemedicine, home health  
   - **PHI distribution**: Adjust identifier types or densities for specific use cases

3. **Regenerate and validate**  
   Rerun `generate_phi_queries(n=...)` and `validate_dataset()`. The output will follow the same format but reflect your specialized clinical context.

### Example Use Cases

- **Oncology variant**: Queries about chemotherapy regimens, tumor markers, staging systems  
- **Pediatrics variant**: Growth chart lookups, vaccination schedules, developmental milestones  
- **Emergency medicine variant**: Triage protocols, toxicology databases, trauma guidelines  
- **Telemedicine variant**: Remote consultation queries, home monitoring data, virtual care workflows

The result is a new benchmark with identical structure and validation checks but optimized for your specific HIPAA-compliant tool integration scenario.

In [None]:
!pip install -r requirements.txt

In [None]:
import os
from datetime import datetime
from openai import AzureOpenAI
from tqdm import tqdm

In [None]:
AZURE_OPENAI_DEPLOYMENT = os.getenv("AZURE_OPENAI_DEPLOYMENT_4o")

OUTPUT_PATH = os.getenv("OUTPUT_PATH", "synthetic_clinical_queries.txt")

TEMPERATURE = 0.9
BATCH_SIZE = 5

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY_4o"),
    api_version="2023-07-01-preview",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT_4o"),
)

In [None]:
PHI_QUERY_SYSTEM_PROMPT = r"""

You are a clinical data simulation assistant. Your mission is to generate a high quality benchmark dataset of synthetic clinical search queries. This dataset is designed for the rigorous training and evaluation of automated de-identification pipelines, with a strict focus on HIPAA Safe Harbor standards.

The queries must represent conversational interactions where physicians use large language models (LLMs) deployed under a HIPAA business associate agreement (BAA) to ask patient specific questions that require evidence based guidance. In this emerging use case, queries naturally contain PHI because physicians are in the patient's context. The primary objective is to test a system's ability to accurately distinguish and remove specific, legally defined PHI identifiers from clinically rich, unstructured text. The dataset must include a balanced mix of straightforward PHI cases, challenging ambiguous queries, and "hard negatives" to robustly measure both detection accuracy and over-redaction risk.

The output must be programmatically parsable and adhere to the specified format without deviation.

---
**1. Query Style and Intent: Information-Seeking vs. Record-Retrieval**

Queries must seek general medical knowledge (e.g., guidelines, evidence, contraindications) using a patient's context, not request a lookup of a patient's specific record.

* **Vary your phrasing.** Do not overuse phrases like "similar to my patient" or "a case like." Frame the queries more directly.

* **GENERATE (Information-Seeking Style):** Queries that seek general medical knowledge.
    * *Correct Example:* "Current guidelines for managing Type 2 Diabetes in a 68-year-old male with a history of CKD, John Doe (MRN: 554-32-11)?"
    * *Correct Example:* "Contraindications for prescribing Paxlovid to a patient on statins like Sarah M., seen at our Boston clinic on May 1st, 2024?"

* **AVOID (Record-Retrieval Style):** Queries that ask the system to look up specific data.
    * *Incorrect Example:* "Pull the lab results for John Doe (MRN: 554-32-11)."
    * *Incorrect Example:* "What was Sarah M.'s discharge summary from May 1st, 2024?"

---
**2. Output Structure**

Each entry must follow this exact format:

===QUERY===
<A single clinical search query.>
===PHI_TAGS===
<A JSON object on a new line for each piece of PHI present in the query. This section must be empty for hard negative queries.>

* The text within the `===QUERY===` block MUST be a single, continuous line of text. DO NOT use line breaks within the query.
* Each `PHI_TAGS` JSON object MUST be a complete, valid JSON object on a single line.

The JSON tag must have this structure:

`{ "identifier_type": "TYPE", "value": "..." }`

---
**3. PHI To Be Tagged (Strictly Limited to HIPAA Safe Harbor Identifiers)**

You must ONLY tag the following types of information. If an item is not on this list, do not tag it.

* **NAME**: Full or last names of individuals (patients, relatives, providers).
* **GEOGRAPHIC_LOCATION**: All geographic subdivisions smaller than a state, including street address, city, county, precinct, and ZIP code.
* **DATE**: All elements of dates (except year) directly related to an individual (e.g., birth date, admission date).
* **PHONE_NUMBER**: All telephone numbers.
* **FAX_NUMBER**: All fax numbers.
* **EMAIL_ADDRESS**: All email addresses.
* **SOCIAL_SECURITY_NUMBER**: All Social Security numbers.
* **MEDICAL_RECORD_NUMBER**: All medical record numbers.
* **HEALTH_PLAN_BENEFICIARY_NUMBER**: All health plan or insurance policy numbers.
* **ACCOUNT_NUMBER**: All financial or other account numbers.
* **CERTIFICATE_LICENSE_NUMBER**: All certificate or license numbers.
* **VEHICLE_IDENTIFIER**: All vehicle identifiers and serial numbers, including license plates.
* **DEVICE_IDENTIFIER**: All medical device identifiers and serial numbers.
* **URL**: All Web Universal Resource Locators.
* **IP_ADDRESS**: All Internet Protocol addresses.
* **BIOMETRIC_IDENTIFIER**: References to biometric data (e.g., fingerprints, retinal scans).
* **FULL_FACE_PHOTO**: References to full-face photographic images.
* **UNIQUE_IDENTIFIER**: Any other unique identifying number, characteristic, or code (e.g., clinical trial number).

---
**4. Information to INCLUDE in the Query and How to Treat It as PHI**

This information provides essential clinical context and must be included in queries. Except for clearly patient linked relative dates as noted below, these items should not be tagged as PHI.

* **Demographics**: Age, gender, race, or ethnicity (e.g., "a 72-year-old Black female").
* **Clinical Status**: Diseases, symptoms, diagnoses, treatments, or medications (e.g., "diagnosed with Alzheimer's disease," "history of stroke").
* **Eponyms, Scores, and Drug Names**: Non-PHI proper nouns like disease names (Parkinson's disease), medical scores (Wells score), clinical trials (GUSTO trial), or drug brand names (Lipitor).
* **Years**: Standalone years are permissible and should not be tagged (e.g., "diagnosed in 2022").

* **Relative Dates**: Phrases such as "last week", "last month", or "last Friday" can be used to describe care timelines. When they clearly refer to an individual patient's care, tag them as DATE. If a relative phrase is vague and not clearly tied to a specific individual, you may leave it untagged.

---
**5. Dataset Generation Requirements**

* **Dataset Composition**: Generate a mix that includes:
    * **Core PHI queries**: Queries that contain at least TWO distinct PHI identifiers from Section 3.
    * **Hard negative and ambiguous queries**: Queries that resemble PHI containing queries in structure and wording but contain no PHI. See Section 6.
    * **Complex syntax queries**: Queries that incorporate shorthand, typos, and varied phrasing.
* **Clinical Intent Variety**: Vary the clinical intent to include:
    * Prognostic questions ("What is the 5 year survival rate for...").
    * Drug-specific queries ("Contraindications for metformin in patients with...").
    * Guideline-seeking ("Latest ACC/AHA guidelines for...").
    * Comparative questions ("Effectiveness of apixaban vs. warfarin for...").
    * Lab interpretation ("Differential diagnosis for elevated AST/ALT in...").
* **Syntactic and Linguistic Variety**: Employ a mix of full natural language questions, clinical shorthand (e.g., "pt", "w/", "hx of", "dx'd", "s/p", "yo" for "year-old"), elliptical phrasing ("post-op fever day 3"), and occasional typographical errors ("side efects of amlodipine"). Use the examples as stylistic references only.


---
**6. Hard Negative and Ambiguous Case Generation**

To robustly test for over-redaction (false positives), a substantial subset of the queries must be "hard negatives." These queries must contain information that could be confused with PHI but is not, and therefore must have an empty `===PHI_TAGS===` section.

* **Use Eponyms and Scores**: Include non-PHI proper nouns (e.g., "Parkinson's disease", "Chaddock reflex", "Wells score").
* **Use Ambiguous Numbers**: Include dosages, lab values, or non-specific identifiers that are not PHI (e.g., "bilirubin level of 12", "notes from bay 3", "protocol for room 12", "trial ID 1999-05").
* **Use Vague Identifiers**: Include phrases that are close to but not explicit PHI (e.g., "patient from the coastal region," "notes from last summer," "his brother, an engineer").

---
**7. Correctly Formatted Examples**

*Example 1: Core PHI Query (Direct Style)*
===QUERY===
Latest NCCN guidelines for HER2+ breast cancer in a 45yo female, Jane Doe, whose case was reviewed at Mercy Hospital on June 5th, 2024 (MRN: JH-876543)?
===PHI_TAGS===
{ "identifier_type": "NAME", "value": "Jane Doe" }
{ "identifier_type": "GEOGRAPHIC_LOCATION", "value": "Mercy Hospital" }
{ "identifier_type": "DATE", "value": "June 5th, 2024" }
{ "identifier_type": "MEDICAL_RECORD_NUMBER", "value": "JH-876543" }

*Example 2: Hard Negative Query*
===QUERY===
What is the recommended management for a 78-year-old male with a high Wells score and a history of Parkinson's disease diagnosed in 2022?
===PHI_TAGS===

*Example 3: Complex Syntax and "Messy" Query*
===QUERY===
rec. tx for pt w/ hx of T2DM & CKD, see notes for Mark J., seen at Mayo on Mar 3rd 2023. worried about side efects of metformin. his insurance is WX-123456.
===PHI_TAGS===
{ "identifier_type": "NAME", "value": "Mark J." }
{ "identifier_type": "GEOGRAPHIC_LOCATION", "value": "Mayo" }
{ "identifier_type": "DATE", "value": "Mar 3rd 2023" }
{ "identifier_type": "HEALTH_PLAN_BENEFICIARY_NUMBER", "value": "WX-123456" }
"""

In [None]:
def generate_phi_queries(n=50, out_path=None):
    """Generate synthetic queries and append to output file."""
    out_path = out_path or OUTPUT_PATH
    os.makedirs(os.path.dirname(out_path) or ".", exist_ok=True)

    num_batches = (n + BATCH_SIZE - 1) // BATCH_SIZE
    failed_batches = 0

    for batch_num in tqdm(range(num_batches), desc="Generating batches"):
        try:
            resp = client.chat.completions.create(
                model=AZURE_OPENAI_DEPLOYMENT,
                messages=[
                    {"role": "system", "content": PHI_QUERY_SYSTEM_PROMPT},
                    {
                        "role": "user",
                        "content": f"Generate {BATCH_SIZE} new, unique clinical queries with structured PHI_TAGS as specified in the system prompt.",
                    },
                ],
                temperature=TEMPERATURE,
                max_tokens=2500,
            )

            generated_text = resp.choices[0].message.content

            if generated_text and "===QUERY===" in generated_text:
                with open(out_path, "a", encoding="utf-8") as f:
                    f.write(generated_text.strip() + "\n\n")
            else:
                print(f"\nBatch {batch_num + 1}: malformed output, skipping")
                failed_batches += 1

        except Exception as e:
            with open("generation_errors.log", "a", encoding="utf-8") as f:
                f.write(f"[{datetime.now()}] Batch {batch_num + 1}: {e}\n")
            print(f"\nBatch {batch_num + 1}: error (see generation_errors.log)")
            failed_batches += 1

    print(f"\nComplete: {num_batches - failed_batches}/{num_batches} successful")
    print(f"Output: {out_path}")

    if os.path.exists(out_path):
        validate_dataset(out_path)

In [None]:
def validate_dataset(filepath):
    """Validate dataset quality metrics."""
    if not os.path.exists(filepath):
        print(f"Error: {filepath} not found")
        return None

    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()

    queries = content.split("===QUERY===")[1:]

    valid_queries = 0
    queries_with_phi = 0
    hard_negatives = 0
    total_phi_elements = 0
    malformed_queries = 0

    for block in queries:
        if "===PHI_TAGS===" not in block:
            malformed_queries += 1
            continue

        query_text, phi_section = block.split("===PHI_TAGS===", 1)

        if query_text.strip():
            valid_queries += 1
            phi_lines = [
                l
                for l in phi_section.split("\n")
                if l.strip() and l.strip().startswith("{")
            ]

            if phi_lines:
                queries_with_phi += 1
                total_phi_elements += len(phi_lines)
            else:
                hard_negatives += 1

    phi_pct = (queries_with_phi / valid_queries * 100) if valid_queries else 0
    hard_neg_pct = (hard_negatives / valid_queries * 100) if valid_queries else 0
    mean_phi = (total_phi_elements / valid_queries) if valid_queries else 0

    print("\nVALIDATION REPORT")
    print(f"Valid queries: {valid_queries}")
    print(f"With PHI: {queries_with_phi} ({phi_pct:.1f}%)")
    print(f"Hard negatives: {hard_negatives} ({hard_neg_pct:.1f}%)")
    print(f"Total PHI: {total_phi_elements}")
    print(f"Mean PHI/query: {mean_phi:.2f}")
    if malformed_queries:
        print(f"Malformed: {malformed_queries}")

    checks = []
    checks.append(("Hard negative ratio (15-35%)", 15 <= hard_neg_pct <= 35))
    checks.append(("Mean PHI/query (1.5-5)", 1.5 <= mean_phi <= 5))
    checks.append(("Malformed rate (<5%)", malformed_queries < valid_queries * 0.05))

    print("\nQUALITY CHECKS")
    for desc, passed in checks:
        print(f"{'✓' if passed else '✗'} {desc}")

    if all(p for _, p in checks):
        print("\nAll checks passed")

    return {
        "valid_queries": valid_queries,
        "queries_with_phi": queries_with_phi,
        "hard_negatives": hard_negatives,
        "total_phi_elements": total_phi_elements,
        "mean_phi_per_query": mean_phi,
        "malformed_queries": malformed_queries,
    }

In [None]:
# Generate 1000 queries (appends to existing file)
# generate_phi_queries(n=1000)

In [None]:
# Validate existing dataset
# validate_dataset('./synthetic_clinical_queries.txt')