# Synthetic Clinical Query Generation

Customizable pipeline for generating synthetic clinical queries with ground-truth PHI tags using Azure OpenAI GPT-4o.

## Purpose
Benchmark dataset for evaluating HIPAA-compliant deidentification systems in AI-clinical workflows.

## Features
- Few-shot driven: Customize by modifying template examples
- Append-only: Preserves existing data
- Validated: Built-in quality checks

## Requirements
- Python 3.12+
- Azure OpenAI credentials (environment variables)
- Dependencies in `requirements.txt`

In [None]:
# Install dependencies
!pip install -r requirements.txt

In [None]:
import os
import random
import configparser
from datetime import datetime
from openai import AzureOpenAI
from tqdm import tqdm

random.seed(42)

In [None]:
config_path = os.getenv(
    'AZURE_CONFIG_PATH',
    './config.ini'
)

AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY_4o')
AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT_4o')
AZURE_OPENAI_DEPLOYMENT = os.getenv('AZURE_OPENAI_DEPLOYMENT_4o')

OUTPUT_PATH = os.getenv(
    'OUTPUT_PATH',
    '/Users/jacweath/Desktop/safesearch_/JMIR AI/Synth Data Gen/synthetic_dataset.txt'
)

API_VERSION = "2023-07-01-preview"
TEMPERATURE = 0.9
BATCH_SIZE = 5

client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT
)

print(f"Model: GPT-4o ({AZURE_OPENAI_DEPLOYMENT})")
print(f"Temperature: {TEMPERATURE} | Batch Size: {BATCH_SIZE}")
print(f"Output: {OUTPUT_PATH}")

In [None]:
phi_query_system_prompt = r"""
### System Prompt for Generation of a Methodologically Robust Synthetic Clinical Query Dataset

You are a clinical data simulation assistant. Your mission is to generate a high-quality, benchmark dataset of 500 synthetic clinical search queries. This dataset is designed for the rigorous training and evaluation of automated de-identification pipelines, with a strict focus on HIPAA Safe Harbor standards.

The queries must represent conversational interactions where physicians ask AI assistants patient-specific questions requiring evidence-based guidance. In this emerging use case, queries naturally contain PHI because physicians are in the patient's context. The primary objective is to test a system's ability to accurately distinguish and remove specific, legally defined PHI identifiers from clinically rich, unstructured text. The dataset must include a balanced mix of straightforward PHI cases, challenging ambiguous queries, and "hard negatives" to robustly measure both detection accuracy and over-redaction risk.

The output must be programmatically parsable and adhere to the specified format without deviation.

---
**1. Query Style and Intent: Information-Seeking vs. Record-Retrieval**

This is the most critical instruction. Queries must seek general medical knowledge (e.g., guidelines, evidence, contraindications) using a patient's context, not request a lookup of a patient's specific record.

* **Vary your phrasing.** Do not overuse phrases like "similar to my patient" or "a case like." Frame the queries more directly.

* **GENERATE (Information-Seeking Style):** Queries that seek general medical knowledge.
    * *Correct Example:* "Current guidelines for managing Type 2 Diabetes in a 68-year-old male with a history of CKD, John Doe (MRN: 554-32-11)?"
    * *Correct Example:* "Contraindications for prescribing Paxlovid to a patient on statins like Sarah M., seen at our Boston clinic on May 1st, 2024?"

* **AVOID (Record-Retrieval Style):** Queries that ask the system to look up specific data.
    * *Incorrect Example:* "Pull the lab results for John Doe (MRN: 554-32-11)."
    * *Incorrect Example:* "What was Sarah M.'s discharge summary from May 1st, 2024?"

---
**2. Output Structure**

Each entry must follow this exact format:

`===QUERY===`
`<A single, realistic clinical search query.>`
`===PHI_TAGS===`
`<A JSON object on a new line for each piece of PHI present in the query. This section must be empty for hard negative queries.>`

* The text within the `===QUERY===` block **MUST** be a single, continuous line of text. **DO NOT** use line breaks within the query.
* Each `PHI_TAGS` JSON object **MUST** be a complete, valid JSON object on a single line.

The JSON tag must have this structure:

`{ "identifier_type": "HIPAA_CATEGORY", "value": "..." }`

---
**3. PHI To Be Tagged (Strictly Limited to HIPAA Safe Harbor Identifiers)**

You must ONLY tag the following types of information. If an item is not on this list, do not tag it.

* **NAME**: Full or last names of individuals (patients, relatives, providers).
* **GEOGRAPHIC\_LOCATION**: All geographic subdivisions smaller than a state, including street address, city, county, precinct, and ZIP code.
* **DATE**: All elements of dates (except year) directly related to an individual (e.g., birth date, admission date).
* **PHONE\_NUMBER**: All telephone numbers.
* **FAX\_NUMBER**: All fax numbers.
* **EMAIL\_ADDRESS**: All email addresses.
* **SOCIAL\_SECURITY\_NUMBER**: All Social Security numbers.
* **MEDICAL\_RECORD\_NUMBER**: All medical record numbers.
* **HEALTH\_PLAN\_BENEFICIARY\_NUMBER**: All health plan or insurance policy numbers.
* **ACCOUNT\_NUMBER**: All financial or other account numbers.
* **CERTIFICATE\_LICENSE\_NUMBER**: All certificate or license numbers.
* **VEHICLE\_IDENTIFIER**: All vehicle identifiers and serial numbers, including license plates.
* **DEVICE\_IDENTIFIER**: All medical device identifiers and serial numbers.
* **URL**: All Web Universal Resource Locators.
* **IP\_ADDRESS**: All Internet Protocol addresses.
* **BIOMETRIC\_IDENTIFIER**: References to biometric data (e.g., fingerprints, retinal scans).
* **FULL\_FACE\_PHOTO**: References to full-face photographic images.
* **UNIQUE\_IDENTIFIER**: Any other unique identifying number, characteristic, or code (e.g., clinical trial number).

---
**4. Information to INCLUDE in the Query but NEVER Tag as PHI**

This information provides essential clinical context and must be included in queries but should NEVER be tagged as PHI.

* **Demographics**: Age, gender, race, or ethnicity (e.g., "a 72-year-old Black female").
* **Clinical Status**: Diseases, symptoms, diagnoses, treatments, or medications (e.g., "diagnosed with Alzheimer's disease," "history of stroke").
* **Eponyms, Scores, and Drug Names**: Non-PHI proper nouns like disease names (Parkinson's disease), medical scores (Wells score), clinical trials (GUSTO trial), or drug brand names (Lipitor).
* **Relative Dates**: Non-specific dates or timeframes (e.g., "last month," "yesterday").
* **Years**: Standalone years are permissible and should not be tagged (e.g., "diagnosed in 2022").

---
**5. Dataset Generation Requirements**

* **Dataset Composition**: Within the 500 queries, generate a mix with the following approximate distribution:
    * **Core PHI Queries (~60%)**: Contains at least TWO distinct PHI identifiers from Section 3.
    * **Hard Negative & Ambiguous Queries (~25%)**: Intentionally designed to resemble PHI but containing none. See Section 6.
    * **Complex Syntax Queries (~15%)**: Incorporate realistic shorthand, typos, and varied phrasing. These can be combined with the other two categories.
* **Clinical Intent Variety**: Vary the clinical intent to include:
    * Prognostic questions ("What is the 5-year survival rate for...").
    * Drug-specific queries ("Contraindications for metformin in patients with...").
    * Guideline-seeking ("Latest ACC/AHA guidelines for...").
    * Comparative questions ("Effectiveness of apixaban vs. warfarin for...").
    * Lab interpretation ("Differential diagnosis for elevated AST/ALT in...").
* **Syntactic & Linguistic Variety**: Employ a mix of full natural language questions, clinical shorthand (e.g., "pt", "w/", "hx of", "dx'd", "s/p", "yo" for "year-old"), elliptical phrasing ("post-op fever day 3"), and occasional common typographical errors ("side efects of amlodipine"). Crucially, do not simply copy the structure of the provided examples; they are guides for style, not for sentence construction.

---
**6. Hard Negative and Ambiguous Case Generation**

To robustly test for over-redaction (false positives), roughly 25% of the queries must be "hard negatives." These queries must contain information that could be confused with PHI but is not, and therefore must have an empty `===PHI_TAGS===` section.

* **Use Eponyms and Scores**: Include non-PHI proper nouns (e.g., "Parkinson's disease", "Chaddock reflex", "Wells score").
* **Use Ambiguous Numbers**: Include dosages, lab values, or non-specific identifiers that are not PHI (e.g., "bilirubin level of 12", "notes from bay 3", "protocol for room 12", "trial ID 1999-05").
* **Use Vague Identifiers**: Include phrases that are close to but not explicit PHI (e.g., "patient from the Galveston area," "notes from last summer," "his brother, an engineer").

---
**7. Correctly Formatted Examples**

*Example 1: Core PHI Query (Direct Style)*
`===QUERY===`
Latest NCCN guidelines for HER2+ breast cancer in a 45yo female, Jane Doe, whose case was reviewed at Mercy Hospital on June 5th, 2024 (MRN: #JH-876543)?
`===PHI_TAGS===`
`{ "identifier_type": "NAME", "value": "Jane Doe" }`
`{ "identifier_type": "GEOGRAPHIC_LOCATION", "value": "Mercy Hospital" }`
`{ "identifier_type": "DATE", "value": "June 5th, 2024" }`
`{ "identifier_type": "MEDICAL_RECORD_NUMBER", "value": "#JH-876543" }`

*Example 2: Hard Negative Query*
`===QUERY===`
What is the recommended management for a 78-year-old male with a high Wells score and a history of Parkinson's disease diagnosed in 2022?
`===PHI_TAGS===`

*Example 3: Complex Syntax & "Messy" Query*
`===QUERY===`
rec. tx for pt w/ hx of T2DM & CKD, see notes for Mark J., seen at Mayo on Mar 3rd 2023. worried about side efects of metformin. his insurance is WX-123456.
`===PHI_TAGS===`
`{ "identifier_type": "NAME", "value": "Mark J." }`
`{ "identifier_type": "GEOGRAPHIC_LOCATION", "value": "Mayo" }`
`{ "identifier_type": "DATE", "value": "Mar 3rd 2023" }`
`{ "identifier_type": "HEALTH_PLAN_BENEFICIARY_NUMBER", "value": "WX-123456" }`
"""

In [None]:
def generate_phi_queries(n=50, out_path=None):
    """Generate synthetic queries and append to output file."""
    out_path = out_path or OUTPUT_PATH
    os.makedirs(
        os.path.dirname(out_path) or '.',
        exist_ok=True
    )
    
    num_batches = (n + BATCH_SIZE - 1) // BATCH_SIZE
    failed_batches = 0

    for batch_num in tqdm(range(num_batches), desc="Generating batches"):
        try:
            resp = client.chat.completions.create(
                model=AZURE_OPENAI_DEPLOYMENT,
                messages=[
                    {
                        "role": "system",
                        "content": phi_query_system_prompt
                    },
                    {
                        "role": "user",
                        "content": f"Generate {BATCH_SIZE} new, unique, and realistic clinical queries with structured PHI_TAGS as specified in the system prompt."
                    }
                ],
                temperature=TEMPERATURE,
                max_tokens=2500
            )
            
            generated_text = resp.choices[0].message.content
            
            if generated_text and '===QUERY===' in generated_text:
                with open(
                    out_path,
                    "a",
                    encoding="utf-8"
                ) as f:
                    f.write(generated_text.strip() + "\n\n")
            else:
                print(f"\nBatch {batch_num+1}: malformed output, skipping")
                failed_batches += 1

        except Exception as e:
            with open(
                "generation_errors.log",
                "a",
                encoding="utf-8"
            ) as f:
                f.write(f"[{datetime.now()}] Batch {batch_num+1}: {e}\n")
            print(f"\nBatch {batch_num+1}: error (see generation_errors.log)")
            failed_batches += 1

    print(f"\nComplete: {num_batches - failed_batches}/{num_batches} successful")
    print(f"Output: {out_path}")
    
    if os.path.exists(out_path):
        validate_dataset(out_path)

## Customizing Query Generation

The LLM follows the structure and PHI patterns in Examples 1-3 more than the general description. Modify these examples to generate domain-specific queries:

**Academic**: Complex terminology, clinical trials, subspecialty queries  
**Community**: Simpler language, primary care, polypharmacy  
**Telemedicine**: Email/phone PHI, device IDs, conversational phrasing

To customize: Replace Examples 1-3 with representative queries for your context.

In [None]:
def validate_dataset(filepath):
    """Validate dataset quality metrics."""
    if not os.path.exists(filepath):
        print(f"Error: {filepath} not found")
        return None
    
    with open(
        filepath,
        'r',
        encoding='utf-8'
    ) as f:
        content = f.read()
    
    queries = content.split('===QUERY===')[1:]
    
    valid_queries = 0
    queries_with_phi = 0
    hard_negatives = 0
    total_phi_elements = 0
    malformed_queries = 0
    
    for block in queries:
        if '===PHI_TAGS===' not in block:
            malformed_queries += 1
            continue
        
        query_text, phi_section = block.split('===PHI_TAGS===', 1)
        
        if query_text.strip():
            valid_queries += 1
            phi_lines = [
                l for l in phi_section.split('\n')
                if l.strip() and l.strip().startswith('{')
            ]
            
            if phi_lines:
                queries_with_phi += 1
                total_phi_elements += len(phi_lines)
            else:
                hard_negatives += 1
    
    phi_pct = (queries_with_phi / valid_queries * 100) if valid_queries else 0
    hard_neg_pct = (hard_negatives / valid_queries * 100) if valid_queries else 0
    mean_phi = (total_phi_elements / valid_queries) if valid_queries else 0
    
    print(f"\nVALIDATION REPORT")
    print(f"Valid queries: {valid_queries}")
    print(f"With PHI: {queries_with_phi} ({phi_pct:.1f}%)")
    print(f"Hard negatives: {hard_negatives} ({hard_neg_pct:.1f}%)")
    print(f"Total PHI: {total_phi_elements}")
    print(f"Mean PHI/query: {mean_phi:.2f}")
    if malformed_queries:
        print(f"Malformed: {malformed_queries}")
    
    checks = []
    checks.append(("Hard negative ratio (15-35%)", 15 <= hard_neg_pct <= 35))
    checks.append(("Mean PHI/query (1.5-5)", 1.5 <= mean_phi <= 5))
    checks.append(("Malformed rate (<5%)", malformed_queries < valid_queries * 0.05))
    
    print(f"\nQUALITY CHECKS")
    for desc, passed in checks:
        print(f"{'✓' if passed else '✗'} {desc}")
    
    if all(p for _, p in checks):
        print("\nAll checks passed")
    
    return {
        'valid_queries': valid_queries,
        'queries_with_phi': queries_with_phi,
        'hard_negatives': hard_negatives,
        'total_phi_elements': total_phi_elements,
        'mean_phi_per_query': mean_phi,
        'malformed_queries': malformed_queries
    }

In [None]:
# Generate 1000 queries (appends to existing file)
# generate_phi_queries(n=1000)

In [None]:
# Validate existing dataset
# validate_dataset('./synthetic_dataset.txt')