# Reproducible Framework for Synthetic Clinical Query Generation

This notebook provides a **customizable, reproducible pipeline** for generating synthetic clinical queries with ground-truth PHI tags using Azure OpenAI (GPT-4o).

## Purpose
- **Benchmark Dataset**: For evaluating HIPAA-compliant deidentification systems in AI-clinical workflows
- **Customizable Framework**: Adapt query generation for different clinical contexts (academic/community/telemedicine) by modifying few-shot examples
- **Reproducible**: Environment-based configuration, pinned dependencies, seeded randomness

## Key Features
- **Few-shot driven**: LLM behavior is primarily controlled by the three template examples in the prompt
- **Append-only**: Safely adds new queries without regenerating existing data
- **Validation**: Built-in dataset quality checks post-generation

## Requirements
- Python 3.12+
- Azure OpenAI credentials (set via environment variables)
- See `requirements.txt` for dependencies

In [None]:
# Create requirements.txt if it doesn't exist
import os

requirements_content = """openai==1.75.0
tqdm==4.67.1
"""

requirements_path = './requirements.txt'
if not os.path.exists(requirements_path):
    with open(requirements_path, 'w') as f:
        f.write(requirements_content)
    print(f"Created {requirements_path}")
else:
    print(f"{requirements_path} already exists")

In [None]:
# Install dependencies from requirements.txt
!pip install -r requirements.txt

In [None]:
# Imports
import os
import random
import configparser
from datetime import datetime
from openai import AzureOpenAI
from tqdm import tqdm

# Set random seed for reproducibility (minimal randomness in this workflow, but ensures consistency)
random.seed(42)

In [None]:
# ============================================================================
# Configuration: Azure OpenAI Credentials and Settings
# ============================================================================
# Security: Use environment variables for all sensitive credentials.
# Set these before running:
#   export AZURE_OPENAI_API_KEY_4o='your_api_key'
#   export AZURE_OPENAI_ENDPOINT_4o='your_endpoint'
#   export AZURE_OPENAI_DEPLOYMENT_4o='your_deployment_name'
# Optional: export AZURE_CONFIG_PATH='./config.ini' if using config file

# Try loading from config file if AZURE_CONFIG_PATH is set
config_path = os.getenv('AZURE_CONFIG_PATH', './config.ini')

if os.path.exists(config_path):
    config = configparser.ConfigParser()
    config.read(config_path)
    
    AZURE_OPENAI_API_KEY = config.get('AZUREOPENAI', 'API_KEY_4o', fallback=None)
    AZURE_OPENAI_ENDPOINT = config.get('AZUREOPENAI', 'ENDPOINT_4o', fallback=None)
    AZURE_OPENAI_DEPLOYMENT = config.get('AZUREOPENAI', 'LLM_DEPLOYMENT_NAME_4o', fallback=None)
else:
    AZURE_OPENAI_API_KEY = None
    AZURE_OPENAI_ENDPOINT = None
    AZURE_OPENAI_DEPLOYMENT = None

# Override with environment variables if set (env vars take precedence)
AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY_4o', AZURE_OPENAI_API_KEY)
AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT_4o', AZURE_OPENAI_ENDPOINT)
AZURE_OPENAI_DEPLOYMENT = os.getenv('AZURE_OPENAI_DEPLOYMENT_4o', AZURE_OPENAI_DEPLOYMENT)

# Validate that all required credentials are present
if not all([AZURE_OPENAI_API_KEY, AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_DEPLOYMENT]):
    raise ValueError(
        "Missing required Azure OpenAI credentials. Please set environment variables:\n"
        "  AZURE_OPENAI_API_KEY_4o\n"
        "  AZURE_OPENAI_ENDPOINT_4o\n"
        "  AZURE_OPENAI_DEPLOYMENT_4o\n"
        "Or provide a config file at AZURE_CONFIG_PATH"
    )

# Output path (points to the actual cleaned dataset by default)
OUTPUT_PATH = os.getenv('OUTPUT_PATH', '/Users/jacweath/Desktop/safesearch_/JMIR AI/Synth Data Gen/synthetic_dataset.txt')

# Initialize Azure OpenAI client
API_VERSION = "2023-07-01-preview"
TEMPERATURE = 0.9  # Higher temperature for diverse query generation
BATCH_SIZE = 5     # Queries generated per API call

client = AzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT
)

# Print generation settings for reproducibility documentation
print("=" * 60)
print("GENERATION SETTINGS")
print("=" * 60)
print(f"Model: GPT-4o")
print(f"Deployment: {AZURE_OPENAI_DEPLOYMENT}")
print(f"API Version: {API_VERSION}")
print(f"Temperature: {TEMPERATURE}")
print(f"Batch Size: {BATCH_SIZE} queries per request")
print(f"Output Path: {OUTPUT_PATH}")
print(f"Random Seed: 42")
print("=" * 60)

In [None]:
# ============================================================================
# System Prompt: Defines the synthetic query generation framework
# ============================================================================
# NOTE: The LLM follows the structure/PHI patterns in the three few-shot 
# examples (see Examples 1-3 at the bottom of this prompt) more than the 
# general description. This is intentional: Customize these examples for 
# domain-specific datasets (e.g., academic vs. community hospital contexts).
#
# The 'r' prefix marks this as a raw string to prevent SyntaxWarnings from
# backslashes used in markdown formatting.

phi_query_system_prompt = r"""
### System Prompt for Generation of a Methodologically Robust Synthetic Clinical Query Dataset

You are a clinical data simulation assistant. Your mission is to generate a high-quality, benchmark dataset of 500 synthetic clinical search queries. This dataset is designed for the rigorous training and evaluation of automated de-identification pipelines, with a strict focus on HIPAA Safe Harbor standards.

The queries must represent conversational interactions where physicians ask AI assistants patient-specific questions requiring evidence-based guidance. In this emerging use case, queries naturally contain PHI because physicians are in the patient's context. The primary objective is to test a system's ability to accurately distinguish and remove specific, legally defined PHI identifiers from clinically rich, unstructured text. The dataset must include a balanced mix of straightforward PHI cases, challenging ambiguous queries, and "hard negatives" to robustly measure both detection accuracy and over-redaction risk.

The output must be programmatically parsable and adhere to the specified format without deviation.

---
**1. Query Style and Intent: Information-Seeking vs. Record-Retrieval**

This is the most critical instruction. Queries must seek general medical knowledge (e.g., guidelines, evidence, contraindications) using a patient's context, not request a lookup of a patient's specific record.

* **Vary your phrasing.** Do not overuse phrases like "similar to my patient" or "a case like." Frame the queries more directly.

* **GENERATE (Information-Seeking Style):** Queries that seek general medical knowledge.
    * *Correct Example:* "Current guidelines for managing Type 2 Diabetes in a 68-year-old male with a history of CKD, John Doe (MRN: 554-32-11)?"
    * *Correct Example:* "Contraindications for prescribing Paxlovid to a patient on statins like Sarah M., seen at our Boston clinic on May 1st, 2024?"

* **AVOID (Record-Retrieval Style):** Queries that ask the system to look up specific data.
    * *Incorrect Example:* "Pull the lab results for John Doe (MRN: 554-32-11)."
    * *Incorrect Example:* "What was Sarah M.'s discharge summary from May 1st, 2024?"

---
**2. Output Structure**

Each entry must follow this exact format:

`===QUERY===`
`<A single, realistic clinical search query.>`
`===PHI_TAGS===`
`<A JSON object on a new line for each piece of PHI present in the query. This section must be empty for hard negative queries.>`

* The text within the `===QUERY===` block **MUST** be a single, continuous line of text. **DO NOT** use line breaks within the query.
* Each `PHI_TAGS` JSON object **MUST** be a complete, valid JSON object on a single line.

The JSON tag must have this structure:

`{ "identifier_type": "HIPAA_CATEGORY", "value": "..." }`

---
**3. PHI To Be Tagged (Strictly Limited to HIPAA Safe Harbor Identifiers)**

You must ONLY tag the following types of information. If an item is not on this list, do not tag it.

* **NAME**: Full or last names of individuals (patients, relatives, providers).
* **GEOGRAPHIC\_LOCATION**: All geographic subdivisions smaller than a state, including street address, city, county, precinct, and ZIP code.
* **DATE**: All elements of dates (except year) directly related to an individual (e.g., birth date, admission date).
* **PHONE\_NUMBER**: All telephone numbers.
* **FAX\_NUMBER**: All fax numbers.
* **EMAIL\_ADDRESS**: All email addresses.
* **SOCIAL\_SECURITY\_NUMBER**: All Social Security numbers.
* **MEDICAL\_RECORD\_NUMBER**: All medical record numbers.
* **HEALTH\_PLAN\_BENEFICIARY\_NUMBER**: All health plan or insurance policy numbers.
* **ACCOUNT\_NUMBER**: All financial or other account numbers.
* **CERTIFICATE\_LICENSE\_NUMBER**: All certificate or license numbers.
* **VEHICLE\_IDENTIFIER**: All vehicle identifiers and serial numbers, including license plates.
* **DEVICE\_IDENTIFIER**: All medical device identifiers and serial numbers.
* **URL**: All Web Universal Resource Locators.
* **IP\_ADDRESS**: All Internet Protocol addresses.
* **BIOMETRIC\_IDENTIFIER**: References to biometric data (e.g., fingerprints, retinal scans).
* **FULL\_FACE\_PHOTO**: References to full-face photographic images.
* **UNIQUE\_IDENTIFIER**: Any other unique identifying number, characteristic, or code (e.g., clinical trial number).

---
**4. Information to INCLUDE in the Query but NEVER Tag as PHI**

This information provides essential clinical context and must be included in queries but should NEVER be tagged as PHI.

* **Demographics**: Age, gender, race, or ethnicity (e.g., "a 72-year-old Black female").
* **Clinical Status**: Diseases, symptoms, diagnoses, treatments, or medications (e.g., "diagnosed with Alzheimer's disease," "history of stroke").
* **Eponyms, Scores, and Drug Names**: Non-PHI proper nouns like disease names (Parkinson's disease), medical scores (Wells score), clinical trials (GUSTO trial), or drug brand names (Lipitor).
* **Relative Dates**: Non-specific dates or timeframes (e.g., "last month," "yesterday").
* **Years**: Standalone years are permissible and should not be tagged (e.g., "diagnosed in 2022").

---
**5. Dataset Generation Requirements**

* **Dataset Composition**: Within the 500 queries, generate a mix with the following approximate distribution:
    * **Core PHI Queries (~60%)**: Contains at least TWO distinct PHI identifiers from Section 3.
    * **Hard Negative & Ambiguous Queries (~25%)**: Intentionally designed to resemble PHI but containing none. See Section 6.
    * **Complex Syntax Queries (~15%)**: Incorporate realistic shorthand, typos, and varied phrasing. These can be combined with the other two categories.
* **Clinical Intent Variety**: Vary the clinical intent to include:
    * Prognostic questions ("What is the 5-year survival rate for...").
    * Drug-specific queries ("Contraindications for metformin in patients with...").
    * Guideline-seeking ("Latest ACC/AHA guidelines for...").
    * Comparative questions ("Effectiveness of apixaban vs. warfarin for...").
    * Lab interpretation ("Differential diagnosis for elevated AST/ALT in...").
* **Syntactic & Linguistic Variety**: Employ a mix of full natural language questions, clinical shorthand (e.g., "pt", "w/", "hx of", "dx'd", "s/p", "yo" for "year-old"), elliptical phrasing ("post-op fever day 3"), and occasional common typographical errors ("side efects of amlodipine"). Crucially, do not simply copy the structure of the provided examples; they are guides for style, not for sentence construction.

---
**6. Hard Negative and Ambiguous Case Generation**

To robustly test for over-redaction (false positives), roughly 25% of the queries must be "hard negatives." These queries must contain information that could be confused with PHI but is not, and therefore must have an empty `===PHI_TAGS===` section.

* **Use Eponyms and Scores**: Include non-PHI proper nouns (e.g., "Parkinson's disease", "Chaddock reflex", "Wells score").
* **Use Ambiguous Numbers**: Include dosages, lab values, or non-specific identifiers that are not PHI (e.g., "bilirubin level of 12", "notes from bay 3", "protocol for room 12", "trial ID 1999-05").
* **Use Vague Identifiers**: Include phrases that are close to but not explicit PHI (e.g., "patient from the Galveston area," "notes from last summer," "his brother, an engineer").

---
**7. Correctly Formatted Examples**

*Example 1: Core PHI Query (Direct Style)*
`===QUERY===`
Latest NCCN guidelines for HER2+ breast cancer in a 45yo female, Jane Doe, whose case was reviewed at Mercy Hospital on June 5th, 2024 (MRN: #JH-876543)?
`===PHI_TAGS===`
`{ "identifier_type": "NAME", "value": "Jane Doe" }`
`{ "identifier_type": "GEOGRAPHIC_LOCATION", "value": "Mercy Hospital" }`
`{ "identifier_type": "DATE", "value": "June 5th, 2024" }`
`{ "identifier_type": "MEDICAL_RECORD_NUMBER", "value": "#JH-876543" }`

*Example 2: Hard Negative Query*
`===QUERY===`
What is the recommended management for a 78-year-old male with a high Wells score and a history of Parkinson's disease diagnosed in 2022?
`===PHI_TAGS===`

*Example 3: Complex Syntax & "Messy" Query*
`===QUERY===`
rec. tx for pt w/ hx of T2DM & CKD, see notes for Mark J., seen at Mayo on Mar 3rd 2023. worried about side efects of metformin. his insurance is WX-123456.
`===PHI_TAGS===`
`{ "identifier_type": "NAME", "value": "Mark J." }`
`{ "identifier_type": "GEOGRAPHIC_LOCATION", "value": "Mayo" }`
`{ "identifier_type": "DATE", "value": "Mar 3rd 2023" }`
`{ "identifier_type": "HEALTH_PLAN_BENEFICIARY_NUMBER", "value": "WX-123456" }`
"""

## Framework Design Note: Few-Shot Example Customization

**Key Insight**: The LLM primarily follows the **structure and PHI patterns** demonstrated in the three template examples (Examples 1-3 in the prompt above), rather than the general description text. This is **intentional design** that enables flexible, domain-specific adaptation.

### Why This Matters
Users can customize query generation for specific clinical contexts by **modifying only the three few-shot examples**, without rewriting the entire prompt. The model learns:
- Query complexity and phrasing style
- Types of PHI to emphasize (e.g., MRNs vs. locations)
- Clinical vocabulary level (academic vs. primary care)
- Use of abbreviations and shorthand

### Example Customizations

**Academic Medical Center Context**:
- Use complex terminology ("HER2+ metastatic adenocarcinoma")
- Reference clinical trials and research protocols
- Include subspecialty-specific queries

**Community Hospital Context**:
- Use simpler language ("breast cancer that spread")
- Focus on primary care conditions (diabetes, hypertension)
- Include common comorbidities and polypharmacy

**Telemedicine Context**:
- Emphasize email addresses and phone numbers as primary identifiers
- Include remote monitoring device IDs
- Use more conversational phrasing

### Practical Use
To adapt this framework for your context:
1. Identify your target clinical setting
2. Write 3 representative queries with realistic PHI
3. Replace Examples 1-3 in the prompt above
4. Regenerate dataset with the same code

This flexibility enables **domain-specific benchmark generation** without changing the generation pipeline.

In [None]:
# ============================================================================
# Query Generation Function
# ============================================================================

def generate_phi_queries(n=50, out_path=None):
    """
    Generates a specified number of synthetic queries and appends them to the output file.
    
    This function requests queries in batches (more efficient than one-by-one generation)
    and appends successfully generated batches immediately to preserve progress.

    Args:
        n (int): The total number of queries to generate.
        out_path (str): The file path to save the generated queries. 
                        Defaults to OUTPUT_PATH from config.
    
    Returns:
        None. Writes queries to file and prints completion message.
    """
    if out_path is None:
        out_path = OUTPUT_PATH
    
    # Ensure the output directory exists
    os.makedirs(os.path.dirname(out_path) or '.', exist_ok=True)
    
    # Calculate batches
    num_batches = (n + BATCH_SIZE - 1) // BATCH_SIZE
    
    failed_batches = 0

    for batch_num in tqdm(range(num_batches), desc="Generating Query Batches"):
        try:
            resp = client.chat.completions.create(
                model=AZURE_OPENAI_DEPLOYMENT,
                messages=[
                    {"role": "system", "content": phi_query_system_prompt},
                    {"role": "user", "content": f"Generate {BATCH_SIZE} new, unique, and realistic clinical queries with structured PHI_TAGS as specified in the system prompt."}
                ],
                temperature=TEMPERATURE,
                max_tokens=2500
            )
            
            generated_text = resp.choices[0].message.content
            
            # Basic validation: Check if response contains expected format markers
            if generated_text and '===QUERY===' in generated_text:
                # Append to the file immediately after a successful call to save progress
                with open(out_path, "a", encoding="utf-8") as f:
                    f.write(generated_text.strip() + "\n\n")
            else:
                print(f"\nWarning: Batch {batch_num+1} returned malformed output; skipping.")
                failed_batches += 1

        except Exception as e:
            # Log errors to a separate file to not pollute the dataset
            error_log_path = "generation_errors.log"
            with open(error_log_path, "a", encoding="utf-8") as f:
                f.write(f"[{datetime.now()}] Batch {batch_num+1} failed: {e}\n")
            print(f"\nError in batch {batch_num+1}; skipping. See generation_errors.log")
            failed_batches += 1

    print(f"\n{'='*60}")
    print(f"Generation complete!")
    print(f"Queries appended to: {out_path}")
    print(f"Successful batches: {num_batches - failed_batches}/{num_batches}")
    if failed_batches > 0:
        print(f"Failed batches: {failed_batches} (see generation_errors.log)")
    print(f"{'='*60}")
    
    # Run validation on the output file
    if os.path.exists(out_path):
        print("\nValidating generated dataset...")
        validate_dataset(out_path)

In [None]:
# ============================================================================
# Dataset Validation Function
# ============================================================================

def validate_dataset(filepath):
    """
    Validate existing dataset: Count queries, PHI elements, and hard negatives.
    
    This function performs post-generation quality checks to ensure the dataset
    meets expected standards for balance and completeness.
    
    Args:
        filepath (str): Path to the generated synthetic_dataset.txt file
    
    Returns:
        dict: Validation statistics
    """
    if not os.path.exists(filepath):
        print(f"Error: File not found at {filepath}")
        return None
    
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    # Split by query markers
    queries = content.split('===QUERY===')[1:]  # Skip first empty split
    
    valid_queries = 0
    queries_with_phi = 0
    hard_negatives = 0
    total_phi_elements = 0
    malformed_queries = 0
    
    for block in queries:
        if '===PHI_TAGS===' not in block:
            malformed_queries += 1
            continue
        
        query_text, phi_section = block.split('===PHI_TAGS===', 1)
        
        if query_text.strip():
            valid_queries += 1
            
            # Count PHI elements (lines starting with '{' are JSON tags)
            phi_lines = [l for l in phi_section.split('\n') 
                        if l.strip() and l.strip().startswith('{')]
            
            if phi_lines:
                queries_with_phi += 1
                total_phi_elements += len(phi_lines)
            else:
                hard_negatives += 1
    
    # Calculate statistics
    phi_percentage = (queries_with_phi / valid_queries * 100) if valid_queries > 0 else 0
    hard_neg_percentage = (hard_negatives / valid_queries * 100) if valid_queries > 0 else 0
    mean_phi_per_query = (total_phi_elements / valid_queries) if valid_queries > 0 else 0
    
    # Print validation report
    print(f"\n{'='*60}")
    print("DATASET VALIDATION REPORT")
    print(f"{'='*60}")
    print(f"Total Valid Queries:      {valid_queries}")
    print(f"Queries with PHI:         {queries_with_phi} ({phi_percentage:.1f}%)")
    print(f"Hard Negatives (no PHI):  {hard_negatives} ({hard_neg_percentage:.1f}%)")
    print(f"Total PHI Elements:       {total_phi_elements}")
    print(f"Mean PHI per Query:       {mean_phi_per_query:.2f}")
    
    if malformed_queries > 0:
        print(f"\n  Malformed Queries:      {malformed_queries}")
    
    # Quality checks
    print(f"\n{'='*60}")
    print("QUALITY CHECKS")
    print(f"{'='*60}")
    
    checks_passed = True
    
    # Check 1: Hard negatives should be ~25%
    if 15 <= hard_neg_percentage <= 35:
        print("✓ Hard negative ratio within expected range (15-35%)")
    else:
        print(f"  Hard negative ratio ({hard_neg_percentage:.1f}%) outside expected range (15-35%)")
        checks_passed = False
    
    # Check 2: Mean PHI should be reasonable (2-4 per query)
    if 1.5 <= mean_phi_per_query <= 5:
        print("✓ Mean PHI per query within expected range (1.5-5)")
    else:
        print(f"  Mean PHI per query ({mean_phi_per_query:.2f}) outside expected range (1.5-5)")
        checks_passed = False
    
    # Check 3: No excessive malformed queries
    if malformed_queries < valid_queries * 0.05:  # Less than 5%
        print("✓ Malformed query rate acceptable (<5%)")
    else:
        print(f"  High malformed query rate: {malformed_queries} ({malformed_queries/(valid_queries+malformed_queries)*100:.1f}%)")
        checks_passed = False
    
    if checks_passed:
        print(f"\n✓ All quality checks passed!")
    else:
        print(f"\n  Some quality checks failed. Review dataset or regenerate.")
    
    print(f"{'='*60}\n")
    
    return {
        'valid_queries': valid_queries,
        'queries_with_phi': queries_with_phi,
        'hard_negatives': hard_negatives,
        'total_phi_elements': total_phi_elements,
        'mean_phi_per_query': mean_phi_per_query,
        'malformed_queries': malformed_queries
    }

In [None]:
# ============================================================================
# Example Execution: Generate Queries
# ============================================================================
# This cell demonstrates generating 1000 queries in batches of 5.
# For maximum reproducibility, set temperature=0 for deterministic outputs
# (though this reduces query diversity).
#
# The function will:
#   1. Generate queries in 200 batches
#   2. Append each batch immediately to preserve progress
#   3. Log any errors to generation_errors.log
#   4. Validate the final dataset automatically
#
# IMPORTANT: This appends to the output file. If you want to start fresh,
# delete or rename the existing synthetic_dataset.txt file first.

# Uncomment to run:
# generate_phi_queries(n=1000)

In [None]:
# ============================================================================
# Validate Existing Dataset
# ============================================================================
# Run this cell to validate an existing dataset without regenerating queries.
# Useful for checking dataset quality after generation or modifications.

# Example: Validate the default output file
# validate_dataset('./synthetic_dataset.txt')

# Or validate a specific file:
# validate_dataset('/Users/jacweath/Desktop/safesearch_/JMIR AI/Synth Data Gen/synthetic_dataset.txt')
