To run the Python code in this notebook locally, you will need to install the following packages. You can install them using pip:

```bash
pip install PyMuPDF openai
```

*   **PyMuPDF (`fitz`)**: Used for PDF manipulation, though in the provided synthetic data generation script, its import might be a remnant if not directly used for generation output. It was likely used in prior data processing steps.
*   **openai**: The official Python library for interacting with the OpenAI API, including Azure OpenAI services.

In [None]:
import os
import fitz  # PyMuPDF
# Conditional imports based on LLM provider choice by the user
# If using Azure OpenAI (default):
from openai import AzureOpenAI
# If adapting for Ollama, you might need to install and import its library:
# e.g., import ollama # (after `pip install ollama`)
# Or use standard libraries like requests for direct API calls:
# import requests
import glob # Using glob is often easier for pattern matching files

# --- LLM Configuration ---
# This script can be configured to use Azure OpenAI or adapted for Ollama (or other LLMs).
#
# --- 1. Azure OpenAI Configuration (Default) ---
# To use Azure OpenAI, set the following as environment variables in your system,
# or replace the placeholder strings in this script with your actual values.
#
# - AZURE_OPENAI_ENDPOINT: Your Azure OpenAI resource endpoint.
#   (e.g., "https://your-resource-name.openai.azure.com/")
# - AZURE_OPENAI_API_KEY: Your Azure OpenAI API key.
# - AZURE_OPENAI_DEPLOYMENT_NAME: The name of your model deployment in Azure OpenAI Studio.
#   This will be used as the 'model' in API calls (e.g., "GPT4.1", "gpt-4-turbo").
# - AZURE_OPENAI_API_VERSION: The API version for Azure OpenAI.
#   The original script used "2024-12-01-preview".

AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT", "YOUR_AZURE_OPENAI_ENDPOINT_HERE")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY", "YOUR_AZURE_OPENAI_API_KEY_HERE")
AZURE_OPENAI_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "YOUR_AZURE_DEPLOYMENT_NAME_HERE")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2024-12-01-preview") # Original API version

# --- 2. Ollama Configuration (Alternative - Requires Code Modification) ---
# To use Ollama (e.g., running locally via `ollama serve`):
#
# a) Define your Ollama settings (can also be environment variables):
#    OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
#    OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "your-ollama-model-name") # e.g., "llama3", "mistral"
#
# b) Modify the 'LLM Client Initialization' section below.
#    You'll need to instantiate an Ollama client or set up for direct HTTP requests.
#    Example using the 'ollama' Python library (`pip install ollama`):
#    --------------------------------------------------------------------
#    # import ollama
#    # try:
#    #     client = ollama.Client(host=OLLAMA_BASE_URL)
#    #     # Test connection (optional): client.list()
#    #     print(f"Ollama client initialized for model: {OLLAMA_MODEL} at {OLLAMA_BASE_URL}")
#    #     LLM_PROVIDER = "Ollama" # Set a flag
#    # except Exception as e:
#    #     print(f"Error initializing Ollama client: {e}")
#    #     client = None
#    #     LLM_PROVIDER = None
#    --------------------------------------------------------------------
#
# c) Adapt the `generate_synthetic_record` function:
#    - The call `client.chat.completions.create(...)` is specific to the OpenAI library.
#      Replace it with the Ollama client's method for chat completions
#      (e.g., `client.chat(model=OLLAMA_MODEL, messages=[...])`).
#    - The `model` parameter should use `OLLAMA_MODEL`.
#    - The structure of `messages` might need adjustment.
#    - Parsing the response (e.g., `response.choices[0].message.content`) will change
#      to match Ollama's response format (e.g., `response['message']['content']`).

# --- LLM Client Initialization ---
# By default, this script attempts to initialize the Azure OpenAI client.
# Modify this section if you are using Ollama or another LLM provider.

client = None
LLM_PROVIDER = None # Can be 'Azure', 'Ollama', etc.

# Attempt Azure OpenAI Initialization if not using placeholders
if not (AZURE_OPENAI_ENDPOINT == "YOUR_AZURE_OPENAI_ENDPOINT_HERE" or \
    AZURE_OPENAI_API_KEY == "YOUR_AZURE_OPENAI_API_KEY_HERE" or \
    AZURE_OPENAI_DEPLOYMENT_NAME == "YOUR_AZURE_DEPLOYMENT_NAME_HERE"):
    try:
    print("Attempting to initialize Azure OpenAI client...")
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION
    )
    print(f"Azure OpenAI client initialized. Using deployment: '{AZURE_OPENAI_DEPLOYMENT_NAME}'.")
    LLM_PROVIDER = "Azure"
    except Exception as e:
    print(f"Error initializing Azure OpenAI client: {e}")
    print("Please ensure your Azure OpenAI ENDPOINT, API_KEY, DEPLOYMENT_NAME, and API_VERSION are correctly set as environment variables or in the script.")
    client = None # Ensure client is None if initialization fails
else:
    print("Azure OpenAI credentials are set to placeholder values or not fully provided.")
    print("If you intend to use Azure OpenAI, please set the following environment variables or update their values directly in the script:")
    print("  - AZURE_OPENAI_ENDPOINT")
    print("  - AZURE_OPENAI_API_KEY")
    print("  - AZURE_OPENAI_DEPLOYMENT_NAME")
    print("  - AZURE_OPENAI_API_VERSION (currently defaults to '2024-12-01-preview')")
    print("The script will proceed, but API calls will fail if Azure OpenAI is the intended provider and is not configured.")

# IMPORTANT FOR USERS:
# The `generate_synthetic_record` function currently uses `client.chat.completions.create`
# and `model=AZURE_OPENAI_DEPLOYMENT_NAME`, which are specific to Azure OpenAI.
# If you switch to Ollama or another LLM, you MUST adapt that function accordingly.

# --- Path Configurations ---
# It's good practice to make paths configurable, e.g., via environment variables.
DEFAULT_PSEUDO_MD_PATH = r".\Lagerugpijn\LR_EPDs" # Original path
PSEUDO_MD_DIRECTORY_PATH = os.getenv("PSEUDO_MD_DIRECTORY_PATH", DEFAULT_PSEUDO_MD_PATH)
print(f"Using pseudonymized examples from: {PSEUDO_MD_DIRECTORY_PATH}")

# Specify the base directory for synthetic data output (derived from input or configurable)
# SYNTHETIC_OUTPUT_BASE_DIR = os.getenv("SYNTHETIC_OUTPUT_BASE_DIR", os.path.dirname(PSEUDO_MD_DIRECTORY_PATH))
SYNTHETIC_OUTPUT_BASE_DIR = os.path.dirname(PSEUDO_MD_DIRECTORY_PATH) # As per original logic

# Specify the output directory for the synthetic files
SYNTHETIC_OUTPUT_DIR_NAME = "synthetic_epds" # Name of the subfolder for synthetic EPDs
SYNTHETIC_OUTPUT_DIR = os.path.join(SYNTHETIC_OUTPUT_BASE_DIR, SYNTHETIC_OUTPUT_DIR_NAME)
# Ensure this path is created later if it doesn't exist (os.makedirs in save function)
print(f"Synthetic data will be saved to: {SYNTHETIC_OUTPUT_DIR}")

# Configure how many synthetic records to generate
# NUM_SYNTHETIC_RECORDS_TO_GENERATE = int(os.getenv("NUM_SYNTHETIC_RECORDS", "20")) # Example: make configurable
NUM_SYNTHETIC_RECORDS_TO_GENERATE = 20 # As per original script

# --- Helper Function to Load Pseudonymized Examples ---
# (load_pseudonymized_examples function definition remains here)
# ...

# --- Function to Generate a Single Synthetic Record ---
# (generate_synthetic_record function definition remains here)
# REMINDER: This function needs adaptation if not using Azure OpenAI.
# Specifically, the client call and model parameter.
# ...

# --- Function to Save a Single Synthetic Record ---
# (save_synthetic_record function definition remains here)
# ...

# --- Main Execution Logic for Synthetic Data Generation ---
# (if __name__ == "__main__": block remains here)
# ...
import fitz  # PyMuPDF
from openai import AzureOpenAI
import glob # Using glob is often easier for pattern matching files


AZURE_OPENAI_ENDPOINT = "https://xxxxxxx.openai.azure.com/" # Your Azure OpenAI Endpoint
AZURE_OPENAI_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Your Azure OpenAI API Key
AZURE_OPENAI_DEPLOYMENT_NAME = "GPT4.1" # The name of your GPT-4 deployment in Azure OpenAI Studio
API_VERSION = "2024-12-01-preview" 

# --- Azure OpenAI Client Initialization (Remains the same) ---
client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=API_VERSION
)

# Use the directory containing the pseudonymized Markdown files as input examples
# This should be the same directory where the pseudo_*.md files were saved
# PSEUDO_MD_DIRECTORY_PATH = r".\Dataset\Lagerugpijn\LR_EPDs" # Reuse the input directory from GIVEN

# Specify the base directory for synthetic data output (e.g., parent of the input dir)
SYNTHETIC_OUTPUT_BASE_DIR = os.path.dirname(PSEUDO_MD_DIRECTORY_PATH)

# Specify the output directory for the synthetic files
SYNTHETIC_OUTPUT_DIR = os.path.join(SYNTHETIC_OUTPUT_BASE_DIR, "synthetic_epds")

# Configure how many synthetic records to generate
NUM_SYNTHETIC_RECORDS_TO_GENERATE = 20 # Example: Generate 20 synthetic records

# --- Helper Function to Load Pseudonymized Examples ---
def load_pseudonymized_examples(directory_path):
    """
    Loads content from pseudonymized markdown files in a directory.
    Adds clear separators to help the AI distinguish examples.
    """
    # Look for files starting with 'pseudo_' and ending with '.md'
    example_files = glob.glob(os.path.join(directory_path, "pseudo_*.md"))
    example_content = []

    if not example_files:
        print(f"Warning: No pseudonymized example files found in '{directory_path}'. "
              f"Cannot load examples for synthetic data generation. "
              f"Generation will proceed without specific structure/style examples.")
        return ""

    print(f"Loading {len(example_files)} pseudonymized example files from '{directory_path}'...")
    for file_path in example_files:
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
                # Add a clear separator and header for each example
                example_content.append(f"\n--- BEGIN VOORBEELD DOSSIER: {os.path.basename(file_path)} ---\n{content.strip()}\n--- EINDE VOORBEELD DOSSIER ---\n")
        except Exception as e:
            print(f"Error reading example file {file_path}: {e}")

    return "\n".join(example_content)


# --- Function to Generate a Single Synthetic Record ---
def generate_synthetic_record(client, example_markdown_content, record_number):
    """
    Generates a single synthetic Dutch physiotherapeutic EHR record
    using Azure OpenAI, guided by the prompts and examples.
    """

    # System prompt based on the Worker's core role
    system_prompt = """Je bent een fysiotherapeut die realistische synthetische Nederlandse patiëntdossiers (EHR) genereert op basis van geanonimiseerde intake-informatie en expertbegeleiding. Je past het International Classification of Functioning (ICF) kader toe en volgt de KNGF klinische richtlijn voor lage rugpijn. Produceer uitsluitend het gevraagde patiëntdossier."""

    # User prompt incorporating instructions from both Supervisor and Worker prompts
    user_prompt = f"""Genereer EEN compleet en realistisch synthetisch fysiotherapeutisch patiëntdossier in het Nederlands, uitsluitend voor een patiënt met **acute, subacute of chronische lage rugpijn**. Genereer **geen** dossiers voor andere klachten.

Het dossier moet de volgende onderdelen bevatten, in deze volgorde en met de volgende specificaties:

1.  **Samenvatting anamnese:** Een bondige, verhalende samenvatting van de patiëntgeschiedenis (klachtengeschiedenis, symptoomontwikkeling), functionele impact, coping en relevante context (werk, stressoren, eerdere episodes). Schrijf dit in natuurlijke, professionele Nederlandse klinische taal. Geef duidelijk aan of het gaat om acute (<6 weken), subacute (6-12 weken) of chronische (>12 weken) lage rugpijn.
2.  **ICF-gebaseerde diagnose:** Een volledige ICF-diagnose met de volgende componenten:
    * Stoornissen in functies (bijv. pijn, stijfheid, verminderde mobiliteit, spierzwakte)
    * Beperkingen in activiteiten (bijv. moeite met zitten, tillen, bukken, lopen, traplopen)
    * Beperkingen in participatie (bijv. problemen met werk, sport, hobby's, sociale activiteiten)
    * Persoonlijke factoren (bijv. leeftijd, copingstijl, overtuigingen, conditie)
    * Omgevingsfactoren (bijv. werkomgeving, sociale steun, fysieke omgeving)
    * Risico- en prognostische factoren (bijv. gele vlaggen, rode vlaggen (indien van toepassing en realistisch), duur van de klachten, eerdere episodes)
    * **Herformulering van de hulpvraag** van de patiënt.
3.  **Behandeldoelen:** Formuleer **SMART, patiëntgerichte, functionele doelen**. Beschrijf specifiek **wat de patiënt weer wil kunnen doen**. Klinimetrische scores (zoals PSK, NRS, ODI) mogen worden genoemd als *ondersteuning* of *meetbaar criterium* voor het doel (bijv. "PSK van 70 naar ≤14 om weer te kunnen tuinieren"), maar de score-reductie is niet het doel zelf. Geef aan *wanneer* het doel bereikt moet zijn.
4.  **Behandelplan:** Beschrijf de voorgestelde interventies (bijv. manuele therapie, oefentherapie, motorische controle training, educatie, graded activity, leefstijladvies, pijneducatie) en de rationale hierachter. Baseer dit plan op de KNGF richtlijn lage rugpijn en de gestelde doelen.
5.  **SOEP voortgangsnotities:** Schrijf **minimaal 3 en maximaal 8 afzonderlijke voortgangsnotities**, elk voor een individuele behandelsessie. Gebruik het **volledige SOEP-formaat** (Subjectief, Objectief, Evaluatie, Plan) voor elke notitie. Toon progressie, eventuele stagnatie, terugval, aanpassing van het plan en klinische besluitvorming over de sessies heen. Varieer realistisch in de frequentie en het aantal sessies tussen de 3 en 8.
6.  **Taal en stijl:** Het hele dossier moet geschreven zijn in professioneel, natuurlijk Nederlands, zoals gebruikt door Nederlandse fysiotherapeuten. Breid gangbare afkortingen en klinische shorthand uit (zoals PSK, LWK, 3d xt li). Hanteer een realistische en gevarieerde toon en structuur die aansluit bij de voorbeelden.

Hieronder staan voorbeelden van gepseudonimiseerde patiëntdossiers. Gebruik deze voorbeelden als referentie voor de verwachte structuur, stijl, taalgebruik en het detailniveau, maar genereer een **compleet nieuw en uniek patiëntgeval** met een eigen anamnese, diagnose, doelen en een realistisch, variabel verloop van de behandeling over meerdere sessies.

{example_markdown_content}

Genereer nu **uitsluitend** het nieuwe patiëntdossier hieronder, beginnend met de anamnese samenvatting en eindigend met 'FINISH'. Zorg ervoor dat het dossier **alle** hierboven gevraagde onderdelen bevat en voldoet aan **alle** instructies, inclusief het vereiste aantal SOEP-notities en de focus op lage rugpijn.

"""

    print(f"  Generating synthetic record {record_number}...")

    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT_NAME, # Your deployment name for generation
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.8, # Higher temperature for creativity and variation
            max_tokens=8000 # Sufficient tokens for a full record with multiple notes
        )
        synthetic_output = response.choices[0].message.content
        print(f"  Synthetic record {record_number} generation successful.")
        return synthetic_output

    except Exception as e:
        print(f"Error calling Azure OpenAI API for synthetic record {record_number}: {e}")
        # Consider adding a small delay or retry logic here for robustness
        # time.sleep(5) # Example delay
        return None

# --- Function to Save a Single Synthetic Record ---
def save_synthetic_record(synthetic_content, output_dir, record_number):
    """Saves a single synthetic record content string to a specified file."""
    # Ensure the output directory exists
    os.makedirs(output_dir, exist_ok=True)
    output_path = os.path.join(output_dir, f"synthetic_patient_{record_number:03d}.md") # Use padding for sorting

    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(synthetic_content)
        print(f"  Saved synthetic record {record_number} to: {output_path}")
    except Exception as e:
        print(f"Error writing synthetic record file {output_path}: {e}")


# --- Main Execution Logic for Synthetic Data Generation ---
if __name__ == "__main__":
    # --- Assuming the previous script's main block finished and the client is available ---

    print("\n--- Starting Synthetic Data Generation ---")

    # Check if the pseudonymized examples directory exists
    if not os.path.isdir(PSEUDO_MD_DIRECTORY_PATH):
        print(f"Error: Pseudonymized examples directory not found at '{PSEUDO_MD_DIRECTORY_PATH}'. "
              f"Please ensure the previous script ran successfully and generated pseudo_*.md files.")
        exit()

    # Load pseudonymized example content
    # This content will be included in the prompt to provide context/style examples
    example_content = load_pseudonymized_examples(PSEUDO_MD_DIRECTORY_PATH)

    # Proceed even if no examples were loaded, but warn the user
    if not example_content:
         print("Continuing synthetic data generation without examples. The AI will rely solely on the prompts.")


    print(f"\nGenerating {NUM_SYNTHETIC_RECORDS_TO_GENERATE} synthetic records in '{SYNTHETIC_OUTPUT_DIR}'.")

    # Loop to generate the specified number of synthetic records
    for i in range(NUM_SYNTHETIC_RECORDS_TO_GENERATE):
        record_index = i + 1
        print(f"\n--- Generating Synthetic Record {record_index} of {NUM_SYNTHETIC_RECORDS_TO_GENERATE} ---")

        # Generate the synthetic record using the AI
        synthetic_record_content = generate_synthetic_record(client, example_content, record_index)

        if synthetic_record_content:
            # Optional: Basic validation (check for "FINISH" marker)
            if synthetic_record_content.strip().endswith("FINISH"):
                 # Remove the FINISH marker from the saved file if desired, or keep it.
                 # Let's keep it for now as per prompt instruction.
                 pass # synthetic_record_content = synthetic_record_content.strip()[:-len("FINISH")].strip()
            else:
                 print(f"Warning: Generated record {record_index} does not end with 'FINISH'. Content might be incomplete or malformed.")


            # Save the generated record to a file
            save_synthetic_record(synthetic_record_content, SYNTHETIC_OUTPUT_DIR, record_index)
        else:
            print(f"Skipping save for synthetic record {record_index} due to generation failure.")

        # Optional: Add a small delay between generation calls to avoid hitting rate limits
        # time.sleep(1) # Example: 1 second delay

    print("\nSynthetic data generation complete.")