<span style="font-size: 12px;">

To run the Python code in this notebook locally, you'll need to install several packages.

### Required Packages:

*   **python-dotenv**: Used for managing environment variables, especially for API keys and endpoints.
    ```bash
    pip install python-dotenv
    ```
*   **PyMuPDF**: A library for accessing PDF files (extracting text, images, etc.). It is imported as `fitz`.
    ```bash
    pip install PyMuPDF
    ```
*   **openai**: The official Python library for interacting with OpenAI APIs, including Azure OpenAI.
    ```bash
    pip install openai
    ```

The `os` and `glob` modules are part of the Python standard library and do not require separate installation.

### Optional for Ollama Users:

If you plan to adapt the script to use Ollama (a platform for running large language models locally), you will also need:

*   **ollama**: The Python client library for Ollama.
    ```bash
    pip install ollama
    ```

Ensure you have Python installed on your system. 

You can then install these packages using `pip`, Python's package installer. 

It's recommended to use a virtual environment to manage your project dependencies.

</span>

<span style="font-size: 12px;">

## Guide: Selecting and Running an Ollama Model (e.g., `mistral-small`) Locally on Windows 11

This guide will walk you through selecting an Ollama model from their library and running it locally on your Windows 11 PC. We'll use `mistral-small` as an example.

### 1. Install Ollama on Windows 11

*   **Download Ollama:** Go to the official Ollama website: [https://ollama.com/](https://ollama.com/)
*   Click on the "Download" button.
*   Select the "Download for Windows" option.
*   Run the downloaded installer and follow the on-screen instructions. This will typically install Ollama and make the `ollama` command available in your terminal (Command Prompt, PowerShell, or Windows Terminal).

### 2. Select a Model from the Ollama Library

*   **Browse the Library:** Visit the Ollama model library: [https://ollama.com/library](https://ollama.com/library) (You can sort by newest, most popular, etc. The link provided `?sort=newest` sorts by newest).
*   **Find Your Model:** For this example, we're looking for `mistral-small`. You can search for it or browse the list.
    *   Clicking on a model (e.g., `mistral-small`) will take you to its page, showing details and tags (versions).

### 3. Pull the Model using the Command Line

*   **Open Your Terminal:** Open Command Prompt, PowerShell, or Windows Terminal.
*   **Pull the Model:** To download the `mistral-small` model (it will default to the latest version/tag), type the following command and press Enter:
    ```bash
    ollama pull mistral-small
    ```
    You will see a download progress. The model files can be several gigabytes, so this might take some time depending on your internet connection.
    *   If you want a specific version/tag of a model (e.g., if `mistral-small` had a tag like `7b-instruct-v0.2`), you would use `ollama pull mistral-small:7b-instruct-v0.2`. For `mistral-small`, just `ollama pull mistral-small` is usually sufficient for the latest.

### 4. Run the Model Locally

Once the model is downloaded, you can run it interactively.

*   **Run Interactively:** In the same terminal, type:
    ```bash
    ollama run mistral-small
    ```
*   **Interact:** The Ollama CLI will load the model, and you'll see a prompt like `>>> Send a message (/? for help)`. You can now type your questions or prompts and press Enter. The model will generate a response.
    *   Type `/?` for a list of commands within the interactive session (e.g., `/bye` to exit, `/save` to save a session).

### 5. (Optional) Using the Model with Python

The Python code in this notebook is set up for Azure OpenAI. To use Ollama with Python (as briefly mentioned in the configuration cell):

*   **Install the Ollama Python Library:**
    ```bash
    pip install ollama
    ```
*   **Adapt the Python Script:** You would need to modify the Python script to:
    1.  Import the `ollama` library: `import ollama`
    2.  Initialize the Ollama client: `client = ollama.Client()` (assuming Ollama is running on the default `http://localhost:11434`)
    3.  Replace the Azure OpenAI API calls (`client.chat.completions.create`) with Ollama's equivalent, for example:
        ```python
        response = ollama.chat(
            model='mistral-small', # Or the model you pulled
            messages=[
                {'role': 'user', 'content': 'Why is the sky blue?'}
            ]
        )
        print(response['message']['content'])
        ```
    4.  Adjust prompts and parameters as needed, because different models respond best to different prompting styles.

This provides a basic way to get started with Ollama and a specific model like `mistral-small` on your Windows 11 machine. For more advanced usage or troubleshooting, refer to the official Ollama documentation.

In [None]:
import os

# --- Configuration for Azure OpenAI ---
# IMPORTANT: To run this script, you need to set the following environment variables
# with your Azure OpenAI service credentials.
#
# How to set environment variables:
#
# 1. Using a .env file (Recommended for local development):
#    - Install the python-dotenv library: pip install python-dotenv
#    - Create a file named .env in the same directory as this script.
#    - Add your credentials to the .env file like this:
#      AZURE_OPENAI_ENDPOINT="your_azure_openai_endpoint_here"
#      AZURE_OPENAI_API_KEY="your_azure_openai_api_key_here"
#      AZURE_OPENAI_DEPLOYMENT_NAME="your_gpt4_deployment_name_here"
#      AZURE_OPENAI_API_VERSION="2024-02-01" # Or your specific API version
#    - The script will then load these variables.
#
# 2. Setting them directly in your shell (Temporary for a session):
#    For Linux/macOS:
#      export AZURE_OPENAI_ENDPOINT="your_azure_openai_endpoint"
#      export AZURE_OPENAI_API_KEY="your_azure_openai_api_key"
#      export AZURE_OPENAI_DEPLOYMENT_NAME="your_gpt4_deployment_name"
#      export AZURE_OPENAI_API_VERSION="your_api_version"
#
#    For Windows (Command Prompt):
#      set AZURE_OPENAI_ENDPOINT="your_azure_openai_endpoint"
#      set AZURE_OPENAI_API_KEY="your_azure_openai_api_key"
#      set AZURE_OPENAI_DEPLOYMENT_NAME="your_gpt4_deployment_name"
#      set AZURE_OPENAI_API_VERSION="your_api_version"
#
#    For Windows (PowerShell):
#      $env:AZURE_OPENAI_ENDPOINT="your_azure_openai_endpoint"
#      $env:AZURE_OPENAI_API_KEY="your_azure_openai_api_key"
#      $env:AZURE_OPENAI_DEPLOYMENT_NAME="your_gpt4_deployment_name"
#      $env:AZURE_OPENAI_API_VERSION="your_api_version"

# Attempt to load .env file if python-dotenv is installed
try:
    from dotenv import load_dotenv
    if load_dotenv():
        print("Loaded environment variables from .env file.")
    else:
        # This can happen if .env exists but is empty, or if python-dotenv is not installed
        # and load_dotenv() is a dummy function. We'll proceed to os.getenv() anyway.
        pass
except ImportError:
    print("python-dotenv library not found, .env file will not be loaded. "
          "Ensure environment variables are set manually if not using a .env file.")

AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
# Use the API version from your original script as a default if not set in environment
API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

# --- Information for Ollama Users ---
# This script is currently configured to use Azure OpenAI.
# If you wish to adapt this script or use Ollama for other projects:
#
# 1. Ollama Installation and Setup:
#    - Ensure Ollama is installed and running on your system.
#      Visit https://ollama.com for installation instructions.
#    - Download models you want to use, e.g., `ollama pull llama3`.
#
# 2. Ollama "Credentials" / Endpoint:
#    - Ollama typically runs a local server. The default API endpoint is
#      `http://localhost:11434`. This is the "address" you'd use to connect.
#    - If Ollama is running on a different host or port, or behind a reverse proxy,
#      you would use that specific URL.
#
# 3. Using Ollama with Python:
#    - You can use the `ollama` Python library: `pip install ollama`
#    - Example usage:
#      ```python
#      # import ollama
#      # client = ollama.Client(host='http://localhost:11434') # Or your custom host
#      # response = client.chat(
#      #   model='llama3', # Or any model you have pulled
#      #   messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
#      # )
#      # print(response['message']['content'])
#      ```
#    - To integrate Ollama into *this* script, you would need to:
#      - Replace the `AzureOpenAI` client initialization.
#      - Modify the `convert_text_to_markdown` and `pseudonymize_markdown` functions
#        to use `ollama.chat` or `ollama.generate` calls.
#      - Adjust prompts and parameters, as Ollama models may require different
#        prompting strategies and have different capabilities than Azure OpenAI's GPT-4.
#
# 4. OpenAI-Compatible Endpoint (Advanced):
#    - Some tools or versions of Ollama (or related projects like LiteLLM) can expose
#      an OpenAI-compatible API endpoint (e.g., at `/v1/chat/completions`).
#    - If you have such an endpoint, you might be able to use the `openai` Python library
#      by configuring the `base_url` and `api_key` (Ollama often doesn't require an API key,
#      so a dummy key like "ollama" might be used).
#      ```python
#      # from openai import OpenAI
#      # client = OpenAI(
#      #     base_url="http://localhost:11434/v1", # Example, check Ollama docs
#      #     api_key="ollama", # Or any non-empty string if not required
#      # )
#      ```
#      This approach would require fewer changes to the existing API call structures
#      but depends on the Ollama setup providing this compatibility layer.

# --- Safety checks for Azure OpenAI credentials ---
if not all([AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT_NAME, API_VERSION]):
    missing_vars = []
    if not AZURE_OPENAI_ENDPOINT: missing_vars.append("AZURE_OPENAI_ENDPOINT")
    if not AZURE_OPENAI_API_KEY: missing_vars.append("AZURE_OPENAI_API_KEY")
    if not AZURE_OPENAI_DEPLOYMENT_NAME: missing_vars.append("AZURE_OPENAI_DEPLOYMENT_NAME")
    if not API_VERSION: missing_vars.append("API_VERSION") # Should have a default
    raise ValueError(
        f"Missing one or more Azure OpenAI environment variables: {', '.join(missing_vars)}. "
        "Please set them (e.g., in a .env file or directly in your environment) "
        "before running the script. Refer to the comments at the beginning of the script for instructions."
    )
else:
    print("Azure OpenAI environment variables loaded successfully.")

# The rest of your imports (fitz, AzureOpenAI, glob) and script logic will follow.
# Ensure you remove the old hardcoded assignments for the variables above.
import fitz  # PyMuPDF
from openai import AzureOpenAI
import glob # Using glob is often easier for pattern matching files

AZURE_OPENAI_ENDPOINT = "https://xxxxx.openai.azure.com/" # Your Azure OpenAI Endpoint
AZURE_OPENAI_API_KEY = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Your Azure OpenAI API Key
AZURE_OPENAI_DEPLOYMENT_NAME = "GPT4.1" # The name of your GPT-4 deployment in Azure OpenAI Studio
API_VERSION = "2024-12-01-preview" 

# Specify the directory containing the PDF files
PDF_DIRECTORY_PATH = r"xxxx\LR_EPDs" # Use raw string for Windows paths
# Specify the single output Markdown file
# Specify the output paths for the combined files
# These will be saved in the parent directory of the PDF folder
OUTPUT_COMBINED_MD_FILE_PATH = os.path.join(os.path.dirname(PDF_DIRECTORY_PATH), "combined_epds_markdown.md")
OUTPUT_COMBINED_PSEUDO_MD_FILE_PATH = os.path.join(os.path.dirname(PDF_DIRECTORY_PATH), "pseudo_combined_epds_markdown.md")

# Categorieën van privacygevoelige gegevens (optioneel, maar helpt de AI)
PRIVACY_CATEGORIES = [
    "Persoonsnamen (patiënt, arts, etc.)",
    "Adressen",
    "Telefoonnummers",
    "E-mailadressen",
    "Geboortedata",
    "Burgerservicenummer (BSN) of andere ID-nummers",
    "Medische klachten, symptomen of diagnoses",
    "Medische behandelingen, medicatie of procedures",
    "Verzekeringsgegevens",
    "Financiële gegevens",
    "Andere direct identificeerbare persoonlijke informatie"
]

# Prompts for Pseudonymization
PSEUDO_SYSTEM_MESSAGE_CONTENT = "Vervang in de aangeleverde tekst uitsluitend de persoonsnamen (zoals patiëntnamen, namen van artsen, medewerkers, familieleden, etc.) door realistische, verzonnen pseudoniemen. Zorg ervoor dat de originele markdown opmaak van de tekst volledig behouden blijft. Geef als antwoord *alleen* de aangepaste tekst terug, zonder enige uitleg of extra commentaar."
# Although a separate context_message_content variable was provided,
# it's often more effective to incorporate the categories directly into the user prompt
# or implicitly rely on the system prompt's instruction regarding "persoonsnamen".
# Let's slightly adapt the user prompt based on common API interaction patterns.

# Safety checks for placeholders
if AZURE_OPENAI_API_KEY == "<YOUR_AZURE_OPENAI_API_KEY>" or AZURE_OPENAI_DEPLOYMENT_NAME == "<YOUR_GPT4_DEPLOYMENT_NAME>":
    raise ValueError("Please replace the placeholder values for AZURE_OPENAI_API_KEY and AZURE_OPENAI_DEPLOYMENT_NAME.")

# --- PDF Text Extraction (Function remains the same) ---
def extract_text_from_pdf(pdf_path):
    """Extracts text content from all pages of a PDF file."""
    text = ""
    try:
        doc = fitz.open(pdf_path)
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            page_text = page.get_text("text") # Extract plain text
            if page_text:
                text += page_text + "\n\n" # Add separator between pages
        doc.close()
        return text
    except Exception as e:
        print(f"Error reading PDF {pdf_path}: {e}")
        return None

# --- Azure OpenAI Client Initialization (Remains the same) ---
client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=API_VERSION
)

# --- Markdown Conversion using Azure OpenAI (Function remains the same) ---
def convert_text_to_markdown(text_content, pdf_filename):
    """Sends text to Azure OpenAI GPT-4 for Markdown conversion."""
    if not text_content:
        return None

    system_prompt = "You are an AI assistant specialized in converting raw text extracted from documents into well-structured and readable Markdown format. Retain the core meaning, structure (headings, lists, paragraphs), and technical details accurately. Do not add any conversational preamble or explanation outside the Markdown itself."
    user_prompt = f"""Please convert the following text, extracted from the PDF document '{pdf_filename}', into Markdown format.
Pay close attention to potential headings, subheadings, bullet points, numbered lists, code blocks, and paragraph breaks based on the text structure.
Format the output strictly as Markdown.

--- BEGIN PDF TEXT ({pdf_filename}) ---
{text_content}
--- END PDF TEXT ({pdf_filename}) ---

Generate only the Markdown content for this document.
"""

    print(f"  Converting text to Markdown...") # Simplified message

    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT_NAME, # Your deployment name
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.2,
            max_tokens=24000
        )
        markdown_output = response.choices[0].message.content
        print(f"  Markdown conversion successful.") # Simplified message
        return markdown_output

    except Exception as e:
        print(f"Error converting text to Markdown for '{pdf_filename}': {e}")
        return None

# --- Pseudonymization using Azure OpenAI (New Function) ---
def pseudonymize_markdown(markdown_content, pdf_filename):
    """Sends markdown text to Azure OpenAI GPT-4 for pseudonymization."""
    if not markdown_content:
        print(f"No markdown content to pseudonymize for {pdf_filename}.")
        return None

    # Adapting the user prompt to include categories as context for the AI
    pseudo_user_prompt = f"""{markdown_content}

--- Instructions ---
Vervang uitsluitend de persoonsnamen (zoals patiëntnamen, namen van artsen, medewerkers, familieleden, etc.) in de bovenstaande tekst door realistische, verzonnen pseudoniemen.
Zorg ervoor dat de originele markdown opmaak van de tekst volledig behouden blijft.
Geef als antwoord *alleen* de aangepaste tekst terug, zonder enige uitleg of extra commentaar.
"""
    # Note: Directly listing PRIVACY_CATEGORIES might be redundant if the system prompt is clear on "persoonsnamen".
    # However, you could include them like:
    # pseudo_user_prompt = f"""{markdown_content}\n\n--- Instructions ---\nReplace names (like patient names, doctor names, staff names, family names, etc.) in the above text with realistic, made-up pseudonyms.\nFocus on replacing names related to:\n{'\n'.join(PRIVACY_CATEGORIES)}\n\nEnsure the original markdown formatting is fully preserved. Provide *only* the modified text as the response, without any explanation or extra commentary."""


    print(f"  Pseudonymizing markdown...") # Simplified message

    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT_NAME, # Your deployment name
            messages=[
                {"role": "system", "content": PSEUDO_SYSTEM_MESSAGE_CONTENT},
                {"role": "user", "content": pseudo_user_prompt} # Use the constructed user prompt
            ],
            temperature=0.2, # Keep low temperature
            max_tokens=24000 # Keep similar max_tokens
        )
        pseudonymized_output = response.choices[0].message.content
        print(f"  Pseudonymization successful.") # Simplified message
        return pseudonymized_output

    except Exception as e:
        print(f"Error calling Azure OpenAI API for pseudonymization of '{pdf_filename}': {e}")
        # Consider adding a small delay before retrying or failing permanently if this is common
        # time.sleep(5) # Example: wait for 5 seconds before potential retry logic
        return None

# --- Function to save individual Markdown files (Reused) ---
def save_single_markdown_file(markdown_content, output_path):
    """Saves a single Markdown content string to a specified file."""
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(markdown_content)
        print(f"  Saved individual file: {os.path.basename(output_path)}") # More concise message
    except Exception as e:
        print(f"Error writing individual Markdown file {output_path}: {e}")

# --- Function to save combined Markdown output (Reused) ---
def save_combined_markdown_to_file(combined_markdown_content, output_path, file_description="Combined Markdown"):
    """Saves the combined Markdown content from all PDFs to a single file."""
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(combined_markdown_content)
        print(f"{file_description} successfully saved to: {output_path}")
    except Exception as e:
        print(f"Error writing {file_description} file {output_path}: {e}")


# --- Main Execution Logic (Modified) ---
if __name__ == "__main__":
    # Check if the input directory exists
    if not os.path.isdir(PDF_DIRECTORY_PATH):
        print(f"Error: Input directory not found at '{PDF_DIRECTORY_PATH}'")
        exit()

    # Find all PDF files in the directory (case-insensitive)
    pdf_files = glob.glob(os.path.join(PDF_DIRECTORY_PATH, "*.pdf")) + \
                  glob.glob(os.path.join(PDF_DIRECTORY_PATH, "*.PDF"))
    # Make unique in case of case variations on systems that differentiate
    pdf_files = list(set(pdf_files))
    pdf_files.sort() # Process in alphabetical order

    if not pdf_files:
        print(f"No PDF files found in directory '{PDF_DIRECTORY_PATH}'")
        exit()

    total_files = len(pdf_files)
    print(f"Found {total_files} PDF files to process in '{PDF_DIRECTORY_PATH}'.")

    all_markdown_content = []         # List for non-pseudonymized combined output
    all_pseudonymized_content = []    # List for pseudonymized combined output

    # Loop through each PDF file with enumeration for progress tracking
    for index, pdf_path in enumerate(pdf_files):
        pdf_filename = os.path.basename(pdf_path)
        # Calculate and print progress
        progress_percentage = ((index + 1) / total_files) * 100
        print(f"\n--- Processing file {index + 1} of {total_files} ({progress_percentage:.1f}%) : {pdf_filename} ---")

        # 1. Extract text from the current PDF
        extracted_text = extract_text_from_pdf(pdf_path)

        generated_markdown = None
        pseudonymized_markdown = None

        if extracted_text:
            # 2. Convert extracted text to Markdown using Azure OpenAI
            generated_markdown = convert_text_to_markdown(extracted_text, pdf_filename)

            if generated_markdown:
                # Save individual Non-Pseudonymized Markdown file
                base_filename = os.path.splitext(pdf_filename)[0]
                individual_md_output_path = os.path.join(PDF_DIRECTORY_PATH, f"{base_filename}.md")
                save_single_markdown_file(generated_markdown, individual_md_output_path)

                # Add content to the list for the combined Non-Pseudonymized file
                separator = f"## Source PDF: {pdf_filename}\n\n" # Header for combined file
                all_markdown_content.append(separator + generated_markdown + "\n\n---\n") # Add content with header and horizontal rule

                # --- NEW: Pseudonymize the generated Markdown ---
                pseudonymized_markdown = pseudonymize_markdown(generated_markdown, pdf_filename)

                if pseudonymized_markdown:
                    # Save individual Pseudonymized Markdown file
                    individual_pseudo_md_output_path = os.path.join(PDF_DIRECTORY_PATH, f"pseudo_{base_filename}.md")
                    save_single_markdown_file(pseudonymized_markdown, individual_pseudo_md_output_path)

                    # Add content to the list for the combined Pseudonymized file
                    pseudo_separator = f"## Source PDF: {pdf_filename} (Pseudonymized)\n\n" # Header for combined pseudo file
                    all_pseudonymized_content.append(pseudo_separator + pseudonymized_markdown + "\n\n---\n")
                else:
                    print(f"Skipping pseudonymization and combined pseudo output for {pdf_filename} due to pseudonymization failure.")
                    # Add a placeholder to the combined pseudo file list
                    all_pseudonymized_content.append(f"## Source PDF: {pdf_filename} (Pseudonymization Failed)\n\n*Failed to pseudonymize this content.*\n\n---\n")

            else:
                print(f"Skipping Markdown conversion and subsequent steps for {pdf_filename} due to Markdown conversion failure.")
                # Add placeholders to both combined lists
                all_markdown_content.append(f"## Source PDF: {pdf_filename}\n\n*Failed to convert this PDF to Markdown.*\n\n---\n")
                all_pseudonymized_content.append(f"## Source PDF: {pdf_filename} (Pseudonymization Skipped)\n\n*Markdown conversion failed.*\n\n---\n")
        else:
            print(f"Skipping text extraction and subsequent steps for {pdf_filename} due to text extraction failure.")
            # Add placeholders to both combined lists
            all_markdown_content.append(f"## Source PDF: {pdf_filename}\n\n*Failed to extract text from this PDF.*\n\n---\n")
            all_pseudonymized_content.append(f"## Source PDF: {pdf_filename} (Pseudonymization Skipped)\n\n*Text extraction failed.*\n\n---\n")


    # 4. Combine all collected markdown content
    combined_content = "\n".join(all_markdown_content)
    combined_pseudo_content = "\n".join(all_pseudonymized_content)

    # 5. Save the combined content to the single output files
    if combined_content.strip():
        save_combined_markdown_to_file(combined_content, OUTPUT_COMBINED_MD_FILE_PATH, "Combined Non-Pseudonymized Markdown")
    else:
        print("No Non-Pseudonymized Markdown content was generated to save to the combined file.")

    if combined_pseudo_content.strip():
        save_combined_markdown_to_file(combined_pseudo_content, OUTPUT_COMBINED_PSEUDO_MD_FILE_PATH, "Combined Pseudonymized Markdown")
    else:
        print("No Pseudonymized Markdown content was generated to save to the combined file.")


    print("\nProcessing complete.")