<span style="font-size: 12px;">

To run the Python code in this notebook locally, you'll need to install several packages.

### Required Packages:

*   **python-dotenv**: Used for managing environment variables, especially for API keys and endpoints.
    ```bash
    pip install python-dotenv
    ```
*   **PyMuPDF**: A library for accessing PDF files (extracting text, images, etc.). It is imported as `fitz`.
    ```bash
    pip install PyMuPDF
    ```
*   **openai**: The official Python library for interacting with OpenAI APIs, including Azure OpenAI.
    ```bash
    pip install openai
    ```

The `os` and `glob` modules are part of the Python standard library and do not require separate installation.


Ensure you have Python installed on your system. 

You can then install these packages using `pip`, Python's package installer. 

It's recommended to use a virtual environment to manage your project dependencies.

</span>

<span style="font-size: 12px;">

## Using Azure OpenAI Credentials with a `.env` File

To keep your Azure OpenAI credentials secure and separate from your source code, you can use a `.env` file to store sensitive information. Below is a step-by-step guide on how to set this up.

### 1. Create a `.env` File

Create a file named `.env` in the root of your project directory with the following content. Replace the placeholder values with your actual Azure OpenAI details:

</br>
AZURE_OPENAI_ENDPOINT        =  https://xxxxxxx.openai.azure.com/

AZURE_OPENAI_API_KEY         =  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

AZURE_OPENAI_DEPLOYMENT_NAME =  GPT4.1

API_VERSION                  =  2024-12-01-preview



### Important Notes

- **Security:** Do **not** commit your `.env` file to any public repositories. Add `.env` to your `.gitignore` file to keep it private.
- **Flexibility:** Using environment variables makes it easier to manage different environments (development, staging, production) without changing your code.
- **Best Practice:** This approach helps keep your API keys and endpoints secure and separate from your application logic.

---

By following these steps, you ensure that your Azure OpenAI credentials are managed securely and your code remains clean and maintainable.



```mermaid

flowchart LR
    A[Your Application] -- HTTPS (Encrypted) --> B[Azure OpenAI API]
    B -- Encrypted Response --> A
    C[.env File] -->|Loads Credentials| A
    A -- Managed Identity/RBAC --> D[Azure Active Directory]
    B -- Private Endpoint/VNet --> E[Azure Network Security]
    E -- Firewall/NSG Rules --> B
    A -- Logs Activity --> F[Azure Monitoring & Audit Logs]

```


In [None]:
# os is needed to load environment variables
# and to set up the OpenAI API client
import os

# Set up Azure OpenAI credentials
from openai import AzureOpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Retrieve credentials from environment variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME")
API_VERSION = os.getenv("API_VERSION")

# Example: Initialize AzureOpenAI client (adjust according to your usage)
client = AzureOpenAI(
    deployment_name=AZURE_OPENAI_DEPLOYMENT_NAME,
    api_key=AZURE_OPENAI_API_KEY,
    endpoint=AZURE_OPENAI_ENDPOINT,
    api_version=API_VERSION,
)

# Now you can use `client` to interact with your Azure OpenAI deployment



In [None]:
# --- Import necessary libraries ---
import fitz  # PyMuPDF
import glob # Using glob is often easier for pattern matching files
# --- Ensure the required libraries are installed ---


# --- Configuration for PDF Processing ---
# Specify the directory containing the PDF files
PDF_DIRECTORY_PATH = r".\Lagerugpijn\LR_EPDs" # Use raw string for Windows paths
# Specify the single output Markdown file
# Specify the output paths for the combined files
# These will be saved in the parent directory of the PDF folder
OUTPUT_COMBINED_MD_FILE_PATH = os.path.join(os.path.dirname(PDF_DIRECTORY_PATH), "combined_epds_markdown.md")
OUTPUT_COMBINED_PSEUDO_MD_FILE_PATH = os.path.join(os.path.dirname(PDF_DIRECTORY_PATH), "pseudo_combined_epds_markdown.md")

# Categorieën van privacygevoelige gegevens (optioneel, maar helpt de AI)
PRIVACY_CATEGORIES = [
    "Persoonsnamen (patiënt, arts, etc.)",
    "Adressen",
    "Telefoonnummers",
    "E-mailadressen",
    "Geboortedata",
    "Burgerservicenummer (BSN) of andere ID-nummers",
    "Medische klachten, symptomen of diagnoses",
    "Medische behandelingen, medicatie of procedures",
    "Verzekeringsgegevens",
    "Financiële gegevens",
    "Andere direct identificeerbare persoonlijke informatie"
]

# Prompts for Pseudonymization
PSEUDO_SYSTEM_MESSAGE_CONTENT = "Vervang in de aangeleverde tekst uitsluitend de persoonsnamen (zoals patiëntnamen, namen van artsen, medewerkers, familieleden, etc.) door realistische, verzonnen pseudoniemen. Zorg ervoor dat de originele markdown opmaak van de tekst volledig behouden blijft. Geef als antwoord *alleen* de aangepaste tekst terug, zonder enige uitleg of extra commentaar."
# Although a separate context_message_content variable was provided,
# it's often more effective to incorporate the categories directly into the user prompt
# or implicitly rely on the system prompt's instruction regarding "persoonsnamen".
# Let's slightly adapt the user prompt based on common API interaction patterns.

# Safety checks for placeholders
if AZURE_OPENAI_API_KEY == "<YOUR_AZURE_OPENAI_API_KEY>" or AZURE_OPENAI_DEPLOYMENT_NAME == "<YOUR_GPT4_DEPLOYMENT_NAME>":
    raise ValueError("Please replace the placeholder values for AZURE_OPENAI_API_KEY and AZURE_OPENAI_DEPLOYMENT_NAME.")

# --- PDF Text Extraction (Function remains the same) ---
def extract_text_from_pdf(pdf_path):
    """Extracts text content from all pages of a PDF file."""
    text = ""
    try:
        doc = fitz.open(pdf_path)
        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            page_text = page.get_text("text") # Extract plain text
            if page_text:
                text += page_text + "\n\n" # Add separator between pages
        doc.close()
        return text
    except Exception as e:
        print(f"Error reading PDF {pdf_path}: {e}")
        return None

# --- Azure OpenAI Client Initialization  ---
client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=API_VERSION
)

# --- Markdown Conversion using Azure OpenAI  ---
def convert_text_to_markdown(text_content, pdf_filename):
    """Sends text to Azure OpenAI GPT-4 for Markdown conversion."""
    if not text_content:
        return None

    system_prompt = "You are an AI assistant specialized in converting raw text extracted from documents into well-structured and readable Markdown format. Retain the core meaning, structure (headings, lists, paragraphs), and technical details accurately. Do not add any conversational preamble or explanation outside the Markdown itself."
    
    user_prompt = f"""Please convert the following text, extracted from the PDF document '{pdf_filename}', into Markdown format.
                    Pay close attention to potential headings, subheadings, bullet points, numbered lists, code blocks, and paragraph breaks based on the text structure.
                    Format the output strictly as Markdown.

--- BEGIN PDF TEXT ({pdf_filename}) ---
{text_content}
--- END PDF TEXT ({pdf_filename}) ---

Generate only the Markdown content for this document.
"""

    print(f"  Converting text to Markdown...") # Simplified feedback message

    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT_NAME, # Your deployment name
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            temperature=0.2,
            max_tokens=24000
        )
        markdown_output = response.choices[0].message.content
        print(f"  Markdown conversion successful.") # Simplified message
        return markdown_output

    except Exception as e:
        print(f"Error converting text to Markdown for '{pdf_filename}': {e}")
        return None

# --- Pseudonymization using Azure OpenAI (New Function) ---
def pseudonymize_markdown(markdown_content, pdf_filename):
    """Sends markdown text to Azure OpenAI GPT-4 for pseudonymization."""
    if not markdown_content:
        print(f"No markdown content to pseudonymize for {pdf_filename}.")
        return None

    # Adapting the user prompt to include categories as context for the AI
    pseudo_user_prompt = f"""{markdown_content}

--- Instructions ---
Vervang uitsluitend de persoonsnamen (zoals patiëntnamen, namen van artsen, medewerkers, familieleden, etc.) in de bovenstaande tekst door realistische, verzonnen pseudoniemen.
Zorg ervoor dat de originele markdown opmaak van de tekst volledig behouden blijft.
Geef als antwoord *alleen* de aangepaste tekst terug, zonder enige uitleg of extra commentaar.
"""
    # Note: Directly listing PRIVACY_CATEGORIES might be redundant if the system prompt is clear on "persoonsnamen".
    # However, you could include them like:
    # pseudo_user_prompt = f"""{markdown_content}\n\n--- Instructions ---\nReplace names (like patient names, doctor names, staff names, family names, etc.) 
    # in the above text with realistic, made-up pseudonyms.\nFocus on replacing names related to:\n{'\n'.join(PRIVACY_CATEGORIES)}\n\n 
    # Ensure the original markdown formatting is fully preserved. Provide *only* the modified text as the response, without any explanation or extra commentary."""
    # This is a more concise version of the user prompt, focusing on the task at hand.
    # The system message is already clear about the task, so we can keep it simple.
    print(f"  Pseudonymizing markdown...") # Simplified message

    try:
        response = client.chat.completions.create(
            model=AZURE_OPENAI_DEPLOYMENT_NAME, # Your deployment name
            messages=[
                {"role": "system", "content": PSEUDO_SYSTEM_MESSAGE_CONTENT},
                {"role": "user", "content": pseudo_user_prompt} # Use the constructed user prompt
            ],
            temperature=0.2, # Keep low temperature
            max_tokens=24000 # Keep similar max_tokens
        )
        pseudonymized_output = response.choices[0].message.content
        print(f"  Pseudonymization successful.") # Simplified message
        return pseudonymized_output

    except Exception as e:
        print(f"Error calling Azure OpenAI API for pseudonymization of '{pdf_filename}': {e}")
        # Consider adding a small delay before retrying or failing permanently if this is common
        # time.sleep(5) # Example: wait for 5 seconds before potential retry logic
        return None

# --- Function to save individual Markdown files (Reused) ---
def save_single_markdown_file(markdown_content, output_path):
    """Saves a single Markdown content string to a specified file."""
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(markdown_content)
        print(f"  Saved individual file: {os.path.basename(output_path)}") # More concise message
    except Exception as e:
        print(f"Error writing individual Markdown file {output_path}: {e}")

# --- Function to save combined Markdown output (Reused) ---
def save_combined_markdown_to_file(combined_markdown_content, output_path, file_description="Combined Markdown"):
    """Saves the combined Markdown content from all PDFs to a single file."""
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(combined_markdown_content)
        print(f"{file_description} successfully saved to: {output_path}")
    except Exception as e:
        print(f"Error writing {file_description} file {output_path}: {e}")


# --- Main Execution Logic (Modified) ---
if __name__ == "__main__":
    # Check if the input directory exists
    if not os.path.isdir(PDF_DIRECTORY_PATH):
        print(f"Error: Input directory not found at '{PDF_DIRECTORY_PATH}'")
        exit()

    # Find all PDF files in the directory (case-insensitive)
    pdf_files = glob.glob(os.path.join(PDF_DIRECTORY_PATH, "*.pdf")) + \
                  glob.glob(os.path.join(PDF_DIRECTORY_PATH, "*.PDF"))
    # Make unique in case of case variations on systems that differentiate
    pdf_files = list(set(pdf_files))
    pdf_files.sort() # Process in alphabetical order

    if not pdf_files:
        print(f"No PDF files found in directory '{PDF_DIRECTORY_PATH}'")
        exit()

    total_files = len(pdf_files)
    print(f"Found {total_files} PDF files to process in '{PDF_DIRECTORY_PATH}'.")

    all_markdown_content = []         # List for non-pseudonymized combined output
    all_pseudonymized_content = []    # List for pseudonymized combined output

    # Loop through each PDF file with enumeration for progress tracking
    for index, pdf_path in enumerate(pdf_files):
        pdf_filename = os.path.basename(pdf_path)
        # Calculate and print progress
        progress_percentage = ((index + 1) / total_files) * 100
        print(f"\n--- Processing file {index + 1} of {total_files} ({progress_percentage:.1f}%) : {pdf_filename} ---")

        # 1. Extract text from the current PDF
        extracted_text = extract_text_from_pdf(pdf_path)

        generated_markdown = None
        pseudonymized_markdown = None

        if extracted_text:
            # 2. Convert extracted text to Markdown using Azure OpenAI
            generated_markdown = convert_text_to_markdown(extracted_text, pdf_filename)

            if generated_markdown:
                # Save individual Non-Pseudonymized Markdown file
                base_filename = os.path.splitext(pdf_filename)[0]
                individual_md_output_path = os.path.join(PDF_DIRECTORY_PATH, f"{base_filename}.md")
                save_single_markdown_file(generated_markdown, individual_md_output_path)

                # Add content to the list for the combined Non-Pseudonymized file
                separator = f"## Source PDF: {pdf_filename}\n\n" # Header for combined file
                all_markdown_content.append(separator + generated_markdown + "\n\n---\n") # Add content with header and horizontal rule

                # --- NEW: Pseudonymize the generated Markdown ---
                pseudonymized_markdown = pseudonymize_markdown(generated_markdown, pdf_filename)

                if pseudonymized_markdown:
                    # Save individual Pseudonymized Markdown file
                    individual_pseudo_md_output_path = os.path.join(PDF_DIRECTORY_PATH, f"pseudo_{base_filename}.md")
                    save_single_markdown_file(pseudonymized_markdown, individual_pseudo_md_output_path)

                    # Add content to the list for the combined Pseudonymized file
                    pseudo_separator = f"## Source PDF: {pdf_filename} (Pseudonymized)\n\n" # Header for combined pseudo file
                    all_pseudonymized_content.append(pseudo_separator + pseudonymized_markdown + "\n\n---\n")
                else:
                    print(f"Skipping pseudonymization and combined pseudo output for {pdf_filename} due to pseudonymization failure.")
                    # Add a placeholder to the combined pseudo file list
                    all_pseudonymized_content.append(f"## Source PDF: {pdf_filename} (Pseudonymization Failed)\n\n*Failed to pseudonymize this content.*\n\n---\n")

            else:
                print(f"Skipping Markdown conversion and subsequent steps for {pdf_filename} due to Markdown conversion failure.")
                # Add placeholders to both combined lists
                all_markdown_content.append(f"## Source PDF: {pdf_filename}\n\n*Failed to convert this PDF to Markdown.*\n\n---\n")
                all_pseudonymized_content.append(f"## Source PDF: {pdf_filename} (Pseudonymization Skipped)\n\n*Markdown conversion failed.*\n\n---\n")
        else:
            print(f"Skipping text extraction and subsequent steps for {pdf_filename} due to text extraction failure.")
            # Add placeholders to both combined lists
            all_markdown_content.append(f"## Source PDF: {pdf_filename}\n\n*Failed to extract text from this PDF.*\n\n---\n")
            all_pseudonymized_content.append(f"## Source PDF: {pdf_filename} (Pseudonymization Skipped)\n\n*Text extraction failed.*\n\n---\n")


    # 4. Combine all collected markdown content
    combined_content = "\n".join(all_markdown_content)
    combined_pseudo_content = "\n".join(all_pseudonymized_content)

    # 5. Save the combined content to the single output files
    if combined_content.strip():
        save_combined_markdown_to_file(combined_content, OUTPUT_COMBINED_MD_FILE_PATH, "Combined Non-Pseudonymized Markdown")
    else:
        print("No Non-Pseudonymized Markdown content was generated to save to the combined file.")

    if combined_pseudo_content.strip():
        save_combined_markdown_to_file(combined_pseudo_content, OUTPUT_COMBINED_PSEUDO_MD_FILE_PATH, "Combined Pseudonymized Markdown")
    else:
        print("No Pseudonymized Markdown content was generated to save to the combined file.")


    print("\nProcessing complete.")