# Policy Generator Notebook

This notebook demonstrates the methodology for generating IDS-compliant policies as described in our conference paper. It is organized into the following sections:

1. **Introduction & Setup**: Overview and required imports.
2. **Data Preprocessing**: Loading and cleaning the data.
3. **Model & Policy Generation**: Implementation of policy generation including API requests.
4. **Results & Analysis**: Evaluation and visualization of the generated policies.
5. **Conclusion**: Summary and directions for future work.

_All sensitive details (e.g., API keys) have been removed or replaced with secure references._

In [None]:
# Setup: Import necessary libraries and configure environment

import os
import json
import time
import re
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set API keys via environment variables for security.
AZURE_API_KEY = os.getenv('AZURE_API_KEY')  # Replace with your secure API key variable
if not AZURE_API_KEY:
    raise EnvironmentError("AZURE_API_KEY is not set in the environment variables.")

# Define endpoints (replace with your own if needed)
AZURE_MODEL_URL = "https://your-azure-endpoint.com/score"

# Additional configuration for reproducibility
np.random.seed(42)
print("Setup complete. Libraries imported and API key loaded securely.")

## Data Preprocessing

This section loads the dataset, cleans it, and prepares it for policy generation. The dataset is expected to be in CSV format with appropriate columns.

In [None]:
# Function to load and preprocess the dataset
def load_data(filepath):
    """
    Load dataset from a CSV file.

    Parameters:
    - filepath: str, path to the CSV file

    Returns:
    - pandas.DataFrame: Preprocessed data
    """
    data = pd.read_csv(filepath)
    # Data cleaning steps: remove duplicates and handle missing values.
    data.drop_duplicates(inplace=True)
    data.fillna(method='ffill', inplace=True)
    return data

# Update the filepath as needed
data_filepath = 'data/policy_data.csv'
data = load_data(data_filepath)
print(f"Data loaded and preprocessed. Shape: {data.shape}")

## Model & Policy Generation

This section contains code that constructs and sends requests to the policy generation API. It uses system instructions combined with user-provided descriptions to generate policies that adhere to the IDS ontology.

In [None]:
# Define system messages and valid IDS ontology terms for generating policies
valid_terms = ["ids:leftOperand", "ids:rightOperand", "ids:operator", "ids:constraint"]

# Example policy structure (as reference)
example_policy = {
    "policy": {
        "id": "example-policy",
        "permissions": [
            {
                "assignee": "example-user",
                "action": "use",
                "constraints": [
                    {"leftOperand": "ids:purpose", "operator": "ids:eq", "rightOperand": "research"}
                ]
            }
        ]
    }
}

# Optimized system messages to guide the API
system_messages = [
    "You are an expert in International Data Spaces (IDS) policy creation.",
    "Generate policies strictly aligned with the IDS ontology.",
    "Use only the following defined terms: " + ", ".join(valid_terms),
    "Refer to https://w3id.org/idsa/core/ for ontology definitions.",
    "Here is an example policy:",
    json.dumps(example_policy, indent=4),
    "Return output as valid JSON without markdown or extra formatting."
]

# Combine system messages into a single context string
context = "\n".join(system_messages)
print("System messages prepared.")

In [None]:
def query_policy_api(prompt):
    """
    Send a request to the policy generation API with the given prompt.

    Parameters:
    - prompt: str, the user-provided description for which to generate a policy

    Returns:
    - dict: The JSON response from the API as a dictionary
    """
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {AZURE_API_KEY}"
    }

    # Create the full prompt by appending user input to the system context
    full_prompt = f"{context}\n\nUser: {prompt}\nAssistant:"

    payload = {
        "input_data": {
            "input_string": [full_prompt],
            "parameters": {
                "temperature": 0.7,
                "top_p": 0.95,
                "max_new_tokens": 800
            }
        }
    }

    response = requests.post(AZURE_MODEL_URL, headers=headers, json=payload)
    response.raise_for_status()  # Raise an error for any HTTP issues
    response_json = response.json()

    # Extract the generated policy from the API response
    if isinstance(response_json, dict) and "output" in response_json:
        output_text = response_json["output"]
        try:
            # Try parsing the output directly as JSON
            return json.loads(output_text)
        except json.JSONDecodeError:
            # Fallback: attempt cleaning the response text to extract JSON
            cleaned_text = output_text.strip().replace("```", "")
            return json.loads(cleaned_text)
    else:
        raise ValueError("Unexpected API response format.")

# Example usage with a sample prompt
sample_prompt = "Generate an access control policy for research data sharing."
policy_response = query_policy_api(sample_prompt)
print("Policy generated:")
print(json.dumps(policy_response, indent=4))

## Policy Generation on Dataset

This cell iterates through the dataset, generates a policy for each description, and saves the result back to the DataFrame.

In [None]:
# Ensure the DataFrame has a column for storing the generated policies.
if 'Generated_Policy' not in data.columns:
    data['Generated_Policy'] = None

processed_count = 0
wait_time = 60  # Time in seconds to wait between API calls to avoid rate limits

# Iterate through each row and generate a policy for rows with valid descriptions
for index, row in data.iterrows():
    description = row.get('Policy', None)
    if pd.notnull(description):
        try:
            policy = query_policy_api(description)
            data.at[index, 'Generated_Policy'] = json.dumps(policy, indent=4)
            print(f"Policy generated for row {index}.")
        except Exception as e:
            print(f"Error generating policy for row {index}: {e}")
        processed_count += 1
        time.sleep(wait_time)

print("Sample of generated policies:")
print(data[['Policy', 'Generated_Policy']].head(10))

# Optionally, save the DataFrame with generated policies to an Excel file.
output_file_path = 'Generated_Policies.xlsx'
data.to_excel(output_file_path, index=False)
print(f"Generated policies saved to {output_file_path}.")

## Results & Analysis

This section provides an evaluation of the generated policies. The following visualization shows the distribution of the target variable from the dataset.

In [None]:
from sklearn.metrics import accuracy_score

# Example: Evaluate model accuracy if applicable (this example uses a simple accuracy metric)
def evaluate_model(model, test_set):
    X_test, y_test = test_set
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    return acc

# In this context, we'll display a sample histogram from the dataset.
plt.figure(figsize=(8, 6))
plt.hist(data['Policy'].dropna(), bins=20, alpha=0.7)
plt.title('Distribution of Descriptions')
plt.xlabel('Description')
plt.ylabel('Frequency')
plt.show()

## Conclusion

This notebook has been cleaned, organized, and documented for review. It details the process of data preprocessing, policy generation via an API, and subsequent analysis of the generated policies. Future work could involve refining the model, reducing API call wait times, or integrating more sophisticated policy validation.