# APIM ‚ù§Ô∏è Microsoft Foundry

## Foundry Models Evals lab
![flow](../../images/foundry-models-evals.gif)

Playground to experiment with [Microsoft Foundry cloud evaluations](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation). This lab demonstrates how to extract LLM request/response data from Azure API Management's built-in logging (`ApiManagementGatewayLlmLog`) and use it as input for running evaluations in Microsoft Foundry.

### Prerequisites

- [Python 3.12 or later version](https://www.python.org/) installed
- [VS Code](https://code.visualstudio.com/) installed with the [Jupyter notebook extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) enabled
- [Python environment](https://code.visualstudio.com/docs/python/environments#_creating-environments) with the [requirements.txt](../../requirements.txt) or run `pip install -r requirements.txt` in your terminal
- [An Azure Subscription](https://azure.microsoft.com/free/) with [Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/privileged#contributor) + [RBAC Administrator](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/privileged#role-based-access-control-administrator) or [Owner](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/privileged#owner) roles
- [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) installed and [Signed into your Azure subscription](https://learn.microsoft.com/cli/azure/authenticate-azure-cli-interactively)

‚ñ∂Ô∏è Click `Run All` to execute all steps sequentially, or execute them `Step by Step`...

<a id='0'></a>
### 0Ô∏è‚É£ Initialize notebook variables

- Resources will be suffixed by a unique string based on your subscription id.
- Adjust the location parameters according your preferences and on the [product availability by Azure region.](https://azure.microsoft.com/explore/global-infrastructure/products-by-region/?cdn=disable&products=cognitive-services,api-management) 
- Adjust the models and versions according the [availability by region.](https://learn.microsoft.com/azure/ai-services/openai/concepts/models)

In [None]:
import os, sys, json
sys.path.insert(1, '../../shared')  # add the shared directory to the Python path
import utils

deployment_name = os.path.basename(os.path.dirname(globals()['__vsc_ipynb_file__']))
resource_group_name = f"lab-{deployment_name}" # change the name to match your naming style
resource_group_location = "swedencentral"  # Must be a region that supports risk and safety evaluators

aiservices_config = [{"name": "foundry1", "location": "swedencentral"}]

models_config = [{"name": "gpt-4.1-mini", "publisher": "OpenAI", "version": "2025-04-14", "sku": "GlobalStandard", "capacity": 100}]

apim_sku = 'Basicv2'
apim_subscriptions_config = [{"name": "subscription1", "displayName": "Subscription 1"}]

inference_api_path = "inference"  # path to the inference API in the APIM service
inference_api_type = "AzureOpenAIV1"  # options: AzureOpenAI, AzureAI, OpenAI, PassThrough
inference_api_version = "v1"
foundry_project_name = deployment_name

utils.print_ok('Notebook initialized')

<a id='1'></a>
### 1Ô∏è‚É£ Verify the Azure CLI and the connected Azure subscription

The following commands ensure that you have the latest version of the Azure CLI and that the Azure CLI is connected to your Azure subscription.

In [None]:
output = utils.run("az account show", "Retrieved az account", "Failed to get the current az account")

if output.success and output.json_data:
    current_user = output.json_data['user']['name']
    tenant_id = output.json_data['tenantId']
    subscription_id = output.json_data['id']

    utils.print_info(f"Current user: {current_user}")
    utils.print_info(f"Tenant ID: {tenant_id}")
    utils.print_info(f"Subscription ID: {subscription_id}")

<a id='2'></a>
### 2Ô∏è‚É£ Create deployment using ü¶æ Bicep

This lab uses [Bicep](https://learn.microsoft.com/azure/azure-resource-manager/bicep/overview?tabs=bicep) to declarative define all the resources that will be deployed in the specified resource group. Change the parameters or the [main.bicep](main.bicep) directly to try different configurations.

In [None]:
# Create the resource group if doesn't exist
utils.create_resource_group(resource_group_name, resource_group_location)

# Define the Bicep parameters
bicep_parameters = {
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentParameters.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "apimSku": { "value": apim_sku },
        "aiServicesConfig": { "value": aiservices_config },
        "modelsConfig": { "value": models_config },
        "apimSubscriptionsConfig": { "value": apim_subscriptions_config },
        "inferenceAPIPath": { "value": inference_api_path },
        "inferenceAPIType": { "value": inference_api_type },
        "foundryProjectName": { "value": foundry_project_name },
    }
}

# Write the parameters to the params.json file
with open('params.json', 'w') as bicep_parameters_file:
    bicep_parameters_file.write(json.dumps(bicep_parameters))

# Run the deployment
output = utils.run(f"az deployment group create --name {deployment_name} --resource-group {resource_group_name} --template-file main.bicep --parameters params.json",
    f"Deployment '{deployment_name}' succeeded", f"Deployment '{deployment_name}' failed")

<a id='3'></a>
### 3Ô∏è‚É£ Get the deployment outputs

We are now at the stage where we only need to retrieve the gateway URL and the subscription before we are ready for testing.

In [None]:
# Obtain all of the outputs from the deployment
output = utils.run(f"az deployment group show --name {deployment_name} -g {resource_group_name}", f"Retrieved deployment: {deployment_name}", f"Failed to retrieve deployment: {deployment_name}")

if output.success and output.json_data:
    log_analytics_id = utils.get_deployment_output(output, 'logAnalyticsWorkspaceId', 'Log Analytics Id')
    apim_service_id = utils.get_deployment_output(output, 'apimServiceId', 'APIM Service Id')
    apim_resource_gateway_url = utils.get_deployment_output(output, 'apimResourceGatewayURL', 'APIM API Gateway URL')
    foundry_project_endpoint = utils.get_deployment_output(output, 'foundryProjectEndpoint', 'Foundry Project Endpoint')
    apim_subscriptions = json.loads(utils.get_deployment_output(output, 'apimSubscriptions').replace("\'", "\""))
    for subscription in apim_subscriptions:
        subscription_name = subscription['name']
        subscription_key = subscription['key']
        utils.print_info(f"Subscription Name: {subscription_name}")
        utils.print_info(f"Subscription Key: ****{subscription_key[-4:]}")
    api_key = apim_subscriptions[0].get("key") # default api key to the first subscription key

<a id='requests'></a>
### üß™ Generate test data by calling the API

We'll make several API calls to generate LLM request/response data that will be logged to `ApiManagementGatewayLlmLog`. This data will later be extracted and used for evaluations.

In [None]:
import json, requests, time

# Sample prompts to generate diverse evaluation data
test_prompts = [
    {"system": "You are a helpful assistant.", "user": "What is the capital of France?"},
    {"system": "You are a helpful assistant.", "user": "Explain quantum computing in simple terms."},
    {"system": "You are a helpful assistant.", "user": "Write a haiku about programming."},
    {"system": "You are a customer service agent.", "user": "I want to return a product I bought last week."},
    {"system": "You are a helpful assistant.", "user": "What are the benefits of exercise?"},
    {"system": "You are a technical support agent.", "user": "My computer won't turn on. What should I do?"},
    {"system": "You are a travel advisor.", "user": "Recommend a vacation destination for families."},
    {"system": "You are a helpful assistant.", "user": "Summarize the plot of Romeo and Juliet."},
]

url = f"{apim_resource_gateway_url}/{inference_api_path}/openai/v1/chat/completions?api-version={inference_api_version}"

session = requests.Session()
session.headers.update({
    'api-key': api_key,
    'x-user-id': 'eval-test-user'
})

try:
    for i, prompt in enumerate(test_prompts):
        print(f"‚ñ∂Ô∏è Request {i+1}/{len(test_prompts)}: {prompt['user'][:50]}...")
        
        messages = {
            "messages": [
                {"role": "system", "content": prompt['system']},
                {"role": "user", "content": prompt['user']}
            ], 
            "model": models_config[0]['name']
        }
        
        start_time = time.time()
        response = session.post(url, json=messages)
        response_time = time.time() - start_time
        
        if response.status_code == 200:
            data = json.loads(response.text)
            print(f"‚úÖ Response: {data.get('choices')[0].get('message').get('content')[:100]}...")
            print(f"‚åö {response_time:.2f} seconds\n")
        else:
            print(f"‚ùå Error: {response.text}\n")
        
        time.sleep(0.5)  # Small delay between requests
finally:
    session.close()

utils.print_ok(f"Generated {len(test_prompts)} test requests for evaluation")
utils.print_info("Wait a few minutes for logs to be ingested into Log Analytics before proceeding.")

<a id='extract'></a>
### üìä Extract LLM logs from API Management

Query the `ApiManagementGatewayLlmLog` table to extract prompts and completions. This data will be formatted for use with Microsoft Foundry evaluations.

In [None]:
import pandas as pd

# Query to extract LLM request/response data from APIM logs
query = "ApiManagementGatewayLlmLog \
| where TimeGenerated > ago(1h) \
| project TimeGenerated, CorrelationId, DeploymentName, ModelName, TotalTokens, RequestMessages, ResponseMessages \
| summarize TimeGenerated = max(TimeGenerated), DeploymentName = take_any(DeploymentName), ModelName = take_any(ModelName), TotalTokens = sum(TotalTokens), RequestMessages = take_any(RequestMessages), ResponseMessages = take_any(ResponseMessages) by CorrelationId \
| take 50"

output = utils.run(f'az monitor log-analytics query -w {log_analytics_id} --analytics-query "{query}"', 
                   "Retrieved LLM logs", "Failed to retrieve LLM logs")

llm_logs = []
if output.success and output.json_data:
    llm_logs = output.json_data
    df = pd.DataFrame(llm_logs)
    utils.print_ok(f"Retrieved {len(llm_logs)} log entries")
    display(df[['TimeGenerated', 'DeploymentName', 'ModelName', 'TotalTokens', 'RequestMessages', 'ResponseMessages']].head(10))
else:
    utils.print_warning("No logs found. Wait a few minutes for logs to be ingested and try again.")

<a id='transform'></a>
### üîÑ Transform logs to evaluation dataset format

Convert the extracted LLM logs into the JSONL format required by Microsoft Foundry evaluations. The format includes `query` (user input) and `response` (model output) fields.

In [None]:
import json

def parse_request_messages(request_messages_str):
    """Extract the user query from request messages."""
    try:
        if isinstance(request_messages_str, str):
            messages = json.loads(request_messages_str)
        else:
            messages = request_messages_str
        
        # Find the user message
        for msg in messages:
            if msg.get('role') == 'user':
                return msg.get('content', '')
        return ''
    except:
        return str(request_messages_str)

def parse_response_content(response_messages_str):
    """Extract the assistant response from response content."""
    try:
        if isinstance(response_messages_str, str):
            content = json.loads(response_messages_str)
        else:
            content = response_messages_str
        
        # Handle different response formats
        if isinstance(content, dict):
            if 'choices' in content:
                return content['choices'][0].get('message', {}).get('content', '')
            elif 'content' in content:
                return content['content']
        return str(content)
    except:
        return str(response_messages_str)

# Transform logs to evaluation format
eval_data = []
for log in llm_logs:
    query = parse_request_messages(log.get('RequestMessages', ''))
    response = parse_response_content(log.get('ResponseMessages', ''))
    
    if query and response:
        eval_data.append({
            "query": query,
            "response": response,
            "context": f"Model: {log.get('ModelName', 'unknown')}",
            "ground_truth": ""  # Can be populated manually for specific evaluations
        })

utils.print_ok(f"Transformed {len(eval_data)} entries for evaluation")

# Preview the evaluation data
if eval_data:
    print("\nüìã Sample evaluation entry:")
    print(json.dumps(eval_data[0], indent=2))

<a id='save'></a>
### üíæ Save evaluation dataset to JSONL file

Save the transformed data as a JSONL file that can be uploaded to Microsoft Foundry for evaluation.

In [None]:
import json

eval_data_file = "evaluation_data.jsonl"

with open(eval_data_file, 'w') as f:
    for entry in eval_data:
        f.write(json.dumps(entry) + '\n')

utils.print_ok(f"Saved {len(eval_data)} entries to {eval_data_file}")

<a id='foundry'></a>
### üöÄ Run evaluation in Microsoft Foundry

Upload the evaluation dataset to Microsoft Foundry and run evaluations using built-in evaluators like coherence, fluency, and relevance.

In [None]:
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from datetime import datetime, timezone

# Create the Foundry project client
project_client = AIProjectClient(
    endpoint=foundry_project_endpoint,
    credential=DefaultAzureCredential(),
)

utils.print_ok("Connected to Microsoft Foundry project")

# Upload the evaluation dataset
dataset_name = f"apim-llm-logs-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}"
dataset_version = "1"

dataset = project_client.datasets.upload_file(
    name=dataset_name,
    version=dataset_version,
    file_path=eval_data_file,
)

utils.print_ok(f"Uploaded dataset: {dataset_name}")
utils.print_info(f"Dataset ID: {dataset.id}")

<a id='create-eval'></a>
### üìù Create and run the evaluation

Define the evaluation criteria and run the evaluation using Microsoft Foundry's built-in evaluators.

In [None]:
import os
import json
import time
from datetime import datetime, timezone
from pprint import pprint

from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from openai.types.evals.create_eval_jsonl_run_data_source_param import (
    CreateEvalJSONLRunDataSourceParam,
    SourceFileID,
)

model_deployment_name = models_config[0]['name']

with DefaultAzureCredential() as credential:
    with AIProjectClient(endpoint=foundry_project_endpoint, credential=credential) as project_client:
        
        print("Creating an OpenAI client from the AI Project client")
        client = project_client.get_openai_client(api_version="2025-04-01-preview")

        # Upload file using OpenAI Files API (NOT Foundry datasets)
        print("Uploading file using OpenAI Files API...")
        with open(eval_data_file, "rb") as f:
            uploaded_file = client.files.create(
                file=f,
                purpose="evals"
            )
        print(f"Uploaded file ID: {uploaded_file.id}")

        data_source_config = {
            "type": "custom",
            "item_schema": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "response": {"type": "string"},
                    "context": {"type": "string"},
                    "ground_truth": {"type": "string"},
                },
                "required": [],
            },
            "include_sample_schema": False,
        }

        testing_criteria = [
            {
                "type": "label_model",
                "name": "quality_check",
                "model": model_deployment_name,
                "input": [
                    {"role": "system", "content": "Rate the response quality as 'good' or 'bad'."},
                    {"role": "user", "content": "Query: {{item.query}}\nResponse: {{item.response}}"}
                ],
                "labels": ["good", "bad"],
                "passing_labels": ["good"]
            }
        ]

        print("Creating Eval Group")
        eval_object = client.evals.create(
            name=f"APIM LLM Logs Evaluation - {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M')}",
            data_source_config=data_source_config,
            testing_criteria=testing_criteria,
        )
        print(f"Eval Group created: {eval_object.id}")

        print("Creating Eval Run with OpenAI File ID")
        eval_run_object = client.evals.runs.create(
            eval_id=eval_object.id,
            name=f"apim-logs-eval-run-{datetime.now(timezone.utc).strftime('%Y%m%d-%H%M%S')}",
            metadata={"source": "ApiManagementGatewayLlmLog", "lab": "foundry-evals"},
            data_source=CreateEvalJSONLRunDataSourceParam(
                type="jsonl",
                source=SourceFileID(
                    type="file_id",
                    id=uploaded_file.id,  # Use OpenAI file ID, not dataset.id
                ),
            ),
        )
        print(f"Eval Run created: {eval_run_object.id}")

        # Poll until the run completes or fails
        while True:
            run = client.evals.runs.retrieve(
                run_id=eval_run_object.id, 
                eval_id=eval_object.id
            )
            print(f"Status: {run.status}")
            
            if run.status in ("completed", "failed"):
                output_items = list(
                    client.evals.runs.output_items.list(
                        run_id=run.id, 
                        eval_id=eval_object.id
                    )
                )
                pprint(output_items[:3])  # Show first 3 results
                print(f"\nüîó Eval Run Report URL: {run.report_url}")
                break

            time.sleep(5)

<a id='clean'></a>
### üóëÔ∏è Clean up resources

When you're finished with the lab, you should remove all your deployed resources from Azure to avoid extra charges and keep your Azure subscription uncluttered.
Use the [clean-up-resources notebook](clean-up-resources.ipynb) for that.