# Testing AOAI Integration Features in the AIP SDK

This notebook will run users through the following features in the AI SDK:
- Using creating grader classes by:
    - Directly supplying an OAI grader config
    - Using grader-specific classes.
- Submitting graders to the `evaluate` method to start an evaluation
- Submitting graders to the remote evaluation service

## Pre-setup
Check that you have the following 2 other files available. These should have been included in the zip that contains this notebook
- eval_eval_input.jsonl : a simple test dataset
- azure_ai_evaluation-1.6.0-py3-none-any.whl : The pre-release variant of the AI SDK that contains the changes of interest.

Install the AI and projects SDKs:

In [None]:
!pip install azure.ai.projects
!pip install azure_ai_evaluation-1.6.0-py3-none-any.whl

Setup credentials needed for later logic. These need to be filled in with whatever values you use, or with the values supplied outside of this value during the bugbash:

In [1]:
# TODO fill in with an API, or use your own credentials
model_config = {
    "azure_endpoint": "https://jamahaja-gpt4o-westus2.openai.azure.com/",
    # "api_key": "...",
    "api_version": "2025-04-01-preview",
    "azure_deployment": "gpt-4o"
}
project = {
    "subscription_id": "2d385bf4-0756-4a76-aa95-28bf9ed3b625",
    "resource_group_name": "rg-quso-ai-canary",
    "project_name": "qusong-canary"
}
fname="test_eval_input.jsonl"

if model_config["api_key"] == "...":
    raise ValueError("Please set your API key in the code snippet.")



## Create grader objects

The AIP SDK wraps OpenAI Grader configurations alongside the necessary credentials to ensure that any grader can be evaluated without additional context. 5 objects are created below, a normal evaluator for comparison, three specific grader classes, and a general grader. The details of the two grader types are as follows:
- Grader specific classes. The first three graders each have unique, required inputs that simplify their use. These inputs are then plugged directly into the corresponding OAI grader configuration object. The graders accounted for are listed in the OAI API [here](https://github.com/openai/openai-python/blob/main/src/openai/types/eval_create_params.py#L151).
- The general `AzureOpenAIGrader` class. Rather than accepting the exact inputs needed to create a specific grader configuration, this class simply accepts a single dictionary and performs no validation upon it. This inputted object is assumed to be an OAI-API-ready configuration. This class is expected to be used by OAI veterans and users who want to test bleeding-edge features that have yet to be accounted for by the other specific grader classes.

In [3]:
from openai.types.eval_string_check_grader import EvalStringCheckGrader
from azure.ai.evaluation import (
    AzureOpenAILabelGrader,
    AzureOpenAIStringCheckGrader,
    AzureOpenAITextSimilarityGrader,
    AzureOpenAIGrader,
    F1ScoreEvaluator,
)

# create a normal evaluator for comparison
f1_eval = F1ScoreEvaluator()

## ---- Initialize specific graders ----

# Corresponds to https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_text_similarity_grader.py#L11
sim_grader = AzureOpenAITextSimilarityGrader(
    model_config=model_config,
    evaluation_metric="fuzzy_match",
    input="{{item.query}}",
    name="similarity",
    pass_threshold=1,
    reference="{{item.query}}",
)

# Corresponds to https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_string_check_grader_param.py#L10
string_grader = AzureOpenAIStringCheckGrader(
    model_config=model_config,
    input="{{item.query}}",
    name="starts with what is",
    operation="like",
    reference="What is",
)

# Corresponds to https://github.com/openai/openai-python/blob/ed53107e10e6c86754866b48f8bd862659134ca8/src/openai/types/eval_create_params.py#L132
label_grader = AzureOpenAILabelGrader(
    model_config=model_config,
    input=[{"content": "{{item.query}}", "role": "user"}],
    labels=["too short", "just right", "too long"],
    passing_labels=["just right"],
    model="gpt-4o",
    name="label",
)

# ---- General Grader Initialization ----

# Define an string check grader config directly using the OAI SDK
oai_string_check_grader = EvalStringCheckGrader(
    input="{{item.query}}",
    name="contains hello",
    operation="like",
    reference="hello",
    type="string_check"
)
# Plug that into the general grader
general_grader = AzureOpenAIGrader(
    model_config=model_config,
    grader_config=oai_string_check_grader
)



## "Local" evaluation

Using the `evaluate` method, we can evaluate the test dataset against the graders. The word 'local' is quoted because the code then calls the OAI API to perform the evaluations. The resulting logs will note this. Click on the studio url to view the resulting evaluation online.

In [None]:
from azure.ai.evaluation import evaluate
evaluation = evaluate(
    data=fname,
    evaluators={
        "label": label_grader,
        "general": general_grader,
        "string": string_grader,
        "similarity": sim_grader,
        "f1": f1_eval,
    },
    azure_ai_project=project
)
evaluation

## Remote evaluation

Using the Projects SDK, we can have the remote evaluation service evaluate graders for us. The existing `EvaluatorConfiguration` object is already capable of handling graders as inputs.

Start by filling in some asset ids you will need for remote evaluation. The first two have already been filled in with functional defaults, but the third value will need to be pulled from the front page of the Foundry.

In [2]:
dataset_id ="azureml://locations/eastus2euap/workspaces/e362e695-8af6-42b9-8fcf-dd271e7d0d53/data/generated_response/versions/1"
environment_id = "azureml://registries/jamahaja-evals-registry/environments/etwinter-aoai/versions/12"
project_connection_string = "TODO FILL ME IN PLEASE"

Next, define the evaluator configs that will be supplied to the remove evaluation service,

In [3]:
from azure.ai.projects.models import EvaluatorConfiguration
from azure.ai.evaluation import (
    LabelGrader,
    StringCheckGrader,
    TextSimilarityGrader,
    AoaiGrader,
)


f1_eval_config = EvaluatorConfiguration(
    id="azureml://registries/azureml/models/F1Score-Evaluator/versions/4",
    init_params={},
)


sim_grader_config = EvaluatorConfiguration(
    id=TextSimilarityGrader.id,
    init_params={
        "model_config": model_config,
        "evaluation_metric": "fuzzy_match",
        "input": "{{item.query}}",
        "name": "similarity",
        "pass_threshold": 1,
        "reference": "{{item.query}}",
    },
)

string_grader_config = EvaluatorConfiguration(
    id=StringCheckGrader.id,
    init_params={
        "model_config": model_config,
        "input": "{{item.query}}",
        "name": "contains hello",
        "operation": "like",
        "reference": "hello",
    },
)

label_grader_config = EvaluatorConfiguration(
    id=LabelGrader.id,
    init_params={
        "model_config": model_config,
        "input": [{"content": "{{item.query}}", "role": "user"}],
        "labels": ["too short", "just right", "too long"],
        "passing_labels": ["just right"],
        "model": "gpt-4o",
        "name": "label",
    },
)

general_grader_config = EvaluatorConfiguration(
    id=AoaiGrader.id,
    init_params={
        "model_config": model_config,
        "grader_config": {
            "input": "{{item.query}}",
            "name": "contains hello",
            "operation": "like",
            "reference": "hello",
            "type": "string_check",
        },
    },
)


Finally, submit the remote evaluation run:

In [None]:
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import Evaluation, Dataset
from azure.identity import DefaultAzureCredential
# Note you might want to change the run name to avoid confusion with others
run_name = "Test Remote AOAI Evaluation"
evaluation = Evaluation(
    display_name=run_name,
    description="Evaluation started by test_remote_aoai_evaluation e2e test.",
    evaluators = {
        "f1": f1_eval_config,
        "label": label_grader_config,
        "general": general_grader_config,
        "string": string_grader_config,
        "similarity": sim_grader_config,
    },
    data=Dataset(id=dataset_id),
    properties={ "Environment":environment_id}
)
project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=project_connection_string,
)
created_evaluation = project_client.evaluations.create(evaluation)
print(f"review remote evaluation results at {created_evaluation.properties['AiStudioEvaluationUri']}")


## More examples

The following code blocks show case more configurations and edge cases.

### Adding Target mappings

In this example, we re-define the string grader to require an input not found in the original dataset called "new_input". This value is instead produced by a target function. 

In [None]:
from azure.ai.evaluation import evaluate
string_grader = StringCheckGrader(
    model_config=model_config,
    input="{{item.new_input}}",
    name="contains hello",
    operation="like",
    reference="hello",
)

def target_fn(query: str) -> str:
    return {"new_input": query.replace("a", "e")}

evaluation = evaluate(
    data=fname,
    evaluators={
        "label": label_grader,
        "general": general_grader,
        "string": string_grader,
        "similarity": sim_grader,
    },
    azure_ai_project=project,
    target=target_fn,
    _use_run_submitter_client=True
)
evaluation

### Grader-specific column mappings

This example runs an evaluation in which some graders have unique target mappings, which cause their 'item.query' value to be derived from a different column in the dataset, instead of matching the original dataset's query column. 

This causes each grader to be evaluated by a separate OAI eval run, since there would otherwise be a risk of conflicting column mappings between graders. It's a post-build TODO item to determine when this caution is needed more carefully.

In [None]:
from azure.ai.evaluation import evaluate
string_grader = StringCheckGrader(
    model_config=model_config,
    input="{{item.newer_input}}",
    name="contains hello",
    operation="like",
    reference="hello",
)

def target_fn(query: str) -> str:
    return {"new_input": query.replace("a", "e")}

evaluation = evaluate(
    data=fname,
    evaluators={
        "label": label_grader,
        "general": general_grader,
        "string": string_grader,
        "similarity": sim_grader,
    },
    azure_ai_project=project,
    target=target_fn,
    evaluation_config={
        "label": {
            "column_mapping": {
                "query": "${data.ground_truth}",
            }
        },
        "general": {
            "column_mapping": {
                "query": "${data.response}",
            }
        },
        "string": {
            "column_mapping": {
                "newer_input": "${target.new_input}",
            }
        },
    },
    _use_run_submitter_client=True
)
evaluation