# Evaluating Claude models with AzureML

In this notebook, you will learn how to run evaluations of Anthropic's Claude model using the AzureML SDK. Along with this notebook, we've included a preconfigured set of 12 evaluations using well-known, public datasets (e.g., MMLU, HellaSwag, Winogrande).

Please see the [Azure AI Leaderboard](https://ai.azure.com/explore/leaderboard) for other supported model benchmarks and for more details on the eval datasets.

## Prerequistes
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with compute cluster - [Configure workspace](../../configuration.ipynb)
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section
- A python environment with [mlflow](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-configure-tracking?view=azureml-api-2&tabs=python%2Cmlflow) for retrieving eval metrics
- Access keys for Claude endpoints on [Amazon Web Services Bedrock platform](https://aws.amazon.com/bedrock/claude/)  

## Configuring a Workspace connection for Bedrock access
You will use a Workspace connection to securely store Bedrock access keys. Follow the steps below to create a custom-type connection:
- Follow directions for [creating a custom connection in the AzureML studio UI](https://learn.microsoft.com/en-us/azure/machine-learning/prompt-flow/tools-reference/python-tool?view=azureml-api-2#create-a-custom-connection)
- Add the following two key-value pairs to the custom connection:
  1. A key named `AccessKey` with a value containing your AWS access key
  2. A key named `SecretKey` with a value containing your AWS secret access key 

## Configuring and running an evaluation pipeline
Please set global values in the following cell for your AzureML Workspace, the Bedrock endpoint you want to call, the name of connection you created in the previous step, and the name of the eval you want to run. Supported evals are the following: mmlu_humanities, \<put others here as we verify them>.

You can also set the sample ratio, the fraction of the selected dataset to run for the eval.

**Warning**: Many datasets contain thousands of examples which can lead to high endpoint usage costs. We advise starting with a small sample ratio (e.g., 1%) to verify the pipeline and then increasing the ratio if desired. 

In [20]:
# AzureML settings
subscription_id = '72c03bf3-4e69-41af-9532-dfcdc3eefef4'
resource_group = 'aml-benchmarking'
workspace_name = 'aml-benchmarking-rd'
experiment_name = 'benchmark-claude-v2_1'
compute_name = 'cpu-cluster-benchmarking'

# Eval config values
eval_name = 'mmlu_stem'
bedrock_endpoint_url = 'https://bedrock-runtime.us-east-1.amazonaws.com/model/anthropic.claude-v2:1/invoke'
connection_name = 'bedrock-test'

# Sample ratio - what fraction of the dataset to run for the eval? 
# **WARNING** be aware of endpoint costs!
sample_ratio = 0.01

Run the following cell to get an `MLClient` for communicating with your Workspace:

In [24]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

# client for AzureML Workspace actions
ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name
)

Finally, to launch the evaluation pipeline job, run the following:

In [25]:
from azure.ai.ml import load_job

# load the pipeline from the yaml def
pipeline_job = load_job(f'./evaluation_pipelines/{eval_name}.yaml')

# Set pipeline job inputs
pipeline_job.settings.default_compute = compute_name
pipeline_job.inputs.endpoint_url = bedrock_endpoint_url
pipeline_job.inputs.ws_connection_name = connection_name
pipeline_job.inputs.sample_ratio = sample_ratio

# Start the job in the Workspace
returned_job = ml_client.jobs.create_or_update(
    pipeline_job,
    experiment_name=experiment_name
)
returned_job

Experiment,Name,Type,Status,Details Page
benchmark-claude-v2_1,sweet_actor_fps92km9m8,pipeline,Preparing,Link to Azure Machine Learning studio


## Retrieve accuracy scores from the run
When the pipeline finishes, you can retrieve evaluation metrics from the run via mlflow. The primary measure of accuracy for the evals is `mean_exact_match`, with the exception of human_eval which uses `pass@1`. 

Mean exact match is the proportion of model predictions that exactly match the corresponding correct answers. Thus, it is applicable to question answering evaluations that are multiple choice or have a single, correct answer. The pass@1 metric is used for evaluating code generation and is the proportion of model generated code solutions that pass a set of unit tests given in the eval dataset. 

In [12]:
import mlflow

accuracy_metric_name = 'mean_exact_match' if eval_name != 'human_eval' else 'pass@1'

mlflow_tracking_uri = ml_client.workspaces.get(ml_client.workspace_name).mlflow_tracking_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)

run = mlflow.get_run(run_id=returned_job.name)
metric_val = run.data.metrics[accuracy_metric_name]

print(f'Accuracy metric name: {accuracy_metric_name}')
print(f'Accuracy metric value: {metric_val}')

Accuracy metric name: mean_exact_match
Accuracy metric value: 0.5
