# Evaluating Distilled Models with AzureML

In this notebook, you will learn how to run evaluations of opensource distilled model using the AzureML SDK. Along with this notebook, we've included a preconfigured set of 5 tasks using well-known public datasets.

*Disclaimer: This notebook has been tested against MaaS endpoints for Llama 3.1. Other deployments or model versions are not guaranteed to work with the evaluation pipelines distributed with this notebook.*  

## Prerequistes
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace - [Configure workspace](../../configuration.ipynb)
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section
- A python environment with [mlflow](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-mlflow-configure-tracking?view=azureml-api-2&tabs=python%2Cmlflow) for retrieving eval metrics
- Distilled model endpoint url and connection name for workspace connection

## Supported tasks
- Summarization
- Math
- NLI

Note that evaluation pipelines automatically download relevant datasets from public sources.

You can also set the sample ratio, the fraction of the selected dataset to run for the eval.

**Warning**: Many datasets contain thousands of examples which can lead to high endpoint usage costs. We advise starting with a small sample ratio (e.g., 1%) to verify the pipeline and then increasing the ratio if desired. Note that benchmark metrics obtained with small sample ratios may not be comparable between different models. Please use sample_ratio=1 for model comparisons.



In [5]:
# AzureML settings - please fill in your values
subscription_id = "<Azure subscription ID>"
resource_group = "<Resource group>"
workspace_name = "<Workspace name"
experiment_name = "<Experiment name>"

# Eval to run - you can change this to any of the 5 supported task names
# Supported evals: text-summarization
task_name = "text-summarization"
eval_name = "dialogsum"

# Distilled model settings
endpoint_url = "endpoint_url"

# Name of the connection in your Workspace storing access keys
connection_name = "<Connection name>"

# Sample ratio - what fraction of the dataset to run for the eval?
# **WARNING** be aware of endpoint costs!
sample_ratio = 0.01

Run the following cell to get an `MLClient` for communicating with your Workspace:

In [3]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

# client for AzureML Workspace actions
ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name,
)

The code in the next cell launches the evaluation pipeline job using [serverless compute](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-serverless-compute) by default. You can optionally [create your own compute cluster](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster) and use it to execute the job.

In [None]:
from azure.ai.ml import load_job

# load the pipeline from the yaml def
pipeline_job = load_job(f"./pipelines/summarization/{eval_name}.yaml")

# Set pipeline job inputs
pipeline_job.inputs.endpoint_url = endpoint_url
pipeline_job.inputs.connection_name = connection_name
pipeline_job.inputs.sample_ratio = sample_ratio

# Optionally use your own compute cluster
# pipeline_job.settings.default_compute = "<Your compute cluster name>"

# Start the job in the Workspace
returned_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name=experiment_name
)
returned_job

Run the next cell to stream the job. Notebook execution will be paused until the job finishes.

In [None]:
# Wait until the job completes
ml_client.jobs.stream(returned_job.name)

## Retrieve metrics from the run
When the pipeline finishes, you can retrieve evaluation metrics from the run via mlflow. The primary measure of accuracy for the evals is task dependant eg. `summarization` performance is usually measured in `rouge` scores

In [None]:
import mlflow

accuracy_metric_name = "rouge1"

mlflow_tracking_uri = ml_client.workspaces.get(
    ml_client.workspace_name
).mlflow_tracking_uri
mlflow.set_tracking_uri(mlflow_tracking_uri)

run = mlflow.get_run(run_id=returned_job.name)
metric_val = run.data.metrics[accuracy_metric_name]

if sample_ratio < 1.0:
    print(
        f"**Warning** sample_ratio is {sample_ratio}. Use sample_ratio=1.0 when comparing metrics between models."
    )

print(f"Eval: {eval_name}")
print(f"Sample ratio: {sample_ratio}")
print(f"Accuracy metric name: {accuracy_metric_name}")
print(f"Accuracy metric value: {metric_val}")