# AzureML Online Endpoints Model Profiler

## Overview

Inferencing machine learning models is a time and compute intensive process. It is vital to quantify the performance of model inferencing to ensure that you make the best use of compute resources and reduce cost to reach the desired performance SLA (e.g. latency, throughput).

Online Endpoints Model Profiler (Preview) provides fully managed experience that makes it easy to benchmark your model performance served through [Online Endpoints](https://docs.microsoft.com/en-us/azure/machine-learning/concept-endpoints).

* Use the benchmarking tool of your choice.

* Easy to use CLI experience.
  
* Support for CI/CD MLOps pipelines to automate profiling.
  
* Thorough performance report containing latency percentiles and resource utilization metrics.

## A brief introduction on benchmarking tools

The online endpoints model profiler currently supports 3 types of benchmarking tools: wrk, wrk2, and labench.

* `wrk`: wrk is a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue. For detailed info please refer to this link: https://github.com/wg/wrk.

* `wrk2`: wrk2 is wrk modifed to produce a constant throughput load, and accurate latency details to the high 9s (i.e. can produce accuracy 99.9999% if run long enough). In addition to wrk's arguments, wrk2 takes a throughput argument (in total requests per second) via either the --rate or -R parameters (default is 1000). For detailed info please refer to this link: https://github.com/giltene/wrk2.

* `labench`: LaBench (for LAtency BENCHmark) is a tool that measures latency percentiles of HTTP GET or POST requests under very even and steady load. For detailed info please refer to this link: https://github.com/microsoft/LaBench.

## 1. Prerequisites

The following prerequisites are required to run the notebook:
- An Azure subscription
- A resource group with ownership permissions or an existing compute instance with the Contributor role
- The following additional Python packages are required: 
    - [azure-mgmt-authorization](https://pypi.org/project/azure-mgmt-authorization/): Used to assign roles

If you don’t have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://aka.ms/AMLFree) today.

Install the additional Python requirements with the following code:

In [None]:
%pip install azure-mgmt-authorization

## 2. Connect to Azure Machine Learning Workspace

### 2.1 Import required libraries

In [None]:
from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    CodeConfiguration,
    Environment,
    ComputeInstance,
    IdentityConfiguration,
    Data,
    CommandJob,
    Job,
    OnlineRequestSettings,
)
from azure.ai.ml import Input
from azure.identity import DefaultAzureCredential
from azure.ai.ml.constants import AssetTypes
import random

### 2.2 Set workspace details

In [None]:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace_name = "<AML_WORKSPACE_NAME>"

### 2.3 Set variables

In [None]:
rand = random.randint(0, 100000)
endpoint_name = f"endpt-moe-{rand}"
profiler_compute_name = f"profiler{rand}"
profiler_compute_size = "Standard_DS4_v2"

### 2.4 Get a handle to the workspace 

In [None]:
credential = DefaultAzureCredential()
ml_client = MLClient(
    credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name,
)

## 3. Create an online endpoint

You will need a simple online endpoint as a target for the profiler. For more information see [online-endpoints-simple-deployment.ipynb](online-endpoints-simple-deployment.ipynb).

### 3.1 Create the endpoint

In [None]:
endpoint = ManagedOnlineEndpoint(name=endpoint_name)
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result()

### 3.2 Create a deployment

In [None]:
deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=Model(path="../model-1/model/sklearn_regression_model.pkl"),
    code_configuration=CodeConfiguration(
        code="../model-1/onlinescoring", scoring_script="score.py"
    ),
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
        conda_file="../model-1/environment/conda.yml",
    ),
    instance_type="Standard_DS2_v2",
    instance_count=1,
    request_settings=OnlineRequestSettings(
        request_timeout_ms=3000, max_concurrent_requests_per_instance=1024
    ),
)

In [None]:
deployment = ml_client.online_deployments.begin_create_or_update(deployment).result()

## 4. Create a compute to host the profiler
You will need a compute to host the profiler, to send requests to the online endpoint, and generate performance report.

This compute is NOT the same one that you used above to deploy your model. Please choose a compute SKU with proper network bandwidth (considering the inference request payload size and profiling traffic, we'd recommend Standard_F4s_v2) in the same region as the online endpoint.

The compute needs to have contributor role to the machine learning workspace. For more information, see [Assign Azure roles using Azure CLI](https://docs.microsoft.com/en-us/azure/role-based-access-control/role-assignments-cli).

### 4.1 Create the compute instance

In [None]:
compute = ComputeInstance(
    name=profiler_compute_name,
    size="Standard_DS4_v2",
    identity=IdentityConfiguration(type="system_assigned"),
)

In [None]:
compute = ml_client.compute.begin_create_or_update(compute).result()

### 4.2 Get Authorization Management Clients 

In [None]:
from azure.mgmt.authorization import AuthorizationManagementClient
from azure.mgmt.authorization.v2018_01_01_preview.models import RoleDefinition
import uuid

role_definition_client = AuthorizationManagementClient(
    credential=credential,
    subscription_id=subscription_id,
    api_version="2018-01-01-preview",
)

from azure.mgmt.authorization.v2020_10_01_preview.models import (
    RoleAssignment,
    RoleAssignmentCreateParameters,
)

role_assignment_client = AuthorizationManagementClient(
    credential=credential,
    subscription_id=subscription_id,
    api_version="2020-10-01-preview",
)

### 4.3 Assign the Contributor role to the compute instance

In [None]:
role_name = "Contributor"
scope = ml_client.workspaces.get(workspace_name).id

role_defs = role_definition_client.role_definitions.list(scope=scope)
role_def = next((r for r in role_defs if r.role_name == role_name))

role_assignment_client.role_assignments.create(
    scope=scope,
    role_assignment_name=str(uuid.uuid4()),
    parameters=RoleAssignmentCreateParameters(
        role_definition_id=role_def.id,
        principal_id=compute.identity.principal_id,
        principal_type="ServicePrincipal",
    ),
)

## 5. Create a profiling job

A profiling job simulates how an online endpoint serves live requests. It produces a throughput load to the online endpoint and generates performance report.

Profiling job parameters can be passed using environment variables or a JSON configuration file. In this example, we'll use environment variables.

### 5.1 Upload the payload
The payload contains a separate JSON payload for the endpoint on each line. Wrapping the payload in a Data object exposes a `path` on the `workspaceblobstore` that will be used for the profiling job.

```json
{"data": [[1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]}
{"data": [[1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]}
{"data": [[1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]}
{"data": [[1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]}
{"data": [[1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]}
``` 

In [None]:
payload = Data(
    name="payload",
    type=AssetTypes.URI_FILE,
    path="profiler/profiling/payload.txt",
    datastore="workspaceblobstore",
)

In [None]:
payload = ml_client.data.create_or_update(payload)

### 5.2 Create a profiling job
To create a profiling job a `CommandJob` object is used with the `online-endpoints-model-profiler` image. For general command job parameters, see the [CLI v2 Command Job YAML Schema](https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-job-command). 

The key parameters for a profiling job are: 

| Key | Type  | Description | Allowed values | Default value |
| --- | ----- | ----------- | -------------- | ------------- |
| `command` | string | The command for running the profiling job. | `python -m online_endpoints_model_profiler ${{inputs.payload}}` | - |
| `experiment_name` | string | The experiment name of the profiling job. An experiment is a group of jobs. | - | - |
| `display_name` | string | The profiling job name. | - | A random string guid, such as `willing_needle_wrzk3lt7j5` |
| `environment.image` | string | An Azure Machine Learning curated image containing benchmarking tools and profiling scripts. | mcr.microsoft.com/azureml/online-endpoints-model-profiler:latest | - |
| `environment_variables` | string | Environment vairables for the profiling job. | [Profiling related environment variables](#YAML-profiling-related-environment_variables)<br><br>[Benchmarking tool related environment variables](#YAML-benchmarking-tool-related-environment_variables) | - |
| `compute` | string | The aml compute for running the profiling job. | - | - |
| `inputs.payload` | string | Payload file that is stored in an AML registered datastore. | [Example payload file content](https://github.com/Azure/azureml-examples/blob/xiyon/mir-profiling/cli/endpoints/online/profiling/payload.txt) | - |

Key environment variables that configure the profiling job include:
| Key | Description | Default Value | wrk | wrk2 | labench |
| --- | ----------- | ------------- | --- | ---- | ------- |
| `DURATION` | Period of time for running the benchmarking tool. | `300s` | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| `CONNECTIONS` | No. of connections for the benchmarking tool. The default value will be set to the value of `max_concurrent_requests_per_instance` | `1` | :heavy_check_mark: | :heavy_check_mark: | :x: |
| `THREAD` | No. of threads allocated for the benchmarking tool. | `1` | :heavy_check_mark: | :heavy_check_mark: | :x: |
| `TARGET_RPS` | Target requests per second for the benchmarking tool. | `50` | :x: | :heavy_check_mark: | :heavy_check_mark: |
| `CLIENTS` | No. of clients for the benchmarking tool. The default value will be set to the value of `max_concurrent_requests_per_instance` | `1` | :x: | :x: | :heavy_check_mark: |
| `TIMEOUT` | Timeout in seconds for each request. | `10s` | :x: | :x: | :heavy_check_mark: |

In [None]:
job = command(
    command="python -m online_endpoints_model_profiler --payload_path ${{inputs.payload}}",
    code=".",
    experiment_name="profiling-job",
    display_name=f"{profiler_compute_size}:1",
    environment=Environment(
        image="mcr.microsoft.com/azureml/online-endpoints-model-profiler:latest"
    ),
    environment_variables={
        "ONLINE_ENDPOINT": endpoint.name,
        "DEPLOYMENT": deployment.name,
        "PROFILING_TOOL": "wrk",
        "DURATION": "10",
        "CONNECTIONS": "1",
        "TARGET_RPS": "50",
        "CLIENTS": "1",
        "TIMEOUT": "10",
        "THREAD": "1",
    },
    compute=profiler_compute_name,
    inputs={
        "payload": Input(
            type="uri_file",
            path=payload.path,
        )
    },
)

In [None]:
job = ml_client.create_or_update(job)

### 5.3 View the profiling job in AzureML Studio
The `Metrics` tab contains metrics gathered from the profiling job.

In [None]:
ml_client.jobs.stream(name=job.name)

# 6. Delete assets

In [None]:
ml_client.online_endpoints.begin_delete(name=endpoint_name)

In [None]:
ml_client.compute.begin_delete(name=profiler_compute_name)