Inferencing machine learning models is a time and compute intensive process. It is vital to quantify the performance of model inferencing to ensure that you make the best use of compute resources and reduce cost to reach the desired performance SLA (e.g. latency, throughput).
Online Endpoints Model Profiler (Preview) provides fully managed experience that makes it easy to benchmark your model performance served through Online Endpoints.
-
Use the benchmarking tool of your choice.
-
Easy to use CLI experience.
-
Support for CI/CD MLOps pipelines to automate profiling.
-
Thorough performance report containing latency percentiles and resource utilization metrics.
The online endpoints model profiler currently supports 3 types of benchmarking tools: wrk, wrk2, and labench.
-
wrk
: wrk is a modern HTTP benchmarking tool capable of generating significant load when run on a single multi-core CPU. It combines a multithreaded design with scalable event notification systems such as epoll and kqueue. For detailed info please refer to this link: https://github.com/wg/wrk. -
wrk2
: wrk2 is wrk modifed to produce a constant throughput load, and accurate latency details to the high 9s (i.e. can produce accuracy 99.9999% if run long enough). In addition to wrk's arguments, wrk2 takes a throughput argument (in total requests per second) via either the --rate or -R parameters (default is 1000). For detailed info please refer to this link: https://github.com/giltene/wrk2. -
labench
: LaBench (for LAtency BENCHmark) is a tool that measures latency percentiles of HTTP GET or POST requests under very even and steady load. For detailed info please refer to this link: https://github.com/microsoft/LaBench.
-
Azure subscription. If you don't have an Azure subscription, sign up to try the free or paid version of Azure Machine Learning today.
-
Azure CLI and ML extension. For more information, see Install, set up, and use the CLI (v2) (preview).
Please follow this example and get started with the model profiling experience.
Follow the example in this tutorial to deploy a model using an online endpoint.
-
Replace the
instance_type
in deployment yaml file with your desired Azure VM SKU. VM SKUs vary in terms of computing power, price and availability in different Azure regions. -
Tune
request_settings.max_concurrent_requests_per_instance
which defines the concurrent level. The higher this setting is, the higher throughput the endpoint gets. If this setting is set higher than the online endpoint can handle, the inference request may end up waiting in the queue and eventually results in longer end-to-end latency. -
If you plan to profile using multiple
instance_type
andrequest_settings.max_concurrent_requests_per_instance
, please create one online deployment for each pair. You can attach all online deployments under the same online endpoint.
Below is a sample yaml file defines an online deployment.
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: my-endpoint
model:
path: ../../model-1/model/sklearn_regression_model.pkl
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score.py
environment:
conda_file: ../../model-1/environment/conda.yml
image: mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210727.v1
instance_type: Standard_F2s_v2
instance_count: 1
request_settings:
request_timeout_ms: 3000
max_concurrent_requests_per_instance: 1024
You will need a compute to host the profiler, send requests to the online endpoint and generate performance report.
-
This compute is NOT the same one that you used above to deploy your model. Please choose a compute SKU with proper network bandwidth (considering the inference request payload size and profiling traffic, we'd recommend Standard_F4s_v2) in the same region as the online endpoint.
az ml compute create --name $PROFILER_COMPUTE_NAME --size $PROFILER_COMPUTE_SIZE --identity-type SystemAssigned --type amlcompute
-
Create proper role assignment for accessing online endpoint resources. The compute needs to have contributor role to the machine learning workspace. For more information, see Assign Azure roles using Azure CLI.
compute_info=`az ml compute show --name $PROFILER_COMPUTE_NAME --query '{"id": id, "identity_object_id": identity.principal_id}' -o json` workspace_resource_id=`echo $compute_info | jq -r '.id' | sed 's/\(.*\)\/computes\/.*/\1/'` identity_object_id=`echo $compute_info | jq -r '.identity_object_id'` az role assignment create --role Contributor --assignee-object-id $identity_object_id --scope $workspace_resource_id if [[ $? -ne 0 ]]; then echo "Failed to create role assignment for compute $PROFILER_COMPUTE_NAME" && exit 1; fi
A profiling job simulates how an online endpoint serves live requests. It produces a throughput load to the online endpoint and generates performance report.
Below is a template yaml file that defines a profiling job.
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
command: >
python -m online_endpoints_model_profiler --payload_path ${{inputs.payload}}
experiment_name: profiling-job
display_name: <% SKU_CONNECTION_PAIR %>
environment:
image: mcr.microsoft.com/azureml/online-endpoints-model-profiler:latest
environment_variables:
ONLINE_ENDPOINT: "<% ENDPOINT_NAME %>"
DEPLOYMENT: "<% DEPLOYMENT_NAME %>"
PROFILING_TOOL: "<% PROFILING_TOOL %>"
DURATION: "<% DURATION %>"
CONNECTIONS: "<% CONNECTIONS %>"
TARGET_RPS: "<% TARGET_RPS %>"
CLIENTS: "<% CLIENTS %>"
TIMEOUT: "<% TIMEOUT %>"
THREAD: "<% THREAD %>"
compute: "azureml:<% COMPUTE_NAME %>"
inputs:
payload:
type: uri_file
path: azureml://datastores/workspaceblobstore/paths/profiling_payloads/<% ENDPOINT_NAME %>_payload.txt
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
command |
string | The command for running the profiling job. | python -m online_endpoints_model_profiler ${{inputs.payload}} |
- |
experiment_name |
string | The experiment name of the profiling job. An experiment is a group of jobs. | - | - |
display_name |
string | The profiling job name. | - | A random string guid, such as willing_needle_wrzk3lt7j5 |
environment.image |
string | An Azure Machine Learning curated image containing benchmarking tools and profiling scripts. | mcr.microsoft.com/azureml/online-endpoints-model-profiler:latest | - |
environment_variables |
string | Environment vairables for the profiling job. | Profiling related environment variables Benchmarking tool related environment variables |
- |
compute |
string | The aml compute for running the profiling job. | - | - |
inputs.payload |
string | Payload file that is stored in an AML registered datastore. | Example payload file content | - |
Key | Description | Default Value |
SUBSCRIPTION | Used together with RESOURCE_GROUP , WORKSPACE , ONLINE_ENDPOINT , DEPLOYMENT to form the profiling target. | Subscription of the profiling job |
RESOURCE_GROUP | Used together with SUBSCRIPTION , WORKSPACE , ONLINE_ENDPOINT , DEPLOYMENT to form the profiling target. | Resource group of the profiling job |
WORKSPACE | Used together with SUBSCRIPTION , RESOURCE_GROUP , ONLINE_ENDPOINT , DEPLOYMENT to form the profiling target. | AML workspace of the profiling job |
ONLINE_ENDPOINT |
Used together with SUBSCRIPTION , RESOURCE_GROUP , WORKSPACE , DEPLOYMENT to form the profiling target.If not provided, SCORING_URI will be used as the profiling target.If neither OLINE_ENDPOINT /DEPLOYMENT nor SCORING_URI is provided, an error will be thrown.
|
- |
DEPLOYMENT |
Used together with SUBSCRIPTION , RESOURCE_GROUP , WORKSPACE , ONLINE_ENDPOINT to form the profiling target.If not provided, SCORING_URI will be used as the profiling target.If neither OLINE_ENDPOINT /DEPLOYMENT nor SCORING_URI is provided, an error will be thrown. |
- |
IDENTITY_ACCESS_TOKEN |
An optional aad token for retrieving online endpoint scoring_uri, access_key, and resource usage metrics. This will not be necessary for the following scenario: - The aml compute that is used to run the profiling job has contributor access to the workspace of the online endpoint. Users should keep in mind that it's recommended to assign appropriate permissions to the aml compute rather than providing this aad token, since the aad token might be expired during the process of the profiling job. |
- |
SCORING_URI | Users are optional to provide this env var as instead of the SUBSCRIPTION /RESOURCE_GROUP /WORKSPACE /ONLINE_ENDPOINT /DEPLOYMENT combination to define the profiling target. Although, missing ONLINE_ENDPOINT /DEPLOYMENT info will lead to missing resource usage metrics in the final report. | - |
SCORING_HEADERS | Users may use this env var to provide any special headers necessary when invoking the profiling target. |
{
"Content-Type": "application/json",
"Authorization": "Bearer ${ONLINE_ENDPOINT_ACCESS_KEY}",
"azureml-model-deployment": "${DEPLOYMENT}"
} |
PROFILING_TOOL | The name of the benchmarking tool. Currently support: wrk , wrk2 , labench | wrk |
PAYLOAD |
Users may use this param to provide a single string format payload data for invoking the profiling target. For example: {"data": [[1,2,3,4,5,6,7,8,9,10], [10,9,8,7,6,5,4,3,2,1]]} .If inputs.payload is provided in the profiling job yaml file, this env var will be ignored.
|
- |
Key | Description | Default Value | wrk | wrk2 | labench |
---|---|---|---|---|---|
DURATION |
Period of time for running the benchmarking tool. | 300s |
✔️ | ✔️ | ✔️ |
CONNECTIONS |
No. of connections for the benchmarking tool. The default value will be set to the value of max_concurrent_requests_per_instance |
1 |
✔️ | ✔️ | ❌ |
THREAD |
No. of threads allocated for the benchmarking tool. | 1 |
✔️ | ✔️ | ❌ |
TARGET_RPS |
Target requests per second for the benchmarking tool. | 50 |
❌ | ✔️ | ✔️ |
CLIENTS |
No. of clients for the benchmarking tool. The default value will be set to the value of max_concurrent_requests_per_instance |
1 |
❌ | ❌ | ✔️ |
TIMEOUT |
Timeout in seconds for each request. | 10s |
❌ | ❌ | ✔️ |
Update the profiling job yaml template with your own values and create a profiling job.
az ml job create -f ${PROFILING_JOB_YAML_FILE_PATH}
-
Users may find profiling job info in the AML workspace studio, under "Experiments" tab.
-
Users may also find job metrics within each individual job page, under "Metrics" tab.
-
Users may also find job report file within each individual job page, under "Outputs + logs" tab, file "outputs/report.json".
-
Users may also use the following cli to download all job output files.
az ml job download --name $JOB_ID --download-path $JOB_LOCAL_PATH
Please use az ml online-endpoint delete
to delete the test online endpoints and online deployment after completing profiling.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
For any questions, bugs and requests of new features, please contact us at miroptprof@microsoft.com
Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.