Skip to content

Files

Latest commit

 

History

History

torchserve

Agent Check: TorchServe

Overview

This check monitors TorchServe through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

Starting from Agent release 7.47.0, the TorchServe check is included in the Datadog Agent package. No additional installation is needed on your server.

This check uses OpenMetrics to collect metrics from the OpenMetrics endpoint TorchServe can expose, which requires Python 3.

Prerequisites

The TorchServe check collects TorchServe's metrics and performance data using three different endpoints:

You can configure these endpoints using the config.properties file, as described in the TorchServe documentation. For example:

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
number_of_netty_threads=32
default_workers_per_model=10
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
load_models=all

This configuration file exposes the three different endpoints that can be used by the integration to monitor your instance.

OpenMetrics endpoint

To enable the Prometheus endpoint, you need to configure two options:

  • metrics_address: Metrics API binding address. Defaults to http://127.0.0.1:8082
  • metrics_mode: Two metric modes are supported by TorchServe: log and prometheus. Defaults to log. You have to set it to prometheus to collect metrics from this endpoint.

For instance:

metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus

In this case, the OpenMetrics endpoint is exposed at this URL: http://<TORCHSERVE_ADDRESS>:8082/metrics.

Configuration

These three different endpoints can be monitored independently and must be configured separately in the configuration file, one API per instance. See the sample torchserve.d/conf.yaml for all available configuration options.

Configure the OpenMetrics endpoint

Configuration options for the OpenMetrics endpoint can be found in the configuration file under the TorchServe OpenMetrics endpoint configuration section. The minimal configuration only requires the openmetrics_endpoint option:

init_config:
  ...
instances:
  - openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics

For more options, see the sample torchserve.d/conf.yaml file.

TorchServe allows the custom service code to emit metrics that will be available based on the configured metrics_mode. You can configure this integration to collect these metrics using the extra_metrics option. These metrics will have the torchserve.openmetrics prefix, just like any other metrics coming from this endpoint.

These custom TorchServe metrics are considered standard metrics in Datadog.

Configure the Inference API

This integration relies on the Inference API to get the overall status of your TorchServe instance. Configuration options for the Inference API can be found in the configuration file under the TorchServe Inference API endpoint configuration section. The minimal configuration only requires the inference_api_url option:

init_config:
  ...
instances:
  - inference_api_url: http://<TORCHSERVE_ADDRESS>:8080

This integration leverages the Ping endpoint to collect the overall health status of your TorchServe server.

Configure the Management API

You can collect metrics related to the models that are currently running in your TorchServe server using the Management API. Configuration options for the Inference API can be found in the configuration file under the TorchServe Management API endpoint configuration section. The minimal configuration only requires the management_api_url option:

init_config:
  ...
instances:
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081

By default, the integration collects data from every single models, up to 100 models. This can be modified using the limit, include, and exclude options. For example:

init_config:
  ...
instances:
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
    limit: 25
    include: 
      - my_model.* 

This configuration only collects metrics for model names that match the my_model.* regular expression, up to 25 models.

You can also exclude some models:

init_config:
  ...
instances:
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
    exclude: 
      - test.* 

This configuration collects metrics for every model name that does not match the test.* regular expression, up to 100 models.

You can use the `include` and `exclude` options in the same configuration. The `exclude` filters are applied after the `include` ones.

By default, the integration retrieves the full list of the models every time the check runs. You can cache this list by using the interval option for increased performance of this check.

Using the `interval` option can also delay some metrics and events.

Complete configuration

This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections:

init_config:
  ...
instances:
  - openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
    # Also collect your own TorchServe metrics
    extra_metrics:
      - my_custom_torchserve_metric
  - inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
    # Include all the model names that match this regex   
    include:
      - my_models.*
    # But exclude all the ones that finish with `-test`
    exclude: 
      - .*-test 
    # Refresh the list of models only every hour
    interval: 3600

Restart the Agent after modifying the configuration.

This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as a Docker label inside docker-compose.yml:

labels:
  com.datadoghq.ad.checks: '{"torchserve":{"instances":[{"openmetrics_endpoint":"http://%%host%%:8082/metrics","extra_metrics":["my_custom_torchserve_metric"]},{"inference_api_url":"http://%%host%%:8080"},{"management_api_url":"http://%%host%%:8081","include":["my_models.*"],"exclude":[".*-test"],"interval":3600}]}}'

This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as Kubernetes annotations on your Torchserve pods:

apiVersion: v1
kind: Pod
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/torchserve.checks: |-
      {
        "torchserve": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8082/metrics",
              "extra_metrics": [
                "my_custom_torchserve_metric"
              ]
            },
            {
              "inference_api_url": "http://%%host%%:8080"
            },
            {
              "management_api_url": "http://%%host%%:8081",
              "include": [
                ".*"
              ],
              "exclude": [
                ".*-test"
              ],
              "interval": 3600
            }
          ]
        }
      }
    # (...)
spec:
  containers:
    - name: 'torchserve'
# (...)

Validation

Run the Agent's status subcommand and look for torchserve under the Checks section.

Data Collected

Metrics

See metadata.csv for a list of metrics provided by this integration.

Metrics are prefixed using the API they are coming from:

  • torchserve.openmetrics.* for metrics coming from the OpenMetrics endpoint.
  • torchserve.inference_api.* for metrics coming from the Inference API.
  • torchserve.management_api.* for metrics coming from the Management API.

Events

The TorchServe integration include three events using the Management API:

  • torchserve.management_api.model_added: This event fires when a new model has been added.
  • torchserve.management_api.model_removed: This event fires when a model has been removed.
  • torchserve.management_api.default_version_changed: This event fires when a default version has been set for a given model.
You can disable the events setting the `submit_events` option to `false` in your configuration file.

Service Checks

See service_checks.json for a list of service checks provided by this integration.

Logs

The TorchServe integration can collect logs from the TorchServe service and forward them to Datadog.

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml file:

    logs_enabled: true
  2. Uncomment and edit the logs configuration block in your torchserve.d/conf.yaml file. Here's an example:

    logs:
      - type: file
        path: /var/log/torchserve/model_log.log
        source: torchserve
        service: torchserve
      - type: file
        path: /var/log/torchserve/ts_log.log
        source: torchserve
        service: torchserve

See the example configuration file on how to collect all logs.

For more information about the logging configuration with TorchServe, see the official TorchServe documentation.

You can also collect logs from the `access_log.log` file. However, these logs are included in the `ts_log.log` file, leading you to duplicated logs in Datadog if you configure both files.

Troubleshooting

Need help? Contact Datadog support.