This check monitors TorchServe through the Datadog Agent.
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.
Starting from Agent release 7.47.0, the TorchServe check is included in the Datadog Agent package. No additional installation is needed on your server.
The TorchServe check collects TorchServe's metrics and performance data using three different endpoints:
- The Inference API to collect the overall health status of your TorchServe instance.
- The Management API to collect metrics on the various models you are running.
- The OpenMetrics endpoint exposed by TorchServe.
You can configure these endpoints using the config.properties
file, as described in the TorchServe documentation. For example:
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
number_of_netty_threads=32
default_workers_per_model=10
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
load_models=all
This configuration file exposes the three different endpoints that can be used by the integration to monitor your instance.
To enable the Prometheus endpoint, you need to configure two options:
metrics_address
: Metrics API binding address. Defaults tohttp://127.0.0.1:8082
metrics_mode
: Two metric modes are supported by TorchServe:log
andprometheus
. Defaults tolog
. You have to set it toprometheus
to collect metrics from this endpoint.
For instance:
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
In this case, the OpenMetrics endpoint is exposed at this URL: http://<TORCHSERVE_ADDRESS>:8082/metrics
.
These three different endpoints can be monitored independently and must be configured separately in the configuration file, one API per instance. See the sample torchserve.d/conf.yaml for all available configuration options.
Configuration options for the OpenMetrics endpoint can be found in the configuration file under the TorchServe OpenMetrics endpoint configuration
section. The minimal configuration only requires the openmetrics_endpoint
option:
init_config:
...
instances:
- openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
For more options, see the sample torchserve.d/conf.yaml
file.
TorchServe allows the custom service code to emit metrics that will be available based on the configured metrics_mode
. You can configure this integration to collect these metrics using the extra_metrics
option. These metrics will have the torchserve.openmetrics
prefix, just like any other metrics coming from this endpoint.
This integration relies on the Inference API to get the overall status of your TorchServe instance. Configuration options for the Inference API can be found in the configuration file under the TorchServe Inference API endpoint configuration
section. The minimal configuration only requires the inference_api_url
option:
init_config:
...
instances:
- inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
This integration leverages the Ping endpoint to collect the overall health status of your TorchServe server.
You can collect metrics related to the models that are currently running in your TorchServe server using the Management API. Configuration options for the Inference API can be found in the configuration file under the TorchServe Management API endpoint configuration
section. The minimal configuration only requires the management_api_url
option:
init_config:
...
instances:
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
By default, the integration collects data from every single models, up to 100 models. This can be modified using the limit
, include
, and exclude
options. For example:
init_config:
...
instances:
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
limit: 25
include:
- my_model.*
This configuration only collects metrics for model names that match the my_model.*
regular expression, up to 25 models.
You can also exclude some models:
init_config:
...
instances:
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
exclude:
- test.*
This configuration collects metrics for every model name that does not match the test.*
regular expression, up to 100 models.
By default, the integration retrieves the full list of the models every time the check runs. You can cache this list by using the interval
option for increased performance of this check.
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections:
init_config:
...
instances:
- openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
# Also collect your own TorchServe metrics
extra_metrics:
- my_custom_torchserve_metric
- inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
- management_api_url: http://<TORCHSERVE_ADDRESS>:8081
# Include all the model names that match this regex
include:
- my_models.*
# But exclude all the ones that finish with `-test`
exclude:
- .*-test
# Refresh the list of models only every hour
interval: 3600
Restart the Agent after modifying the configuration.
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as a Docker label inside docker-compose.yml
:
labels:
com.datadoghq.ad.checks: '{"torchserve":{"instances":[{"openmetrics_endpoint":"http://%%host%%:8082/metrics","extra_metrics":["my_custom_torchserve_metric"]},{"inference_api_url":"http://%%host%%:8080"},{"management_api_url":"http://%%host%%:8081","include":["my_models.*"],"exclude":[".*-test"],"interval":3600}]}}'
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as Kubernetes annotations on your Torchserve pods:
apiVersion: v1
kind: Pod
metadata:
name: '<POD_NAME>'
annotations:
ad.datadoghq.com/torchserve.checks: |-
{
"torchserve": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:8082/metrics",
"extra_metrics": [
"my_custom_torchserve_metric"
]
},
{
"inference_api_url": "http://%%host%%:8080"
},
{
"management_api_url": "http://%%host%%:8081",
"include": [
".*"
],
"exclude": [
".*-test"
],
"interval": 3600
}
]
}
}
# (...)
spec:
containers:
- name: 'torchserve'
# (...)
Run the Agent's status subcommand and look for torchserve
under the Checks section.
See metadata.csv for a list of metrics provided by this integration.
Metrics are prefixed using the API they are coming from:
torchserve.openmetrics.*
for metrics coming from the OpenMetrics endpoint.torchserve.inference_api.*
for metrics coming from the Inference API.torchserve.management_api.*
for metrics coming from the Management API.
The TorchServe integration include three events using the Management API:
torchserve.management_api.model_added
: This event fires when a new model has been added.torchserve.management_api.model_removed
: This event fires when a model has been removed.torchserve.management_api.default_version_changed
: This event fires when a default version has been set for a given model.
See service_checks.json for a list of service checks provided by this integration.
The TorchServe integration can collect logs from the TorchServe service and forward them to Datadog.
-
Collecting logs is disabled by default in the Datadog Agent. Enable it in your
datadog.yaml
file:logs_enabled: true
-
Uncomment and edit the logs configuration block in your
torchserve.d/conf.yaml
file. Here's an example:logs: - type: file path: /var/log/torchserve/model_log.log source: torchserve service: torchserve - type: file path: /var/log/torchserve/ts_log.log source: torchserve service: torchserve
See the example configuration file on how to collect all logs.
For more information about the logging configuration with TorchServe, see the official TorchServe documentation.
Need help? Contact Datadog support.