feat: Add Enhanced CPU Metrics for Serverless#47421
feat: Add Enhanced CPU Metrics for Serverless#47421gh-worker-dd-mergequeue-cf854d[bot] merged 78 commits intomainfrom
Conversation
Files inventory check summaryFile checks results against ancestor 25b438f9: Results for datadog-agent_7.79.0~devel.git.136.2e2ef9f.pipeline.104369606-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
26 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 25f0b9c Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.76 | [-1.29, +4.81] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.76 | [-1.29, +4.81] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.69 | [+0.45, +0.93] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | +0.16 | [+0.10, +0.21] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_logs | memory utilization | +0.10 | [-0.00, +0.20] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.06 | [-0.08, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.04 | [-0.49, +0.57] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | +0.02 | [-0.04, +0.08] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.00 | [-0.21, +0.21] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | +0.00 | [-0.10, +0.11] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.00 | [-0.20, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.09, +0.05] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.03 | [-0.09, +0.04] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.05 | [-0.44, +0.35] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.06 | [-0.49, +0.37] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.06 | [-0.13, +0.00] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.14 | [-0.18, -0.11] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | -0.19 | [-0.35, -0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.26 | [-0.49, -0.04] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.30 | [-0.46, -0.13] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.30 | [-0.44, -0.16] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | -0.45 | [-0.62, -0.28] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -1.25 | [-1.40, -1.09] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -2.38 | [-4.00, -0.76] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 711 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 272.23MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 685 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.23GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 174.60MiB ≤ 175MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 = 3 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 491.36MiB ≤ 550MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 204.16MiB ≤ 220MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 357.51 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 411.91MiB ≤ 475MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
…metrics, general cleanup
…s for cpu usage, remove unused mode enum, remove todos
| val := fnMetadata(httpClient, url) | ||
| metaChan <- keyVal{baseKey, val} | ||
| if isCloudRun { | ||
| if cloudRunType == CloudRunJob { |
There was a problem hiding this comment.
Nit, can be later: These two conditional branches aren't covered by unit tests, looks like?
|
Ok, from the LLM comparison reviews, comes this chunk, which seems worth putting a test on: |
@Lewis-E On this point I believe the case is covered. I added a test to validate it here: 8d11b92. Let me know if there is another case that I'm missing! |
apiarian-datadog
left a comment
There was a problem hiding this comment.
looks great! some nits but we can skip them or note them for a future refactor or something if you're not going to be making any more changes
| appServiceLegacyShutdownMetricName = "azure.appservice.enhanced.shutdown" | ||
| appServiceLegacyStartMetricName = "azure.appservice.enhanced.cold_start" | ||
|
|
||
| appServiceUsageMetricName = "instance" |
There was a problem hiding this comment.
nit: this is a suffix, isn't it?
| func (c *CloudRun) GetTags() map[string]string { | ||
| isCloudRun := c.spanNamespace == cloudRunService | ||
| tags := metadataHelperFunc(GetDefaultConfig(), isCloudRun) | ||
| isCloudRun := c.spanNamespace == cloudRunServiceTagPrefix |
There was a problem hiding this comment.
(not for this pr, but...) humph. what a funky bit of sneaky coupling =/
| metaChan <- keyVal{baseKey, val} | ||
| if isCloudRun { | ||
| metaChan <- keyVal{cloudRunService + baseKey, val} | ||
| if cloudRunType == CloudRunJob { |
There was a problem hiding this comment.
nit: why not a switch here? https://gobyexample.com/enums
| } | ||
|
|
||
| // CloudRunType identifies the GCP Cloud Run variant. | ||
| type CloudRunType int |
There was a problem hiding this comment.
nit: since we print this out in some panic situations, might be worth using a string for this?
cmd/serverless-init/main.go
Outdated
| enhancedMetricsCollector.Stop() | ||
| } | ||
|
|
||
| metricAgent.WaitForPendingSamples() // wait for worker to consume it |
There was a problem hiding this comment.
nit: are we also waiting for the metrics agent to consume the enhanced metrics? in this case is this "wait for worker to consume them"?
| config.SetDefault("serverless.enabled", false) | ||
| config.BindEnvAndSetDefault("serverless.logs_enabled", true) | ||
| config.BindEnvAndSetDefault("enhanced_metrics", true) | ||
| config.BindEnvAndSetDefault("enhanced_metrics", true, "DD_ENHANCED_METRICS_ENABLED") |
There was a problem hiding this comment.
nit: is this DD_SERVERLESS_INIT_ENHANCED_METRICS_ENABLED or something like that? or is a generic enhanced metrics flag okay?
There was a problem hiding this comment.
There aren't enhanced metrics in non-serverless so I don't think we need to add SERVERLESS in the name to explicitly say that this setting is related to Serverless.
For what it's worth Lambda uses DD_ENHANCED_METRICS without SERVERLESS in the name. I added _ENABLED in the environment variable for Serverless Init so the naming consistent with other environment variables like DD_TRACE_ENABLED or DD_LOGS_ENABLED.
…d to be consumed at shutdown
|
/merge |
|
View all feedbacks in Devflow UI.
The expected merge time in
|
### What does this PR do? Adds enhanced CPU metrics for Azure App Service Web Apps, Azure Container Apps, Google Cloud Run (Services, Functions, Jobs). ### Motivation https://datadoghq.atlassian.net/browse/SVLS-8356 ### Describe how you validated your changes Deployed to Azure Web Apps, Azure Container Apps, Google Cloud Run (Services, Functions, Jobs) using both in container and sidecar modes. ### Additional Notes #### Implementation Details - When the collector is started metrics are collected immediately, then with a collection interval of 3 seconds. The metrics aggregator has a bucket size of 10 seconds and the metrics agent flushes those buckets every 3 seconds. If there is a shutdown event, there is a final collection and flush of metrics with a partial interval. - CPU metrics are collected from cgroup v1 system files - Because the CPU metrics will not have a tag with a unique identifier for the instance they are sent from, they need to be sent as distribution metrics in order to avoid aggregation issues. - The usage metrics will have a unique identifier for instance or replica so these metrics will be submitted as gauge metrics to reduce internal costs. - A new `sidecar: <true|false>` tag will be added to enhanced metrics to identify if the metrics came from a sidecar container - Limited tag set consists of configured tags (`DD_TAGS`, `DD_EXTRA_TAGS`), base tags (`env`, `service`, `version`), cloud service specific tags, `datadog_init_version`, `datadog_sidecar_version`, and `sidecar` - Enhanced Metrics can are enabled by default and can be disabled with `DD_ENHANCED_METRICS_ENABLED=false` - Enhanced metrics added as part of this PR - these metrics have an updated prefix in order to align with the metric prefix of the corresponding cloud integration metrics. They also have a limited tag set instead of having all of the available tags that are added to traces, logs, and dogstatsd metrics . High cardinality tags will only be added to the `cpu.limit` metric. * `azure.app_services.enhanced.cpu.usage` * `azure.app_services.enhanced.cpu.limit` * `azure.app_services.enhanced.instance` * `azure.app_containerapp.enhanced.cpu.usage` * `azure.app_containerapp.enhanced.cpu.limit` * `azure.app_containerapp.enhanced.replica` * `gcp.run.container.enhanced.cpu.usage` * `gcp.run.container.enhanced.cpu.limit` * `gcp.run.container.enhanced.instance` - Legacy enhanced Metrics - these metrics will continue to be sent with the same full tag set for backwards compatibility. The metrics can be removed in a future major version update. * `azure.appservice.enhanced.cold_start` * `azure.appservice.enhanced.shutdown` * `azure.containerapp.enhanced.cold_start` * `azure.containerapp.enhanced.shutdown` * `gcp.run.enhanced.cold_start` * `gcp.run.enhanced.shutdown` * `gcp.run.enhanced.cold_start` * `gcp.run.enhanced.shutdown` - Updated enhanced metrics - these metrics have the same namespace but are updated to have a more limited tag set to be consistent with the other enhanced metrics as part of this PR. Google Cloud Run jobs is still in preview. * `gcp.run.job.enhanced.task.duration` * `gcp.run.job.enhanced.task.started` * `gcp.run.job.enhanced.task.ended` #### Other updates and fixes - Assign correct namespace for Google Cloud Run Jobs. Currently some tags are assigned under the `gcrfx` namespace. As a result, add `gcrj.container_id` as a high cardinality tag so it is excluded from metrics - `cloudService.GetStartMetricName()` refactored to `cloudService.AddStartMetric(metricAgent)` in order to accommodate prefix changes - Remove `cmd/serverless-init/metric/metric.go` and move into `pkg/serverless/metrics/metric.go` * Add separate methods for adding a legacy enhanced metric (full tag set), new enhanced metric (limited tag set), and a new enhanced metric with high cardinality tags (limited tag set, including high cardinality tags) - Refactor `MergeWithOverwrite` to accept multiple tag maps - Refactor `serverlessInitTag.GetBaseTagsMapWithMetadata` to `serverlessInitTag.GetBaseTagsMap` so it only handles setting base tags (`service`, `env`, `version`) - Set `use_v2_api.series` to `true` for metric agent so metric source is submitted correctly for gauge usage metric Co-authored-by: duncan.harvey <duncan.harvey@datadoghq.com>
What does this PR do?
Adds enhanced CPU metrics for Azure App Service Web Apps, Azure Container Apps, Google Cloud Run (Services, Functions, Jobs).
Motivation
https://datadoghq.atlassian.net/browse/SVLS-8356
Describe how you validated your changes
Deployed to Azure Web Apps, Azure Container Apps, Google Cloud Run (Services, Functions, Jobs) using both in container and sidecar modes.
Additional Notes
Implementation Details
When the collector is started metrics are collected immediately, then with a collection interval of 3 seconds. The metrics aggregator has a bucket size of 10 seconds and the metrics agent flushes those buckets every 3 seconds. If there is a shutdown event, there is a final collection and flush of metrics with a partial interval.
CPU metrics are collected from cgroup v1 system files
Because the CPU metrics will not have a tag with a unique identifier for the instance they are sent from, they need to be sent as distribution metrics in order to avoid aggregation issues.
The usage metrics will have a unique identifier for instance or replica so these metrics will be submitted as gauge metrics to reduce internal costs.
A new
sidecar: <true|false>tag will be added to enhanced metrics to identify if the metrics came from a sidecar containerLimited tag set consists of configured tags (
DD_TAGS,DD_EXTRA_TAGS), base tags (env,service,version), cloud service specific tags,datadog_init_version,datadog_sidecar_version, andsidecarEnhanced Metrics can are enabled by default and can be disabled with
DD_ENHANCED_METRICS_ENABLED=falseEnhanced metrics added as part of this PR - these metrics have an updated prefix in order to align with the metric prefix of the corresponding cloud integration metrics. They also have a limited tag set instead of having all of the available tags that are added to traces, logs, and dogstatsd metrics . High cardinality tags will only be added to the
cpu.limitmetric.azure.app_services.enhanced.cpu.usageazure.app_services.enhanced.cpu.limitazure.app_services.enhanced.instanceazure.app_containerapp.enhanced.cpu.usageazure.app_containerapp.enhanced.cpu.limitazure.app_containerapp.enhanced.replicagcp.run.container.enhanced.cpu.usagegcp.run.container.enhanced.cpu.limitgcp.run.container.enhanced.instanceLegacy enhanced Metrics - these metrics will continue to be sent with the same full tag set for backwards compatibility. The metrics can be removed in a future major version update.
azure.appservice.enhanced.cold_startazure.appservice.enhanced.shutdownazure.containerapp.enhanced.cold_startazure.containerapp.enhanced.shutdowngcp.run.enhanced.cold_startgcp.run.enhanced.shutdowngcp.run.enhanced.cold_startgcp.run.enhanced.shutdownUpdated enhanced metrics - these metrics have the same namespace but are updated to have a more limited tag set to be consistent with the other enhanced metrics as part of this PR. Google Cloud Run jobs is still in preview.
gcp.run.job.enhanced.task.durationgcp.run.job.enhanced.task.startedgcp.run.job.enhanced.task.endedOther updates and fixes
gcrfxnamespace. As a result, addgcrj.container_idas a high cardinality tag so it is excluded from metricscloudService.GetStartMetricName()refactored tocloudService.AddStartMetric(metricAgent)in order to accommodate prefix changescmd/serverless-init/metric/metric.goand move intopkg/serverless/metrics/metric.goMergeWithOverwriteto accept multiple tag mapsserverlessInitTag.GetBaseTagsMapWithMetadatatoserverlessInitTag.GetBaseTagsMapso it only handles setting base tags (service,env,version)use_v2_api.seriestotruefor metric agent so metric source is submitted correctly for gauge usage metric