feat(ci): add HPA pod autoscaling validation to inference workflow by dims · Pull Request #163 · NVIDIA/aicr

dims · 2026-02-20T03:40:56Z

Summary

Add pod autoscaling (HPA) validation to the H100 inference workflow, covering CNCF AI conformance requirement pod_autoscaling
Create HPA manifest targeting vLLM worker with gpu_utilization custom metric from prometheus-adapter
Use maxReplicas=1 to validate the metrics pipeline (DCGM exporter → Prometheus → prometheus-adapter → HPA) without triggering actual scaling
Add HPA debug diagnostics on failure

Test plan

Verify HPA reads gpu_utilization from prometheus-adapter (AbleToScale=True, currentMetrics non-empty)
Verify existing inference test steps are unaffected
Verify HPA cleanup runs unconditionally

Validate the custom metrics pipeline (DCGM → Prometheus → prometheus-adapter → custom metrics API) that HPA consumes for GPU-aware pod autoscaling. Queries the custom.metrics.k8s.io API directly for gpu_utilization, gpu_memory_used, and gpu_power_usage metrics. DCGM exporter runs as a DaemonSet in gpu-operator namespace, so Prometheus labels GPU metrics with namespace=gpu-operator. We query that namespace to validate the full metrics pipeline. Dynamo uses PodCliqueSets (not Deployments), so we validate the metrics API availability rather than creating an HPA object. This covers CNCF AI Conformance requirement #8b (pod autoscaling). Signed-off-by: Davanum Srinivas <dsrinivas@nvidia.com>

dims requested review from a team as code owners February 20, 2026 03:40

github-actions bot added area/ci area/docs size/M labels Feb 20, 2026

dims force-pushed the add-hpa-gpu-metrics-test branch from 9beb15f to 8fbe5ab Compare February 20, 2026 04:14

github-actions bot removed the area/docs label Feb 20, 2026

dims force-pushed the add-hpa-gpu-metrics-test branch from 8fbe5ab to 46148df Compare February 20, 2026 11:58

dims merged commit e55f9a2 into NVIDIA:main Feb 20, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add HPA pod autoscaling validation to inference workflow#163

feat(ci): add HPA pod autoscaling validation to inference workflow#163
dims merged 1 commit intoNVIDIA:mainfrom
dims:add-hpa-gpu-metrics-test

dims commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dims commented Feb 20, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant