This change introduces using prometheus exporter integration #21

manoj-freyr · 2025-11-19T16:42:57Z

We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using

./utils/deploy_monitoring_stack.sh
which deploys these three utilities to nodes.
After this can run any other cvs tests.

Device Metrics Exporter runs 24/7 exposing GPU metrics
Prometheus scrapes metrics every 15 seconds automatically
Tests run independently - metrics collected in background
After test, you can query Prometheus to see what happened during test

./utils/deploy_monitoring_stack.sh
============================================
CVS Monitoring Stack Deployment
============================================

Working Directory: /etc/cvs
Cluster File: ./input/cluster_file/local_test_cluster.json
Monitoring Config: ./input/config_file/monitoring/monitoring_config.json
Prometheus Version: 2.55.0
Grafana Version: 10.4.1
Exporter Version: v1.4.0

post deployment success we can query either curl or UI based:

curl -s 'http://localhost:9090/api/v1/query?query=gpu_edge_temperature' | jq -r '.data.result[] | "GPU (.metric.gpu_id) on (.metric.hostname): (.value[1])°C"'

or via browsing at http://localhost:9090/

Running gives

./utils/deploy_monitoring_stack.sh input/cluster_file/sample_monitor_cluster.json input/config_file/monitoring/monitoring_config.json

and

pytest tests/monitoring/install_device_metrics_exporter.py --cluster_file=input/cluster_file/sample_monitor_cluster.json --config_file=input/config_file/monitoring/monitoring_config.json v

We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using ./utils/deploy_monitoring_stack.sh which deploys these three utilities to nodes. After this can run any other cvs tests. Device Metrics Exporter runs 24/7 exposing GPU metrics Prometheus scrapes metrics every 15 seconds automatically Tests run independently - metrics collected in background After test, you can query Prometheus to see what happened during test

fixing config for prometheus to scrappe proper ones

conftest.py

solaiys · 2025-11-20T08:30:03Z

input/cluster_file/dummy_monitor_cluster.json

its same as clustor.json template right? User need to fill in the nodes and key details.

May be we can rename it as sample_monitor_cluster.json

solaiys · 2025-11-20T08:33:28Z

input/cluster_file/local_test_cluster.json

I think you added this file for testing with the local host. which may not be needed in actual case.

May be we can rename it as "sample_localhost_monitor_clustor.json"

input/config_file/monitoring/monitoring_config.json

… integrate-exporter

manoj-freyr requested review from solaiys and venksrin09 November 19, 2025 16:42

using amdgpu filters everything out, metrics have gpu_* names

7b4aa82

fixing config for prometheus to scrappe proper ones

solaiys reviewed Nov 20, 2025

View reviewed changes

conftest.py Outdated Show resolved Hide resolved

solaiys reviewed Nov 20, 2025

View reviewed changes

input/config_file/monitoring/monitoring_config.json Show resolved Hide resolved

solaiys reviewed Nov 20, 2025

View reviewed changes

input/config_file/monitoring/monitoring_config.json Show resolved Hide resolved

manoj-freyr requested a review from frepaul November 20, 2025 09:24

Manoj S K and others added 2 commits November 20, 2025 02:49

remove new CLI opts and cleanup

f54747a

Merge branch 'main' into integrate-exporter

5183fd9

manoj-freyr requested a review from solaiys November 20, 2025 10:56

Manoj S K added 2 commits November 20, 2025 03:57

stop prometheus before checking and running the flow

aaac291

Merge branch 'integrate-exporter' of https://github.com/ROCm/cvs into…

4a7ec42

… integrate-exporter

manoj-freyr force-pushed the integrate-exporter branch from e65e54b to 043fd1f Compare November 20, 2025 12:12

keep opts as in the original, no changes here

4a6ae0e

manoj-freyr force-pushed the integrate-exporter branch from 043fd1f to 4a6ae0e Compare November 20, 2025 12:13

Manoj S K added 2 commits November 20, 2025 23:07

check mgmt and install servers only there

d113c5d

fixes after cluster tests

d60445f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

This change introduces using prometheus exporter integration #21

This change introduces using prometheus exporter integration #21

Uh oh!

manoj-freyr commented Nov 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

solaiys Nov 20, 2025

Uh oh!

solaiys Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

This change introduces using prometheus exporter integration #21

Are you sure you want to change the base?

This change introduces using prometheus exporter integration #21

Uh oh!

Conversation

manoj-freyr commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

solaiys Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

solaiys Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

manoj-freyr commented Nov 19, 2025 •

edited

Loading