Skip to content

Conversation

@manoj-freyr
Copy link

@manoj-freyr manoj-freyr commented Nov 19, 2025

We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using

./utils/deploy_monitoring_stack.sh
which deploys these three utilities to nodes.
After this can run any other cvs tests.

Device Metrics Exporter runs 24/7 exposing GPU metrics
Prometheus scrapes metrics every 15 seconds automatically
Tests run independently - metrics collected in background
After test, you can query Prometheus to see what happened during test

./utils/deploy_monitoring_stack.sh
============================================
CVS Monitoring Stack Deployment
============================================

Working Directory: /etc/cvs
Cluster File: ./input/cluster_file/local_test_cluster.json
Monitoring Config: ./input/config_file/monitoring/monitoring_config.json
Prometheus Version: 2.55.0
Grafana Version: 10.4.1
Exporter Version: v1.4.0

post deployment success we can query either curl or UI based:

curl -s 'http://localhost:9090/api/v1/query?query=gpu_edge_temperature' | jq -r '.data.result[] | "GPU (.metric.gpu_id) on (.metric.hostname): (.value[1])°C"'
{90427263-E74D-4AEC-909E-ACF8BCED4DA7}

or via browsing at http://localhost:9090/

Running gives

./utils/deploy_monitoring_stack.sh input/cluster_file/sample_monitor_cluster.json input/config_file/monitoring/monitoring_config.json

{99D35318-BFFB-4C18-AEFF-DB965A7BDA5C}

and

pytest tests/monitoring/install_device_metrics_exporter.py --cluster_file=input/cluster_file/sample_monitor_cluster.json --config_file=input/config_file/monitoring/monitoring_config.json v

{E1E7AEAE-1102-4990-A50A-15BC461E91DC}

We have device-metrics-exporter as publicly available package
to have prometheus exporter capability. Adding this to nodes
to export device metrics which can be collected by central node
and also have grafna integration to the same with said
prometheus data source. Tested locally for deployemnt using

./utils/deploy_monitoring_stack.sh
which deploys these three utilities to nodes.
After this can run any other cvs tests.

	Device Metrics Exporter runs 24/7 exposing GPU metrics
	Prometheus scrapes metrics every 15 seconds automatically
	Tests run independently - metrics collected in background
	After test, you can query Prometheus to see what happened during test
fixing config for prometheus to scrappe proper ones
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its same as clustor.json template right? User need to fill in the nodes and key details.

May be we can rename it as sample_monitor_cluster.json

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you added this file for testing with the local host. which may not be needed in actual case.

May be we can rename it as "sample_localhost_monitor_clustor.json"

@manoj-freyr manoj-freyr requested a review from frepaul November 20, 2025 09:24
@manoj-freyr manoj-freyr requested a review from solaiys November 20, 2025 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants