-
Notifications
You must be signed in to change notification settings - Fork 0
This change introduces using prometheus exporter integration #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using ./utils/deploy_monitoring_stack.sh which deploys these three utilities to nodes. After this can run any other cvs tests. Device Metrics Exporter runs 24/7 exposing GPU metrics Prometheus scrapes metrics every 15 seconds automatically Tests run independently - metrics collected in background After test, you can query Prometheus to see what happened during test
fixing config for prometheus to scrappe proper ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its same as clustor.json template right? User need to fill in the nodes and key details.
May be we can rename it as sample_monitor_cluster.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you added this file for testing with the local host. which may not be needed in actual case.
May be we can rename it as "sample_localhost_monitor_clustor.json"
e65e54b to
043fd1f
Compare
043fd1f to
4a6ae0e
Compare
We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using
./utils/deploy_monitoring_stack.sh
which deploys these three utilities to nodes.
After this can run any other cvs tests.
Working Directory: /etc/cvs
Cluster File: ./input/cluster_file/local_test_cluster.json
Monitoring Config: ./input/config_file/monitoring/monitoring_config.json
Prometheus Version: 2.55.0
Grafana Version: 10.4.1
Exporter Version: v1.4.0
post deployment success we can query either curl or UI based:
curl -s 'http://localhost:9090/api/v1/query?query=gpu_edge_temperature' | jq -r '.data.result[] | "GPU (.metric.gpu_id) on (.metric.hostname): (.value[1])°C"'

or via browsing at http://localhost:9090/
Running gives
and
pytest tests/monitoring/install_device_metrics_exporter.py --cluster_file=input/cluster_file/sample_monitor_cluster.json --config_file=input/config_file/monitoring/monitoring_config.json v