Real-time monitoring dashboards for HPC/SLURM GPU clusters. Provides Grafana dashboards, a Prometheus metrics exporter, and a Cassandra-backed data viewer for tracking job performance, power consumption, and GPU utilization across your cluster.
SLURM Cluster DeepLM Dashboards Stack
+------------------+ +-------------------+
| Compute Nodes | | |
| - prologue.sh --+------------>| Flask Metrics |
| - epilogue.sh --+------------>| API (:5000) |
| - collectors --+------------>| | |
+------------------+ | v |
| Cassandra (DB) |
+------------------+ | | |
| SLURM Controller | | v |
| - squeue --+------------>| /metrics |
| - sacct --+------------>| (Prometheus fmt) |
+------------------+ | | |
| v |
+------------------+ | Prometheus |
| BCM (optional) | | (:9090) |
| - REST API --+------------>| | |
+------------------+ | v |
| Grafana (:3000) |
| 5 dashboards |
+-------------------+
| Dashboard | Description |
|---|---|
| Job Insights | Per-job CPU/GPU utilization, memory, priority, power consumption |
| System Overview | Cluster-wide power, GPU temperature, fan speed, CPU/memory per node |
| Live Jobs | Real-time active job monitoring with 5-second refresh |
| Historical Jobs | Job duration analysis, completion rates, CPU hours by user |
| Checkpoint Analysis | Sync vs async checkpoint strategy comparison (stall time, overhead) |
cp .env.example .env
# Edit .env with your Cassandra host, compute nodes, etc.
docker compose up -dServices will be available at:
- Grafana: http://localhost:3000 (admin / changeme)
- Prometheus: http://localhost:9090
- Metrics API: http://localhost:5000/metrics
- Cassandra Viewer: http://localhost:5002
pip install .
# Start the metrics API
python -m metrics_exporter.app
# Start the Cassandra viewer (optional)
python -m cassandra_viewer.appAll configuration is via environment variables. Copy .env.example to .env and customize:
| Variable | Default | Description |
|---|---|---|
DEEPLM_CASSANDRA_HOST |
localhost |
Cassandra contact point |
DEEPLM_CASSANDRA_PORT |
9042 |
Cassandra CQL native port |
DEEPLM_CASSANDRA_KEYSPACE |
cassandradb |
Cassandra keyspace |
DEEPLM_COMPUTE_NODES |
(empty) | Comma-separated compute node names |
DEEPLM_BCM_ENABLED |
false |
Enable NVIDIA BCM power metrics |
DEEPLM_BCM_HOST |
(empty) | BCM REST API URL (e.g., https://headnode:8081) |
DEEPLM_BCM_USERNAME |
(empty) | BCM API username |
DEEPLM_BCM_PASSWORD |
(empty) | BCM API password |
DEEPLM_GPU_POWER_LIMIT |
300 |
GPU TDP in watts (for estimation fallback) |
GF_SECURITY_ADMIN_PASSWORD |
changeme |
Grafana admin password |
See .env.example for the full list.
- Cassandra 4.x+ with the DeepLM schema (see
database/schemas/init.cql) - SLURM with
squeueandsacctaccessible from the metrics API host - Python 3.10+
- Docker and Docker Compose (for containerized deployment)
cqlsh <cassandra-host> <port> -f database/schemas/init.cqlInstall the prologue/epilogue hooks on your compute nodes:
# Copy hooks
sudo cp slurm_hooks/prologue.sh /etc/slurm/prologue.sh
sudo cp slurm_hooks/epilogue.sh /etc/slurm/epilogue.sh
sudo chmod +x /etc/slurm/prologue.sh /etc/slurm/epilogue.sh
# Set the API URL
echo 'export DEEPLM_API_URL=http://<metrics-api-host>:5000' >> /etc/slurm/slurm.confdeeplm-dashboards/
config/ # Central configuration (env-based)
metrics_exporter/ # Flask app + Prometheus metrics builder
app.py # Main application (metrics + REST API)
bcm_client.py # Optional BCM integration
power_estimator.py # TDP-based power fallback
prometheus_builder.py # Prometheus exposition format builder
database/ # Cassandra models and utilities
models.py # JobModel ORM
utils.py # SLURM parsing helpers
schemas/init.cql # Cassandra schema
cassandra_viewer/ # Web UI for browsing Cassandra tables
grafana/ # Dashboards + provisioning configs
prometheus/ # Scrape configuration
slurm_hooks/ # Prologue/epilogue scripts
docker-compose.yml # Full-stack deployment
Dockerfile.metrics # Metrics API container
Dockerfile.viewer # Cassandra viewer container
If you have NVIDIA Base Command Manager, set DEEPLM_BCM_ENABLED=true and provide credentials. The metrics exporter will fetch real GPU power, temperature, and utilization data from the BCM REST API.
Without BCM, the system falls back to TDP-based power estimation using job CPU/GPU utilization percentages.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/my-feature) - Commit your changes (
git commit -am 'Add my feature') - Push to the branch (
git push origin feature/my-feature) - Open a Pull Request
For bug reports and feature requests, please use GitHub Issues.
Apache License 2.0. See LICENSE.
Made by DeepLM
Intelligent scheduling and monitoring for HPC GPU clusters.
For more information, products, and services visit
www.deeplm.ai
