DeepLM Dashboards

Real-time monitoring dashboards for HPC/SLURM GPU clusters. Provides Grafana dashboards, a Prometheus metrics exporter, and a Cassandra-backed data viewer for tracking job performance, power consumption, and GPU utilization across your cluster.

Architecture

SLURM Cluster                    DeepLM Dashboards Stack
+------------------+             +-------------------+
| Compute Nodes    |             |                   |
|  - prologue.sh --+------------>|  Flask Metrics    |
|  - epilogue.sh --+------------>|  API (:5000)      |
|  - collectors  --+------------>|       |           |
+------------------+             |       v           |
                                 |  Cassandra (DB)   |
+------------------+             |       |           |
| SLURM Controller |             |       v           |
|  - squeue      --+------------>|  /metrics         |
|  - sacct       --+------------>|  (Prometheus fmt) |
+------------------+             |       |           |
                                 |       v           |
+------------------+             |  Prometheus       |
| BCM (optional)   |             |  (:9090)          |
|  - REST API    --+------------>|       |           |
+------------------+             |       v           |
                                 |  Grafana (:3000)  |
                                 |  5 dashboards     |
                                 +-------------------+

Dashboards

Dashboard	Description
Job Insights	Per-job CPU/GPU utilization, memory, priority, power consumption
System Overview	Cluster-wide power, GPU temperature, fan speed, CPU/memory per node
Live Jobs	Real-time active job monitoring with 5-second refresh
Historical Jobs	Job duration analysis, completion rates, CPU hours by user
Checkpoint Analysis	Sync vs async checkpoint strategy comparison (stall time, overhead)

Quick Start

Docker Compose (recommended)

cp .env.example .env
# Edit .env with your Cassandra host, compute nodes, etc.

docker compose up -d

Services will be available at:

Grafana: http://localhost:3000 (admin / changeme)
Prometheus: http://localhost:9090
Metrics API: http://localhost:5000/metrics
Cassandra Viewer: http://localhost:5002

Bare Metal

pip install .

# Start the metrics API
python -m metrics_exporter.app

# Start the Cassandra viewer (optional)
python -m cassandra_viewer.app

Configuration

All configuration is via environment variables. Copy .env.example to .env and customize:

Variable	Default	Description
`DEEPLM_CASSANDRA_HOST`	`localhost`	Cassandra contact point
`DEEPLM_CASSANDRA_PORT`	`9042`	Cassandra CQL native port
`DEEPLM_CASSANDRA_KEYSPACE`	`cassandradb`	Cassandra keyspace
`DEEPLM_COMPUTE_NODES`	(empty)	Comma-separated compute node names
`DEEPLM_BCM_ENABLED`	`false`	Enable NVIDIA BCM power metrics
`DEEPLM_BCM_HOST`	(empty)	BCM REST API URL (e.g., `https://headnode:8081`)
`DEEPLM_BCM_USERNAME`	(empty)	BCM API username
`DEEPLM_BCM_PASSWORD`	(empty)	BCM API password
`DEEPLM_GPU_POWER_LIMIT`	`300`	GPU TDP in watts (for estimation fallback)
`GF_SECURITY_ADMIN_PASSWORD`	`changeme`	Grafana admin password

See .env.example for the full list.

Prerequisites

Cassandra 4.x+ with the DeepLM schema (see database/schemas/init.cql)
SLURM with squeue and sacct accessible from the metrics API host
Python 3.10+
Docker and Docker Compose (for containerized deployment)

Cassandra Schema Setup

cqlsh <cassandra-host> <port> -f database/schemas/init.cql

SLURM Integration

Install the prologue/epilogue hooks on your compute nodes:

# Copy hooks
sudo cp slurm_hooks/prologue.sh /etc/slurm/prologue.sh
sudo cp slurm_hooks/epilogue.sh /etc/slurm/epilogue.sh
sudo chmod +x /etc/slurm/prologue.sh /etc/slurm/epilogue.sh

# Set the API URL
echo 'export DEEPLM_API_URL=http://<metrics-api-host>:5000' >> /etc/slurm/slurm.conf

Project Structure

deeplm-dashboards/
  config/                  # Central configuration (env-based)
  metrics_exporter/        # Flask app + Prometheus metrics builder
    app.py                 # Main application (metrics + REST API)
    bcm_client.py          # Optional BCM integration
    power_estimator.py     # TDP-based power fallback
    prometheus_builder.py  # Prometheus exposition format builder
  database/                # Cassandra models and utilities
    models.py              # JobModel ORM
    utils.py               # SLURM parsing helpers
    schemas/init.cql       # Cassandra schema
  cassandra_viewer/        # Web UI for browsing Cassandra tables
  grafana/                 # Dashboards + provisioning configs
  prometheus/              # Scrape configuration
  slurm_hooks/             # Prologue/epilogue scripts
  docker-compose.yml       # Full-stack deployment
  Dockerfile.metrics       # Metrics API container
  Dockerfile.viewer        # Cassandra viewer container

BCM Integration (Optional)

If you have NVIDIA Base Command Manager, set DEEPLM_BCM_ENABLED=true and provide credentials. The metrics exporter will fetch real GPU power, temperature, and utilization data from the BCM REST API.

Without BCM, the system falls back to TDP-based power estimation using job CPU/GPU utilization percentages.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Commit your changes (git commit -am 'Add my feature')
Push to the branch (git push origin feature/my-feature)
Open a Pull Request

For bug reports and feature requests, please use GitHub Issues.

License

Apache License 2.0. See LICENSE.

Made by DeepLM
Intelligent scheduling and monitoring for HPC GPU clusters.

For more information, products, and services visit
www.deeplm.ai

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepLM Dashboards

Architecture

Dashboards

Quick Start

Docker Compose (recommended)

Bare Metal

Configuration

Prerequisites

Cassandra Schema Setup

SLURM Integration

Project Structure

BCM Integration (Optional)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
api		api
assets		assets
cassandra_viewer		cassandra_viewer
checkpoint_dashboard		checkpoint_dashboard
collectors		collectors
config		config
database		database
grafana		grafana
metrics_exporter		metrics_exporter
prometheus		prometheus
slurm_hooks		slurm_hooks
.env.example		.env.example
.gitignore		.gitignore
DeepLM_Dashboards_OpenSource_Release.pdf		DeepLM_Dashboards_OpenSource_Release.pdf
Dockerfile.metrics		Dockerfile.metrics
Dockerfile.viewer		Dockerfile.viewer
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
generate_release_doc.py		generate_release_doc.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

DeepLM Dashboards

Architecture

Dashboards

Quick Start

Docker Compose (recommended)

Bare Metal

Configuration

Prerequisites

Cassandra Schema Setup

SLURM Integration

Project Structure

BCM Integration (Optional)

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages