Skip to content

DeepLM/Insights

Repository files navigation

DeepLM Dashboards

Real-time monitoring dashboards for HPC/SLURM GPU clusters. Provides Grafana dashboards, a Prometheus metrics exporter, and a Cassandra-backed data viewer for tracking job performance, power consumption, and GPU utilization across your cluster.

Architecture

SLURM Cluster                    DeepLM Dashboards Stack
+------------------+             +-------------------+
| Compute Nodes    |             |                   |
|  - prologue.sh --+------------>|  Flask Metrics    |
|  - epilogue.sh --+------------>|  API (:5000)      |
|  - collectors  --+------------>|       |           |
+------------------+             |       v           |
                                 |  Cassandra (DB)   |
+------------------+             |       |           |
| SLURM Controller |             |       v           |
|  - squeue      --+------------>|  /metrics         |
|  - sacct       --+------------>|  (Prometheus fmt) |
+------------------+             |       |           |
                                 |       v           |
+------------------+             |  Prometheus       |
| BCM (optional)   |             |  (:9090)          |
|  - REST API    --+------------>|       |           |
+------------------+             |       v           |
                                 |  Grafana (:3000)  |
                                 |  5 dashboards     |
                                 +-------------------+

Dashboards

Dashboard Description
Job Insights Per-job CPU/GPU utilization, memory, priority, power consumption
System Overview Cluster-wide power, GPU temperature, fan speed, CPU/memory per node
Live Jobs Real-time active job monitoring with 5-second refresh
Historical Jobs Job duration analysis, completion rates, CPU hours by user
Checkpoint Analysis Sync vs async checkpoint strategy comparison (stall time, overhead)

Quick Start

Docker Compose (recommended)

cp .env.example .env
# Edit .env with your Cassandra host, compute nodes, etc.

docker compose up -d

Services will be available at:

Bare Metal

pip install .

# Start the metrics API
python -m metrics_exporter.app

# Start the Cassandra viewer (optional)
python -m cassandra_viewer.app

Configuration

All configuration is via environment variables. Copy .env.example to .env and customize:

Variable Default Description
DEEPLM_CASSANDRA_HOST localhost Cassandra contact point
DEEPLM_CASSANDRA_PORT 9042 Cassandra CQL native port
DEEPLM_CASSANDRA_KEYSPACE cassandradb Cassandra keyspace
DEEPLM_COMPUTE_NODES (empty) Comma-separated compute node names
DEEPLM_BCM_ENABLED false Enable NVIDIA BCM power metrics
DEEPLM_BCM_HOST (empty) BCM REST API URL (e.g., https://headnode:8081)
DEEPLM_BCM_USERNAME (empty) BCM API username
DEEPLM_BCM_PASSWORD (empty) BCM API password
DEEPLM_GPU_POWER_LIMIT 300 GPU TDP in watts (for estimation fallback)
GF_SECURITY_ADMIN_PASSWORD changeme Grafana admin password

See .env.example for the full list.

Prerequisites

  • Cassandra 4.x+ with the DeepLM schema (see database/schemas/init.cql)
  • SLURM with squeue and sacct accessible from the metrics API host
  • Python 3.10+
  • Docker and Docker Compose (for containerized deployment)

Cassandra Schema Setup

cqlsh <cassandra-host> <port> -f database/schemas/init.cql

SLURM Integration

Install the prologue/epilogue hooks on your compute nodes:

# Copy hooks
sudo cp slurm_hooks/prologue.sh /etc/slurm/prologue.sh
sudo cp slurm_hooks/epilogue.sh /etc/slurm/epilogue.sh
sudo chmod +x /etc/slurm/prologue.sh /etc/slurm/epilogue.sh

# Set the API URL
echo 'export DEEPLM_API_URL=http://<metrics-api-host>:5000' >> /etc/slurm/slurm.conf

Project Structure

deeplm-dashboards/
  config/                  # Central configuration (env-based)
  metrics_exporter/        # Flask app + Prometheus metrics builder
    app.py                 # Main application (metrics + REST API)
    bcm_client.py          # Optional BCM integration
    power_estimator.py     # TDP-based power fallback
    prometheus_builder.py  # Prometheus exposition format builder
  database/                # Cassandra models and utilities
    models.py              # JobModel ORM
    utils.py               # SLURM parsing helpers
    schemas/init.cql       # Cassandra schema
  cassandra_viewer/        # Web UI for browsing Cassandra tables
  grafana/                 # Dashboards + provisioning configs
  prometheus/              # Scrape configuration
  slurm_hooks/             # Prologue/epilogue scripts
  docker-compose.yml       # Full-stack deployment
  Dockerfile.metrics       # Metrics API container
  Dockerfile.viewer        # Cassandra viewer container

BCM Integration (Optional)

If you have NVIDIA Base Command Manager, set DEEPLM_BCM_ENABLED=true and provide credentials. The metrics exporter will fetch real GPU power, temperature, and utilization data from the BCM REST API.

Without BCM, the system falls back to TDP-based power estimation using job CPU/GPU utilization percentages.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/my-feature)
  3. Commit your changes (git commit -am 'Add my feature')
  4. Push to the branch (git push origin feature/my-feature)
  5. Open a Pull Request

For bug reports and feature requests, please use GitHub Issues.

License

Apache License 2.0. See LICENSE.


DeepLM

Made by DeepLM
Intelligent scheduling and monitoring for HPC GPU clusters.

For more information, products, and services visit
www.deeplm.ai

About

Real-time Grafana dashboards and Prometheus metrics for HPC/SLURM GPU clusters. Track job performance, GPU utilization, power consumption, and checkpoint efficiency. Docker Compose deploy, optional NVIDIA BCM integration, Cassandra-backed historical analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors