DECICE is an open-source framework designed to orchestrate workloads across the Compute Continuum—seamlessly bridging the gap between IoT devices, Edge nodes, Cloud clusters, and High-Performance Computing (HPC) infrastructure.
At its core, DECICE utilizes an Integrated AI Scheduler (IAIS) powered by Deep Reinforcement Learning (PPO) to optimize workload placement based on real-time metrics like energy consumption, latency, and resource availability.
- 🧠 AI-Driven Orchestration: A Proximal Policy Optimization (PPO) agent that dynamically selects scheduling strategies (e.g., Round Robin, Min-Max, Energy-Aware) based on live cluster states.
- 🧪 Virtual Training Environment (VTE): A robust MLOps pipeline to generate synthetic data, train models offline, and evaluate them against heuristics without impacting production.
- 🌐 Heterogeneous Support: Unified API for managing Kubernetes Jobs, Deployments, and Slurm (HPC) batch jobs.
- 🔮 Digital Twin: Real-time system modeling and predictive analytics using Prometheus and InfluxDB.
- 📊 Observability: Built-in integration with TensorBoard for training metrics and a unified Dashboard for cluster topology visualization.
- 🔌 Workflow Integration: Native support for Argo Workflows and Snakemake pipelines.
DECICE operates as a set of loosely coupled microservices:
- Control Manager: The central brain handling user authentication, workflow parsing, and state management.
- Integrated AI Scheduler: The decision engine containing the RL agent and Feature Engineering pipeline.
- Scheduler Controller: Intermediary that enriches job requests with real-time data from the Digital Twin.
- PSGC (Platform Specific Glue Code): The execution engine that translates abstract decisions into platform-specific actions (K8s manifests or Slurm scripts).
- PromQL Wrapper & Digital Twin: The telemetry layer providing a unified view of the infrastructure.
(See /docs for detailed admin documentation and setup guide)
- Docker Engine & Docker Compose (v2.0+)
- NVIDIA Drivers / CUDA (Optional, recommended for training the AI Scheduler)
- Python 3.10+ (For local script execution)
- Minikube
git clone https://github.com/DECICE-project/decice-framework.git
cd decice-frameworkcp .env.example .envOpen .env and configure:
INTERNAL_API_KEY: Generate a strong random string (e.g.,openssl rand -hex 32).DATA_BASE_DIR: Default is./data. Ensure this path exists or let Docker create it.
docker compose up -d --build
Once the containers are healthy:
- Dashboard (UI): http://localhost:3000
- OpenAPI (Swagger)
- Control Manager: http://127.0.0.1:8000/
- Digital Twin: http://127.0.0.1:8010/
- Scheduler Controller: http://127.0.0.1:8020/
- AI Scheduler: http://127.0.0.1:8030/
- PSGC: http://127.0.0.1:8040/
- PromQL-Wrapper: http://127.0.0.1:8050/
- Slurm-Client: http://127.0.0.1:8060/
- TensorBoard: http://localhost:6006
- Grafana: http://localhost:3001 (Default creds: admin/admin)
DECICE includes a fully API-driven training ground. You do not need to run manual scripts.
Use the OpenAPI specification or POST
curl -X POST "http://localhost:8030/data/generate/dataset_v1" \
-H "X-Internal-Api-Key: <YOUR_KEY>" \
-H "Content-Type: application/json" \
-d '{"num_files": 100, "jobs_min": 5, "jobs_max": 20}'
Use the OpenAPI specification or POST
curl -X POST "http://localhost:8030/models/" \
-H "X-Internal-Api-Key: <YOUR_KEY>" \
-H "Content-Type: application/json" \
-d '{"name": "ppo_v1_aggressive", "actor_lr": 0.0005}'
Use the OpenAPI specification or POST
curl -X POST "http://localhost:8030/training/start" \
-H "X-Internal-Api-Key: <YOUR_KEY>" \
-H "Content-Type: application/json" \
-d '{"scheduler_name": "ppo_v1_aggressive", "dataset_name": "dataset_v1", "cycles": 50}'
Monitor Progress Open TensorBoard at http://localhost:6006 to watch the Actor/Critic loss convergence in real-time.
...
We welcome contributions!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project has received funding from the European Union's Horizon Europe research and innovation programme under Grant Agreement No 101092582.
Distributed under the Apache License, Version 2.0. See LICENSE for more information.
