Project CodeFlare provides a simple, user-friendly abstraction for developing, resource-scaling, queuing, and management of distributed AI/ML and Python workloads, whilst maximising the utilization of accelerators and compute resources, on OpenShift Container Platform.
Project CodeFlare consists of the following components:
CodeFlare SDK to define, develop, and control remote distributed compute jobs and infrastructure from either a python-based environment or command-line interface
AppWrapper a flexible and workload-agnostic mechanism to enable Kueue to manage a group of Kubernetes resources as a single logical unit and to provide an additional level of automatic fault detection and recovery
CodeFlare Operator for automating deployment and configuration of the Project CodeFlare stack
In addition to running standalone, Project CodeFlare is deployed as part of and integrated with the Open Data Hub.
Watch this video for an introduction to Project CodeFlare and what the stack can do. (Nov. 2022)
See this video as well for an updated demonstration of the basic stack functionality in-action. (Jun. 2023)
To get started using the Project CodeFlare stack, try this end-to-end example!
For more basic walk-throughs and in-depth tutorials, see our demo notebooks!
See more details in any of the component repos linked above, and go to their issues page for open tasks/issues!
We attempt to document all architectural decisions in our ADR documents. Start here to understand the architectural details of Project CodeFlare.
Join our Slack community to get involved or ask questions.
Unless otherwise noted at a per-component level, this Project CodeFlare is licensed under the Apache-2.0 License.