Alnair Release v0.3
Release Summary
This release v0.3. Improvements are made to existing components profiler, Alnair device plugin, scheduler, and a new unified training framework is added to support both torch elastic and horovod elastic. The main feature brought by this release is GPU sharing. The definition of sharing GPU here is that each user/workload requests less than one physical GPU.
Key Features and Improvements
-
Alnair Device Plugin
"alnair/vgpu-memory" and "alnair/vgpu-compute" are two resources registered and managed by Alnair device plugin. A physical GPU is divided into multiple virtual GPUs. By default, the unit for alnair/vgpu-memory is 1 GB, and the unit for alnair/vgpu-compute is percent and the value should be less than 100. The memory and compute usage of each pod is controlled by an intercept library. -
vGPU Sharing Scheduler Plugin
The vGPU sharing plugin implements the filtering and scoring extension. For the filter, plugin is responsible to filter out the node if none of its GPU cards have enough capacity to meet the request. For the score, it supports bin-packing and high-performance two modes to score the node based on the used GPU resources. In addition, to work with device plugin, scheduler also add timestamp to the pod when it enters the filter phase. -
Unified Training Framework
Unified training framework provides unified job template and controller interface to control regular distributed job, torch elastic and horovod elastic. The reconcile process is controlled by a 6-state state machine, with potential support for job migration. For torch elastic, a etcd server is created to manage the states of workers in a job. Moreover, when job controller creates workers, it queries node-level resource usage, which make it possible to make application-level scheduling, i.e., assign nodes to pods. -
Profiler
To describe how the workloads use resource during the entire running phase, profiler extract all time series data of GPU/CPU utilization of all the pods belonging to the job from Prometheus and aggregated the series to a 10 data point distribution. The data are persisted to MongoDB.