Skip to content

Alnair Release v0.3

Compare
Choose a tag to compare
@Fizzbb Fizzbb released this 22 Feb 02:54
· 417 commits to main since this release
d66cd91

Release Summary

This release v0.3. Improvements are made to existing components profiler, Alnair device plugin, scheduler, and a new unified training framework is added to support both torch elastic and horovod elastic. The main feature brought by this release is GPU sharing. The definition of sharing GPU here is that each user/workload requests less than one physical GPU.

Key Features and Improvements

  • Alnair Device Plugin
    "alnair/vgpu-memory" and "alnair/vgpu-compute" are two resources registered and managed by Alnair device plugin. A physical GPU is divided into multiple virtual GPUs. By default, the unit for alnair/vgpu-memory is 1 GB, and the unit for alnair/vgpu-compute is percent and the value should be less than 100. The memory and compute usage of each pod is controlled by an intercept library.

  • vGPU Sharing Scheduler Plugin
    The vGPU sharing plugin implements the filtering and scoring extension. For the filter, plugin is responsible to filter out the node if none of its GPU cards have enough capacity to meet the request. For the score, it supports bin-packing and high-performance two modes to score the node based on the used GPU resources. In addition, to work with device plugin, scheduler also add timestamp to the pod when it enters the filter phase.

  • Unified Training Framework
    Unified training framework provides unified job template and controller interface to control regular distributed job, torch elastic and horovod elastic. The reconcile process is controlled by a 6-state state machine, with potential support for job migration. For torch elastic, a etcd server is created to manage the states of workers in a job. Moreover, when job controller creates workers, it queries node-level resource usage, which make it possible to make application-level scheduling, i.e., assign nodes to pods.

  • Profiler
    To describe how the workloads use resource during the entire running phase, profiler extract all time series data of GPU/CPU utilization of all the pods belonging to the job from Prometheus and aggregated the series to a 10 data point distribution. The data are persisted to MongoDB.