Skip to content

Alnair Release v0.3.1

Compare
Choose a tag to compare
@Fizzbb Fizzbb released this 07 Apr 20:54
· 375 commits to main since this release

Release Summary

In this v0.3.1 release, improvements are made to existing components Alnair device plugin, vGPU scheduler, and profiler. Based on previous release, GPU sharing function is improved through optimizing token distribution algorithms, supporting V100 GPU, and simplified installation process. Now users can simply choose the portion of one GPU to run AI workloads through alnair-vgpu/memory settings and select GPUs in either cost-saving mode or high-performance mode by configuring Alnair vGPU scheduler.

To run multiple jobs on the same GPU, the competition time for the whole set of jobs is significantly reduced (~20%), compared to run one job at a time sequentially. However, each individual job's completion time may increase due to GPU resource sharing. Packing jobs with fewer conflicts on resource requirements can lead to faster completion time.

Key Features and Improvements

1. Alnair Device Plugin

  • Clean up the alnair workspace after vGPU jobs complete.
  • Change Alnair socket path to avoid nvidia container runtime mounting conflicts.
  • Verify the token bucket based GPU utilization control algorithm on V100 GPU.
  • Optimize the token distribution in the initial utilization ramp up phase by dynamically adjust fill rate.
  • Add init container in deployment yaml file to automatically copy intercept library from container image to host.

2. vGPU Scheduler Plugin

  • Refactor previous scheduler plugin to two plugins with different score functions.
  • Provide two profiles in scheduler configuration to support scheduler selection: one is alnair-cost-saving, and the other is alnair-high-performance. In cost-saving mode, new pods will placed to the most used GPU node, while in high-performance mode, new pods will be spread out to least used node.

3. Profiler

  • Add memory copy utilization metrics to Pod annotations to reflect AI workloads' I/O usage
  • Keep resource utilization metrics in the Pod annotations after job completes
  • Now Alnair device plugin, Alnair scheduler and profiler all add Pod annotations under ai.centaurus.io domain to share information and work together.