Skip to content

Releases: CentaurusInfra/alnair

Alnair Release v0.5.0

26 Oct 23:24
b4e39fb
Compare
Choose a tag to compare

Release Summary

The v0.5.0 release includes a new Alluxio-based in-memory file-system cache operator, an improved version of Alnair profiler, working with Alnair exporter, decoupled from Nvidia dcgm exporter, storing pod data in a structured way in MongoDB. A standalone CUDA API intercept lib for profiling is also released to track memcpy behavior between CPUs and GPUs. In addition, the training and rendering of neural head avatars from monocular RGB Videos are implemented as platform's use case. More services and acceleration centered on this application will be added in the future release. Lastly, RDMA supported MPI backend is built with Pytorch (1.8). Container images are released for RDMA acceleration testing.

Key Features and Improvements

1. Alluxio Cache Operator

  • Track Pod annotation (cacheDataset:"yes"), automatically load remote data to cache
  • Auto switch from remote data location to local cache with user transparency
  • Support NFS and S3 data source auto hydration (cache loading)
  • Cache capacity and availability management

2. Alnair Exporter

  • Update alnair_gpu_util and alnair_gpu_mem_util's label, with pod_name
  • Move the logic of getting-Pod-name-by-PID from profiler to exporter
  • Bug fix on unexpected program existing

3. Alnair Profiler

  • Refactor pod monitoring logic with Kubernetes pod event watch
  • Query the max utilization data of CPU, MEM, IO, network and GPU data from Prometheus
  • Create one record for each pod storing metadata (name, status, start/end time, ...) and utilization data, upsert records to MongoDB
  • Patch utilization data to Pod annotation once
  • Standalone profiler-hook-lib to intercept CUDA memory copy related APIs

4. Emerging application (Neural head avatar training and rendering)

  • Avatar training with monocular RGB video, supports manual expression control
  • Frame-by-frame and video-to-video reenactment, with various acceleration

5. RDMA+MPI+Pytorch Containerization

  • Build container images with and without Mellanox RDMA supports
  • Test distributed training performance with and without MPI+RDMA

Alnair Release v0.4.0

21 Jul 00:57
Compare
Choose a tag to compare

Release Summary

The v0.4.0 release includes a new Alnair exporter module, a prototype of a data caching cluster based on Redis, improvements on intercept library, and an open data set on AI workloads resource utilization.
Alnair-exporter is a Prometheus exporter with custom collectors, which brings out cuda-level, and GPU process level metrics for fine-grained resource utilization monitoring and application performance analysis. A prototype of Redis-based data caching cluster is designed to speed up data ingestion for deep learning training. User data sets are fetched and managed in in-memory data store. Content-based hashing is used to reduce data duplication from different users. Cached data accelerate training speed when training scripts run multiple times, which is often the case in model design phase. Intercept library is improved to support CUDA>=11.3 with intercept cuGetProcAddress as the CUDA driver API entry point. The token refill rate is dynamically adjusted to improve the fractional GPU utilization. A monitoring thread is also added in the intercept library to report cuda-level metrics.

Key Features and Improvements

1. Alnair Exporter

  • Prometheus based metrics exporter, directly connected to Prometheus Server
  • Custom collectors cross different layers, CUDA and GPU process
  • Fine-grained metrics for GPU sharing fine control and verification
  • Six Alnair metrics, extensible framework

2. Data Caching Cluster

  • Redis-based in-memory K-V store for high read throughput
  • Contented based hashing to save storage among different users on the same data set
  • Optimized pytorch dataloader with prefetch and multithreading.
  • Complete CRD/Operator offering, minimize deployment efforts

3. Intercept lib

  • Dynamically adjust token refill rate to boost GPU utilization in the initial ramp up phase
  • Add new intercept flow with new CUDA driver entry point cuGetProcAddress after CUDA 11.3
  • Add monitoring thread to count kernel launch, memory and token usages every 0.01 sec
  • Fixe Cgroup.proc access issue inside container through docker client API.

4. Open Data Set

  • Five types of AI workloads are executed with various parameters setting
  • Recorded resource utilization (60+ features) are shared and analyzed

Alnair Release v0.3.1

07 Apr 20:54
Compare
Choose a tag to compare

Release Summary

In this v0.3.1 release, improvements are made to existing components Alnair device plugin, vGPU scheduler, and profiler. Based on previous release, GPU sharing function is improved through optimizing token distribution algorithms, supporting V100 GPU, and simplified installation process. Now users can simply choose the portion of one GPU to run AI workloads through alnair-vgpu/memory settings and select GPUs in either cost-saving mode or high-performance mode by configuring Alnair vGPU scheduler.

To run multiple jobs on the same GPU, the competition time for the whole set of jobs is significantly reduced (~20%), compared to run one job at a time sequentially. However, each individual job's completion time may increase due to GPU resource sharing. Packing jobs with fewer conflicts on resource requirements can lead to faster completion time.

Key Features and Improvements

1. Alnair Device Plugin

  • Clean up the alnair workspace after vGPU jobs complete.
  • Change Alnair socket path to avoid nvidia container runtime mounting conflicts.
  • Verify the token bucket based GPU utilization control algorithm on V100 GPU.
  • Optimize the token distribution in the initial utilization ramp up phase by dynamically adjust fill rate.
  • Add init container in deployment yaml file to automatically copy intercept library from container image to host.

2. vGPU Scheduler Plugin

  • Refactor previous scheduler plugin to two plugins with different score functions.
  • Provide two profiles in scheduler configuration to support scheduler selection: one is alnair-cost-saving, and the other is alnair-high-performance. In cost-saving mode, new pods will placed to the most used GPU node, while in high-performance mode, new pods will be spread out to least used node.

3. Profiler

  • Add memory copy utilization metrics to Pod annotations to reflect AI workloads' I/O usage
  • Keep resource utilization metrics in the Pod annotations after job completes
  • Now Alnair device plugin, Alnair scheduler and profiler all add Pod annotations under ai.centaurus.io domain to share information and work together.

Alnair Release v0.3

22 Feb 02:54
d66cd91
Compare
Choose a tag to compare

Release Summary

This release v0.3. Improvements are made to existing components profiler, Alnair device plugin, scheduler, and a new unified training framework is added to support both torch elastic and horovod elastic. The main feature brought by this release is GPU sharing. The definition of sharing GPU here is that each user/workload requests less than one physical GPU.

Key Features and Improvements

  • Alnair Device Plugin
    "alnair/vgpu-memory" and "alnair/vgpu-compute" are two resources registered and managed by Alnair device plugin. A physical GPU is divided into multiple virtual GPUs. By default, the unit for alnair/vgpu-memory is 1 GB, and the unit for alnair/vgpu-compute is percent and the value should be less than 100. The memory and compute usage of each pod is controlled by an intercept library.

  • vGPU Sharing Scheduler Plugin
    The vGPU sharing plugin implements the filtering and scoring extension. For the filter, plugin is responsible to filter out the node if none of its GPU cards have enough capacity to meet the request. For the score, it supports bin-packing and high-performance two modes to score the node based on the used GPU resources. In addition, to work with device plugin, scheduler also add timestamp to the pod when it enters the filter phase.

  • Unified Training Framework
    Unified training framework provides unified job template and controller interface to control regular distributed job, torch elastic and horovod elastic. The reconcile process is controlled by a 6-state state machine, with potential support for job migration. For torch elastic, a etcd server is created to manage the states of workers in a job. Moreover, when job controller creates workers, it queries node-level resource usage, which make it possible to make application-level scheduling, i.e., assign nodes to pods.

  • Profiler
    To describe how the workloads use resource during the entire running phase, profiler extract all time series data of GPU/CPU utilization of all the pods belonging to the job from Prometheus and aggregated the series to a 10 data point distribution. The data are persisted to MongoDB.

Alnair Release v0.2

04 Oct 18:19
Compare
Choose a tag to compare

Release Summary

This is Release v0.2. It includes improvements on Profiler and Elastic Framework, and the following new components:

  • Autonomous Scheduler
  • Alnair Device Plugin

Key Features and Improvements:

Profiler (improvements)

  • CPU metrics collection
    • Take advantage of cAdvisor, pod-level metrics, e.g. CPU and Memory utilization, disk io and network utilization are collected every second.
  • GPU metrics mapping
    • The process IDs on GPUs are extracted using nvml library and they are mapped to Pod name. Therefore the GPU utilization resolution is improved from node level to pod level.
  • Resource utilization aggregation
    • The pods created by kubernetes jobs and job-like CRDs are auto deleted after job is complete, the annotations on pods will be lost. By using Pod's owner reference, Profiler automatic aggreates the Pods/workers' information to they owner (Jobs/CRDs). The max utilization of each metrics are recorded in the annotations at Job level. This is for future job execution efficiency analysis.

Elastic Training Framework (improvements)

  • Single YAML deployment and unit/integration tests
  • GPU allocation auto reduction
    • Due to race condition, when the number of GPUS set by TargetReplica is not avaliable at scheduling phase, elastic horovod job controller will auto scale down the size of StatefulSet by one.
  • PodGroup Integration
    • Leverage coscheduling plugin, create and assign PodGroup for each elastic horovod job, and launch the workers(StatefulSet) as a PodGroup, i.e., the workers run in an all-or-nothing manner to avoid resource starvation.
  • Improve scaling speed
    • Updated podManagementPolicy field in the StatefulSet from default (OrderedReady) to Parallel to reduce scaling-up/down time. All Pods can be launched or terminated in parallel. No need to wait for predecessor to become Running and Ready or completely terminated.

Autonomous Scheduler (new)

  • An utilization-driven scheduler: UtilSched
    • UtilSched is a customized kubernetes scheduler based on the k8s scheduling framework. It optimizes the scheduling strategy of AI workloads (with GPUs invoked), by aggregating the real-time GPU metrics that the above Profiler module extracts. By leveraging the APIs provided by the k8s scheduling framework, UtilSched works as a plugin that optimizes the scheduling decisions by only being invoked at several extension points (such as the Filter and Score process) without interrupting the core of k8s scheduling.
  • Co-scheduling feature
    • Co-scheduling feature is to ensure the atomicity of a group of pods being scheduled together. The default k8s scheduler, under certain scenes (e.g. race conditions), cannot schedule the pods of a batch workload that are spawned by a statefulset or deployment, which causes a heavy waste of resource. By introducing a PodGroup CRD, a batch scheduling will be masked failed once a pod in the PodGroup failed. A controller is designed to reconcile PodGroup status and help recover from abnormal cases. The co-scheduling feature is based on the k8s scheduling framework. Cooperating with the above elastic-training-framework module, it alleviates the potential race conditions in an elastic scale down/up process.

Alnair Device Plugin (new)

  • Kubernetes Device Plugin for Nvidia GPUs
    • Generate synthetic device IDs to support fractional GPU request and allocation.
    • Prepare the container environment for CUDA driver API interception.
    • Enforcement of GPU resource requests and GPU resource isolation will be supported in the future releases.

Alnair Release v0.1

04 Jun 19:40
e311453
Compare
Choose a tag to compare
Alnair Release v0.1 Pre-release
Pre-release

This is project Alnair's Release v0.1, including two components, profiler and elastic training framework. Components' key features are listed below.

  • Profiler
    • GPU metrics collection
      • Take advantage of Nvidia monitoring toolkit DCGM-exporter, device-level metrics, e.g. GPU and Memory utilization are collected every second.
      • The GPU metrics exported from DCGM is scraped by Prometheus. Prometheus auto discovers the pods with scraping annoations.
    • Deep learning training job (DLT) identification
      • Considering the cyclic pattern of memory utilization in DLT jobs, an autocorrelation based cyclic pattern detection algorithm is implemented to detect DLT job, once DLT job is detected, the max memory utilization is predicted based on the past usage.
      • Analytical results including job type and predicted memotry utilization are continuously patched to every GPU node as annotations.
  • Elastic Training Framework
    • Kubernetes operator for horovod jobs
      • End user can create their horovod training jobs in our framework using the CRDs.
      • Elastic training: jobs can scale up and down the number of GPU workers dynamically at runtime, without restart.
      • Fault tolerance: jobs can keep on training when some of the GPU workers fail, without restart.
    • GPU allocator
      • Dynamically allocates the pool of GPUs within a cluster to the elastic training jobs, optimizing for maximum GPU utilization and minimum job completion time.