Skip to content

Alnair Release v0.5.0

Latest
Compare
Choose a tag to compare
@Fizzbb Fizzbb released this 26 Oct 23:24
· 39 commits to main since this release
b4e39fb

Release Summary

The v0.5.0 release includes a new Alluxio-based in-memory file-system cache operator, an improved version of Alnair profiler, working with Alnair exporter, decoupled from Nvidia dcgm exporter, storing pod data in a structured way in MongoDB. A standalone CUDA API intercept lib for profiling is also released to track memcpy behavior between CPUs and GPUs. In addition, the training and rendering of neural head avatars from monocular RGB Videos are implemented as platform's use case. More services and acceleration centered on this application will be added in the future release. Lastly, RDMA supported MPI backend is built with Pytorch (1.8). Container images are released for RDMA acceleration testing.

Key Features and Improvements

1. Alluxio Cache Operator

  • Track Pod annotation (cacheDataset:"yes"), automatically load remote data to cache
  • Auto switch from remote data location to local cache with user transparency
  • Support NFS and S3 data source auto hydration (cache loading)
  • Cache capacity and availability management

2. Alnair Exporter

  • Update alnair_gpu_util and alnair_gpu_mem_util's label, with pod_name
  • Move the logic of getting-Pod-name-by-PID from profiler to exporter
  • Bug fix on unexpected program existing

3. Alnair Profiler

  • Refactor pod monitoring logic with Kubernetes pod event watch
  • Query the max utilization data of CPU, MEM, IO, network and GPU data from Prometheus
  • Create one record for each pod storing metadata (name, status, start/end time, ...) and utilization data, upsert records to MongoDB
  • Patch utilization data to Pod annotation once
  • Standalone profiler-hook-lib to intercept CUDA memory copy related APIs

4. Emerging application (Neural head avatar training and rendering)

  • Avatar training with monocular RGB video, supports manual expression control
  • Frame-by-frame and video-to-video reenactment, with various acceleration

5. RDMA+MPI+Pytorch Containerization

  • Build container images with and without Mellanox RDMA supports
  • Test distributed training performance with and without MPI+RDMA