Release Alnair Release v0.5.0 · CentaurusInfra/alnair

Release Summary

The v0.5.0 release includes a new Alluxio-based in-memory file-system cache operator, an improved version of Alnair profiler, working with Alnair exporter, decoupled from Nvidia dcgm exporter, storing pod data in a structured way in MongoDB. A standalone CUDA API intercept lib for profiling is also released to track memcpy behavior between CPUs and GPUs. In addition, the training and rendering of neural head avatars from monocular RGB Videos are implemented as platform's use case. More services and acceleration centered on this application will be added in the future release. Lastly, RDMA supported MPI backend is built with Pytorch (1.8). Container images are released for RDMA acceleration testing.

Key Features and Improvements

1. Alluxio Cache Operator

Track Pod annotation (cacheDataset:"yes"), automatically load remote data to cache
Auto switch from remote data location to local cache with user transparency
Support NFS and S3 data source auto hydration (cache loading)
Cache capacity and availability management

2. Alnair Exporter

Update alnair_gpu_util and alnair_gpu_mem_util's label, with pod_name
Move the logic of getting-Pod-name-by-PID from profiler to exporter
Bug fix on unexpected program existing

3. Alnair Profiler

Refactor pod monitoring logic with Kubernetes pod event watch
Query the max utilization data of CPU, MEM, IO, network and GPU data from Prometheus
Create one record for each pod storing metadata (name, status, start/end time, ...) and utilization data, upsert records to MongoDB
Patch utilization data to Pod annotation once
Standalone profiler-hook-lib to intercept CUDA memory copy related APIs

4. Emerging application (Neural head avatar training and rendering)

Avatar training with monocular RGB video, supports manual expression control
Frame-by-frame and video-to-video reenactment, with various acceleration

5. RDMA+MPI+Pytorch Containerization

Build container images with and without Mellanox RDMA supports
Test distributed training performance with and without MPI+RDMA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alnair Release v0.5.0

Release Summary

Key Features and Improvements