Release Alnair Release v0.4.0 · CentaurusInfra/alnair

Release Summary

The v0.4.0 release includes a new Alnair exporter module, a prototype of a data caching cluster based on Redis, improvements on intercept library, and an open data set on AI workloads resource utilization.
Alnair-exporter is a Prometheus exporter with custom collectors, which brings out cuda-level, and GPU process level metrics for fine-grained resource utilization monitoring and application performance analysis. A prototype of Redis-based data caching cluster is designed to speed up data ingestion for deep learning training. User data sets are fetched and managed in in-memory data store. Content-based hashing is used to reduce data duplication from different users. Cached data accelerate training speed when training scripts run multiple times, which is often the case in model design phase. Intercept library is improved to support CUDA>=11.3 with intercept cuGetProcAddress as the CUDA driver API entry point. The token refill rate is dynamically adjusted to improve the fractional GPU utilization. A monitoring thread is also added in the intercept library to report cuda-level metrics.

Key Features and Improvements

1. Alnair Exporter

Prometheus based metrics exporter, directly connected to Prometheus Server
Custom collectors cross different layers, CUDA and GPU process
Fine-grained metrics for GPU sharing fine control and verification
Six Alnair metrics, extensible framework

2. Data Caching Cluster

Redis-based in-memory K-V store for high read throughput
Contented based hashing to save storage among different users on the same data set
Optimized pytorch dataloader with prefetch and multithreading.
Complete CRD/Operator offering, minimize deployment efforts

3. Intercept lib

Dynamically adjust token refill rate to boost GPU utilization in the initial ramp up phase
Add new intercept flow with new CUDA driver entry point cuGetProcAddress after CUDA 11.3
Add monitoring thread to count kernel launch, memory and token usages every 0.01 sec
Fixe Cgroup.proc access issue inside container through docker client API.

4. Open Data Set

Five types of AI workloads are executed with various parameters setting
Recorded resource utilization (60+ features) are shared and analyzed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alnair Release v0.4.0

Release Summary

Key Features and Improvements