Alnair Release v0.4.0
Release Summary
The v0.4.0 release includes a new Alnair exporter module, a prototype of a data caching cluster based on Redis, improvements on intercept library, and an open data set on AI workloads resource utilization.
Alnair-exporter is a Prometheus exporter with custom collectors, which brings out cuda-level, and GPU process level metrics for fine-grained resource utilization monitoring and application performance analysis. A prototype of Redis-based data caching cluster is designed to speed up data ingestion for deep learning training. User data sets are fetched and managed in in-memory data store. Content-based hashing is used to reduce data duplication from different users. Cached data accelerate training speed when training scripts run multiple times, which is often the case in model design phase. Intercept library is improved to support CUDA>=11.3 with intercept cuGetProcAddress as the CUDA driver API entry point. The token refill rate is dynamically adjusted to improve the fractional GPU utilization. A monitoring thread is also added in the intercept library to report cuda-level metrics.
Key Features and Improvements
1. Alnair Exporter
- Prometheus based metrics exporter, directly connected to Prometheus Server
- Custom collectors cross different layers, CUDA and GPU process
- Fine-grained metrics for GPU sharing fine control and verification
- Six Alnair metrics, extensible framework
2. Data Caching Cluster
- Redis-based in-memory K-V store for high read throughput
- Contented based hashing to save storage among different users on the same data set
- Optimized pytorch dataloader with prefetch and multithreading.
- Complete CRD/Operator offering, minimize deployment efforts
3. Intercept lib
- Dynamically adjust token refill rate to boost GPU utilization in the initial ramp up phase
- Add new intercept flow with new CUDA driver entry point cuGetProcAddress after CUDA 11.3
- Add monitoring thread to count kernel launch, memory and token usages every 0.01 sec
- Fixe Cgroup.proc access issue inside container through docker client API.
4. Open Data Set
- Five types of AI workloads are executed with various parameters setting
- Recorded resource utilization (60+ features) are shared and analyzed