This project is built on top of the vLLM project
CachedLLM is a multi-tenant LoRA serving system designed to improve memory management and reduce memory exchange overhead. It leverages a dynamic page cache to store persistent states and LoRA adapters in main memory while caching them in VRAM across multiple invocations. This design significantly improves GPU utilization, latency, and throughput in VRAM-limited scenarios.
- Dynamic Page Cache: Stores data and weights in pages and dynamically manages the page cache based on access frequency.
- Efficient Kernel: Optimize SGMV kernel for paged LoRA
- Memory Management: Reduces memory fragmentation and improves VRAM utilization.
The adapter cache pool is divided into pages of equal size for fine-grained management, significantly reducing memory fragmentation and improving memory management efficiency.
CachedLLM uses a least-recently-used (LRU) cache policy to minimize the count of exchanging adapters between main memory and VRAM, reducing memory exchange overhead.
The batching mechanism in CachedLLM enables batching requests of different LoRA adapters, increasing the number of batched requests per computation and thus improving throughput.
We use optimized kernels based on SGMV in punica.
CachedLLM achieved a notable reduction in task-switching overhead compared to vLLM and Huggingface, with up to a 10% overhead reduction.
CachedLLM demonstrated up to a 3x improvement in throughput compared to vLLM.
CachedLLM offers significant improvements in latency, throughput, and memory management for multi-tenant LoRA serving scenarios. It efficiently manages VRAM usage, reducing task-switching overhead and improving batching efficiency.