-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
UCM aims to accelerate reasoning for long sequences, encompassing table lookup instead of KV computation in the Prefill phase, sparsification in the Decode phase, and a PD (Prefill-Decode) disaggregated architecture centered on KVCache for large-scale scenarios.
The first version of UCM has achieved the basic goal of sparsification acceleration for long sequences and successfully implemented a heterogeneous PD Disaggregation example. In Q4, we will successively release long-sequence inference acceleration features to further enhance inference performance, reduce inference costs, and address issues such as long sequences being "unable to be inferred" or "slow to be inferred".
Core
- CacheBlend
- Prefill KVCache Offload
- Model Window Extrapolation
- Sparse
- DSA
- GSA Optimization
- KVComp Optimization
- KVStar Optimization
- PD Disaggregation
- Heterogeneous Optimization
- PD Scheduler
- Store
- Scatter Gather IO
- GPU Direct Storage
- NPU Direct Storage
- localCacheStore
Others
- Docs Optimization
- Benchmark
- Mooncake Trace and more dataset for PD test
- benchmark for sparse performance and accuracy
Metadata
Metadata
Assignees
Labels
No labels