Releases
v1.2.0
Compare
Sorry, something went wrong.
No results found
1.2.0 (2026-03-29)
Features
benchmark: add multi-model support for text-only models (#32 ) (f383895 )
kv-cache: add context manager protocol and double-compression detection (#24 ) (d5c58f5 )
triton: add fused paged TQ4 decode attention kernel (#37 ) (ae7941e )
triton: add fused paged TQ4 INT8 prefill kernel (#41 ) (bff651b )
triton: add out parameter to tq4 compress/decompress wrappers (#34 ) (6fc60d8 )
verify: add verify CLI for compression quality checks (#27 ) (91cbf0e )
vllm: add CUDA graph buffer pre-allocation to TQ4 backend (#35 ) (2106d09 )
vllm: add fused paged TQ4 decode backend integration and feature gating (#39 ) (fa6b220 )
Bug Fixes
docs: address Copilot review findings from Epic 6 PRs (#40 ) (f36ba66 ), closes #38
experiments: address code review findings for Story 6.5 (073166c )
git: restore directory-only matching and IDE ignores in gitignore (4ecc62e )
test: add gc.collect() before cuda empty_cache in verify GPU test (e003f24 )
triton: resolve code review findings for fused paged TQ4 kernel (ae7941e )
verify: handle explicit None head_dim in _detect_model_config (418661d )
verify: restrict --bits to valid choices [3, 4] (91cbf0e )
vllm: guard INT8 prefill dispatch for single-sequence only (bff651b )
Performance Improvements
benchmark: add experiment 018 CUDA graph decode latency (#36 ) (4ad2210 )
benchmark: add experiment 018 fused decode smoke test log (fa6b220 )
triton: add kernel benchmarks and optimize autotune configs (#42 ) (073166c )
You can’t perform that action at this time.