v1.2.0

Alberto-Codes released this 29 Mar 22:33

55a1280

1.2.0 (2026-03-29)

Features

benchmark: add multi-model support for text-only models (#32) (f383895)
kv-cache: add context manager protocol and double-compression detection (#24) (d5c58f5)
triton: add fused paged TQ4 decode attention kernel (#37) (ae7941e)
triton: add fused paged TQ4 INT8 prefill kernel (#41) (bff651b)
triton: add out parameter to tq4 compress/decompress wrappers (#34) (6fc60d8)
verify: add verify CLI for compression quality checks (#27) (91cbf0e)
vllm: add CUDA graph buffer pre-allocation to TQ4 backend (#35) (2106d09)
vllm: add fused paged TQ4 decode backend integration and feature gating (#39) (fa6b220)

Bug Fixes

docs: address Copilot review findings from Epic 6 PRs (#40) (f36ba66), closes #38
experiments: address code review findings for Story 6.5 (073166c)
git: restore directory-only matching and IDE ignores in gitignore (4ecc62e)
test: add gc.collect() before cuda empty_cache in verify GPU test (e003f24)
triton: resolve code review findings for fused paged TQ4 kernel (ae7941e)
verify: handle explicit None head_dim in _detect_model_config (418661d)
verify: restrict --bits to valid choices [3, 4] (91cbf0e)
vllm: guard INT8 prefill dispatch for single-sequence only (bff651b)

Performance Improvements

benchmark: add experiment 018 CUDA graph decode latency (#36) (4ad2210)
benchmark: add experiment 018 fused decode smoke test log (fa6b220)
triton: add kernel benchmarks and optimize autotune configs (#42) (073166c)

Assets 2