Skip to content

v1.2.0

Choose a tag to compare

@Alberto-Codes Alberto-Codes released this 29 Mar 22:33
55a1280

1.2.0 (2026-03-29)

Features

  • benchmark: add multi-model support for text-only models (#32) (f383895)
  • kv-cache: add context manager protocol and double-compression detection (#24) (d5c58f5)
  • triton: add fused paged TQ4 decode attention kernel (#37) (ae7941e)
  • triton: add fused paged TQ4 INT8 prefill kernel (#41) (bff651b)
  • triton: add out parameter to tq4 compress/decompress wrappers (#34) (6fc60d8)
  • verify: add verify CLI for compression quality checks (#27) (91cbf0e)
  • vllm: add CUDA graph buffer pre-allocation to TQ4 backend (#35) (2106d09)
  • vllm: add fused paged TQ4 decode backend integration and feature gating (#39) (fa6b220)

Bug Fixes

  • docs: address Copilot review findings from Epic 6 PRs (#40) (f36ba66), closes #38
  • experiments: address code review findings for Story 6.5 (073166c)
  • git: restore directory-only matching and IDE ignores in gitignore (4ecc62e)
  • test: add gc.collect() before cuda empty_cache in verify GPU test (e003f24)
  • triton: resolve code review findings for fused paged TQ4 kernel (ae7941e)
  • verify: handle explicit None head_dim in _detect_model_config (418661d)
  • verify: restrict --bits to valid choices [3, 4] (91cbf0e)
  • vllm: guard INT8 prefill dispatch for single-sequence only (bff651b)

Performance Improvements

  • benchmark: add experiment 018 CUDA graph decode latency (#36) (4ad2210)
  • benchmark: add experiment 018 fused decode smoke test log (fa6b220)
  • triton: add kernel benchmarks and optimize autotune configs (#42) (073166c)