v26.04
What's Changed
Features & Enhancements
- Add hash_roundrobin routing mode to mitigate modulo-aliasing imbalance by @ShaobinChen-AH in #367
- Jagged Arbitrary Masked Self Attention support by @z52527 in #339
- fix segmented_unique_cuda: replace table_ids with segmented_range by @jiashuy in #377
- perf(hstu): restore eager-mode .item() in preprocessor; drop duplicate triton_jagged.py by @JacoCheung in #389
- Improve AOTI compilation of hstu model by @geoffreyQiu in #380
- Recsys KVCache Manager refactored into standalone package by @geoffreyQiu in #387
- Add inference aoti benchmark results by @geoffreyQiu in #394
- [FEA] Beam search by @z52527 in #379
Bug Fixes
- fix: unify dense tensor padding convention (dim-0 == batch_size) by @JacoCheung in #362
- fix(dynamicemb): traverse nn.Module children in check_emb_collection_modules by @JacoCheung in #355
- fix(ddp): bucket_size=True silently disables grad bucketing by @JacoCheung in #374
Misc
- fix: reduce Docker layers, add auto CI trigger, fix fake ops import by @JacoCheung in #363
- delete invalid line by @ShaobinChen-AH in #381
- build(docker): bump Megatron-LM 0.12.1 -> 0.13.1 to fix count_zeros wasted work by @JacoCheung in #375
- Clean up unused variables in get_kvcache_metadata_buffer by @gameofdimension in #371
- Update blossom-ci.yml by @z52527 in #391
Full Changelog: v26.03...v26.04