v25.09
What's Changed
Features & Enhancements
- Dynamicemb prefetch integration by @JacoCheung in #181
- Support distributed embedding dumping for dynamicemb by @z52527 @shijieliu in #120 #185
- Add kernel fusion in HSTU block for inference, with KVCache fixes by @geoffreyQiu in #184
- export hstu fp8 quant by @shijieliu in #168
- Replace BatchedDynamicEmbeddingTables with BatchedDynamicEmbeddingTablesV2 by @jiashuy in #155
Bug Fixs
- fix DynamicEmbDump - handle long strings in broadcast_string by @fshhr46 in #164
- fix: consider mask when calc hstu attn flops by @shijieliu in #177
- export fix hstu ima when num_candidates = seqlen by @shijieliu in #183
Misc
- Make local hbm budget grow when num_embeddings grows. by @jiashuy in #156
- Fix several errors for inference. by @geoffreyQiu in #167
- Fix setup.py by @yiwenchen2025 in #169
- Suppress mcore deps install by @JacoCheung in #170
- dynamicemb clean BatchedDynamicEmbeddingTables by @jiashuy in #179
- Update hstu layer benchmark doc by @JacoCheung in #171
- Update dynamicemb's benchmark and example with README.md by @jiashuy in #188
New Contributors
Full Changelog: v25.08...v25.09