v25.10
What's Changed
Features & Enhancements
- Add sequence parallelism by @JacoCheung in #216
- Decouple scaling seqlen from
max_seqlenin hstu attn by @geoffreyQiu in #208 - Fea support lru score dump load by @shijieliu in #186
- Gradient clipping by reusing TorchRec&FBGEMM's parameters by @jiashuy in #223
- [HSTU]Add SM 89 support by @JacoCheung in #217
- allow allow_overwrite in DynamicEmbDump by @fshhr46 in #206
But Fixs
- Fix LFU mode frequency count bug by @z52527 in #176
- Fix config bug when using torchrec's STBE in benchmark by @jiashuy in #193
- Fix IMA in incremental dump and test the dumped embeddings by @jiashuy in #211
- Fix rab num heads by @JacoCheung in #222
- Fix IMA caused by wrong worker id for device of which max threads is … by @jiashuy in #220
Misc
- Code reorganization for hstu training and inference by @geoffreyQiu in #202
- Add embedding pooling kernel by @z52527 in #215
Full Changelog: v25.09...v25.10