Fix non-canonical cu_seqlens_k from preprocessor by jlamypoirier · Pull Request #514 · ServiceNow/Fast-LLM

jlamypoirier · 2026-05-14T22:54:06Z

Summary

Preprocessor emitted cu_seqlens_k[0] = first_document_begin instead of 0, violating the canonical varlen prefix-sum layout. SDPA EFFICIENT backward writes corrupt dK/dV rows when fed this, propagating wrong K/V projection grads through the reduce-scatter under sequence-data-parallel + micro-batch splits.
Preprocessor now produces canonical cu_seqlens_k starting at 0; narrows document_index_k / position_index to the active K extent; exposes the dropped leading-prefix length as a new first_document_begin int kwarg.
Pre-allocate one K/V buffer per attention layer across all micro-sequences of a sequence. Forward writes the SDP-gather result into the next slice via gather_op(out=); backward accumulates per-micro-seq K/V grad into a shared grad buffer slice. The leading + trailing narrows and the per-step torch.cat / AttachGrad cross-micro-seq splice are absorbed into _query_key_value's custom autograd region.

Test plan

tests/data/test_preprocessing.py (843/843 pass) — updated to expect canonical layout
tests/layers/test_attention.py (57/57 pass on CPU and CUDA, including the new first_document_begin regression case)
New regression test test_attention[first_document_begin-[4, 1, 10]] — injects a fake past K/V slot, drives attention with sequence_k_past > 0 and first_document_begin > 0, and verifies output + parameter grads match a per-doc reference and that slot.grad_buffer[:past_length] is exactly zero. Passes for backup, sdpa_dense, flash, sdpa_nested.
tests/models/test_model.py — gpt_2-ms4, gpt_2-sdp2, gpt_2-sdp2_stp2, gpt_2-sdp2_stp2_bf4, gpt_2-stp2_pp2s1_bf4 still fail with ~0.5–1% relative gradient drift vs the simple baseline. The new unit test proves the narrow logic is correct in isolation, so the residual is in something the unit test doesn't exercise: multi-micro-sequence buffer chaining (pasts/presents flowing across 2+ micro-seqs), SDP multi-rank reduce-scatter, or schedule integration. Debugging continues — likely as additional commits on this PR.

The data preprocessor emitted `cu_seqlens_k[0] = first_document_begin` rather than 0, violating the canonical varlen prefix-sum layout required by every public varlen attention API. SDPA's EFFICIENT backward writes corrupted dK/dV rows when fed this layout, propagating wrong gradients through the K/V projection's reduce-scatter under sequence-data-parallel + micro-batch splits. Three changes that compose: - `LengthModelInputPreprocessor` now produces `cu_seqlens_k` starting at 0 and narrows `document_index_k` / `position_index` to the active K extent. The dropped leading-prefix length is exposed as a new `first_document_begin` int kwarg. - Pre-allocate one K/V buffer per attention layer across all micro-sequences of a sequence. Each forward writes the SDP-gather result into the next slice via `gather_op(out=)`; backward accumulates each micro-seq's K/V grad into a shared grad buffer slice. The leading + trailing narrows and the per-step `torch.cat` / `AttachGrad` workaround for the cross-micro-seq splice are all absorbed into the `_query_key_value` custom autograd region. - `_preprocess_for_backup_attention` builds the attention mask against the narrowed K cols so `sdpa_dense` and `backup` consume the same K extent as flash and `sdpa_nested`. Update `tests/data/test_preprocessing.py` to expect the canonical layout.

`_test_first_document_begin` injects a fake past K/V slot with arbitrary leading data, drives attention through a manually-built kwargs with `sequence_k_past` and `first_document_begin` both set to a non-zero `past_length`, and verifies: - forward output matches a per-doc reference computed on the active documents alone (the dropped prefix has no observable effect), - parameter gradients match the reference, - the K/V grad buffer at `[:past_length]` is exactly zero — the specific guarantee of the cu_seqlens_k canonicalization fix. Runs backup + sdpa_dense on fp32, flash + sdpa_nested on bf16 (flash rejects fp32). Plugged into the existing `test_attention` parametrization as a new case with `name="first_document_begin"`, dispatched via name check.

jlamypoirier added 2 commits May 14, 2026 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix non-canonical cu_seqlens_k from preprocessor#514

Fix non-canonical cu_seqlens_k from preprocessor#514
jlamypoirier wants to merge 2 commits into
jlp_sdpa-attentionfrom
jlp_varlen-cu-seqlens-fix

jlamypoirier commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented May 14, 2026 •

edited

Loading