You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the THD collator's pad_to_multiple_of option is active (used primarily for FP8/FP4 shape alignment), the collator appends a mock pad sequence at the end of the batch and mutates cu_seq_lens_q in place to describe the appended layout. Because no cu_seq_lens_q_padded key is written in this path — and intentionally cannot be, since cu_seq_lens_q_padded is reserved for TE's per-sequence CP zigzag-divisibility padding semantic — the perf_logger's _attn_work_from_batch helper has no way to distinguish useful work (real docs) from hardware work (real docs + mock pad) when pad_to_multiple_of is used.
In that path, train/mfu_pct (useful) and train/mfu_padded_pct (hardware) collapse to the same value, and both include the mock sequence's remainder² as if it were real work.
Quantitative impact
remainder < pad_to_multiple_of by construction. Typical alignment values for FP8/MXFP8/NVFP4 are {8, 16, 32}, so the extra Σ(Lᵢ²) contribution is bounded by pad_to_multiple_of² ∈ {64, 256, 1024}. Real batch totals are on the order of 10⁷–10⁹ (e.g. ESM-2 at mbs=26 × S=1022: 26 × 1022² ≈ 2.7e7), so the inflation is ≤10⁻⁵ of the real total — below any measurement noise we resolve in practice. The mock sequence's labels are -100 (excluded from loss), so no gradient contribution; the tiny amount of actual attention compute FA executes on the mock block is real hardware work, just not useful work.
Affected code
All four MFU-tracking recipes share this behavior:
Recipe
Collator site
bionemo-recipes/recipes/esm2_native_te/
via enforced copy of bionemo-recipes/models/esm2/collator.py (_pt_pad_to_multiple_of at line 871; _pad_batch_to_multiple_of at line 187)
bionemo-recipes/recipes/llama3_native_te/
byte-identical copy of the ESM-2 collator (collator.py)
byte-identical copy of the ESM-2 collator (collator.py)
bionemo-recipes/recipes/codonfm_native_te/
inline CodonTHDCollator in dataset.py:243-344; same semantics plus a defensive chunking path that splits a single over-max_seq_length pad into multiple mock sequences (currently unreachable since remainder < pad_to_multiple_of ≤ 32 ≪ max_seq_length)
The logger site that reads cu_seq_lens_q (and is therefore affected) is _attn_work_from_batch in each recipe's perf_logger.py.
Status
Known limitation, not fixing for now per discussion on #1548. Impact is <10⁻⁵, well below measurement resolution, and this mode is only exercised when pad_to_multiple_of is explicitly set (FP8/FP4 workflows). Inline NOTE comments have been added at the cu_seq_lens_q read sites in all four perf_logger.py files pointing at this issue.
If we revisit
The constraint is that cu_seq_lens_q_paddedmust not be populated from _pt_pad_to_multiple_of — that key is reserved for pad_sequences_to_be_divisible_by (TE's in-sequence CP padding, consumed by pad_thd_sequences_for_cp). Confirmed by @pstjohn and @jomitchellnv on #1548:
For "in sequence padding" we would use cu_seqlens_padded but I don't think we want that just for padding the remainder of a sequence vector (T in the THD).
Two candidate fixes, both avoiding cu_seq_lens_q_padded:
Option A (preferred): Collator records batch["num_real_seqs"] = number of real sequences before any mock-pad append. The perf_logger slices cu_seq_lens_q to the first num_real_seqs deltas for useful-work Σ(Lᵢ²); keeps the full cu_seq_lens_q for hardware-work Σ(Lᵢ²). Handles codonfm's multi-mock chunking for free.
Option B: Collator stashes the pre-mutation cu_seq_lens_q under a new key (e.g. cu_seq_lens_q_real). Logger prefers that key when present. Functionally equivalent but duplicates a small tensor per batch.
Option A is marginally simpler (one int scalar) and generalizes cleanly to "N real + M mock" cases.
Summary
When the THD collator's
pad_to_multiple_ofoption is active (used primarily for FP8/FP4 shape alignment), the collator appends a mock pad sequence at the end of the batch and mutatescu_seq_lens_qin place to describe the appended layout. Because nocu_seq_lens_q_paddedkey is written in this path — and intentionally cannot be, sincecu_seq_lens_q_paddedis reserved for TE's per-sequence CP zigzag-divisibility padding semantic — theperf_logger's_attn_work_from_batchhelper has no way to distinguish useful work (real docs) from hardware work (real docs + mock pad) whenpad_to_multiple_ofis used.In that path,
train/mfu_pct(useful) andtrain/mfu_padded_pct(hardware) collapse to the same value, and both include the mock sequence'sremainder²as if it were real work.Quantitative impact
remainder < pad_to_multiple_ofby construction. Typical alignment values for FP8/MXFP8/NVFP4 are{8, 16, 32}, so the extra Σ(Lᵢ²) contribution is bounded bypad_to_multiple_of²∈{64, 256, 1024}. Real batch totals are on the order of 10⁷–10⁹ (e.g. ESM-2 atmbs=26 × S=1022:26 × 1022² ≈ 2.7e7), so the inflation is ≤10⁻⁵ of the real total — below any measurement noise we resolve in practice. The mock sequence's labels are-100(excluded from loss), so no gradient contribution; the tiny amount of actual attention compute FA executes on the mock block is real hardware work, just not useful work.Affected code
All four MFU-tracking recipes share this behavior:
bionemo-recipes/recipes/esm2_native_te/bionemo-recipes/models/esm2/collator.py(_pt_pad_to_multiple_ofat line 871;_pad_batch_to_multiple_ofat line 187)bionemo-recipes/recipes/llama3_native_te/collator.py)bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py)bionemo-recipes/recipes/codonfm_native_te/CodonTHDCollatorindataset.py:243-344; same semantics plus a defensive chunking path that splits a single over-max_seq_lengthpad into multiple mock sequences (currently unreachable sinceremainder < pad_to_multiple_of ≤ 32 ≪ max_seq_length)The logger site that reads
cu_seq_lens_q(and is therefore affected) is_attn_work_from_batchin each recipe'sperf_logger.py.Status
Known limitation, not fixing for now per discussion on #1548. Impact is <10⁻⁵, well below measurement resolution, and this mode is only exercised when
pad_to_multiple_ofis explicitly set (FP8/FP4 workflows). InlineNOTEcomments have been added at thecu_seq_lens_qread sites in all fourperf_logger.pyfiles pointing at this issue.If we revisit
The constraint is that
cu_seq_lens_q_paddedmust not be populated from_pt_pad_to_multiple_of— that key is reserved forpad_sequences_to_be_divisible_by(TE's in-sequence CP padding, consumed bypad_thd_sequences_for_cp). Confirmed by @pstjohn and @jomitchellnv on #1548:Two candidate fixes, both avoiding
cu_seq_lens_q_padded:Option A (preferred): Collator records
batch["num_real_seqs"]= number of real sequences before any mock-pad append. The perf_logger slicescu_seq_lens_qto the firstnum_real_seqsdeltas for useful-work Σ(Lᵢ²); keeps the fullcu_seq_lens_qfor hardware-work Σ(Lᵢ²). Handles codonfm's multi-mock chunking for free.Option B: Collator stashes the pre-mutation
cu_seq_lens_qunder a new key (e.g.cu_seq_lens_q_real). Logger prefers that key when present. Functionally equivalent but duplicates a small tensor per batch.Option A is marginally simpler (one int scalar) and generalizes cleanly to "N real + M mock" cases.
References
bionemo-recipes/recipes/esm2_native_te/README.md:387_pt_pad_to_multiple_of:bionemo-recipes/models/esm2/collator.py:871_pad_batch_to_multiple_of:bionemo-recipes/models/esm2/collator.py:187CodonTHDCollator.__call__:bionemo-recipes/recipes/codonfm_native_te/dataset.py:270-344_attn_work_from_batch: each recipe'sperf_logger.py(ESM-2 at:114, llama3 at:112, og2 at:120, codonfm at:115)