Skip to content

MFU tracking: pad_to_multiple_of path inflates useful-work Σ(Lᵢ²) by the mock-sequence contribution #1561

@gagank1

Description

@gagank1

Summary

When the THD collator's pad_to_multiple_of option is active (used primarily for FP8/FP4 shape alignment), the collator appends a mock pad sequence at the end of the batch and mutates cu_seq_lens_q in place to describe the appended layout. Because no cu_seq_lens_q_padded key is written in this path — and intentionally cannot be, since cu_seq_lens_q_padded is reserved for TE's per-sequence CP zigzag-divisibility padding semantic — the perf_logger's _attn_work_from_batch helper has no way to distinguish useful work (real docs) from hardware work (real docs + mock pad) when pad_to_multiple_of is used.

In that path, train/mfu_pct (useful) and train/mfu_padded_pct (hardware) collapse to the same value, and both include the mock sequence's remainder² as if it were real work.

Quantitative impact

remainder < pad_to_multiple_of by construction. Typical alignment values for FP8/MXFP8/NVFP4 are {8, 16, 32}, so the extra Σ(Lᵢ²) contribution is bounded by pad_to_multiple_of²{64, 256, 1024}. Real batch totals are on the order of 10⁷–10⁹ (e.g. ESM-2 at mbs=26 × S=1022: 26 × 1022² ≈ 2.7e7), so the inflation is ≤10⁻⁵ of the real total — below any measurement noise we resolve in practice. The mock sequence's labels are -100 (excluded from loss), so no gradient contribution; the tiny amount of actual attention compute FA executes on the mock block is real hardware work, just not useful work.

Affected code

All four MFU-tracking recipes share this behavior:

Recipe Collator site
bionemo-recipes/recipes/esm2_native_te/ via enforced copy of bionemo-recipes/models/esm2/collator.py (_pt_pad_to_multiple_of at line 871; _pad_batch_to_multiple_of at line 187)
bionemo-recipes/recipes/llama3_native_te/ byte-identical copy of the ESM-2 collator (collator.py)
bionemo-recipes/recipes/opengenome2_llama_native_te/ byte-identical copy of the ESM-2 collator (collator.py)
bionemo-recipes/recipes/codonfm_native_te/ inline CodonTHDCollator in dataset.py:243-344; same semantics plus a defensive chunking path that splits a single over-max_seq_length pad into multiple mock sequences (currently unreachable since remainder < pad_to_multiple_of ≤ 32 ≪ max_seq_length)

The logger site that reads cu_seq_lens_q (and is therefore affected) is _attn_work_from_batch in each recipe's perf_logger.py.

Status

Known limitation, not fixing for now per discussion on #1548. Impact is <10⁻⁵, well below measurement resolution, and this mode is only exercised when pad_to_multiple_of is explicitly set (FP8/FP4 workflows). Inline NOTE comments have been added at the cu_seq_lens_q read sites in all four perf_logger.py files pointing at this issue.

If we revisit

The constraint is that cu_seq_lens_q_padded must not be populated from _pt_pad_to_multiple_of — that key is reserved for pad_sequences_to_be_divisible_by (TE's in-sequence CP padding, consumed by pad_thd_sequences_for_cp). Confirmed by @pstjohn and @jomitchellnv on #1548:

For "in sequence padding" we would use cu_seqlens_padded but I don't think we want that just for padding the remainder of a sequence vector (T in the THD).

Two candidate fixes, both avoiding cu_seq_lens_q_padded:

Option A (preferred): Collator records batch["num_real_seqs"] = number of real sequences before any mock-pad append. The perf_logger slices cu_seq_lens_q to the first num_real_seqs deltas for useful-work Σ(Lᵢ²); keeps the full cu_seq_lens_q for hardware-work Σ(Lᵢ²). Handles codonfm's multi-mock chunking for free.

Option B: Collator stashes the pre-mutation cu_seq_lens_q under a new key (e.g. cu_seq_lens_q_real). Logger prefers that key when present. Functionally equivalent but duplicates a small tensor per batch.

Option A is marginally simpler (one int scalar) and generalizes cleanly to "N real + M mock" cases.

References

  • PR Generalize MFU/FLOPs module across recipes with log_mfu training hook #1548 — review thread on bionemo-recipes/recipes/esm2_native_te/README.md:387
  • _pt_pad_to_multiple_of: bionemo-recipes/models/esm2/collator.py:871
  • _pad_batch_to_multiple_of: bionemo-recipes/models/esm2/collator.py:187
  • CodonTHDCollator.__call__: bionemo-recipes/recipes/codonfm_native_te/dataset.py:270-344
  • _attn_work_from_batch: each recipe's perf_logger.py (ESM-2 at :114, llama3 at :112, og2 at :120, codonfm at :115)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions