fix: guard against zero label tokens causing NaN loss in VLM training#1985
Conversation
When all labels in a batch are -100 (empty supervision), num_label_tokens is 0, causing division by zero and NaN loss that corrupts training. - masked_ce.py: return 0.0 instead of dividing by zero - finetune.py: guard PP reporting_loss normalization against zero - test_masked_ce.py: add regression test for the empty-supervision case Signed-off-by: khazic <khazzz1c@gmail.com>
|
Additional reproduction note While the trigger condition is unlikely with the default In that scenario, the image tokens occupy most of the sequence budget and right-truncation completely removes the assistant response. The resulting sample has all labels set to Multi-GPU training is not affected in practice because |
|
/ok to test 4011752 |
1 similar comment
|
/ok to test 4011752 |
Adds two regression tests for _run_train_optim_step with pp_enabled=True, covering the num_label_tokens=0 guard added in #1985 (finetune.py:1142) and the standard num_label_tokens>0 division branch. Neither branch had prior coverage since no existing test exercised _run_train_optim_step with pipeline parallelism enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
test: cover PP reporting loss guard for zero label tokens Adds two regression tests for _run_train_optim_step with pp_enabled=True, covering the num_label_tokens=0 guard added in #1985 (finetune.py:1142) and the standard num_label_tokens>0 division branch. Neither branch had prior coverage since no existing test exercised _run_train_optim_step with pipeline parallelism enabled. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1985) When all labels in a batch are -100 (empty supervision), num_label_tokens is 0, causing division by zero and NaN loss that corrupts training. - masked_ce.py: return 0.0 instead of dividing by zero - finetune.py: guard PP reporting_loss normalization against zero - test_masked_ce.py: add regression test for the empty-supervision case Signed-off-by: khazic <khazzz1c@gmail.com>
test: cover PP reporting loss guard for zero label tokens Adds two regression tests for _run_train_optim_step with pp_enabled=True, covering the num_label_tokens=0 guard added in #1985 (finetune.py:1142) and the standard num_label_tokens>0 division branch. Neither branch had prior coverage since no existing test exercised _run_train_optim_step with pipeline parallelism enabled. Signed-off-by: HuiyingLi <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What does this PR do ?
Add a defensive guard against division by zero in `MaskedCrossEntropy` when `num_label_tokens=0`.
Changelog
Before your PR is "Ready for review"
Pre checks:
Additional Information
Related to #1883.
I attempted to reproduce the SDPA NaN described in #1883 using the exact repro command on transformers 5.5.0 / PyTorch 2.11.0 (CUDA 13.1) / 8xH100, but could not reproduce it. The issue was filed against `transformers==5.5.0.dev0`, and I believe the underlying SDPA masking bug has since been fixed in the stable 5.5.0 release.
While investigating, I noticed that `MaskedCrossEntropy` has no guard for `num_label_tokens=0`, which produces `NaN` via division by zero. In the multi-GPU training path, `num_label_tokens` is all-reduced across DP ranks, so hitting zero in practice would require every sample across every rank to have no valid labels simultaneously -- extremely unlikely. However, two paths are genuinely exposed:
This PR adds a minimal defensive guard so that an empty-supervision batch contributes zero loss instead of NaN, keeping training and validation metrics clean.