Skip to content

fix: vision loss forward pass falls back to exclude on crash#223

Merged
abrichr merged 1 commit into
mainfrom
fix/vision-loss-fallback
Mar 29, 2026
Merged

fix: vision loss forward pass falls back to exclude on crash#223
abrichr merged 1 commit into
mainfrom
fix/vision-loss-fallback

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 29, 2026

Summary

Qwen3's vision merge changes sequence length unpredictably. Both
include and checkpoint modes crash intermittently with attention
mask mismatches. This killed 3 training runs on step 5.

Fix: catch the error and retry with exclude mode for that step.
Training never crashes. Some steps get vision gradients, some don't,
but all contribute to learning.

🤖 Generated with Claude Code

Qwen3's vision-language merge changes internal sequence length
unpredictably. Both include and checkpoint modes crash intermittently
with attention mask mismatches (mask too large OR too small depending
on generated sequence length).

Fix: catch IndexError/RuntimeError from the vision forward pass and
retry with exclude mode (text-only, no vision tensors) for that step.
Training never crashes — some steps get vision-aware gradients, some
get text-only gradients, but all steps contribute to learning.

This is the pragmatic fix. The proper fix (capturing logits during
generation to avoid re-forward entirely) is future work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abrichr abrichr merged commit d348f1b into main Mar 29, 2026
1 check passed
abrichr added a commit that referenced this pull request Mar 29, 2026
Root cause: manually concatenating action_ids onto prompt input_ids
created inconsistent input (pixel_values sized for prompt, input_ids
includes action tokens). Qwen3's vision merge changes internal
sequence length, crashing with attention mask mismatches.

Fix: process prompt_text + action_text as a SINGLE string through the
processor. Produces consistent input_ids, pixel_values, attention_mask.
The model handles vision merge correctly on processor output.

Replaces the silent fallback from PR #223 with a proper solution that
gives correct vision-aware gradients for ALL steps in ALL modes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 29, 2026
Root cause: manually concatenating action_ids onto prompt input_ids
created inconsistent input (pixel_values sized for prompt, input_ids
includes action tokens). Qwen3's vision merge changes internal
sequence length, crashing with attention mask mismatches.

Fix: process prompt_text + action_text as a SINGLE string through the
processor. Produces consistent input_ids, pixel_values, attention_mask.
The model handles vision merge correctly on processor output.

Replaces the silent fallback from PR #223 with a proper solution that
gives correct vision-aware gradients for ALL steps in ALL modes.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant