fix: vision loss forward pass falls back to exclude on crash by abrichr · Pull Request #223 · OpenAdaptAI/openadapt-evals

abrichr · 2026-03-29T13:53:35Z

Summary

Qwen3's vision merge changes sequence length unpredictably. Both
include and checkpoint modes crash intermittently with attention
mask mismatches. This killed 3 training runs on step 5.

Fix: catch the error and retry with exclude mode for that step.
Training never crashes. Some steps get vision gradients, some don't,
but all contribute to learning.

🤖 Generated with Claude Code

Qwen3's vision-language merge changes internal sequence length unpredictably. Both include and checkpoint modes crash intermittently with attention mask mismatches (mask too large OR too small depending on generated sequence length). Fix: catch IndexError/RuntimeError from the vision forward pass and retry with exclude mode (text-only, no vision tensors) for that step. Training never crashes — some steps get vision-aware gradients, some get text-only gradients, but all steps contribute to learning. This is the pragmatic fix. The proper fix (capturing logits during generation to avoid re-forward entirely) is future work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: manually concatenating action_ids onto prompt input_ids created inconsistent input (pixel_values sized for prompt, input_ids includes action tokens). Qwen3's vision merge changes internal sequence length, crashing with attention mask mismatches. Fix: process prompt_text + action_text as a SINGLE string through the processor. Produces consistent input_ids, pixel_values, attention_mask. The model handles vision merge correctly on processor output. Replaces the silent fallback from PR #223 with a proper solution that gives correct vision-aware gradients for ALL steps in ALL modes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Root cause: manually concatenating action_ids onto prompt input_ids created inconsistent input (pixel_values sized for prompt, input_ids includes action tokens). Qwen3's vision merge changes internal sequence length, crashing with attention mask mismatches. Fix: process prompt_text + action_text as a SINGLE string through the processor. Produces consistent input_ids, pixel_values, attention_mask. The model handles vision merge correctly on processor output. Replaces the silent fallback from PR #223 with a proper solution that gives correct vision-aware gradients for ALL steps in ALL modes. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

abrichr merged commit d348f1b into main Mar 29, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: vision loss forward pass falls back to exclude on crash#223

fix: vision loss forward pass falls back to exclude on crash#223
abrichr merged 1 commit into
mainfrom
fix/vision-loss-fallback

abrichr commented Mar 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abrichr commented Mar 29, 2026

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant