Add Qwen3.5 vision encoder and connector#3962
Merged
Merged
Conversation
899d4db to
bbcf606
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
entrpn
approved these changes
May 21, 2026
|
🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
The Pull Request successfully adds support for the Qwen3.5 vision encoder and connector by subclassing the Qwen3 Omni layers. This approach ensures clean checkpoint parameter keys while reusing established logic. The fix for the hybrid GDN logic in attentions.py is a crucial correction for vision tower integration.
🔍 General Feedback
- Clean Architecture: Subclassing Qwen3 Omni layers to achieve clean checkpoint names while reusing logic is an excellent use of NNX and maintains code modularity.
- Critical Bug Fix: The update to
is_qwen3_hybridinattentions.pycorrectly prevents the hybrid GDN/Attention logic from being incorrectly applied to the vision tower. - Comprehensive Testing: The addition of
tests/unit/qwen3_5_layers_test.pywith detailed comparisons against the HuggingFace reference implementation ensures the correctness of the new vision tower. - Consistent Configuration: The MRoPE and vision configuration in the YAML file align perfectly with the model's architectural requirements.
aireenmei
approved these changes
May 22, 2026
a43b833 to
545b677
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Subclassed JAX Vision Layers: Created clean JAX
Qwen3_5MoeVisionEncoderandQwen3_5MoeVisionProjectorsubclasses to reuse Qwen3-Omni layers (both share3VLbase), keeping checkpoint parameter keys clean through specifying names inencoders.py.Key Differences against Qwen3-Omni:
ln_q/mlptonorm,linear_fc1,linear_fc2. We updated in the unit testcopy_qwen3_5_patch_merger_weights, and should also address it in the follow-up ckpt PR.deepstack_visual_indexes_for_vit: []in yml.Hybrid Attention Bug Fix: Fixed
maxtext/layers/attentions.pyto prevent Qwen3.5 hybrid GDN query-splitting logic from executing on the vision tower attention layer.Equivalence Unit Test: Added
tests/unit/qwen3_5_layers_test.pycomparing the subclassed JAX tower against HFQwen3_5MoeVisionModelon TPU. Usesatol=2e-2(due to more accumulated error of 4096 visual projection dimension vs 2048 in Omni). Passed cleanly.Tests
Offline unit test against HF Qwen3.5 reference implementation:
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.