Enable Qwen3-Omni vision decode#3741
Open
hengtaoguo wants to merge 1 commit intomainfrom
Open
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
8927c94 to
455d50f
Compare
|
🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @hengtaoguo, but I was unable to process your request. Please see the logs for more details. |
95784f5 to
1890ae4
Compare
entrpn
approved these changes
Apr 24, 2026
code style fix
1890ae4 to
425dcc7
Compare
aireenmei
approved these changes
Apr 27, 2026
Collaborator
aireenmei
left a comment
There was a problem hiding this comment.
Does decode work with video?
Collaborator
Author
The logics works for videos too but now the direct video decode is not available. I plan to raise a follow up PR adding the video inputs field to the model pipeline. It may involve around 10+ lines of change through out different interfaces. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes inference for Qwen3-Omni multimodal (image) decode:
moe.py: FixRaggedDotGroupSizesrepresentative value: commit083293fc3(Use Tokamax's representative group sizes. #3434) regressedlen(inputs)(anint) where atuple[int, ...]is required. Restored as(inputs.shape[0] // kernel.shape[0],) * kernel.shape[0]. Also minor formatting cleanup from the sparsity PR.decode.py: Respectconfig.add_bosflag instead of hardcodingnot has_chat_template. Add batch dimension when callingget_rope_indexforinput_ids/attention_mask.maxengine.py: Castnext_posto the decode state's dtype beforedynamic_update_index_in_dimto avoid dtype mismatch when MRoPE position IDs are float vs int.qwen3.py: FixQwen3OmniMoeVisionEncoder.__call__: explicitly reshapehidden_statesto patch-level layout beforepatch_embed, then reshape output back to(batch, seq, hidden). Previously assumed a flattened input that didn't account for batch size.processor_qwen3_omni.py: Reshape preprocessed image pixel values to(1, C, T*t, H*h, W*w)to match the updated vision encoder's expected input layout, carrying explicit grid-thw information.Note: Video decoding is working too but pending an input interface refactor. To manage the growing number of model-specific inputs (video_values, mask, grid_thw), we are planning a follow-up PR to standardize the interface.
Tests
Unit tests passing:
Decode with a test image:
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.