Add qwen3 vl #1095

SangChengC · 2025-11-04T11:06:43Z

No description provided.

gemini-code-assist · 2025-11-04T11:07:18Z

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model capabilities by adding full support for the Qwen3-VL and Qwen3-VL MoE models. It involves a deep integration of multimodal features, including a new visual encoder, specialized tokenization, and the innovative 'deepstack' mechanism to pass rich visual context through multiple layers. Alongside these major additions, the PR refines existing image processing, enhances Rotary Positional Embeddings for complex scenarios, and optimizes MoE weight handling and Whisper's attention for better performance.

Highlights

Qwen3-VL Model Integration: Introduced comprehensive support for the Qwen3-VL (Vision-Language) model, including its Mixture-of-Experts (MoE) variant, by adding new model definitions, inference logic, and weight handling.
Deepstack Feature Support: Integrated 'deepstack' features for multimodal models, allowing intermediate visual features from different layers of the vision encoder to be passed to the language model, enhancing multimodal understanding.
Advanced Image Processing: Enhanced image processing utilities with more flexible smart_resize and resize_image functions, explicitly handling factor, min_pixels, and max_pixels for better control over image resizing and token length calculation.
Rotary Positional Embedding (RoPE) Enhancements: Updated RoPE handling in the Llama model to support mrope_section and a new 'default' scaling type, along with an apply_interleaved_mrope function for 3D rotary embeddings in Qwen3-VL.
MoE Weight Fusion Improvements: Modified the FusedMoeWeightTP class to correctly handle fused gate-up weights (3D tensors) for MoE experts, including tensor parallelism splitting and transposing, while also adding a check for quantization compatibility.
Whisper Model Optimization: Optimized the Whisper model's attention mechanism by introducing WhisperSdpaAttention, which leverages torch.nn.functional.scaled_dot_product_attention for improved efficiency.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2025-11-04T11:07:49Z

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Qwen3 Vision-Language (VL) model into the LightLLM framework, enabling efficient processing of multimodal inputs. It involves adapting the model's architecture, including its Mixture-of-Experts (MoE) components and specialized image processing pipeline, to work seamlessly within the existing inference system. The changes enhance the framework's capability to handle complex visual and textual data, ensuring accurate and performant multimodal inference for Qwen3 VL.

Highlights

Qwen3 VL Model Integration: Introduced comprehensive support for the Qwen3 Vision-Language (VL) model, including both standard and Mixture-of-Experts (MoE) variants, within the LightLLM framework.
Multimodal Image Processing Enhancements: Updated image resizing logic and introduced new vision transformer components (e.g., Qwen3VLPatchEmbed, Qwen3VLVisionBlock) to handle Qwen3 VL's specific image processing requirements, including deepstack features.
Advanced Rotary Embeddings (MRoPE): Implemented interleaved MRoPE (Multi-dimensional Rotary Position Embeddings) for Qwen3 VL, allowing for more complex positional encoding in multimodal contexts.
Inference State and Weight Management: Added dedicated inference state (Qwen3VLMOEInferStateInfo) and weight loading classes for Qwen3 VL, optimizing weight fusion for MoE layers and ensuring proper tensor parallelism handling.
Shared Memory for Deepstack Features: Extended shared memory utilities to support the efficient transfer and storage of 'deepstack features' generated by the vision model, which are then integrated into the language model's embeddings.
Whisper Attention Optimization: Migrated the Whisper encoder layer to use WhisperSdpaAttention, leveraging torch.nn.functional.scaled_dot_product_attention for potential performance improvements.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for the qwen3_vl and qwen3_vl_moe multimodal models. The changes are extensive, including new model definitions, custom layer implementations for handling vision features and Mixture-of-Experts, and specific weight loading logic. Notably, it adds support for deepstack features and mRoPE. While the implementation is comprehensive, I've identified a critical bug in the model initialization logic that could affect other models, along with a few high-severity issues related to potential runtime errors and several medium-severity issues for code cleanup. My feedback focuses on ensuring correctness, robustness, and maintainability.

gemini-code-assist · 2025-11-04T11:09:01Z

lightllm/server/visualserver/model_infer/model_rpc.py

                self.model = (
                    Qwen2_5_VisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval().bfloat16()
                )
+            elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":


The condition or "qwen3_vl_moe" is logically incorrect and will always evaluate to True because a non-empty string is truthy. This will cause this block to execute for any model type that doesn't match the preceding if/elif conditions, preventing other models from being initialized correctly. You should use in to check for membership in a list of model types.

Suggested change

elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":

elif self.model_type in ["qwen3_vl", "qwen3_vl_moe"]:

gemini-code-assist · 2025-11-04T11:09:01Z

lightllm/models/llama/model.py

        模型特殊的一些初始化
        """
        rope_scaling = self.config.get("rope_scaling", None)
+        if "mrope_section" in rope_scaling:


This condition will raise a TypeError if rope_scaling is None, which is the default value assigned on the preceding line. You should add a check to ensure rope_scaling is not None before attempting to access it as a dictionary.

Suggested change

if "mrope_section" in rope_scaling:

if rope_scaling and "mrope_section" in rope_scaling:

gemini-code-assist · 2025-11-04T11:09:01Z

lightllm/models/qwen2_vl/vision_process.py

+    h_bar = round(height / factor) * factor
+    w_bar = round(width / factor) * factor


The calculation for h_bar and w_bar could result in zero if height or width are very small compared to factor (e.g., if height / factor < 0.5). The previous implementation prevented this by using max(factor, ...). This logic has been moved into one branch of the conditional but is missing from the initial calculation. This could lead to errors or incorrect resizing for small images. It's safer to restore the max guard for the initial calculation.

Suggested change

h_bar = round(height / factor) * factor

w_bar = round(width / factor) * factor

h_bar = max(factor, round(height / factor) * factor)

w_bar = max(factor, round(width / factor) * factor)

gemini-code-assist · 2025-11-04T11:09:01Z

lightllm/models/llama/model.py

        freqs = torch.outer(t, inv_freq)
-
+        self.freqs = freqs.cuda()
+        print(f"model.freqs is {self.freqs}")


This print statement appears to be for debugging purposes and should be removed before merging into the main branch to keep the logs clean.

gemini-code-assist · 2025-11-04T11:09:01Z

lightllm/models/qwen3_vl/infer_struct.py

+        self.position_cos = torch.cat([cos_half, cos_half], dim=-1).contiguous()  # [L, d]
+        self.position_sin = torch.cat([sin_half, sin_half], dim=-1).contiguous()
+        if self.is_prefill:
+            pos = None


The line pos = None appears to be leftover debugging code, as the pos variable is not used after this assignment. It should be removed to improve code clarity.

gemini-code-assist

Code Review

This pull request introduces support for the Qwen3-VL and Qwen3-VL-MoE models. The changes are extensive, covering new model definitions, a custom vision tower with deepstack feature injection, multi-modal RoPE (mROPE), and updates to weight loading for fused MoE layers. The implementation also includes a mechanism for passing deepstack features via shared memory. Additionally, there's a performance improvement for the Whisper model by integrating SDPA.

The overall approach is sound, but I've found a few critical issues that need to be addressed, particularly a potential TypeError in RoPE initialization and a buggy conditional in the visual server. I've also noted some areas for code cleanup, such as removing debug prints and dead code.

gemini-code-assist · 2025-11-04T11:10:39Z

lightllm/models/llama/model.py

+        if "mrope_section" in rope_scaling:
+            self.mrope_section = rope_scaling["mrope_section"]


The check if "mrope_section" in rope_scaling: will raise a TypeError if rope_scaling is None, which is a possible value assigned on the preceding line. You should ensure rope_scaling is not None before attempting to access it.

Suggested change

if "mrope_section" in rope_scaling:

self.mrope_section = rope_scaling["mrope_section"]

if rope_scaling and "mrope_section" in rope_scaling:

gemini-code-assist · 2025-11-04T11:10:39Z

lightllm/server/visualserver/model_infer/model_rpc.py

                self.model = (
                    Qwen2_5_VisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval().bfloat16()
                )
+            elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":


The condition self.model_type == "qwen3_vl" or "qwen3_vl_moe" is always true because the non-empty string "qwen3_vl_moe" evaluates to True. This will cause incorrect model loading for any subsequent model types. This should be corrected to check for membership in a list of model types.

Suggested change

elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":

elif self.model_type in ["qwen3_vl", "qwen3_vl_moe"]:

gemini-code-assist · 2025-11-04T11:10:40Z

lightllm/common/basemodel/layer_weights/meta_weights/fused_moe_weight_tp.py

+            if self.fused_gate_up:
+                raise ValueError("qwen3_vl_moe not support quant now")


This is a good safeguard to prevent running quantization on a model variant that does not support it yet. However, instead of raising a ValueError, it would be more informative to log a warning and skip quantization for this case, allowing the model to load and run in a non-quantized mode. This would provide more flexibility.

gemini-code-assist · 2025-11-04T11:10:40Z

lightllm/models/llama/model.py

        freqs = torch.outer(t, inv_freq)
-
+        self.freqs = freqs.cuda()
+        print(f"model.freqs is {self.freqs}")


This print statement appears to be for debugging and should be removed from the production code.

gemini-code-assist · 2025-11-04T11:10:40Z

lightllm/models/qwen2_vl/vision_process.py

+def resize_image(
+    image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
+) -> tuple[Image.Image, int, int]:


The function resize_image is type-hinted to return a tuple[Image.Image, int, int], but the implementation only returns a single Image.Image object. The type hint should be corrected to match the actual return value.

Suggested change

def resize_image(

image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS

) -> tuple[Image.Image, int, int]:

def resize_image(

image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS

) -> Image.Image:

gemini-code-assist · 2025-11-04T11:10:40Z

lightllm/models/qwen3_vl/infer_struct.py

+        args:
+            x: (3, bs, seq_len, head_dim // 2)
+            mrope_section: (3,)
+        returns:
+            x_t: (bs, seq_len, head_dim // 2)


The docstring for apply_interleaved_mrope is inaccurate. The argument is freqs with a shape of (3, seq_len, head_dim // 2), but the docstring refers to x with a bs (batch size) dimension, which is not present. The return value shape is also (seq_len, head_dim // 2). Please update the docstring for clarity.

Suggested change

args:

x: (3, bs, seq_len, head_dim // 2)

mrope_section: (3,)

returns:

x_t: (bs, seq_len, head_dim // 2)

args:

freqs: (3, seq_len, head_dim // 2)

mrope_section: (3,)

returns:

freqs_t: (seq_len, head_dim // 2)

gemini-code-assist · 2025-11-04T11:10:40Z

lightllm/models/qwen3_vl/layer_infer/transformer_layer_infer.py

+        axis_map = []
+        for i, n in enumerate(self.mrope_section * 2):
+            axis_map += [i % 3] * n
+        self.axis_map = torch.tensor(axis_map, dtype=torch.int32, device="cuda")


The instance variable self.axis_map is initialized here but is not used anywhere in the class. This appears to be dead code and should be removed.

gemini-code-assist · 2025-11-04T11:10:40Z

lightllm/models/qwen3_vl/layer_weights/transformers_layer_weight.py

+)
+
+
+class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight):  # 后面看要不要改


This comment // 后面看要不要改 (which translates to "check later if this needs to be changed") seems to be a temporary note and should be removed from the code.

Suggested change

class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight): # 后面看要不要改

class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight):

SangChengC and others added 3 commits September 26, 2025 07:43

[add]add whisper sdpa

defa581

[add]add qwen3-vl-moe support

87c15dc

fix1103

c318d72

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

add qwen3-vl support

2ebdb58

	elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":
	elif self.model_type in ["qwen3_vl", "qwen3_vl_moe"]:

	if "mrope_section" in rope_scaling:
	if rope_scaling and "mrope_section" in rope_scaling:

		h_bar = round(height / factor) * factor
		w_bar = round(width / factor) * factor

		if "mrope_section" in rope_scaling:
		self.mrope_section = rope_scaling["mrope_section"]

		if self.fused_gate_up:
		raise ValueError("qwen3_vl_moe not support quant now")

		)


		class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight): # 后面看要不要改

Add qwen3 vl #1095

Are you sure you want to change the base?

Add qwen3 vl #1095

Uh oh!

Conversation

SangChengC commented Nov 4, 2025

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants