Skip to content

Conversation

@SangChengC
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model capabilities by adding full support for the Qwen3-VL and Qwen3-VL MoE models. It involves a deep integration of multimodal features, including a new visual encoder, specialized tokenization, and the innovative 'deepstack' mechanism to pass rich visual context through multiple layers. Alongside these major additions, the PR refines existing image processing, enhances Rotary Positional Embeddings for complex scenarios, and optimizes MoE weight handling and Whisper's attention for better performance.

Highlights

  • Qwen3-VL Model Integration: Introduced comprehensive support for the Qwen3-VL (Vision-Language) model, including its Mixture-of-Experts (MoE) variant, by adding new model definitions, inference logic, and weight handling.
  • Deepstack Feature Support: Integrated 'deepstack' features for multimodal models, allowing intermediate visual features from different layers of the vision encoder to be passed to the language model, enhancing multimodal understanding.
  • Advanced Image Processing: Enhanced image processing utilities with more flexible smart_resize and resize_image functions, explicitly handling factor, min_pixels, and max_pixels for better control over image resizing and token length calculation.
  • Rotary Positional Embedding (RoPE) Enhancements: Updated RoPE handling in the Llama model to support mrope_section and a new 'default' scaling type, along with an apply_interleaved_mrope function for 3D rotary embeddings in Qwen3-VL.
  • MoE Weight Fusion Improvements: Modified the FusedMoeWeightTP class to correctly handle fused gate-up weights (3D tensors) for MoE experts, including tensor parallelism splitting and transposing, while also adding a check for quantization compatibility.
  • Whisper Model Optimization: Optimized the Whisper model's attention mechanism by introducing WhisperSdpaAttention, which leverages torch.nn.functional.scaled_dot_product_attention for improved efficiency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Qwen3 Vision-Language (VL) model into the LightLLM framework, enabling efficient processing of multimodal inputs. It involves adapting the model's architecture, including its Mixture-of-Experts (MoE) components and specialized image processing pipeline, to work seamlessly within the existing inference system. The changes enhance the framework's capability to handle complex visual and textual data, ensuring accurate and performant multimodal inference for Qwen3 VL.

Highlights

  • Qwen3 VL Model Integration: Introduced comprehensive support for the Qwen3 Vision-Language (VL) model, including both standard and Mixture-of-Experts (MoE) variants, within the LightLLM framework.
  • Multimodal Image Processing Enhancements: Updated image resizing logic and introduced new vision transformer components (e.g., Qwen3VLPatchEmbed, Qwen3VLVisionBlock) to handle Qwen3 VL's specific image processing requirements, including deepstack features.
  • Advanced Rotary Embeddings (MRoPE): Implemented interleaved MRoPE (Multi-dimensional Rotary Position Embeddings) for Qwen3 VL, allowing for more complex positional encoding in multimodal contexts.
  • Inference State and Weight Management: Added dedicated inference state (Qwen3VLMOEInferStateInfo) and weight loading classes for Qwen3 VL, optimizing weight fusion for MoE layers and ensuring proper tensor parallelism handling.
  • Shared Memory for Deepstack Features: Extended shared memory utilities to support the efficient transfer and storage of 'deepstack features' generated by the vision model, which are then integrated into the language model's embeddings.
  • Whisper Attention Optimization: Migrated the Whisper encoder layer to use WhisperSdpaAttention, leveraging torch.nn.functional.scaled_dot_product_attention for potential performance improvements.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the qwen3_vl and qwen3_vl_moe multimodal models. The changes are extensive, including new model definitions, custom layer implementations for handling vision features and Mixture-of-Experts, and specific weight loading logic. Notably, it adds support for deepstack features and mRoPE. While the implementation is comprehensive, I've identified a critical bug in the model initialization logic that could affect other models, along with a few high-severity issues related to potential runtime errors and several medium-severity issues for code cleanup. My feedback focuses on ensuring correctness, robustness, and maintainability.

self.model = (
Qwen2_5_VisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval().bfloat16()
)
elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition or "qwen3_vl_moe" is logically incorrect and will always evaluate to True because a non-empty string is truthy. This will cause this block to execute for any model type that doesn't match the preceding if/elif conditions, preventing other models from being initialized correctly. You should use in to check for membership in a list of model types.

Suggested change
elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":
elif self.model_type in ["qwen3_vl", "qwen3_vl_moe"]:

模型特殊的一些初始化
"""
rope_scaling = self.config.get("rope_scaling", None)
if "mrope_section" in rope_scaling:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This condition will raise a TypeError if rope_scaling is None, which is the default value assigned on the preceding line. You should add a check to ensure rope_scaling is not None before attempting to access it as a dictionary.

Suggested change
if "mrope_section" in rope_scaling:
if rope_scaling and "mrope_section" in rope_scaling:

Comment on lines +42 to +43
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation for h_bar and w_bar could result in zero if height or width are very small compared to factor (e.g., if height / factor < 0.5). The previous implementation prevented this by using max(factor, ...). This logic has been moved into one branch of the conditional but is missing from the initial calculation. This could lead to errors or incorrect resizing for small images. It's safer to restore the max guard for the initial calculation.

Suggested change
h_bar = round(height / factor) * factor
w_bar = round(width / factor) * factor
h_bar = max(factor, round(height / factor) * factor)
w_bar = max(factor, round(width / factor) * factor)

freqs = torch.outer(t, inv_freq)

self.freqs = freqs.cuda()
print(f"model.freqs is {self.freqs}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging purposes and should be removed before merging into the main branch to keep the logs clean.

self.position_cos = torch.cat([cos_half, cos_half], dim=-1).contiguous() # [L, d]
self.position_sin = torch.cat([sin_half, sin_half], dim=-1).contiguous()
if self.is_prefill:
pos = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The line pos = None appears to be leftover debugging code, as the pos variable is not used after this assignment. It should be removed to improve code clarity.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3-VL and Qwen3-VL-MoE models. The changes are extensive, covering new model definitions, a custom vision tower with deepstack feature injection, multi-modal RoPE (mROPE), and updates to weight loading for fused MoE layers. The implementation also includes a mechanism for passing deepstack features via shared memory. Additionally, there's a performance improvement for the Whisper model by integrating SDPA.

The overall approach is sound, but I've found a few critical issues that need to be addressed, particularly a potential TypeError in RoPE initialization and a buggy conditional in the visual server. I've also noted some areas for code cleanup, such as removing debug prints and dead code.

Comment on lines 111 to 112
if "mrope_section" in rope_scaling:
self.mrope_section = rope_scaling["mrope_section"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The check if "mrope_section" in rope_scaling: will raise a TypeError if rope_scaling is None, which is a possible value assigned on the preceding line. You should ensure rope_scaling is not None before attempting to access it.

Suggested change
if "mrope_section" in rope_scaling:
self.mrope_section = rope_scaling["mrope_section"]
if rope_scaling and "mrope_section" in rope_scaling:

self.model = (
Qwen2_5_VisionTransformerPretrainedModel(kvargs, **model_cfg["vision_config"]).eval().bfloat16()
)
elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition self.model_type == "qwen3_vl" or "qwen3_vl_moe" is always true because the non-empty string "qwen3_vl_moe" evaluates to True. This will cause incorrect model loading for any subsequent model types. This should be corrected to check for membership in a list of model types.

Suggested change
elif self.model_type == "qwen3_vl" or "qwen3_vl_moe":
elif self.model_type in ["qwen3_vl", "qwen3_vl_moe"]:

Comment on lines +226 to +227
if self.fused_gate_up:
raise ValueError("qwen3_vl_moe not support quant now")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a good safeguard to prevent running quantization on a model variant that does not support it yet. However, instead of raising a ValueError, it would be more informative to log a warning and skip quantization for this case, allowing the model to load and run in a non-quantized mode. This would provide more flexibility.

freqs = torch.outer(t, inv_freq)

self.freqs = freqs.cuda()
print(f"model.freqs is {self.freqs}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging and should be removed from the production code.

Comment on lines +55 to +57
def resize_image(
image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
) -> tuple[Image.Image, int, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function resize_image is type-hinted to return a tuple[Image.Image, int, int], but the implementation only returns a single Image.Image object. The type hint should be corrected to match the actual return value.

Suggested change
def resize_image(
image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
) -> tuple[Image.Image, int, int]:
def resize_image(
image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
) -> Image.Image:

Comment on lines +18 to +22
args:
x: (3, bs, seq_len, head_dim // 2)
mrope_section: (3,)
returns:
x_t: (bs, seq_len, head_dim // 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for apply_interleaved_mrope is inaccurate. The argument is freqs with a shape of (3, seq_len, head_dim // 2), but the docstring refers to x with a bs (batch size) dimension, which is not present. The return value shape is also (seq_len, head_dim // 2). Please update the docstring for clarity.

Suggested change
args:
x: (3, bs, seq_len, head_dim // 2)
mrope_section: (3,)
returns:
x_t: (bs, seq_len, head_dim // 2)
args:
freqs: (3, seq_len, head_dim // 2)
mrope_section: (3,)
returns:
freqs_t: (seq_len, head_dim // 2)

Comment on lines +25 to +28
axis_map = []
for i, n in enumerate(self.mrope_section * 2):
axis_map += [i % 3] * n
self.axis_map = torch.tensor(axis_map, dtype=torch.int32, device="cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The instance variable self.axis_map is initialized here but is not used anywhere in the class. This appears to be dead code and should be removed.

)


class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight): # 后面看要不要改
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment // 后面看要不要改 (which translates to "check later if this needs to be changed") seems to be a temporary note and should be removed from the code.

Suggested change
class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight): # 后面看要不要改
class Qwen3VLTransformerLayerWeight(Qwen3TransformerLayerWeight):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants