Skip to content

Conversation

@hiworldwzj
Copy link
Collaborator

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @hiworldwzj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug in the handling of AWQ (Activation-aware Weight Quantization) weights within the lightllm framework, particularly concerning their slicing for matrix multiplication operations. The changes introduce specialized mixins to correctly process AWQ's unique inxout weight storage format and streamline the bias division logic across various weight slicing implementations. This ensures accurate and efficient processing of quantized models.

Highlights

  • Introduction of AWQ-Specific Slicer Mixins: New classes, AwqQuantizedRowSliceMixin and AwqQuantizedColSliceMixin, have been added to specifically handle the unique inxout storage format of AWQ (Activation-aware Weight Quantization) weights, ensuring correct slicing behavior.
  • Refactored Bias Slicing Logic: The bias_div_world_size parameter has been removed from the constructors of several base SliceMixin classes. The logic for bias division is now directly embedded within the _slice_bias methods of ColSliceMixin and the newly introduced AWQ slicers, simplifying the API and ensuring consistent behavior.
  • Integration of New AWQ Slicers: The AWQCOLMMWeight and AWQROWMMWeight classes have been updated to utilize the appropriate AwqQuantizedColSliceMixin and AwqQuantizedRowSliceMixin respectively, replacing previous generic quantized slicers and removing the now redundant bias_div_world_size argument.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the tensor-parallel slicing logic, particularly to add support for AWQ quantized weights which have a different memory layout. The changes simplify bias handling by removing the bias_div_world_size flag, which is a good improvement. However, I've found two critical issues where the new AWQ slicer classes are swapped between the column-parallel and row-parallel weight implementations, which will cause the model to fail. Additionally, the naming convention for all slicer classes is counter-intuitive and is likely the root cause of these bugs; I've suggested a refactoring to improve clarity and maintainability.

self.param_slicer = QuantizedRowSliceMixin(
tp_rank=tp_rank, tp_world_size=tp_world_size, bias_div_world_size=True
)
self.param_slicer = AwqQuantizedColSliceMixin(tp_rank=tp_rank, tp_world_size=tp_world_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a mix-up in the slicer class used here. AWQCOLMMWeight implements a column-parallel linear layer. For AWQ's in x out weight layout, this requires slicing along dimension 1 (the out dimension). The correct slicer for this is AwqQuantizedRowSliceMixin, which correctly handles slicing for weights, scales, zero-points, and bias for column-parallelism with AWQ weights.

However, AwqQuantizedColSliceMixin is used, which is designed for row-parallelism. This will result in incorrect tensor slicing and will break the model. You'll also need to update the import on line 11 to bring in AwqQuantizedRowSliceMixin.

Suggested change
self.param_slicer = AwqQuantizedColSliceMixin(tp_rank=tp_rank, tp_world_size=tp_world_size)
self.param_slicer = AwqQuantizedRowSliceMixin(tp_rank=tp_rank, tp_world_size=tp_world_size)

tp_rank=tp_rank, tp_world_size=tp_world_size, bias_div_world_size=False
)

self.param_slicer = AwqQuantizedRowSliceMixin(tp_rank=tp_rank, tp_world_size=tp_world_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a mix-up in the slicer class used here. AWQROWMMWeight implements a row-parallel linear layer. For AWQ's in x out weight layout, this requires slicing along dimension 0 (the in dimension). The correct slicer for this is AwqQuantizedColSliceMixin, which correctly handles slicing for weights, scales, zero-points, and bias for row-parallelism with AWQ weights.

However, AwqQuantizedRowSliceMixin is used, which is designed for column-parallelism. This will result in incorrect tensor slicing and will break the model. You'll also need to update the import on line 12 to bring in AwqQuantizedColSliceMixin.

Suggested change
self.param_slicer = AwqQuantizedRowSliceMixin(tp_rank=tp_rank, tp_world_size=tp_world_size)
self.param_slicer = AwqQuantizedColSliceMixin(tp_rank=tp_rank, tp_world_size=tp_world_size)

Comment on lines +118 to +134
# awq 的量化权重是inxout存储格式,需要定制实现。
class AwqQuantizedRowSliceMixin(QuantizedColSliceMixin):
def __init__(self, tp_rank: int = None, tp_world_size: int = None):
super().__init__(tp_rank, tp_world_size)

def _slice_bias(self, bias: torch.Tensor) -> torch.Tensor:
assert bias.shape[0] % self.tp_world_size_ == 0, f"tp slice error {bias.shape[0]} % {self.tp_world_size_}"
tp_size = bias.shape[0] // self.tp_world_size_
return bias[tp_size * self.tp_rank_ : tp_size * (self.tp_rank_ + 1)]


class AwqQuantizedColSliceMixin(QuantizedRowSliceMixin):
def __init__(self, tp_rank: int = None, tp_world_size: int = None):
super().__init__(tp_rank, tp_world_size)

def _slice_bias(self, bias: torch.Tensor) -> torch.Tensor:
return bias / self.tp_world_size_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The naming of these new AWQ slicer classes, and the base slicer classes they inherit from, is counter-intuitive and has likely led to the bugs in colmm_weight.py and rowmm_weight.py.

  • AwqQuantizedRowSliceMixin is for column-parallelism.
  • AwqQuantizedColSliceMixin is for row-parallelism.

This is because they are named based on the dimension they slice for a standard out x in weight, but for AWQ's in x out weights, the slicing dimension is swapped for a given parallelism type. This makes the code very difficult to reason about.

To improve clarity and prevent future bugs, I strongly recommend renaming all slicer classes to reflect the parallelism strategy they implement, rather than the dimension they slice. For example:

  • RowSliceMixin -> ColumnParallelSliceMixin
  • ColSliceMixin -> RowParallelSliceMixin
  • AwqQuantizedRowSliceMixin -> AwqColumnParallelSliceMixin
  • AwqQuantizedColSliceMixin -> AwqRowParallelSliceMixin

@shihaobai shihaobai merged commit 8fd19c2 into main Nov 13, 2025
1 check passed
@shihaobai shihaobai deleted the wzj branch November 13, 2025 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants