Skip to content

Conversation

@Qubitium
Copy link
Collaborator

@Qubitium Qubitium commented Sep 30, 2025

@avtc Another 75% speed reduction off on top of current main branch for MoE quantization. No joke. This PR will actually make smaller non-moe models slower but gives a huge boost to MoE models. The bigger the model, the more gpus will help.

Forwarding is now data-parallel (model is replicated across all gpus and work is sharded)

But based on some small tests, I expect oom possibility to now actually increase because now the model needs to be replicated/copied to mutliple gpu in multiple threads so gpu:0's memory load actually increases.

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
@Qubitium Qubitium marked this pull request as ready for review September 30, 2025 04:04
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
Signed-off-by: Qubitium <Qubitium@modelcloud.ai>
@Qubitium Qubitium merged commit 28dfc92 into main Sep 30, 2025
5 checks passed
@avtc
Copy link
Contributor

avtc commented Sep 30, 2025

Awesome, will check.
Is it possible to add pipeline parallel to be able to handle large layers with multigpu?

@avtc
Copy link
Contributor

avtc commented Sep 30, 2025

@Qubitium
Tried to run last main ( 1918990 ) and got an error:

  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/models/base.py", line 715, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 712, in loop
    input_cache = self.cache_inputs(layers=layers,
                                    calibration_data=processor.calibration_dataset,
                                    use_cache=False)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 659, in cache_inputs
    self.gptq_model.model(**example, use_cache=use_cache)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 940, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 587, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ~~~~~~~~~~^
        input_ids=input_ids,
        ^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 1064, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 510, in forward
    causal_mask = create_causal_mask(
        config=self.config,
    ...<4 lines>...
        position_ids=position_ids,
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 822, in create_causal_mask
    causal_mask = mask_interface(
        batch_size=batch_size,
    ...<7 lines>...
        config=config,  # Pass the config as well, in case someone wants to easily have their own mask_interface
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 374, in sdpa_mask_recent_torch
    if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, kv_offset, local_size):
                                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 254, in _ignore_causal_mask_sdpa
    padding_mask.all()
    ~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/_meta_registrations.py", line 7457, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors

@Qubitium
Copy link
Collaborator Author

Qubitium commented Sep 30, 2025

@avtc You are a bug magnet if I may say so. lol. Geez. You keep running into bugs faster than I can fix them. Btw, offload file direct copy on model save is now live (merged) on main for ultra low model.save cpu memory usage.

@Qubitium Qubitium deleted the data-p2 branch September 30, 2025 10:07
@avtc
Copy link
Contributor

avtc commented Sep 30, 2025

Idk if the fact that model first layer(s) does not have expert modules, and starting from layer index 1 for GLM-4.5-Air, and from layer index 3 for GLM-4.5 there are expert modules, could be an issue. Also last layer has additional modules that are absent in all other layers.

@avtc
Copy link
Contributor

avtc commented Oct 1, 2025

latest main works! thank you @Qubitium

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants