Data Parallel #1950

Qubitium · 2025-09-30T03:16:08Z

@avtc Another 75% speed reduction off on top of current main branch for MoE quantization. No joke. This PR will actually make smaller non-moe models slower but gives a huge boost to MoE models. The bigger the model, the more gpus will help.

Forwarding is now data-parallel (model is replicated across all gpus and work is sharded)

But based on some small tests, I expect oom possibility to now actually increase because now the model needs to be replicated/copied to mutliple gpu in multiple threads so gpu:0's memory load actually increases.

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

avtc · 2025-09-30T06:28:36Z

Awesome, will check.
Is it possible to add pipeline parallel to be able to handle large layers with multigpu?

avtc · 2025-09-30T08:45:29Z

@Qubitium
Tried to run last main ( 1918990 ) and got an error:

  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/models/base.py", line 715, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 712, in loop
    input_cache = self.cache_inputs(layers=layers,
                                    calibration_data=processor.calibration_dataset,
                                    use_cache=False)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/gptqmodel/looper/module_looper.py", line 659, in cache_inputs
    self.gptq_model.model(**example, use_cache=use_cache)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 940, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 587, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ~~~~~~~~~~^
        input_ids=input_ids,
        ^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 1064, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 510, in forward
    causal_mask = create_causal_mask(
        config=self.config,
    ...<4 lines>...
        position_ids=position_ids,
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 822, in create_causal_mask
    causal_mask = mask_interface(
        batch_size=batch_size,
    ...<7 lines>...
        config=config,  # Pass the config as well, in case someone wants to easily have their own mask_interface
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 374, in sdpa_mask_recent_torch
    if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, kv_offset, local_size):
                                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 254, in _ignore_causal_mask_sdpa
    padding_mask.all()
    ~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/_meta_registrations.py", line 7457, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors

Qubitium · 2025-09-30T10:05:52Z

@avtc You are a bug magnet if I may say so. lol. Geez. You keep running into bugs faster than I can fix them. Btw, offload file direct copy on model save is now live (merged) on main for ultra low model.save cpu memory usage.

avtc · 2025-09-30T13:51:07Z

Idk if the fact that model first layer(s) does not have expert modules, and starting from layer index 1 for GLM-4.5-Air, and from layer index 3 for GLM-4.5 there are expert modules, could be an issue. Also last layer has additional modules that are absent in all other layers.

avtc · 2025-10-01T12:38:35Z

latest main works! thank you @Qubitium

Qubitium added 6 commits September 30, 2025 02:40

init

578351e

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

fix clone and hidden_states

6c64d4e

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

allow blocks to disable auto gc

309b7fa

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

async turtle reload

60ccdc2

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

fix propagation of hidden states

1c18b46

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

fix forwarding cannot be sharded across same device:index

71ecddf

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

Qubitium marked this pull request as ready for review September 30, 2025 04:04

Qubitium added 6 commits September 30, 2025 04:10

dedup

1e82651

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

add some docs

18d3001

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

fix device override

c2f5b3c

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

increase gc interval

8f044a2

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

format

6c3f7bb

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

readme changes

a68ea0a

Signed-off-by: Qubitium <Qubitium@modelcloud.ai>

Qubitium merged commit 28dfc92 into main Sep 30, 2025
5 checks passed

Qubitium deleted the data-p2 branch September 30, 2025 10:07

avtc mentioned this pull request Oct 1, 2025

lock q.to to fix accelerate invalid argument #1947

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Parallel #1950

Data Parallel #1950

Uh oh!

Qubitium commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

Qubitium commented Sep 30, 2025 •

edited

Loading

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

avtc commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Data Parallel #1950

Data Parallel #1950

Uh oh!

Conversation

Qubitium commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

Qubitium commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

avtc commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Qubitium commented Sep 30, 2025 •

edited

Loading

Qubitium commented Sep 30, 2025 •

edited

Loading