lock q.to to fix accelerate invalid argument #1947

avtc · 2025-09-29T15:36:27Z

Revert removal of lock over Q.moveTo, to be able to run on multi-gpu

…positional argument: 'scope'

Qubitium · 2025-09-29T21:47:16Z

Are you still getting the Q.moveTo() asserts on main? Do you have SLI/NVLINK enabled on your 3090 by any chance? I have tested main on 2-4 GPU and so I am thinking maybe it's something 3090 specific since on main the Process task work where this error happened can only happen on the same device as Q so for this to happen is very eye opening.

Qubitium · 2025-09-29T23:48:52Z

@avtc The Optional arg issue and many other DeviceThreadPool bugs was fixed in #1948. I will double check Q.to bug on my system with GLM 4.5 Air later today.

avtc · 2025-09-30T07:08:28Z

@avtc

Are you still getting the Q.moveTo() asserts on main? Do you have SLI/NVLINK enabled on your 3090 by any chance? I have tested main on 2-4 GPU and so I am thinking maybe it's something 3090 specific since on main the Process task work where this error happened can only happen on the same device as Q so for this to happen is very eye opening.

@Qubitium
I am using p2p enabled driver, something like nvlink between all cards via PCI bus. For me the issue with Accelerate Invalid argument reproduced on 5+ cards, but suddenly reproduced on 4 cards as well, so it is random but with more cards the chance is higher.

sometimes CUDA_LAUNCH_BLOCKING=1 helps with Accelerate Invalid argument

python is 3.13.7t

Qubitium · 2025-09-30T10:03:06Z

@avtc Are you using the tinygrad hacked p2p driver by any chance?

avtc · 2025-09-30T10:15:36Z

@avtc Are you using the tinygrad hacked p2p driver by any chance?

yep

Qubitium · 2025-10-01T02:20:14Z

@avtc Btw, do you have flash attention installed? Quantization forwarding use less vram if you have flashattn invovlved. GPT-QModel will auto enable it by default if you have it installed.

avtc · 2025-10-01T05:36:32Z

@avtc Btw, do you have flash attention installed? Quantization forwarding use less vram if you have flashattn invovlved. GPT-QModel will auto enable it by default if you have it installed.

@Qubitium no, only those packages that were in requirements.txt, I built with

pip install -r requirements.txt
pip install -vvv . --no-build-isolation

but as requirements.txt were removed idk what is the proper way to build now, I am extracting from pyproject.toml to requirements.txt manually.
pip install -e . fails for me.

I will install flash-attn to try.
Right now I am blocked by #1950 (comment)

Qubitium · 2025-10-01T08:02:39Z

Right now I am blocked by #1950 (comment)

The blocking crash that you saw should be fixed on main. There was a threading issue when looper started before model was actually ready.

Flash Attention is not a hard requirement but is a requirements as many models supports it, not all, and for those that supports, there is a observable reduction of lower vram usage during forwarding.

You will see GPT-QModel loading logs when it is enabled.

project.toml is all we have now but the install is no different. Only diff is there no specific file to just install the requirements as before.

> pip install -v -e . --no-build-isolation
> uv pip install -v -e . --no-build-isolation

avtc · 2025-10-01T08:52:45Z

Right now I am blocked by #1950 (comment)

The blocking crash that you saw should be fixed on main. There was a threading issue when looper started before model was actually ready.

@Qubitium
still fails for me on latest main ( 4f74537 )

INFO  Calibration: Total tokens: 80                                                                                        
WARN  The average length of input_ids of calibration_dataset should be greater than 256: actual avg: 80.0.                 
Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-air-gptqmodel-clean.py", line 58, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        #auto_gc=False,
        ^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 715, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 721, in loop
    input_cache = self.cache_inputs(layers=layers,
                                    calibration_data=processor.calibration_dataset,
                                    use_cache=False)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 668, in cache_inputs
    self.gptq_model.model(**example, use_cache=use_cache)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 940, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 587, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ~~~~~~~~~~^
        input_ids=input_ids,
        ^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 1064, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 510, in forward
    causal_mask = create_causal_mask(
        config=self.config,
    ...<4 lines>...
        position_ids=position_ids,
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 822, in create_causal_mask
    causal_mask = mask_interface(
        batch_size=batch_size,
    ...<7 lines>...
        config=config,  # Pass the config as well, in case someone wants to easily have their own mask_interface
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 374, in sdpa_mask_recent_torch
    if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, kv_offset, local_size):
                                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 254, in _ignore_causal_mask_sdpa
    padding_mask.all()
    ~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/_meta_registrations.py", line 7457, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors

I am using the same script for GLM-4.5-Air, 4bit, 1 sample, mock_quantization=True, but it fails for me also for GLM-4.5 with normal number of samples, probably related to specific model.

The error appeared after merging data-p2

avtc · 2025-10-01T09:15:16Z

@Qubitium

project.toml is all we have now but the install is no different. Only diff is there no specific file to just install the requirements as before.
> pip install -v -e . --no-build-isolation
> uv pip install -v -e . --no-build-isolation

I am using venv.
pip install -v -e . --no-build-isolation - this works for existing venv, but fails for new clean venv.
It works after installation of:

pip install maturin
pip install puccinialin

avtc · 2025-10-01T12:39:30Z

btw, I still have to use this lock

Qubitium · 2025-10-02T03:52:46Z

@Qubitium
project.toml is all we have now but the install is no different. Only diff is there no specific file to just install the requirements as before.
> pip install -v -e . --no-build-isolation
> uv pip install -v -e . --no-build-isolation
I am using venv. pip install -v -e . --no-build-isolation - this works for existing venv, but fails for new clean venv. It works after installation of:
pip install maturin
pip install puccinialin

I was able to use a clean venv and install latest main without having to isntall maturin and puccinialin. Check if those two are required for glm 4.5 air model and not specific to gptqmodel? If you still get clean instal lerrors, let me know the stacktrace.

#1964

Qubitium · 2025-10-02T09:35:27Z

Closed with #1963

avtc added 2 commits September 29, 2025 16:15

lock q.to

aa74c53

attempt to fix TypeError: DeviceThreadPool.wait() missing 1 required …

bba88f7

…positional argument: 'scope'

Qubitium closed this Oct 2, 2025

lock q.to to fix accelerate invalid argument #1947

lock q.to to fix accelerate invalid argument #1947

Uh oh!

Conversation

avtc commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Sep 29, 2025

Uh oh!

avtc commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Sep 30, 2025

Uh oh!

avtc commented Sep 30, 2025

Uh oh!

Qubitium commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avtc commented Oct 1, 2025

Uh oh!

Qubitium commented Oct 2, 2025

Uh oh!

Qubitium commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avtc commented Sep 29, 2025 •

edited

Loading

Qubitium commented Sep 29, 2025 •

edited

Loading

avtc commented Sep 30, 2025 •

edited

Loading

Qubitium commented Oct 1, 2025 •

edited

Loading

avtc commented Oct 1, 2025 •

edited

Loading

Qubitium commented Oct 1, 2025 •

edited

Loading

avtc commented Oct 1, 2025 •

edited

Loading

avtc commented Oct 1, 2025 •

edited

Loading