Skip to content

Conversation

@avtc
Copy link
Contributor

@avtc avtc commented Sep 29, 2025

Revert removal of lock over Q.moveTo, to be able to run on multi-gpu

@Qubitium
Copy link
Collaborator

Qubitium commented Sep 29, 2025

@avtc

Are you still getting the Q.moveTo() asserts on main? Do you have SLI/NVLINK enabled on your 3090 by any chance? I have tested main on 2-4 GPU and so I am thinking maybe it's something 3090 specific since on main the Process task work where this error happened can only happen on the same device as Q so for this to happen is very eye opening.

@Qubitium
Copy link
Collaborator

@avtc The Optional arg issue and many other DeviceThreadPool bugs was fixed in #1948. I will double check Q.to bug on my system with GLM 4.5 Air later today.

@avtc
Copy link
Contributor Author

avtc commented Sep 30, 2025

@avtc

Are you still getting the Q.moveTo() asserts on main? Do you have SLI/NVLINK enabled on your 3090 by any chance? I have tested main on 2-4 GPU and so I am thinking maybe it's something 3090 specific since on main the Process task work where this error happened can only happen on the same device as Q so for this to happen is very eye opening.

@Qubitium
I am using p2p enabled driver, something like nvlink between all cards via PCI bus. For me the issue with Accelerate Invalid argument reproduced on 5+ cards, but suddenly reproduced on 4 cards as well, so it is random but with more cards the chance is higher.

sometimes CUDA_LAUNCH_BLOCKING=1 helps with Accelerate Invalid argument

python is 3.13.7t

@Qubitium
Copy link
Collaborator

@avtc Are you using the tinygrad hacked p2p driver by any chance?

@avtc
Copy link
Contributor Author

avtc commented Sep 30, 2025

@avtc Are you using the tinygrad hacked p2p driver by any chance?

yep

@Qubitium
Copy link
Collaborator

Qubitium commented Oct 1, 2025

@avtc Btw, do you have flash attention installed? Quantization forwarding use less vram if you have flashattn invovlved. GPT-QModel will auto enable it by default if you have it installed.

@avtc
Copy link
Contributor Author

avtc commented Oct 1, 2025

@avtc Btw, do you have flash attention installed? Quantization forwarding use less vram if you have flashattn invovlved. GPT-QModel will auto enable it by default if you have it installed.

@Qubitium no, only those packages that were in requirements.txt, I built with

pip install -r requirements.txt
pip install -vvv . --no-build-isolation

but as requirements.txt were removed idk what is the proper way to build now, I am extracting from pyproject.toml to requirements.txt manually.
pip install -e . fails for me.

I will install flash-attn to try.
Right now I am blocked by #1950 (comment)

@Qubitium
Copy link
Collaborator

Qubitium commented Oct 1, 2025

Right now I am blocked by #1950 (comment)

The blocking crash that you saw should be fixed on main. There was a threading issue when looper started before model was actually ready.

Flash Attention is not a hard requirement but is a requirements as many models supports it, not all, and for those that supports, there is a observable reduction of lower vram usage during forwarding.

You will see GPT-QModel loading logs when it is enabled.

project.toml is all we have now but the install is no different. Only diff is there no specific file to just install the requirements as before.

> pip install -v -e . --no-build-isolation
> uv pip install -v -e . --no-build-isolation

@avtc
Copy link
Contributor Author

avtc commented Oct 1, 2025

Right now I am blocked by #1950 (comment)

The blocking crash that you saw should be fixed on main. There was a threading issue when looper started before model was actually ready.

@Qubitium
still fails for me on latest main ( 4f74537 )

INFO  Calibration: Total tokens: 80                                                                                        
WARN  The average length of input_ids of calibration_dataset should be greater than 256: actual avg: 80.0.                 
Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-air-gptqmodel-clean.py", line 58, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        #auto_gc=False,
        ^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 715, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 721, in loop
    input_cache = self.cache_inputs(layers=layers,
                                    calibration_data=processor.calibration_dataset,
                                    use_cache=False)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 668, in cache_inputs
    self.gptq_model.model(**example, use_cache=use_cache)
    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 940, in wrapper
    output = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 587, in forward
    outputs: BaseModelOutputWithPast = self.model(
                                       ~~~~~~~~~~^
        input_ids=input_ids,
        ^^^^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/utils/generic.py", line 1064, in wrapper
    outputs = func(self, *args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/models/glm4_moe/modeling_glm4_moe.py", line 510, in forward
    causal_mask = create_causal_mask(
        config=self.config,
    ...<4 lines>...
        position_ids=position_ids,
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 822, in create_causal_mask
    causal_mask = mask_interface(
        batch_size=batch_size,
    ...<7 lines>...
        config=config,  # Pass the config as well, in case someone wants to easily have their own mask_interface
    )
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 374, in sdpa_mask_recent_torch
    if allow_is_causal_skip and _ignore_causal_mask_sdpa(padding_mask, q_length, kv_length, kv_offset, local_size):
                                ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/transformers/masking_utils.py", line 254, in _ignore_causal_mask_sdpa
    padding_mask.all()
    ~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/_meta_registrations.py", line 7457, in meta_local_scalar_dense
    raise RuntimeError("Tensor.item() cannot be called on meta tensors")
RuntimeError: Tensor.item() cannot be called on meta tensors

I am using the same script for GLM-4.5-Air, 4bit, 1 sample, mock_quantization=True, but it fails for me also for GLM-4.5 with normal number of samples, probably related to specific model.

The error appeared after merging data-p2

@avtc
Copy link
Contributor Author

avtc commented Oct 1, 2025

@Qubitium

project.toml is all we have now but the install is no different. Only diff is there no specific file to just install the requirements as before.

> pip install -v -e . --no-build-isolation
> uv pip install -v -e . --no-build-isolation

I am using venv.
pip install -v -e . --no-build-isolation - this works for existing venv, but fails for new clean venv.
It works after installation of:

pip install maturin
pip install puccinialin

@avtc
Copy link
Contributor Author

avtc commented Oct 1, 2025

btw, I still have to use this lock

@Qubitium
Copy link
Collaborator

Qubitium commented Oct 2, 2025

@Qubitium

project.toml is all we have now but the install is no different. Only diff is there no specific file to just install the requirements as before.

> pip install -v -e . --no-build-isolation
> uv pip install -v -e . --no-build-isolation

I am using venv. pip install -v -e . --no-build-isolation - this works for existing venv, but fails for new clean venv. It works after installation of:

pip install maturin
pip install puccinialin

I was able to use a clean venv and install latest main without having to isntall maturin and puccinialin. Check if those two are required for glm 4.5 air model and not specific to gptqmodel? If you still get clean instal lerrors, let me know the stacktrace.

#1964

@Qubitium
Copy link
Collaborator

Qubitium commented Oct 2, 2025

Closed with #1963

@Qubitium Qubitium closed this Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants