Skip to content

Conversation

@avtc
Copy link
Contributor

@avtc avtc commented Sep 9, 2025

This fixes Q.to on multi-gpu gptq when proceeding fast and has many experts and gpus (for example with mock_quantization=True on 8 gpus on GLM-4.5-Air)

The error originally thrown:
torch.AcceleratorError: CUDA error: invalid argument
For me it throws after first layer with experts finished processing.

Retry of Q.to fixes it.

Original code Q.type_as not only changes the type but also moves the tensor to weight.data.device
The logic of moving to device preserved.
And it seems that wq = wq.to(device=DEVICE_0, non_blocking=False) is redundant so was removed.

===
torch_empty_cache is redundant as well, as retry works without it
I have tried torch.sync and torch.accelerator.synchronize but only retry with try catch works

@Qubitium
Copy link
Collaborator

Qubitium commented Sep 9, 2025

@avtc Good catch! Looks like you found a torch/cuda sync bug ,or, gtqmode is not safely locking the cuda ops in our GIL=0 setup. I have a feeling it's the latter. Merge for now and log this for future TODO.

@Qubitium Qubitium merged commit 64b3901 into ModelCloud:main Sep 9, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants