fix Q.to on multi-gpu gptq when proceeding fast and has many experts and gpus #1774
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes
Q.toon multi-gpu gptq when proceeding fast and has many experts and gpus (for example with mock_quantization=True on 8 gpus on GLM-4.5-Air)The error originally thrown:
torch.AcceleratorError: CUDA error: invalid argumentFor me it throws after first layer with experts finished processing.
Retry of
Q.tofixes it.Original code
Q.type_asnot only changes the type but also moves the tensor toweight.data.deviceThe logic of moving to device preserved.
And it seems that
wq = wq.to(device=DEVICE_0, non_blocking=False)is redundant so was removed.===
torch_empty_cache is redundant as well, as retry works without it
I have tried
torch.syncandtorch.accelerator.synchronizebut only retry with try catch works