[BUG] Vram on cuda:0 usage vs 4.2.5

Trying to quantize with gptqmodel commit hash d8f3c78988bb8f11982a5e52361537ffba05d145
with `mock_quantization=False`, and got an error on first layer with experts (layer 1) (GLM-4.5-Air):

```
Quantizing mlp.experts.32.gate_proj in layer  [1 of 45] ████-------------------------------------------------------------------------------------------------| 0:13:41 / 5:14:43 [2/46] 4.3%Traceback (most recent call last):
  File "/home/ubuntu/Documents/Quantize/quantize-glm4.5-Air-gptqmodel-moe-prune-smart-4.py", line 489, in <module>
    model.quantize(
    ~~~~~~~~~~~~~~^
        calibration_dataset,
        ^^^^^^^^^^^^^^^^^^^^
        batch_size=BATCH_SIZE,
        ^^^^^^^^^^^^^^^^^^^^^^
        )
        ^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/models/base.py", line 717, in quantize
    return module_looper.loop(
           ~~~~~~~~~~~~~~~~~~^
        backend=backend,
        ^^^^^^^^^^^^^^^^
        fail_safe=self.quantize_config.fail_safe,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 850, in loop
    name, m = fut.result()
              ~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/ubuntu/.pyenv/versions/3.13.7t/lib/python3.13t/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/utils/threadx.py", line 360, in _run
    result = fn(*args, **kwargs)
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/module_looper.py", line 842, in _process_on_worker
    proc.process(module=nm)
    ~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/looper/gptq_processor.py", line 123, in process
    wq, q_scales, q_zeros, q_g_idx, duration, avg_loss, damp_percent, nsamples = g.quantize()
                                                                                 ~~~~~~~~~~^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 354, in quantize
    Hinv, damp = self.hessian_inverse(self.H)
                 ~~~~~~~~~~~~~~~~~~~~^^^^^^^^
  File "/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/git/avtc/GPTQModel/gptqmodel/quantization/gptq.py", line 257, in hessian_inverse
    H2 = torch.linalg.cholesky(H2)
RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling `cusolverDnCreate(handle)`. If you keep seeing this error, you may use `torch.backends.cuda.preferred_linalg_library()` to try linear algebra operators with other supported backends. See https://pytorch.org/docs/stable/backends.html#torch.backends.cuda.preferred_linalg_library
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Vram on cuda:0 usage vs 4.2.5 #1977

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Vram on cuda:0 usage vs 4.2.5 #1977

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions