Skip to content

Fix OOM regression in _apply() for quantized models during inference#13372

Merged
Kosinkadink merged 2 commits intoComfy-Org:masterfrom
jkyamog:master
Apr 15, 2026
Merged

Fix OOM regression in _apply() for quantized models during inference#13372
Kosinkadink merged 2 commits intoComfy-Org:masterfrom
jkyamog:master

Conversation

@jkyamog
Copy link
Copy Markdown
Contributor

@jkyamog jkyamog commented Apr 12, 2026

Skip unnecessary clone of inference-mode tensors when already inside torch.inference_mode(), matching the existing guard in set_attr_param. The unconditional clone introduced in 20561aa caused transient VRAM doubling during model movement for FP8/quantized models.

I don't really do pytorch but I had OOM after upgrading to 0.18.2, which I eventually traced down using git bisect to commit 20561aa. I used qwen_image_edit_2511_fp8mixed model on the stocks Image Edit(Qwen-Image 2511). If I run the workflow more than once I get OOM. I then asked claude code to help me debug and fix this regression. With this small patch this fixes my OOM issue with my 5090.

Skip unnecessary clone of inference-mode tensors when already inside
torch.inference_mode(), matching the existing guard in set_attr_param.
The unconditional clone introduced in 20561aa caused transient VRAM
doubling during model movement for FP8/quantized models.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a7b63f4-9b4f-4139-9650-426c2f5465aa

📥 Commits

Reviewing files that changed from the base of the PR and between 31283d2 and b7872e2.

📒 Files selected for processing (1)
  • comfy/ops.py

📝 Walkthrough

Walkthrough

The change modifies the parameter cloning behavior in the mixed precision operations module. Previously, parameters were cloned when p.is_inference() returned true. Now, cloning only occurs when both conditions are met: PyTorch's inference mode is not globally enabled AND p.is_inference() returns true. This adds a guard condition based on the global PyTorch inference-mode state before the parameter cloning operation.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing an OOM regression in the _apply() method for quantized models during inference.
Description check ✅ Passed The description is directly related to the changeset, explaining the OOM regression, the root cause (unconditional clone from commit 20561aa), the fix applied, and the specific use case that triggered the issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cdmusic2019
Copy link
Copy Markdown

Thank you, that helped me.

@rattus128
Copy link
Copy Markdown
Contributor

I have a reproducer of this but its pretty corners. The change LGTM so approving. Thanks for figuring this out.

FYI this should go away if dynamic-vram mode is working as those module level loads aren't a thing in dynamic mode. So I reprod in the non-dynamic mode. Are you going for --gpu-only? I can't make it fit in --gpu-only on my 5090 but maybe you are squeezing in better than me in your WF as it should be close.

Curious whats your memory mode?

Here is my test data to confirm your fix:

Example Test conditions:

Linux, 5090
--disable-dynamic-vram
qwen-image-edit-2506 template with 2011 fp8mixed model.
Execute with one input image (2nd bypassed), then execute again with 2nd reference.

image

Before:

Requested to load QwenImage
Unloaded partially: 1033.82 MB freed, 6876.46 MB remains loaded, 85.75 MB buffer reserved, lowvram patches: 0
loaded completely; 22815.39 MB usable, 19581.95 MB loaded, full load: True
100%|██████████| 4/4 [00:03<00:00,  1.02it/s]
Requested to load WanVAE
Unloaded partially: 257.39 MB freed, 6619.07 MB remains loaded, 85.78 MB buffer reserved, lowvram patches: 0
loaded completely; 268.52 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 9.71 seconds
got prompt
!!! Exception during processing !!! Allocation on device 

...

 File "/home/rattus/ComfyUI/comfy/model_patcher.py", line 1062, in partially_load
    raise e
  File "/home/rattus/ComfyUI/comfy/model_patcher.py", line 1059, in partially_load
    self.load(device_to, lowvram_model_memory=current_used + extra_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/home/rattus/ComfyUI/comfy/model_patcher.py", line 865, in load
    x[2].to(device_to)
  File "/home/rattus/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1371, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ops.py", line 1155, in _apply
    p = p.clone()
        ^^^^^^^^^
  File "/home/rattus/venv/lib/python3.12/site-packages/comfy_kitchen/tensor/base.py", line 340, in __torch_dispatch__
    return handler(qt, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv/lib/python3.12/site-packages/comfy_kitchen/tensor/base.py", line 400, in _handle_clone
    return qt._copy_with(qdata=qt._qdata.clone())
                               ^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device 

...

After:

Requested to load QwenImage
Unloaded partially: 1033.82 MB freed, 6876.46 MB remains loaded, 85.75 MB buffer reserved, lowvram patches: 0
loaded completely; 22815.39 MB usable, 19581.95 MB loaded, full load: True
100%|██████████| 4/4 [00:03<00:00,  1.02it/s]
Requested to load WanVAE
Unloaded partially: 257.39 MB freed, 6619.07 MB remains loaded, 85.78 MB buffer reserved, lowvram patches: 0
loaded completely; 268.52 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 9.58 seconds
got prompt
loaded completely; 10226.37 MB usable, 7910.28 MB loaded, full load: True
Unloaded partially: 193.33 MB freed, 7716.95 MB remains loaded, 29.24 MB buffer reserved, lowvram patches: 0
100%|██████████| 4/4 [00:06<00:00,  1.58s/it]
Requested to load WanVAE
Unloaded partially: 1159.16 MB freed, 6557.79 MB remains loaded, 85.78 MB buffer reserved, lowvram patches: 0
loaded completely; 329.80 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 8.00 seconds

I also tested --disable-dynamic-vram --disable-smart-memory

You can see the variation in vram peaks and the OOM ^^ suggests this variation is hard VRAM allocation and not just pyt caching allocator peak.

Before:

Screenshot from 2026-04-15 08-01-59

After:

Screenshot from 2026-04-15 08-00-23

@jkyamog
Copy link
Copy Markdown
Contributor Author

jkyamog commented Apr 15, 2026

I have a reproducer of this but its pretty corners. The change LGTM so approving. Thanks for figuring this out.

FYI this should go away if dynamic-vram mode is working as those module level loads aren't a thing in dynamic mode. So I reprod in the non-dynamic mode. Are you going for --gpu-only? I can't make it fit in --gpu-only on my 5090 but maybe you are squeezing in better than me in your WF as it should be close.

Curious whats your memory mode?

Here is my test data to confirm your fix:

Example Test conditions:

Linux, 5090 --disable-dynamic-vram qwen-image-edit-2506 template with 2011 fp8mixed model. Execute with one input image (2nd bypassed), then execute again with 2nd reference.

Thanks for reproducing this. Yes similar, using WSL (not bare Linux). I did try --disable-dynamic-vram and I got the same results, that is why tried to find the issue using git bisect.

@Kosinkadink
Copy link
Copy Markdown
Member

Since we got a reproduction + good amount of testing, merging.

@Kosinkadink Kosinkadink merged commit 1de83f9 into Comfy-Org:master Apr 15, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants