Fix OOM regression in _apply() for quantized models during inference by jkyamog · Pull Request #13372 · Comfy-Org/ComfyUI

jkyamog · 2026-04-12T09:35:45Z

Skip unnecessary clone of inference-mode tensors when already inside torch.inference_mode(), matching the existing guard in set_attr_param. The unconditional clone introduced in 20561aa caused transient VRAM doubling during model movement for FP8/quantized models.

I don't really do pytorch but I had OOM after upgrading to 0.18.2, which I eventually traced down using git bisect to commit 20561aa. I used qwen_image_edit_2511_fp8mixed model on the stocks Image Edit(Qwen-Image 2511). If I run the workflow more than once I get OOM. I then asked claude code to help me debug and fix this regression. With this small patch this fixes my OOM issue with my 5090.

Skip unnecessary clone of inference-mode tensors when already inside torch.inference_mode(), matching the existing guard in set_attr_param. The unconditional clone introduced in 20561aa caused transient VRAM doubling during model movement for FP8/quantized models.

coderabbitai · 2026-04-12T09:37:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a7b63f4-9b4f-4139-9650-426c2f5465aa

📥 Commits

Reviewing files that changed from the base of the PR and between 31283d2 and b7872e2.

📒 Files selected for processing (1)

comfy/ops.py

📝 Walkthrough

Walkthrough

The change modifies the parameter cloning behavior in the mixed precision operations module. Previously, parameters were cloned when p.is_inference() returned true. Now, cloning only occurs when both conditions are met: PyTorch's inference mode is not globally enabled AND p.is_inference() returns true. This adds a guard condition based on the global PyTorch inference-mode state before the parameter cloning operation.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing an OOM regression in the _apply() method for quantized models during inference.
Description check	✅ Passed	The description is directly related to the changeset, explaining the OOM regression, the root cause (unconditional clone from commit `20561aa`), the fix applied, and the specific use case that triggered the issue.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

cdmusic2019 · 2026-04-12T12:27:54Z

Thank you, that helped me.

rattus128 · 2026-04-14T22:23:53Z

I have a reproducer of this but its pretty corners. The change LGTM so approving. Thanks for figuring this out.

FYI this should go away if dynamic-vram mode is working as those module level loads aren't a thing in dynamic mode. So I reprod in the non-dynamic mode. Are you going for --gpu-only? I can't make it fit in --gpu-only on my 5090 but maybe you are squeezing in better than me in your WF as it should be close.

Curious whats your memory mode?

Here is my test data to confirm your fix:

Example Test conditions:

Linux, 5090
--disable-dynamic-vram
qwen-image-edit-2506 template with 2011 fp8mixed model.
Execute with one input image (2nd bypassed), then execute again with 2nd reference.

Before:

Requested to load QwenImage
Unloaded partially: 1033.82 MB freed, 6876.46 MB remains loaded, 85.75 MB buffer reserved, lowvram patches: 0
loaded completely; 22815.39 MB usable, 19581.95 MB loaded, full load: True
100%|██████████| 4/4 [00:03<00:00,  1.02it/s]
Requested to load WanVAE
Unloaded partially: 257.39 MB freed, 6619.07 MB remains loaded, 85.78 MB buffer reserved, lowvram patches: 0
loaded completely; 268.52 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 9.71 seconds
got prompt
!!! Exception during processing !!! Allocation on device

...

 File "/home/rattus/ComfyUI/comfy/model_patcher.py", line 1062, in partially_load
    raise e
  File "/home/rattus/ComfyUI/comfy/model_patcher.py", line 1059, in partially_load
    self.load(device_to, lowvram_model_memory=current_used + extra_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/home/rattus/ComfyUI/comfy/model_patcher.py", line 865, in load
    x[2].to(device_to)
  File "/home/rattus/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1371, in to
    return self._apply(convert)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ops.py", line 1155, in _apply
    p = p.clone()
        ^^^^^^^^^
  File "/home/rattus/venv/lib/python3.12/site-packages/comfy_kitchen/tensor/base.py", line 340, in __torch_dispatch__
    return handler(qt, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/venv/lib/python3.12/site-packages/comfy_kitchen/tensor/base.py", line 400, in _handle_clone
    return qt._copy_with(qdata=qt._qdata.clone())
                               ^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device

...

After:

Requested to load QwenImage
Unloaded partially: 1033.82 MB freed, 6876.46 MB remains loaded, 85.75 MB buffer reserved, lowvram patches: 0
loaded completely; 22815.39 MB usable, 19581.95 MB loaded, full load: True
100%|██████████| 4/4 [00:03<00:00,  1.02it/s]
Requested to load WanVAE
Unloaded partially: 257.39 MB freed, 6619.07 MB remains loaded, 85.78 MB buffer reserved, lowvram patches: 0
loaded completely; 268.52 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 9.58 seconds
got prompt
loaded completely; 10226.37 MB usable, 7910.28 MB loaded, full load: True
Unloaded partially: 193.33 MB freed, 7716.95 MB remains loaded, 29.24 MB buffer reserved, lowvram patches: 0
100%|██████████| 4/4 [00:06<00:00,  1.58s/it]
Requested to load WanVAE
Unloaded partially: 1159.16 MB freed, 6557.79 MB remains loaded, 85.78 MB buffer reserved, lowvram patches: 0
loaded completely; 329.80 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 8.00 seconds

I also tested --disable-dynamic-vram --disable-smart-memory

You can see the variation in vram peaks and the OOM ^^ suggests this variation is hard VRAM allocation and not just pyt caching allocator peak.

Before:

After:

jkyamog · 2026-04-15T03:52:28Z

I have a reproducer of this but its pretty corners. The change LGTM so approving. Thanks for figuring this out.

FYI this should go away if dynamic-vram mode is working as those module level loads aren't a thing in dynamic mode. So I reprod in the non-dynamic mode. Are you going for --gpu-only? I can't make it fit in --gpu-only on my 5090 but maybe you are squeezing in better than me in your WF as it should be close.

Curious whats your memory mode?

Here is my test data to confirm your fix:

Example Test conditions:

Linux, 5090 --disable-dynamic-vram qwen-image-edit-2506 template with 2011 fp8mixed model. Execute with one input image (2nd bypassed), then execute again with 2nd reference.

Thanks for reproducing this. Yes similar, using WSL (not bare Linux). I did try --disable-dynamic-vram and I got the same results, that is why tried to find the issue using git bisect.

Kosinkadink · 2026-04-15T06:40:35Z

Since we got a reproduction + good amount of testing, merging.

jkyamog requested review from Kosinkadink, comfyanonymous and guill as code owners April 12, 2026 09:35

rattus128 approved these changes Apr 14, 2026

View reviewed changes

Kosinkadink approved these changes Apr 15, 2026

View reviewed changes

Merge branch 'master' into master

34879b4

Kosinkadink merged commit 1de83f9 into Comfy-Org:master Apr 15, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM regression in _apply() for quantized models during inference#13372

Fix OOM regression in _apply() for quantized models during inference#13372
Kosinkadink merged 2 commits intoComfy-Org:masterfrom
jkyamog:master

jkyamog commented Apr 12, 2026

Uh oh!

coderabbitai bot commented Apr 12, 2026

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

cdmusic2019 commented Apr 12, 2026

Uh oh!

rattus128 commented Apr 14, 2026

Uh oh!

jkyamog commented Apr 15, 2026

Uh oh!

Kosinkadink commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jkyamog commented Apr 12, 2026

Uh oh!

coderabbitai bot commented Apr 12, 2026

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

cdmusic2019 commented Apr 12, 2026

Uh oh!

rattus128 commented Apr 14, 2026

Uh oh!

jkyamog commented Apr 15, 2026

Uh oh!

Kosinkadink commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants