Skip to content

make conds use non-blocking transfers#7152

Open
drhead wants to merge 5 commits intoComfy-Org:masterfrom
drhead:patch-3
Open

make conds use non-blocking transfers#7152
drhead wants to merge 5 commits intoComfy-Org:masterfrom
drhead:patch-3

Conversation

@drhead
Copy link
Contributor

@drhead drhead commented Mar 9, 2025

This change should make use of non-blocking transfers for conds, avoiding unnecessary syncs. Should improve speed, as long as other things are not causing syncs between steps. It is also set up to use existing functions from the model management class to avoid using non blocking on unsupported platforms.

@drhead drhead requested a review from comfyanonymous as a code owner March 9, 2025 23:58
@comfyanonymous
Copy link
Member

This needs proper testing, there are sometimes weird side effects when enabling the non blocking transfers like this.

@drhead
Copy link
Contributor Author

drhead commented Mar 19, 2025

At least for the portion of testing I am able to do, I have been using this change on my local copy for I think months at this point, it works completely fine on CUDA. I assume that the existing code for gating use of non-blocking transfers covers most of the known problematic backends, but if needed it could be gated to be CUDA-only

@asagi4
Copy link
Contributor

asagi4 commented Mar 31, 2025

I'm running this with my 7900 XTX (after force-enabling non-blocking support since it seems to be globally disabled) and so far it seems to work fine here too. I don't know if it did anything for performance though.

@drhead
Copy link
Contributor Author

drhead commented May 23, 2025

Changes here need more evaluation before merging because I am noticing after removing all blocking calls on a path that performance is sometimes inexplicably slower, particularly on smaller image sizes (768x768 for example) -- 1024x1024 has a measurable performance boost though. From comparing profiles of it on and off there is no obvious part that is being slowed down but I would want to figure it out before this gets merged.

@drhead
Copy link
Contributor Author

drhead commented May 25, 2025

Changed it to copy the conds to device before applying repeat, and I also changed memory pinning so that it should only happen once and avoids unnecessary copies every step (we're trying it every time but it'll become a no-op if the tensor is already pinned). Copying the tensor to device first means it's smaller when copied over and also should mean less memory usage because it'll probably be created as a view. This is probably as good as things will get unless it's changed to where the conds go to the GPU once and are kept there unless changed.

@comfy-pr-bot
Copy link
Member

Test Evidence Check

⚠️ Warning: Test Explanation Missing

If this PR modifies behavior that requires testing, a test explanation is required. PRs lacking applicable test explanations may not be reviewed until added. Please add test explanations to ensure code quality and prevent regressions.

⚠️ Warning: Visual Documentation Missing

If this PR changes user-facing behavior, visual proof (screen recording or screenshot) is required. PRs without applicable visual documentation may not be reviewed until provided.

You can add it by:

  • GitHub: Drag & drop media directly into the PR description
  • YouTube: Include a link to a short demo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants