Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch by AngryLoki · Pull Request #3649 · Comfy-Org/ComfyUI

AngryLoki · 2024-06-04T23:18:24Z

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations.

With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new --use-cpu-bf16=auto option.
It can be disabled with --use-cpu-bf16=no.

With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):

LD_PRELOAD=libtrick.so:/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so \
KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 \
numactl -C 0-15 -m 0 python main.py --cpu --bf16-vae --bf16-unet

`--use-cpu-bf16=no` - 1.68s/it	`--use-cpu-bf16=auto` - 1.22it/s

simonlui

Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.

comfy/model_management.py

AngryLoki · 2024-08-06T06:17:45Z

While testing with Flux, I discovered few interesting things:

ipex allocates extra memory (even with weight_prepack=False) so that with ipex Flux + OS does not fit 64GB.
ipex focuses on models with forward() method; for other models most of optimizations are not available
new pytorch builds can perform the most heavy bf16 ops on CPU (read: matmul) without ipex.

So I reworked patch so that there is no requirement for ipex-for-cpu anymore.

After checking with flux-schnell (which is already distributed in bf16-format):

Without patch: 54GB ram, prompt executed in 242.54 seconds
With patch: 35GB ram, prompt executed in 118.42 seconds

Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs. There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new `--use-cpu-bf16=auto` option. It can be disabled with `--use-cpu-bf16=no`. Signed-off-by: Sv. Lockal <lockalsash@gmail.com>

AngryLoki · 2025-07-02T16:31:50Z

Rebased, removed section from Readme... (not too much information about CPU in Readme anyways).

@comfyanonymous , I understand that CPU is low priority, but the idea is still there and here are some results with SDXL:

	Original (fp32)^[1]	Original (fp32, libtrick)^[2]	Patched, bf16^[3]
Memory	16G	16G	9.1G
s/it	24.59	19.81	9.84
Total time	00:12:23	00:10:05	00:05:05

I. e. more than 2 times faster.

[1] Without patch, fp32, launched with LD_PRELOAD=/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 numactl -C 0-15 -m 0 python main.py --cpu - "Optimization notice"-d, mkl_blas_def_sgemm_kernel_0_zen dominates due to MKL issues

[2] Without patch, fp32, launched with LD_PRELOAD=$(pwd)/libtrick.so:/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 numactl -C 0-15 -m 0 python main.py --cpu, added for fair comparison

[3] With patch, bf16 on AMD Ryzen 9 7950X3D, launched with LD_PRELOAD=/src/oneapi/compiler/2024.0/lib/libiomp5.so:/usr/lib64/libtcmalloc.so KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 OMP_NUM_THREADS=16 numactl -C 0-15 -m 0 python main.py --cpu --bf16-vae --bf16-unet. MKL has no impact on performance (it uses OneDNN JIT kernels).

For SDXL there are some differences in results (bigger than SD1.5), but probably not critical:

FP32	BF16

AngryLoki requested a review from comfyanonymous as a code owner June 4, 2024 23:18

simonlui reviewed Jun 14, 2024

View reviewed changes

comfy/model_management.py Outdated Show resolved Hide resolved

comfy/model_management.py Outdated Show resolved Hide resolved

comfy/model_management.py Outdated Show resolved Hide resolved

comfy/model_management.py Outdated Show resolved Hide resolved

AngryLoki force-pushed the cpu-autocast branch from 9ae59d2 to 8fbc9ed Compare June 16, 2024 19:23

This comment was marked as outdated.

Sign in to view

simonlui approved these changes Jun 16, 2024

View reviewed changes

mcmonkey4eva added the Feature A new feature to add to ComfyUI. label Jun 28, 2024

mcmonkey4eva approved these changes Jun 28, 2024

View reviewed changes

mcmonkey4eva added the Needs Testing Please test this issue and report results label Jun 28, 2024

AngryLoki force-pushed the cpu-autocast branch from b82949e to 290df91 Compare August 5, 2024 21:11

This comment was marked as resolved.

Sign in to view

AngryLoki marked this pull request as draft August 6, 2024 05:16

AngryLoki force-pushed the cpu-autocast branch from 290df91 to 88f3f92 Compare August 6, 2024 06:05

AngryLoki marked this pull request as ready for review August 7, 2024 07:59

JorgeR81 mentioned this pull request Aug 7, 2024

Above 32 GB RAM usage, when loading Flux models in checkpoint version. #4239

Open

simonlui mentioned this pull request Jan 8, 2025

could not create an engine #6276

Open

AngryLoki force-pushed the cpu-autocast branch from 88f3f92 to 0e0f1ed Compare July 2, 2025 16:08

AngryLoki requested review from Kosinkadink, christian-byrne, ltdrdata, pythongosssss, robinjhuang, webfiltered and yoland68 as code owners July 2, 2025 16:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch#3649

Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch#3649
AngryLoki wants to merge 1 commit intoComfy-Org:masterfrom
AngryLoki:cpu-autocast

AngryLoki commented Jun 4, 2024 •

edited

Loading

Uh oh!

simonlui left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

AngryLoki commented Aug 6, 2024

Uh oh!

AngryLoki commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AngryLoki commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonlui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as resolved.

AngryLoki commented Aug 6, 2024

Uh oh!

AngryLoki commented Jul 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AngryLoki commented Jun 4, 2024 •

edited

Loading