Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch#3649
Allow bfloat16 computations on compatible CPUs with Intel Extension for PyTorch#3649AngryLoki wants to merge 1 commit intoComfy-Org:masterfrom
Conversation
simonlui
left a comment
There was a problem hiding this comment.
Going to chime in here since I did significant work on the XPU side of IPEX for ComfyUI. This patch basically turns on CPU mode for IPEX, doesn't it? I have been meaning to write a patch for something like this for a while so thanks for doing the work to enable this. Had a few comments and nudges on things that could be improved but nothing else looks terribly wrong and I think this will improve everyone's experience with running the project although I am not sure if the bar to get that speed is enough to make it a default option for people to try, IPEX does have a minimum requirement of AVX2 needed on the CPU in order to even work. I would also suggest changing the README too to note this is available. Hopefully, when @comfyanonymous is less busy with things, he can take a look at the PR.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as resolved.
This comment was marked as resolved.
|
While testing with Flux, I discovered few interesting things:
So I reworked patch so that there is no requirement for ipex-for-cpu anymore. After checking with flux-schnell (which is already distributed in bf16-format):
|
Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations. With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs. There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new `--use-cpu-bf16=auto` option. It can be disabled with `--use-cpu-bf16=no`. Signed-off-by: Sv. Lockal <lockalsash@gmail.com>
|
Rebased, removed section from Readme... (not too much information about CPU in Readme anyways). @comfyanonymous , I understand that CPU is low priority, but the idea is still there and here are some results with SDXL:
I. e. more than 2 times faster. [1] Without patch, fp32, launched with [2] Without patch, fp32, launched with [3] With patch, bf16 on AMD Ryzen 9 7950X3D, launched with For SDXL there are some differences in results (bigger than SD1.5), but probably not critical:
|


Modern CPUs have native AVX512 BF16 instructions, which significantly improves matmul and conv2d operations.
With Bfloat16 instructions UNET steps are 40-50% faster on both AMD and Intel CPUs.
There are minor visible changes with bf16, but no avalanche effects, so this feature is enabled by default with new
--use-cpu-bf16=autooption.It can be disabled with
--use-cpu-bf16=no.With the following command (note: ComfyUI never mention this, but setting correct environment variables is highly important, see this page), KSampler node is almost 2 times faster (also memory usage is proportionally smaller):
--use-cpu-bf16=no- 1.68s/it--use-cpu-bf16=auto- 1.22it/s