New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Unified quantization backend for x86 CPU platforms #83888
Comments
Hi @jerryzh168, please review. Thanks! |
looks good to me, will post in internal groups to have more people to review the doc. |
@Xia-Weiwen are there any results comparing the numerics of the model with the unified backend vs fbgemm only backend? This would be useful to access any internal impact on numerics for customers who are currently using fbgemm in production. |
Another question is around the |
The machine where quantization is done is not always the same machine which runs inference. Thoughts on how this could be handled? Are we sure we have enough data at quantization time to do the backend selection? Do we need to relax this constraint? |
I think the selection is done every time we prepacks the weight, so we will select backend again when we load the model from disk before inference, does that work in real use cases? |
Thanks for the question. We will provide some results later. |
Good question. We will figure out a solution. Hi @supriyar. We decide to use |
@supriyar FBGEMM and oneDNN backends should have the same numerics by design. In PyTorch UT, oneDNN backend also apply the same result checking logic as FBGEMM. Meanwhile, we can provide more end-to-end model accuracy comparison like @Xia-Weiwen mentioned.
@supriyar The
@jerryzh168 @vkuzo @supriyar A related question is what is the default |
Should we modify the language of this RFC to make this clearer? "done at quantization time" sounds like "done during prepare/convert calls", which is different from "done during weight prepacking". |
IMO, it's better for the default to be numerically correct and potentially slow compared to potentially numerically incorrect and fast. This would point to keeping |
yeah agree, I think the default should be True so that it is always correct, and we can provide documentations for users who knows that they are using CPUs with VNNI support and want to have a slightly better accuracy. |
Thanks for the suggestions. Make sense to me. BTW, I guess @vkuzo wanted to say "potentially worse accurate" right? |
Thanks for the suggestion. Updated. |
Yes, you are right, my mistake. |
Hi @supriyar We have run torchvision models with FBGEMM and X86 for accuracy comparison. Results show that they have the same accuracy. For details, please refer to the attached worksheet: |
## Description Implement unified quantization backend 'X86' for x86 platforms. It combines the advantages of FBGEMM and ONEDNN. It selects kernels during weight prepacking and hide the details from end users. It will be the default backend in place of FBGEMM. For details, please refer to this RFC: [[RFC] Unified quantization backend for x86 CPU platforms](#83888) ## Validation **Correctness** Covered by UT **Accuracy** By running torchvision models on imagenet, no accuracy difference is found between FBGEMM and the unified X86 backend: [torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx](https://github.com/pytorch/pytorch/files/9598114/torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx) **Performance** Depends on #84470 which improves performance. For early PoC results, please refer to https://github.com/pytorch/pytorch/files/9399202/unified_qengine_poc_performance_bechmark.xlsx With the two PRs combined, we collected some data on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz Method: Run multi-instances with 4 cores per instance on whole socket. Using JeMalloc and Intel OMP. Models/throughput | fbgemm | x86 | improvement -- | -- | -- | -- wide_resnet101_2 | 173.5675 | 241.815 | 39.32% resnext101_32x8d | 174.365 | 339.8175 | 94.89% resnet50 | 573.155 | 1174.14 | 104.86% vgg19_bn | 260.335 | 337.92 | 29.80% vgg19 | 257.935 | 333.265 | 29.21% inception_v3 | 601.1175 | 1309.33 | 117.82% densenet161 | 296.645 | 435.5625 | 46.83% mnasnet1_0 | 1216.7 | 4057.515 | 233.49% squeezenet1_0 | 1220.085 | 5153.3875 | 322.38% alexnet | 2294.91 | 2624.6375 | 14.37% fbnetc_100 | 976.2825 | 3110.1825 | 218.57% shufflenet_v2_x0_5 | 1555.76 | 3026.125 | 94.51% spnasnet_100 | 1059.065 | 3502.0975 | 230.68% pytorch-unet | 192.76 | 246.77 | 28.02% acgan | 257.32 | 333.7325 | 29.70% cgan | 7790.6925 | 7803.1025 | 0.16% sgan | 257.565 | 338.8875 | 31.57% se_resnet50 | 492.3725 | 916.5175 | 86.14% vggm | 300.2875 | 316.2075 | 5.30% Environment: - PyTorch version: 1.13.0a0+gitcdd625b - Is debug build: False - CUDA used to build PyTorch: None - ROCM used to build PyTorch: N/A - OS: Ubuntu 20.04.3 LTS (x86_64) - GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 - Clang version: Could not collect - CMake version: version 3.22.5 - Libc version: glibc-2.31 - Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime) - Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31 - Is CUDA available: False - CUDA runtime version: No CUDA - GPU models and configuration: No CUDA - Nvidia driver version: No CUDA - cuDNN version: No CUDA - HIP runtime version: N/A - MIOpen runtime version: N/A - Is XNNPACK available: True Versions of relevant libraries: - [pip3] intel-extension-for-pytorch==1.13.0+cpu - [pip3] numpy==1.23.3 - [pip3] pytorch-widedeep==0.3.7 - [pip3] torch==1.13.0a0+git48b423b - [pip3] torchvision==0.14.0a0+ebb68f3 - [conda] blas 1.0 mkl - [conda] intel-extension-for-pytorch 1.13.0+cpu pypi_0 pypi - [conda] mkl 2021.4.0 h06a4308_640 - [conda] mkl-include 2022.1.0 pypi_0 pypi - [conda] mkl-service 2.4.0 py39h7f8727e_0 - [conda] mkl-static 2022.1.0 pypi_0 pypi - [conda] mkl_fft 1.3.1 py39hd3c417c_0 - [conda] mkl_random 1.2.2 py39h51133e4_0 - [conda] numpy 1.23.3 pypi_0 pypi - [conda] numpy-base 1.22.3 py39hf524024_0 - [conda] torch 1.13.0a0+git48b423b pypi_0 pypi - [conda] torchvision 0.14.0a0+ebb68f3 pypi_0 pypi Pull Request resolved: #84329 Approved by: https://github.com/jerryzh168
## Description Implement unified quantization backend 'X86' for x86 platforms. It combines the advantages of FBGEMM and ONEDNN. It selects kernels during weight prepacking and hide the details from end users. It will be the default backend in place of FBGEMM. For details, please refer to this RFC: [[RFC] Unified quantization backend for x86 CPU platforms](#83888) ## Validation **Correctness** Covered by UT **Accuracy** By running torchvision models on imagenet, no accuracy difference is found between FBGEMM and the unified X86 backend: [torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx](https://github.com/pytorch/pytorch/files/9598114/torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx) **Performance** Depends on #84470 which improves performance. For early PoC results, please refer to https://github.com/pytorch/pytorch/files/9399202/unified_qengine_poc_performance_bechmark.xlsx With the two PRs combined, we collected some data on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz Method: Run multi-instances with 4 cores per instance on whole socket. Using JeMalloc and Intel OMP. Models/throughput | fbgemm | x86 | improvement -- | -- | -- | -- wide_resnet101_2 | 173.5675 | 241.815 | 39.32% resnext101_32x8d | 174.365 | 339.8175 | 94.89% resnet50 | 573.155 | 1174.14 | 104.86% vgg19_bn | 260.335 | 337.92 | 29.80% vgg19 | 257.935 | 333.265 | 29.21% inception_v3 | 601.1175 | 1309.33 | 117.82% densenet161 | 296.645 | 435.5625 | 46.83% mnasnet1_0 | 1216.7 | 4057.515 | 233.49% squeezenet1_0 | 1220.085 | 5153.3875 | 322.38% alexnet | 2294.91 | 2624.6375 | 14.37% fbnetc_100 | 976.2825 | 3110.1825 | 218.57% shufflenet_v2_x0_5 | 1555.76 | 3026.125 | 94.51% spnasnet_100 | 1059.065 | 3502.0975 | 230.68% pytorch-unet | 192.76 | 246.77 | 28.02% acgan | 257.32 | 333.7325 | 29.70% cgan | 7790.6925 | 7803.1025 | 0.16% sgan | 257.565 | 338.8875 | 31.57% se_resnet50 | 492.3725 | 916.5175 | 86.14% vggm | 300.2875 | 316.2075 | 5.30% Environment: - PyTorch version: 1.13.0a0+gitcdd625b - Is debug build: False - CUDA used to build PyTorch: None - ROCM used to build PyTorch: N/A - OS: Ubuntu 20.04.3 LTS (x86_64) - GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 - Clang version: Could not collect - CMake version: version 3.22.5 - Libc version: glibc-2.31 - Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime) - Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31 - Is CUDA available: False - CUDA runtime version: No CUDA - GPU models and configuration: No CUDA - Nvidia driver version: No CUDA - cuDNN version: No CUDA - HIP runtime version: N/A - MIOpen runtime version: N/A - Is XNNPACK available: True Versions of relevant libraries: - [pip3] intel-extension-for-pytorch==1.13.0+cpu - [pip3] numpy==1.23.3 - [pip3] pytorch-widedeep==0.3.7 - [pip3] torch==1.13.0a0+git48b423b - [pip3] torchvision==0.14.0a0+ebb68f3 - [conda] blas 1.0 mkl - [conda] intel-extension-for-pytorch 1.13.0+cpu pypi_0 pypi - [conda] mkl 2021.4.0 h06a4308_640 - [conda] mkl-include 2022.1.0 pypi_0 pypi - [conda] mkl-service 2.4.0 py39h7f8727e_0 - [conda] mkl-static 2022.1.0 pypi_0 pypi - [conda] mkl_fft 1.3.1 py39hd3c417c_0 - [conda] mkl_random 1.2.2 py39h51133e4_0 - [conda] numpy 1.23.3 pypi_0 pypi - [conda] numpy-base 1.22.3 py39hf524024_0 - [conda] torch 1.13.0a0+git48b423b pypi_0 pypi - [conda] torchvision 0.14.0a0+ebb68f3 pypi_0 pypi Pull Request resolved: #84329 Approved by: https://github.com/jerryzh168
## Description Implement unified quantization backend 'X86' for x86 platforms. It combines the advantages of FBGEMM and ONEDNN. It selects kernels during weight prepacking and hide the details from end users. It will be the default backend in place of FBGEMM. For details, please refer to this RFC: [[RFC] Unified quantization backend for x86 CPU platforms](pytorch#83888) ## Validation **Correctness** Covered by UT **Accuracy** By running torchvision models on imagenet, no accuracy difference is found between FBGEMM and the unified X86 backend: [torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx](https://github.com/pytorch/pytorch/files/9598114/torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx) **Performance** Depends on pytorch#84470 which improves performance. For early PoC results, please refer to https://github.com/pytorch/pytorch/files/9399202/unified_qengine_poc_performance_bechmark.xlsx With the two PRs combined, we collected some data on Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz Method: Run multi-instances with 4 cores per instance on whole socket. Using JeMalloc and Intel OMP. Models/throughput | fbgemm | x86 | improvement -- | -- | -- | -- wide_resnet101_2 | 173.5675 | 241.815 | 39.32% resnext101_32x8d | 174.365 | 339.8175 | 94.89% resnet50 | 573.155 | 1174.14 | 104.86% vgg19_bn | 260.335 | 337.92 | 29.80% vgg19 | 257.935 | 333.265 | 29.21% inception_v3 | 601.1175 | 1309.33 | 117.82% densenet161 | 296.645 | 435.5625 | 46.83% mnasnet1_0 | 1216.7 | 4057.515 | 233.49% squeezenet1_0 | 1220.085 | 5153.3875 | 322.38% alexnet | 2294.91 | 2624.6375 | 14.37% fbnetc_100 | 976.2825 | 3110.1825 | 218.57% shufflenet_v2_x0_5 | 1555.76 | 3026.125 | 94.51% spnasnet_100 | 1059.065 | 3502.0975 | 230.68% pytorch-unet | 192.76 | 246.77 | 28.02% acgan | 257.32 | 333.7325 | 29.70% cgan | 7790.6925 | 7803.1025 | 0.16% sgan | 257.565 | 338.8875 | 31.57% se_resnet50 | 492.3725 | 916.5175 | 86.14% vggm | 300.2875 | 316.2075 | 5.30% Environment: - PyTorch version: 1.13.0a0+gitcdd625b - Is debug build: False - CUDA used to build PyTorch: None - ROCM used to build PyTorch: N/A - OS: Ubuntu 20.04.3 LTS (x86_64) - GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 - Clang version: Could not collect - CMake version: version 3.22.5 - Libc version: glibc-2.31 - Python version: 3.9.12 (main, Jun 1 2022, 11:38:51) [GCC 7.5.0] (64-bit runtime) - Python platform: Linux-5.11.0-27-generic-x86_64-with-glibc2.31 - Is CUDA available: False - CUDA runtime version: No CUDA - GPU models and configuration: No CUDA - Nvidia driver version: No CUDA - cuDNN version: No CUDA - HIP runtime version: N/A - MIOpen runtime version: N/A - Is XNNPACK available: True Versions of relevant libraries: - [pip3] intel-extension-for-pytorch==1.13.0+cpu - [pip3] numpy==1.23.3 - [pip3] pytorch-widedeep==0.3.7 - [pip3] torch==1.13.0a0+git48b423b - [pip3] torchvision==0.14.0a0+ebb68f3 - [conda] blas 1.0 mkl - [conda] intel-extension-for-pytorch 1.13.0+cpu pypi_0 pypi - [conda] mkl 2021.4.0 h06a4308_640 - [conda] mkl-include 2022.1.0 pypi_0 pypi - [conda] mkl-service 2.4.0 py39h7f8727e_0 - [conda] mkl-static 2022.1.0 pypi_0 pypi - [conda] mkl_fft 1.3.1 py39hd3c417c_0 - [conda] mkl_random 1.2.2 py39h51133e4_0 - [conda] numpy 1.23.3 pypi_0 pypi - [conda] numpy-base 1.22.3 py39hf524024_0 - [conda] torch 1.13.0a0+git48b423b pypi_0 pypi - [conda] torchvision 0.14.0a0+ebb68f3 pypi_0 pypi Pull Request resolved: pytorch#84329 Approved by: https://github.com/jerryzh168
**Summary** Make x86 the default quantization backend (qengine) for X86 CPU platforms. X86 is a unified quantization backend combining goodness of fbgemm and onednn. For more details please see #83888 **Test plan** python test/test_quantization.py Pull Request resolved: #91235 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/malfet
@jerryzh168 @Xia-Weiwen What is the status of this issue? Is it done? |
Yes, it is completed as a feature of 2.0. |
Ok, I'm closing this then. Feel free to reopen if you think this is a mistake. Thanks. |
🚀 The feature, motivation and pitch
Description
Add a unified quantization backend ‘X86’ for x86 CPU platforms. Make it the default PyTorch quantization backend for x86 in place of FBGEMM.
It is implemented by auto selection between FBGEMM and ONEDNN during weight prepacking.
Motivation
The ONEDNN quantization backend takes advantage of features of the latest Intel® CPU products. And it supports more fused ops. It has shown better performance over FBGEMM in many (but not all) cases.
From an API design point of view, it would not be user-friendly if we expose both FBGEMM and ONEDNN backends to end users. In that case, we propose a unified quantization backend named ‘X86’ to combine the goodness of both backends while keep API simple.
In the frontend, users will be using the x86 backend by default for x86 platforms. And in the backend, we decide for them about which kernel to run and hide the details. The selection between different kernels is automatically done during weight prepacking with static information.
Thus, the X86 backend can replace FBGEMM and offer better performance.
Design philosophy
Auto kernel selection between FBGEMM and ONEDNN by simple heuristics without runtime info.
QEngine
andbackend_config
API.Proposed Heuristics
Rules for auto selection between FBGEMM and ONEDNN:
For the unified backend, selection occurs during weight prepacking when ‘quantized::conv/linear_prepack’ is called. The prepack function will check hardware info (with or without VNNI) and op parameters and return a proper prepacked weight object accordingly.
For example, for linear or on platforms without VNNI, FBGEMM’s prepacked weight is always returned; and for convolution with groups=1, ONEDNN’s prepacked weight is returned. Then at runtime, it will automatically call corresponding kernels.
We have done implementation and run common models for benchmarking. The following table lists speedup ratio of throughputs of unified x86 backend vs. Pure FBGEMM:
(Table updated on Feb 21, 2023)
Note:
unified_qengine_poc_performance_bechmark.xlsx
Performance data updated on Jan 20, 2023:
int8_oob_x86_vs_fbgemm_20230119.xlsx
Performance data updated on Feb 5, 2023:
int8_benchmark_x86_vs_fbgemm_20230205.xlsx
(Using PyTorch nightly build on Feb 4, 2023, installed by pip)
Performance data updated on Feb 21, 2023:
int8_benchmark_x86_vs_fbgemm_20230221.xlsx
(Using PyTorch nightly build on Feb 20, 2023, installed by pip)
About qconfig
For compatibility, the new backend will use
reduce_range=True
to align with FBGEMM.However, for accuracy, we hope to change it to
reduce_range=False
in the future.Accuracy
We have run torchvision models to compare accuracy. Results show that FBGEMM and X86 backends give the same accuracy. For details, please see the worksheet:
torchvision_accuracy_comparison_fbgemm_vs_x86.xlsx
Plans
Original plans
Current status
Alternatives
N/A
Additional context
Example of implementing conv_prepack for unified X86 quantization backend.
The prepacking function returns an object of prepacked weight which belongs to either FBGEMM or ONEDNN. The call to
prepacked_weight->run
will automatically run into the correct kernel.Original code can be found here for reference:
pytorch/aten/src/ATen/native/quantized/cpu/qconv_prepack.cpp
Line 482 in b29a074
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @leslie-fang-intel
The text was updated successfully, but these errors were encountered: