Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate x86_64 SSE to ppc64le VSX intrinsics #4807

Merged
merged 11 commits into from Jul 6, 2023
Merged

Conversation

JeremyRand
Copy link
Contributor

Yields a quite large speedup on POWER9. See this article for background.

Benchmarks (all done with -DNCNN_ENABLE_LTO=ON on a Talos II Workstation with 2x 18-core POWER9 CPU's):

Before this PR:

loop_count = 100
num_threads = 18
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet  min =   61.09  max =  196.51  avg =   80.10
     squeezenet_int8  min =   43.01  max =  171.22  avg =   54.94
           mobilenet  min =  111.34  max =  280.33  avg =  123.60
      mobilenet_int8  min =   69.10  max =  172.20  avg =   78.12
        mobilenet_v2  min =   64.53  max =  228.19  avg =   79.75
        mobilenet_v3  min =   53.66  max =  196.57  avg =   65.48
          shufflenet  min =   30.34  max =  154.02  avg =   39.36
       shufflenet_v2  min =   31.82  max =  104.86  avg =   35.23
             mnasnet  min =   62.93  max =  159.05  avg =   70.84
     proxylessnasnet  min =   66.05  max =  173.80  avg =   76.22
     efficientnet_b0  min =   85.06  max =  260.51  avg =   96.11
   efficientnetv2_b0  min =  118.04  max =  337.97  avg =  138.86
        regnety_400m  min =   81.64  max =  280.86  avg =   94.12
           blazeface  min =    9.24  max =   51.09  avg =   11.35
           googlenet  min =  180.48  max =  411.27  avg =  209.57
      googlenet_int8  min =  134.68  max =  304.84  avg =  156.26
            resnet18  min =  159.92  max =  388.47  avg =  199.41
       resnet18_int8  min =  131.32  max =  329.29  avg =  175.57
             alexnet  min =   50.99  max =  147.26  avg =   63.51
               vgg16  min = 1567.90  max = 2049.77  avg = 1801.80
          vgg16_int8  min = 1139.54  max = 1904.75  avg = 1397.06
            resnet50  min =  555.13  max = 1108.02  avg =  644.54
       resnet50_int8  min =  373.25  max =  812.87  avg =  455.61
      squeezenet_ssd  min =  138.37  max =  390.43  avg =  228.91
 squeezenet_ssd_int8  min =  100.01  max =  266.46  avg =  150.76
       mobilenet_ssd  min =  226.49  max =  482.65  avg =  282.94
  mobilenet_ssd_int8  min =  142.00  max =  363.09  avg =  176.94
      mobilenet_yolo  min =  565.40  max =  923.64  avg =  623.31
  mobilenetv2_yolov3  min =  238.50  max =  578.40  avg =  331.47
         yolov4-tiny  min =  405.91  max =  664.19  avg =  478.14
           nanodet_m  min =   74.71  max =  242.68  avg =   85.00
    yolo-fastest-1.1  min =   39.97  max =  158.29  avg =   52.69
      yolo-fastestv2  min =   25.37  max =   67.04  avg =   31.72
  vision_transformer  min =  410.63  max =  630.04  avg =  510.37
          FastestDet  min =   29.12  max =  128.42  avg =   32.39

With this PR applied:

loop_count = 100
num_threads = 18
powersave = 0
gpu_device = -1
cooling_down = 1
          squeezenet  min =    6.09  max =   19.67  avg =    7.89
     squeezenet_int8  min =    6.26  max =    9.30  avg =    6.76
           mobilenet  min =   12.13  max =   30.03  avg =   13.74
      mobilenet_int8  min =    8.63  max =   21.62  avg =   10.80
        mobilenet_v2  min =    8.16  max =   95.63  avg =   11.02
        mobilenet_v3  min =    7.48  max =   11.15  avg =    7.68
          shufflenet  min =    8.26  max =   10.69  avg =    8.82
       shufflenet_v2  min =    6.55  max =    9.72  avg =    7.04
             mnasnet  min =    7.87  max =   68.94  avg =   10.80
     proxylessnasnet  min =    9.07  max =  113.80  avg =   11.84
     efficientnet_b0  min =   14.23  max =  106.71  avg =   19.11
   efficientnetv2_b0  min =   17.91  max =  123.81  avg =   20.61
        regnety_400m  min =   27.20  max =  134.10  avg =   33.29
           blazeface  min =    3.51  max =    5.31  avg =    3.80
           googlenet  min =   22.16  max =  121.97  avg =   25.90
      googlenet_int8  min =   20.46  max =   58.96  avg =   23.61
            resnet18  min =   17.63  max =   50.29  avg =   20.71
       resnet18_int8  min =   12.89  max =   36.11  avg =   13.93
             alexnet  min =   14.22  max =   39.14  avg =   16.91
               vgg16  min =  112.73  max =  221.44  avg =  160.76
          vgg16_int8  min =   44.35  max =  137.12  avg =   50.38
            resnet50  min =   47.34  max =  108.54  avg =   50.78
       resnet50_int8  min =   30.51  max =   44.48  avg =   31.25
      squeezenet_ssd  min =   19.22  max =  117.90  avg =   23.39
 squeezenet_ssd_int8  min =   18.22  max =   26.81  avg =   19.09
       mobilenet_ssd  min =   24.32  max =  136.99  avg =   29.31
  mobilenet_ssd_int8  min =   18.72  max =   53.61  avg =   21.54
      mobilenet_yolo  min =   78.38  max =  214.07  avg =   93.26
  mobilenetv2_yolov3  min =   29.38  max =  138.79  avg =   42.11
         yolov4-tiny  min =   45.67  max =  137.23  avg =   62.10
           nanodet_m  min =   14.41  max =   29.52  avg =   15.16
    yolo-fastest-1.1  min =   10.85  max =   13.64  avg =   11.00
      yolo-fastestv2  min =    9.55  max =   14.39  avg =   10.05
  vision_transformer  min =  396.60  max =  598.17  avg =  446.76
          FastestDet  min =    9.57  max =   14.16  avg =   10.11

(I think this definitely takes the cake for "most speedup per lines of code" of any patch I've written. :) )

@tencent-adm
Copy link

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Jeremy Rand seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link

codecov-commenter commented Jun 16, 2023

Codecov Report

Merging #4807 (ad5bf0e) into master (4b97730) will decrease coverage by 5.15%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master    #4807       +/-   ##
===========================================
- Coverage   94.90%   89.75%    -5.15%     
===========================================
  Files         779      309      -470     
  Lines      223166    84266   -138900     
===========================================
- Hits       211795    75637   -136158     
+ Misses      11371     8629     -2742     

see 648 files with indirect coverage changes

@JeremyRand
Copy link
Contributor Author

The GCC build fail on CI is interesting. Maybe an artifact of an older GCC version than I tested with? Curious what you'd recommend I do to avoid this; I guess I could test whether that function is available as part of the cmake step, and only enable SSE to VSX translation if it is? Let me know if that's a good approach or if you prefer some other workaround.

@JeremyRand
Copy link
Contributor Author

It looks like VSX translation of _mm_packus_epi32 was added in GCC v12.1.0. The CI job uses Ubuntu 20.04, which packages GCC 10. So that explains the fail. _mm_packus_epi32 is part of SSE4.1, so I think I can just disable SSE4.1 if the compiler is too old, and leave the other optimizations there. I'll see if I can push a fix in the next few days.

(Feel free to review the rest of this PR in parallel though.)

@JeremyRand
Copy link
Contributor Author

The linux-aarch64 CI fails look unrelated to this PR if I'm not mistaken.

@JeremyRand
Copy link
Contributor Author

 14/110 Test  #15: test_binaryop_3 ..................***Failed   12.28 sec
value not match  at c:7 d:2 h:3 w:1    expect 3.141593 but got -3.141593
output blob 0 not match
test_layer_cpu failed
test_layer BinaryOp failed use_packing_layout=0 use_fp16_packed=0 use_fp16_storage=0 use_fp16_arithmetic=0 use_shader_pack8=0 use_bf16_storage=0 use_image_storage=0 use_sgemm_convolution=1 use_winograd_convolution=1
test_binaryop failed a.dims=4 a=(2 7 3 31) b.dims=4 b=(2 7 3 31) op_type=11
CMake Error at /home/runner/work/ncnn/ncnn/cmake/run_test.cmake:4 (message):
  Test failed with return value '1'

Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain.

@JeremyRand
Copy link
Contributor Author

Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain.

op_type=11 is OPERATION_RATAN. The atan function returns an angle; PI radians and -PI radians are the same thing. So this sounds like the VSX behavior is fine, and the tests should be modified to allow this. Thoughts?

@nihui
Copy link
Member

nihui commented Jun 19, 2023

Is this an actual bug, or just a quirk of VSX returning a different representation of the same value? I suspect the latter, but I'm not familiar enough with what that test is doing to be certain.

op_type=11 is OPERATION_RATAN. The atan function returns an angle; PI radians and -PI radians are the same thing. So this sounds like the VSX behavior is fine, and the tests should be modified to allow this. Thoughts?

binaryop test fixed in 9022b71

@nihui
Copy link
Member

nihui commented Jun 21, 2023

Using x86-compatible intrinsics to compile performance on other architectures is also a practice in webassembly. It is great to see similar exciting results in power architectures 👍

I observed that you added quite a few hacks in the cmakelists, especially the modification in ncnn_add_layer

I think a good way is to create a dedicated cmake toolchain file, such as powerpc64le-linux-gnu-vsx.toolchain.cmake, declare CMAKE_SYSTEM_PROCESSOR as x86_64 in it, cheat ncnn's architecture judgment, and add the required global compilation parameters, such as - DNO_WARN_X86_INTRINSICS -mcpu=xxx -march=xxx etc.

This is also how emsdk implements x86 intrinsics for webassembly

cmake build system will automatically enter the x86 part of ncnn, and use the x86 optimized code

@JeremyRand
Copy link
Contributor Author

Good feedback, thanks! I was not aware that similar approaches were used with WebAssembly. I'll see if I can refactor accordingly; may take me some days.

Translating x86_64 SSE to ppc64le VSX intrinsics yields a quite large
speedup on POWER9. See this article for background:

https://www.talospace.com/2019/07/easier-power-vectorizing-for-fun-and.html
Copy link
Member

@nihui nihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some brief instruction about building ncnn on powerpc

docs/how-to-build/how-to-build.md and README.md HowTo section

toolchains/power9le-linux-gnu-vsx.clang.toolchain.cmake Outdated Show resolved Hide resolved
toolchains/power9le-linux-gnu-vsx.toolchain.cmake Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
src/CMakeLists.txt Outdated Show resolved Hide resolved
.github/workflows/linux-ppc64-cpu-gcc.yml Outdated Show resolved Hide resolved
@JeremyRand
Copy link
Contributor Author

please add some brief instruction about building ncnn on powerpc

docs/how-to-build/how-to-build.md and README.md HowTo section

Added some docs; let me know if anything looks wrong.

@nihui nihui merged commit 47e0daf into Tencent:master Jul 6, 2023
1 of 4 checks passed
@nihui
Copy link
Member

nihui commented Jul 6, 2023

Thanks for your contribution !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants