xxhash: support SVE by intrinsic code #752

hzhuang1 · 2022-10-25T14:02:48Z

xxhash: add ARM SVE intrinsic implementation

Implement intrinsic code for ARM SVE. In this patch, it seems that there
is no improvement. Further optimization will be contained in the later
patch set.

SCALAR implementation (default)
 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  2718,  3093,  3151,  3187,  3217,  3232,  3238,  3223,  3190,  3240,  3241,  3222,  3238,  3049,  2253,  2113,  2102,  2117,  2132
XXH32  ,  1499,  1520,  1538,  1543,  1543,  1545,  1547,  1535,  1486,  1487,  1484,  1468,  1484,  1435,  1231,  1183,  1184,  1185,  1187
XXH64  ,  2632,  2874,  3017,  3092,  3132,  3143,  3155,  3122,  2990,  2994,  2998,  3000,  3006,  2842,  2121,  1838,  1833,  1811,  1839
XXH128 ,  2435,  2872,  3024,  3121,  3187,  3219,  3234,  3222,  3221,  3225,  3225,  3227,  3227,  3034,  1962,  1789,  1866,  1871,  1874

Intrinsic SVE512 implementation
 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  1903,  2313,  2469,  2568,  2631,  2663,  2683,  2662,  2673,  2678,  2679,  2678,  2676,  2584,  2236,  2173,  2173,  2163,  2183
XXH32  ,  1326,  1436,  1495,  1523,  1535,  1543,  1547,  1536,  1505,  1506,  1504,  1507,  1508,  1446,  1246,  1193,  1194,  1194,  1167
XXH64  ,  2510,  2802,  2977,  3071,  3121,  3136,  3156,  3126,  3041,  3047,  3050,  3050,  3048,  2890,  2043,  1944,  1944,  1945,  1955
XXH128 ,  1867,  2295,  2462,  2542,  2627,  2663,  2679,  2656,  2661,  2675,  2673,  2647,  2671,  2574,  2218,  2144,  2169,  2175,  2181

Signed-off-by: Haojian Zhuang haojian.zhuang@linaro.org
Signed-off-by: Devin Hussey easyaspi314@users.noreply.github.com

hzhuang1 · 2022-11-07T02:56:59Z

@Cyan4973 Could you help to share any comments on this pull request? There's only SVE intrinsic code in this pull request.

Cyan4973 · 2022-11-07T03:34:54Z

Hi @hzhuang1 ,
the SVE implementation looks good to me.

I'm essentially concerned about how it could be continuously tested.
Any chance this could run on github action using qemu ?

t-mat · 2022-11-07T04:03:36Z

Any chance this could run on github action using qemu ?

It seems qemu has some options for SVE. (Since I don't have environment for now, I just take a quick look.)

Perhaps -cpu max,sve=on,sve512=on may work.

Cyan4973 · 2022-11-07T05:05:05Z

xxhash.h

@@ -3197,12 +3209,16 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
 #     define XXH_ACC_ALIGN 16
 #  elif XXH_VECTOR == XXH_AVX512  /* avx512 */
 #     define XXH_ACC_ALIGN 64
+#  elif XXH_VECTOR == XXH_SVE   /* sve */
+#     define XXH_ACC_ALIGN 64


Quick question :

This variant seems to directly target 512-bit width.

My understanding of SVE is that it's more flexible with regards to vector width, compared to Intel's AVX.
That being said, the qemu documentation implies that there are multiple hardware implementations of SVE to expect, from 128 to 512 bits.

Will this vector extension work with all of them, or only specifically on the 512-bit one ?

Yes, you're right. SVE ranges from 128-bit to 2048-bit. Should I create a macro to calculate it for SVE?

I've configured CI to run code from SVE128 to SVE2048. All of them could work with this XXH_ACC_ALIGN.

Cyan4973 · 2022-11-07T05:16:00Z

xxhash.h

@@ -3178,6 +3184,12 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
 #  endif
 #endif

+/* __ARM_FEATURE_SVE is only supported by GCC & Clang. */
+#if (XXH_VECTOR == XXH_SVE) && !defined(__ARM_FEATURE_SVE)


Note : maybe minor, but surely worth mentioning.

This transparent fallback method
will make compilation succeed when SVE is selected, but actually not present on compilation target.

But it also means that a developer may think that the library is compiled with SVE enabled, because it's instructed to, but actually it's not and uses SCALAR instead, and the developer would be none the wiser.

I'm wondering if it would be preferable to let compilation fail instead.

Intermediate position, maybe a warning (though unfortunately, I believe #warning is not part of C90 syntax, and #pragma warning depends on compiler support. But then, so does __ARM_FEATURE_SVE).

OK. I'll update it.

hzhuang1 · 2022-11-07T06:45:25Z

Hi @hzhuang1 , the SVE implementation looks good to me.

I'm essentially concerned about how it could be continuously tested. Any chance this could run on github action using qemu ?

SVE could be tested in QEMU. I'm using QEMU to test code. And I only test performance on real platform.

But I don't know how to configure QEMU for SVE in your CI script.

hzhuang1 · 2022-11-07T06:50:36Z

Any chance this could run on github action using qemu ?

It seems qemu has some options for SVE. (Since I don't have environment for now, I just take a quick look.)

https://github.com/qemu/qemu/blob/master/docs/system/arm/cpu-features.rst#sve-cpu-properties

https://github.com/qemu/qemu/blob/master/docs/system/arm/cpu-features.rst#sve-cpu-property-examples

Perhaps -cpu max,sve=on,sve512=on may work.

Yes, you're right.

I'm using the config -cpu max,sve128=on,sve256=on,sve512=on,sve1024=off,sve2048=off for my test.

t-mat · 2022-11-07T08:46:43Z

I'm using the config -cpu max,sve128=on,sve256=on,sve512=on,sve1024=off,sve2048=off for my test.

Thanks for the info!

I'm essentially concerned about how it could be continuously tested. Any chance this could run on github action using qemu ?

But I don't know how to configure QEMU for SVE in your CI script.

You can add your aarch64 SVE test to the following lines

xxHash/.github/workflows/ci.yml

Lines 381 to 385 in 267ba94

    
               - name: ARM64 (XXH_VECTOR=[ scalar, NEON ]) 
        
                 if: ${{ matrix.name == 'ARM64' }} 
        
                 run: | 
        
                   CPPFLAGS="-DXXH_VECTOR=XXH_SCALAR" LDFLAGS="-static" CC=$XCC RUN_ENV=$XEMU make clean check 
        
                   CPPFLAGS="-DXXH_VECTOR=XXH_NEON" LDFLAGS="-static" CC=$XCC RUN_ENV=$XEMU make clean check

For example:

-   - name: ARM64 (XXH_VECTOR=[ scalar, NEON ])
+   - name: ARM64 (XXH_VECTOR=[ scalar, NEON, SVE ])
      if: ${{ matrix.name == 'ARM64' }}
      run: |
        CPPFLAGS="-DXXH_VECTOR=XXH_SCALAR" LDFLAGS="-static" CC=$XCC RUN_ENV=$XEMU make clean check
        CPPFLAGS="-DXXH_VECTOR=XXH_NEON" LDFLAGS="-static" CC=aarch64-linux-gnu-gcc RUN_ENV=qemu-aarch64-static make clean check
+       CPPFLAGS="-DXXH_VECTOR=XXH_SVE" LDFLAGS="-static" CC=aarch64-linux-gnu-gcc RUN_ENV="qemu-aarch64-static -cpu max,sve128=on,sve256=on,sve512=on,sve1024=off,sve2048=off" make clean check

Note that RUN_ENV is passed as a "prefix" of the make check test commands.

xxHash/Makefile

Lines 189 to 213 in 267ba94

    
           # make check can be run with cross-compiled binaries on emulated environments (qemu user mode) 
        
           # by setting $(RUN_ENV) to the target emulation environment 
        
           .PHONY: check 
        
           check: xxhsum   ## basic tests for xxhsum CLI, set RUN_ENV for emulated environments 
        
           	# stdin 
        
           	$(RUN_ENV) ./xxhsum$(EXT) < xxhash.c 
        
           	# multiple files 
        
           	$(RUN_ENV) ./xxhsum$(EXT) xxhash.* 
        
           	# internal bench 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -bi0 
        
           	# long bench command 
        
           	$(RUN_ENV) ./xxhsum$(EXT) --benchmark-all -i0 
        
           	# bench multiple variants 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -b1,2,3 -i0 
        
           	# file bench 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -bi0 xxhash.c 
        
           	# 32-bit 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -H0 xxhash.c 
        
           	# 128-bit 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -H2 xxhash.c 
        
           	# XXH3 (enforce BSD style) 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -H3 xxhash.c | grep "XXH3" 
        
           	# request incorrect variant 
        
           	$(RUN_ENV) ./xxhsum$(EXT) -H9 xxhash.c ; test $$? -eq 1 
        
           	@printf "\n .......   checks completed successfully   ....... \n"

Implement intrinsic code for ARM SVE. In this patch, it seems that there is no improvement. Further optimization will be contained in the later patch set. SCALAR implementation (default) === benchmarking 4 hash functions === benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) xxh3 , 2718, 3093, 3151, 3187, 3217, 3232, 3238, 3223, 3190, 3240, 3241, 3222, 3238, 3049, 2253, 2113, 2102, 2117, 2132 XXH32 , 1499, 1520, 1538, 1543, 1543, 1545, 1547, 1535, 1486, 1487, 1484, 1468, 1484, 1435, 1231, 1183, 1184, 1185, 1187 XXH64 , 2632, 2874, 3017, 3092, 3132, 3143, 3155, 3122, 2990, 2994, 2998, 3000, 3006, 2842, 2121, 1838, 1833, 1811, 1839 XXH128 , 2435, 2872, 3024, 3121, 3187, 3219, 3234, 3222, 3221, 3225, 3225, 3227, 3227, 3034, 1962, 1789, 1866, 1871, 1874 Intrinsic SVE512 implementation === benchmarking 4 hash functions === benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27) xxh3 , 1903, 2313, 2469, 2568, 2631, 2663, 2683, 2662, 2673, 2678, 2679, 2678, 2676, 2584, 2236, 2173, 2173, 2163, 2183 XXH32 , 1326, 1436, 1495, 1523, 1535, 1543, 1547, 1536, 1505, 1506, 1504, 1507, 1508, 1446, 1246, 1193, 1194, 1194, 1167 XXH64 , 2510, 2802, 2977, 3071, 3121, 3136, 3156, 3126, 3041, 3047, 3050, 3050, 3048, 2890, 2043, 1944, 1944, 1945, 1955 XXH128 , 1867, 2295, 2462, 2542, 2627, 2663, 2679, 2656, 2661, 2675, 2673, 2647, 2671, 2574, 2218, 2144, 2169, 2175, 2181 Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org> Signed-off-by: Devin Hussey <easyaspi314@users.noreply.github.com>

Since the vector size of ARM64 SVE ranges from 128-bit to 2048-bit, configure CI to run with each vector size. Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>

Cyan4973

Thanks @hzhuang1 , this looks good to me

Cyan4973 · 2022-11-10T02:03:50Z

Maybe I was too fast here : I don't see Github Action test results in the test summary.
Seems only Appveyor (Windows) are reported.

hzhuang1 force-pushed the sve_01 branch from 09b320f to e9f7740 Compare October 26, 2022 05:42

hzhuang1 mentioned this pull request Oct 27, 2022

Implement ARM SVE optimization with assembly code #751

Open

Cyan4973 reviewed Nov 7, 2022

View reviewed changes

hzhuang1 added 2 commits November 8, 2022 18:23

CI: support ARM64 SVE

dd19080

Since the vector size of ARM64 SVE ranges from 128-bit to 2048-bit, configure CI to run with each vector size. Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>

hzhuang1 force-pushed the sve_01 branch from e9f7740 to dd19080 Compare November 8, 2022 11:02

Cyan4973 approved these changes Nov 8, 2022

View reviewed changes

Cyan4973 merged commit dbacfec into Cyan4973:dev Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xxhash: support SVE by intrinsic code #752

xxhash: support SVE by intrinsic code #752

hzhuang1 commented Oct 25, 2022 •

edited

hzhuang1 commented Nov 7, 2022

Cyan4973 commented Nov 7, 2022

t-mat commented Nov 7, 2022

Cyan4973 Nov 7, 2022

hzhuang1 Nov 7, 2022

hzhuang1 Nov 8, 2022

Cyan4973 Nov 7, 2022

hzhuang1 Nov 7, 2022

hzhuang1 commented Nov 7, 2022

hzhuang1 commented Nov 7, 2022

t-mat commented Nov 7, 2022

Cyan4973 left a comment

Cyan4973 commented Nov 10, 2022

xxhash: support SVE by intrinsic code #752

xxhash: support SVE by intrinsic code #752

Conversation

hzhuang1 commented Oct 25, 2022 • edited

hzhuang1 commented Nov 7, 2022

Cyan4973 commented Nov 7, 2022

t-mat commented Nov 7, 2022

Cyan4973 Nov 7, 2022

Choose a reason for hiding this comment

hzhuang1 Nov 7, 2022

Choose a reason for hiding this comment

hzhuang1 Nov 8, 2022

Choose a reason for hiding this comment

Cyan4973 Nov 7, 2022

Choose a reason for hiding this comment

hzhuang1 Nov 7, 2022

Choose a reason for hiding this comment

hzhuang1 commented Nov 7, 2022

hzhuang1 commented Nov 7, 2022

t-mat commented Nov 7, 2022

Cyan4973 left a comment

Choose a reason for hiding this comment

Cyan4973 commented Nov 10, 2022

hzhuang1 commented Oct 25, 2022 •

edited