Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xxhash: support SVE by intrinsic code #752

Merged
merged 2 commits into from Nov 9, 2022
Merged

Conversation

hzhuang1
Copy link
Contributor

@hzhuang1 hzhuang1 commented Oct 25, 2022

xxhash: add ARM SVE intrinsic implementation

Implement intrinsic code for ARM SVE. In this patch, it seems that there
is no improvement. Further optimization will be contained in the later
patch set.

SCALAR implementation (default)
 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  2718,  3093,  3151,  3187,  3217,  3232,  3238,  3223,  3190,  3240,  3241,  3222,  3238,  3049,  2253,  2113,  2102,  2117,  2132
XXH32  ,  1499,  1520,  1538,  1543,  1543,  1545,  1547,  1535,  1486,  1487,  1484,  1468,  1484,  1435,  1231,  1183,  1184,  1185,  1187
XXH64  ,  2632,  2874,  3017,  3092,  3132,  3143,  3155,  3122,  2990,  2994,  2998,  3000,  3006,  2842,  2121,  1838,  1833,  1811,  1839
XXH128 ,  2435,  2872,  3024,  3121,  3187,  3219,  3234,  3222,  3221,  3225,  3225,  3227,  3227,  3034,  1962,  1789,  1866,  1871,  1874

Intrinsic SVE512 implementation
 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  1903,  2313,  2469,  2568,  2631,  2663,  2683,  2662,  2673,  2678,  2679,  2678,  2676,  2584,  2236,  2173,  2173,  2163,  2183
XXH32  ,  1326,  1436,  1495,  1523,  1535,  1543,  1547,  1536,  1505,  1506,  1504,  1507,  1508,  1446,  1246,  1193,  1194,  1194,  1167
XXH64  ,  2510,  2802,  2977,  3071,  3121,  3136,  3156,  3126,  3041,  3047,  3050,  3050,  3048,  2890,  2043,  1944,  1944,  1945,  1955
XXH128 ,  1867,  2295,  2462,  2542,  2627,  2663,  2679,  2656,  2661,  2675,  2673,  2647,  2671,  2574,  2218,  2144,  2169,  2175,  2181

Signed-off-by: Haojian Zhuang haojian.zhuang@linaro.org
Signed-off-by: Devin Hussey easyaspi314@users.noreply.github.com

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Nov 7, 2022

@Cyan4973 Could you help to share any comments on this pull request? There's only SVE intrinsic code in this pull request.

@Cyan4973
Copy link
Owner

Cyan4973 commented Nov 7, 2022

Hi @hzhuang1 ,
the SVE implementation looks good to me.

I'm essentially concerned about how it could be continuously tested.
Any chance this could run on github action using qemu ?

@t-mat
Copy link
Contributor

t-mat commented Nov 7, 2022

Any chance this could run on github action using qemu ?

It seems qemu has some options for SVE. (Since I don't have environment for now, I just take a quick look.)

Perhaps -cpu max,sve=on,sve512=on may work.

@@ -3197,12 +3209,16 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
# define XXH_ACC_ALIGN 16
# elif XXH_VECTOR == XXH_AVX512 /* avx512 */
# define XXH_ACC_ALIGN 64
# elif XXH_VECTOR == XXH_SVE /* sve */
# define XXH_ACC_ALIGN 64
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick question :

This variant seems to directly target 512-bit width.

My understanding of SVE is that it's more flexible with regards to vector width, compared to Intel's AVX.
That being said, the qemu documentation implies that there are multiple hardware implementations of SVE to expect, from 128 to 512 bits.

Will this vector extension work with all of them, or only specifically on the 512-bit one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. SVE ranges from 128-bit to 2048-bit. Should I create a macro to calculate it for SVE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've configured CI to run code from SVE128 to SVE2048. All of them could work with this XXH_ACC_ALIGN.

@@ -3178,6 +3184,12 @@ enum XXH_VECTOR_TYPE /* fake enum */ {
# endif
#endif

/* __ARM_FEATURE_SVE is only supported by GCC & Clang. */
#if (XXH_VECTOR == XXH_SVE) && !defined(__ARM_FEATURE_SVE)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note : maybe minor, but surely worth mentioning.

This transparent fallback method
will make compilation succeed when SVE is selected, but actually not present on compilation target.

But it also means that a developer may think that the library is compiled with SVE enabled, because it's instructed to, but actually it's not and uses SCALAR instead, and the developer would be none the wiser.

I'm wondering if it would be preferable to let compilation fail instead.

Intermediate position, maybe a warning (though unfortunately, I believe #warning is not part of C90 syntax, and #pragma warning depends on compiler support. But then, so does __ARM_FEATURE_SVE).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll update it.

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Nov 7, 2022

Hi @hzhuang1 , the SVE implementation looks good to me.

I'm essentially concerned about how it could be continuously tested. Any chance this could run on github action using qemu ?

SVE could be tested in QEMU. I'm using QEMU to test code. And I only test performance on real platform.

But I don't know how to configure QEMU for SVE in your CI script.

@hzhuang1
Copy link
Contributor Author

hzhuang1 commented Nov 7, 2022

Any chance this could run on github action using qemu ?

It seems qemu has some options for SVE. (Since I don't have environment for now, I just take a quick look.)

Perhaps -cpu max,sve=on,sve512=on may work.

Yes, you're right.

I'm using the config -cpu max,sve128=on,sve256=on,sve512=on,sve1024=off,sve2048=off for my test.

@t-mat
Copy link
Contributor

t-mat commented Nov 7, 2022

I'm using the config -cpu max,sve128=on,sve256=on,sve512=on,sve1024=off,sve2048=off for my test.

Thanks for the info!

I'm essentially concerned about how it could be continuously tested. Any chance this could run on github action using qemu ?

But I don't know how to configure QEMU for SVE in your CI script.

You can add your aarch64 SVE test to the following lines

- name: ARM64 (XXH_VECTOR=[ scalar, NEON ])
if: ${{ matrix.name == 'ARM64' }}
run: |
CPPFLAGS="-DXXH_VECTOR=XXH_SCALAR" LDFLAGS="-static" CC=$XCC RUN_ENV=$XEMU make clean check
CPPFLAGS="-DXXH_VECTOR=XXH_NEON" LDFLAGS="-static" CC=$XCC RUN_ENV=$XEMU make clean check

For example:

-   - name: ARM64 (XXH_VECTOR=[ scalar, NEON ])
+   - name: ARM64 (XXH_VECTOR=[ scalar, NEON, SVE ])
      if: ${{ matrix.name == 'ARM64' }}
      run: |
        CPPFLAGS="-DXXH_VECTOR=XXH_SCALAR" LDFLAGS="-static" CC=$XCC RUN_ENV=$XEMU make clean check
        CPPFLAGS="-DXXH_VECTOR=XXH_NEON" LDFLAGS="-static" CC=aarch64-linux-gnu-gcc RUN_ENV=qemu-aarch64-static make clean check
+       CPPFLAGS="-DXXH_VECTOR=XXH_SVE" LDFLAGS="-static" CC=aarch64-linux-gnu-gcc RUN_ENV="qemu-aarch64-static -cpu max,sve128=on,sve256=on,sve512=on,sve1024=off,sve2048=off" make clean check

Note that RUN_ENV is passed as a "prefix" of the make check test commands.

xxHash/Makefile

Lines 189 to 213 in 267ba94

# make check can be run with cross-compiled binaries on emulated environments (qemu user mode)
# by setting $(RUN_ENV) to the target emulation environment
.PHONY: check
check: xxhsum ## basic tests for xxhsum CLI, set RUN_ENV for emulated environments
# stdin
$(RUN_ENV) ./xxhsum$(EXT) < xxhash.c
# multiple files
$(RUN_ENV) ./xxhsum$(EXT) xxhash.*
# internal bench
$(RUN_ENV) ./xxhsum$(EXT) -bi0
# long bench command
$(RUN_ENV) ./xxhsum$(EXT) --benchmark-all -i0
# bench multiple variants
$(RUN_ENV) ./xxhsum$(EXT) -b1,2,3 -i0
# file bench
$(RUN_ENV) ./xxhsum$(EXT) -bi0 xxhash.c
# 32-bit
$(RUN_ENV) ./xxhsum$(EXT) -H0 xxhash.c
# 128-bit
$(RUN_ENV) ./xxhsum$(EXT) -H2 xxhash.c
# XXH3 (enforce BSD style)
$(RUN_ENV) ./xxhsum$(EXT) -H3 xxhash.c | grep "XXH3"
# request incorrect variant
$(RUN_ENV) ./xxhsum$(EXT) -H9 xxhash.c ; test $$? -eq 1
@printf "\n ....... checks completed successfully ....... \n"

Implement intrinsic code for ARM SVE. In this patch, it seems that there
is no improvement. Further optimization will be contained in the later
patch set.

SCALAR implementation (default)
 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  2718,  3093,  3151,  3187,  3217,  3232,  3238,  3223,  3190,  3240,  3241,  3222,  3238,  3049,  2253,  2113,  2102,  2117,  2132
XXH32  ,  1499,  1520,  1538,  1543,  1543,  1545,  1547,  1535,  1486,  1487,  1484,  1468,  1484,  1435,  1231,  1183,  1184,  1185,  1187
XXH64  ,  2632,  2874,  3017,  3092,  3132,  3143,  3155,  3122,  2990,  2994,  2998,  3000,  3006,  2842,  2121,  1838,  1833,  1811,  1839
XXH128 ,  2435,  2872,  3024,  3121,  3187,  3219,  3234,  3222,  3221,  3225,  3225,  3227,  3227,  3034,  1962,  1789,  1866,  1871,  1874

Intrinsic SVE512 implementation
 ===  benchmarking 4 hash functions  ===
benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
xxh3   ,  1903,  2313,  2469,  2568,  2631,  2663,  2683,  2662,  2673,  2678,  2679,  2678,  2676,  2584,  2236,  2173,  2173,  2163,  2183
XXH32  ,  1326,  1436,  1495,  1523,  1535,  1543,  1547,  1536,  1505,  1506,  1504,  1507,  1508,  1446,  1246,  1193,  1194,  1194,  1167
XXH64  ,  2510,  2802,  2977,  3071,  3121,  3136,  3156,  3126,  3041,  3047,  3050,  3050,  3048,  2890,  2043,  1944,  1944,  1945,  1955
XXH128 ,  1867,  2295,  2462,  2542,  2627,  2663,  2679,  2656,  2661,  2675,  2673,  2647,  2671,  2574,  2218,  2144,  2169,  2175,  2181

Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Signed-off-by: Devin Hussey <easyaspi314@users.noreply.github.com>
Since the vector size of ARM64 SVE ranges from 128-bit to 2048-bit,
configure CI to run with each vector size.

Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
Copy link
Owner

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hzhuang1 , this looks good to me

@Cyan4973 Cyan4973 merged commit dbacfec into Cyan4973:dev Nov 9, 2022
@Cyan4973
Copy link
Owner

Maybe I was too fast here : I don't see Github Action test results in the test summary.
Seems only Appveyor (Windows) are reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants