Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize full accumulating loop for SVE #756

Merged
merged 2 commits into from
Nov 24, 2022
Merged

Commits on Nov 10, 2022

  1. switch to full accmulating loop

    XXH3_accumulate() handle the whole accumulating loop and architecture
    optimized code is in the mini loop of 512 bytes. But it also causes
    accessing memory frequently for the large block data.
    
    Now make XXH3_accumulate() as architecture optimized code.
    
    Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
    Signed-off-by: Devin Hussey <easyaspi314@users.noreply.github.com>
    hzhuang1 committed Nov 10, 2022
    Configuration menu
    Copy the full SHA
    91788f1 View commit details
    Browse the repository at this point in the history
  2. customize full accumulating loop for ARM64 SVE

    With optimized full accumulating loop, the performance is improved
    at least 2 times.
    
    The ACC result needn't to save to stack in the full loop. And
    instructions of prefetching data for SVE are also used.
    
    Without this patch, the performance result is in below.
     ===  benchmarking 4 hash functions  ===
    benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
    xxh3   ,  1904,  2315,  2468,  2580,  2640,  2670,  2682,  2673,  2677,  2663,  2683,  2688,  2686,  2591,  2241,  2181,  2191,  2048,  2048
    XXH32  ,  1326,  1440,  1493,  1523,  1534,  1543,  1547,  1532,  1504,  1507,  1507,  1505,  1506,  1446,  1218,  1150,  1151,  1153,  1135
    XXH64  ,  2511,  2795,  2975,  3068,  3120,  3125,  3154,  3128,  3034,  3045,  3052,  3053,  3053,  2842,  2050,  1853,  1848,  1853,  1853
    XXH128 ,  1867,  2294,  2465,  2569,  2622,  2662,  2676,  2667,  2677,  2682,  2684,  2677,  2683,  2570,  2093,  2013,  2045,  2046,  2046
    
    With this patch, the performance result is in below.
     ===  benchmarking 4 hash functions  ===
    benchmarking large inputs : from 512 bytes (log9) to 128 MB (log27)
    xxh3   ,  3681,  6007,  7803,  8954,  9875, 10411, 10703, 10505, 10670, 10794, 10812, 10804, 10205,  9923,  6279,  5927,  5967,  6022,  6062
    XXH32  ,  1281,  1434,  1494,  1523,  1534,  1543,  1547,  1535,  1500,  1502,  1502,  1502,  1501,  1443,  1242,  1169,  1193,  1196,  1195
    XXH64  ,  2497,  2801,  2961,  3074,  3092,  3136,  3155,  3123,  3031,  3037,  3040,  3037,  3033,  2847,  2102,  1955,  1967,  1974,  1971
    XXH128 ,  3419,  5798,  7488,  8854,  9787, 10357, 10673, 10468, 10647, 10748, 10785, 10751, 10805,  9698,  6011,  5677,  5999,  6065,  6074
    
    Signed-off-by: Haojian Zhuang <haojian.zhuang@linaro.org>
    Signed-off-by: Devin Hussey <easyaspi314@users.noreply.github.com>
    hzhuang1 committed Nov 10, 2022
    Configuration menu
    Copy the full SHA
    cfbf0b7 View commit details
    Browse the repository at this point in the history