-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Why no one use AVX instruction to optimize pcre library? #263
Comments
I tried it once, but did not see much improvement. Probably because it is mostly just data loading, the actual computation is simple. Or I just don't know how to use them properly. |
@zherczeg Thanks for the information. I don't think we need to try again. |
Since the compiler knows more, I have added AVX2 support for the jit compiler: I see some improvements for simple fixed strings, but they never dominated benchmarks. |
@stkeke Are you still available? There is something I don't understand. There is a SSE4.1 code:
AVX2 code:
Do you have any idea why this happens? The code reads an aligned 16/32 byte block, then another 16/32 byte block with an unaligned (negative) offset, then doing some compare and logic instructions. Then use the mask collector instruction to get the sign bits. |
@zherczeg Hmmm...unexpected...let me talk with my colleagues tomorrow. Is it possible to provide CPU model showed by lscpu? |
|
If the following lines are copied into
The code is quite similar except: It also uses vbroadcastd instead of pshufd on 32 byte registers to replicate the elements. |
Hi @zherczeg, can you kindly share your test method? |
My test method is kind of complicated, so I created a simplified code from it. I see the differences there as well. Source code: Input: This zip contains a single text file (~20 Mbyte), please rename it to Compiling: Configure pcre2:
I prefer static builds but it is not necessary. Probably Then just build the performance test program and link it with the 8 bit pcre2 library: |
Thanks. We can check it out tomorrow. Now it is night in our time zone.
…________________________________
From: Zoltan Herczeg ***@***.***>
Sent: Monday, November 13, 2023 7:17:23 PM
To: PCRE2Project/pcre2 ***@***.***>
Cc: Su, Tao ***@***.***>; Mention ***@***.***>
Subject: Re: [PCRE2Project/pcre2] [Question] Why no one use AVX instruction to optimize pcre library? (Issue #263)
My test method is kind of complicated, so I created a simplified code from it. I see the differences there as well.
Source code:
https://gist.github.com/zherczeg/43d1266dbc73df8f4223c9416156278b
Input:
http://www.gutenberg.org/files/3200/old/mtent12.zip
This zip contains a single text file (~20 Mbyte), please rename it to text.txt.
Compiling:
Configure pcre2:
export CFLAGS="-O3"
./configure --enable-shared=no --enable-pcre2-16 --enable-pcre2-32 --enable-jit --enable-unicode
make
I prefer static builds but it is not necessary. Probably ./configure --enable-jit would be enough, but I haven't tested it.
Then just build the performance test program and link it with the 8 bit pcre2 library:
gcc perf_test.c -O3 -o perf_test -lpcre2-8
—
Reply to this email directly, view it on GitHub<#263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABRT7U5H6YPMQ7A7LIPJ7LDYEH6UHAVCNFSM6AAAAAAY5GZXFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBXHE3DGMBZGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi, I follow the steps to run the test on bare metal machine. I pin the process to one cpu. I got some small value, that is, very small ms. I don't know if I run the test correctly? I suppose the default repo from upstream is avx2 version:
Then I modify
|
Looks correct. Even if the input is 20 Mb, scanning it with a modern cpu does not take much time. The patterns in the test are very simple ones, where the simd code dominates the match. The interpreter runtime is roughly the same (it uses The AVX2 is (should be) auto detected by default. The 0.011 ms looks five times slower than the 0.002 ms. This confirms my finding, although AVX2 was only twice as slow on my system. Btw if bigger values are needed, you can use |
Let me give you some info about how the code works. It searches character pairs, e.g. an
str_ptr is a string pointer in the input, which is aligned to 16/32 byte.
|
Thanks for the addition info. We're still working on this...
…________________________________
From: Zoltan Herczeg ***@***.***>
Sent: Saturday, November 18, 2023 6:47:55 AM
To: PCRE2Project/pcre2 ***@***.***>
Cc: Su, Tao ***@***.***>; Mention ***@***.***>
Subject: Re: [PCRE2Project/pcre2] [Question] Why no one use AVX instruction to optimize pcre library? (Issue #263)
If the following lines are copied into pcre2_jit_simd_inc.h you can force the SSE4.1 code:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L44
#undef SLJIT_HAS_AVX2
#define SLJIT_HAS_AVX2 -1
The code is quite similar except:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L593
It also uses vbroadcastd instead of pshufd on 32 byte registers to replicate the elements.
—
Reply to this email directly, view it on GitHub<#263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABRT7U6J4BXH4ALXVQQRYL3YE7SRXAVCNFSM6AAAAAAY5GZXFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBXGE2DCMBXHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thank you! Maybe this could help: the jit compiler can be used to insert instructions. For example AVX2:
SSE4.1
|
[Comment was updated] I have found another interesting observation. I played around with some fixed string patterns: \xf0\xf1\xf2\xf3\xf4 - the input is ascii, so these characters are never found there AVX2 times:
SSE4.1 times
The AVX2 code path gets slower gradually with these patterns. The SS4.1 code path is slower first, but then it is unaffected. The first pattern is basically runs on the loop of the lower part of the assembly code above. In this case AVX2 seems faster. But when we have matches, it gets slower. If it has frequent matches, it is really slow. |
The "ab" is found 23832 times and "ho" is found "58736" times in the text. So the latter is more than twice as frequent. The character pair search leaves to search loop more frequently for "how to". |
@zherczeg things are more complicated that the first looking. We are doing several things in parallel right now. Trying to find the root cause, more people we need to link and talk. It will take a while... |
No problem, this is not urgent. Anyway this is what I know so far: SSE4.1:
AVX2:
The SSE code path does not affected by the number of character pair matches. However the AVX2 gets slower. With zero match, it is faster, but with 58K matches its runtime is tripled. Does |
I modify the perf_test.c as below so we can focus on JIT part.
AVXI collect trace with
and let's see the corresponding JITed code:
So the difference between AVX and SSE version is this instruction: SSEFor sse, we have different hotspot:
and JITed code:
|
BTW, why I can still see AVX2 instruction such as |
Because the jit compiler independently detects its availability, and use it when available. I could get rid of it for testing if it is really needed but it is a bit more complicated. Btw mixing avx and sse forms has any negative impact? I thought the instruction decoder changes all instructions into internal forms. |
If the hotspot data is correct, in avx2 the initialization of 256 bit registers is somehow heavy. In sse2 the main loops is heavy, which is expected. |
We will see if we can find some public whitepaper mentioning that this kind of operation is heavy. |
Thank you. Honestly, it is surprising that operation is heavy according to perf, but the 128 bit variant of the same instruction is not. |
Because I am not sure if perf works well on JITed code, I want to do some experiment.
What's purpose of the code? Looks like it just fill or initialize the 256bits ymm2 by broadcast a dword integer "0x6f6f6f6f" to eight locations in ymm2? Can we switch to other method to see the effects? btw, I'm new to this repo so I am still ramp up the source code to see how to do that. |
Should we avoid broadcast in same reg: |
The purpose of the broadcast is replicating 8/16/32 bit values depending on the character type (pcre is a pattern matching engine). The old code used Yes, you can use other registers to hold the data before replication: E.g. you can use SLJIT_FR0 / SLJIT_FR1 to hold the constant before replication. |
Looks like it’s caused by AVX-SSE transition penalties. Please refer to below links: According the links, we may do followings to avoid penalties:
Perf reported the hotspot:
" Now the JITed code looks like:
This helps on my Intel Sapphire Rapids server, the AVX2 and SSE version now has similar performance. My POC is:
|
I am not sure if this is the root cause. But I think at least JIT need support Also, the perf metric |
Likely. I don't know much about this topic, so everything is new to me. I never heard of tma_avx_assists.
It seems the registers cannot be specified for vzeroupper. I think the abi on windows only requires to preserve the lower 128 bits of saved float registers. |
Edited:
If I have time tomorrow, I will search some Intel manual or Agner Fog's doc and share the snippet. |
Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software Optimization manual, https://cdrdv2-public.intel.com/671488/248966-046A-software-optimization-manual.pdf :
Next steps:
|
Thank you very much for your help! This is very interesting:
Vzeroupper helped in the "how to" case, but not in the other cases. Regardless, it is worth adding this instruction. I thought the avx is just an alias to sse when 128 bit registers are used. I can modify the jit compiler to always generate avx instructions when avx is available. This will be a longer term work though. |
If avx can reduce instruction path length and save cycles, it's worthy, considering the popularity and volume of installation of this library.
…________________________________
From: Zoltan Herczeg ***@***.***>
Sent: Friday, December 8, 2023 4:40:23 PM
To: PCRE2Project/pcre2 ***@***.***>
Cc: Su, Tao ***@***.***>; Mention ***@***.***>
Subject: Re: [PCRE2Project/pcre2] [Question] Why no one use AVX instruction to optimize pcre library? (Issue #263)
Thank you very much for your help!
This is very interesting:
Size of input: 20045118
Pattern: xz
average interpreter runtime: 0.004 ms (matched: 0)
average jit runtime: 0.002 ms (matched: 0)
Pattern: ab
average interpreter runtime: 0.064 ms (matched: 23832)
average jit runtime: 0.004 ms (matched: 23832)
Pattern: ho
average interpreter runtime: 0.042 ms (matched: 58736)
average jit runtime: 0.006 ms (matched: 58736)
Pattern: how to
average interpreter runtime: 0.037 ms (matched: 395)
average jit runtime: 0.002 ms (matched: 395)
Vzeroupper helped in the "how to" case, but not in the other cases. Regardless, it is worth adding this instruction.
I thought the avx is just an alias to sse when 128 bit registers are used. I can modify the jit compiler to always generate avx instructions when avx is available. This will be a longer term work though.
—
Reply to this email directly, view it on GitHub<#263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABRT7UY3ZAFFX66SVSFWEOLYILG7PAVCNFSM6AAAAAAY5GZXFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWG43TIOBYGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
The perf_test.c on https://gist.github.com/zherczeg/43d1266dbc73df8f4223c9416156278b only test the "how to" pattern? In future, if the JITed code is well orgnized to pack all AVX2 together, that will be easy, we just place vzeroupper before/after the AVX2 block. |
You can add any pcre2 pattern to
The first member contains the compile flags. The last pattern is marked with a |
I have also noticed that vzeroupper needs to be placed there. I will try if zeroing with |
I check avx/sse mix penalty for pattern "ab" and "ho" after appling our PoC patch. It's 59.2% (almost same as without the patch) and is too high.
To run this command, you need update your Linux kernel. The function
So I add vzeroupper after it (zero upper after AVX2 block):
This, however, breaks the functionality but do benefit the ASSISTS.SSE_AVX_MIX. So for long term, people may optimzie the JIT engine to pack AVX2 code together.
|
I said it breaks the functionality because the matches should be 58736:
|
Maybe this is a already-answered or zero-level question.
I just wondered why there is no one to use AVX instructions to optimize this PCRE library after SSE2 optimization?
Has someone already tried but no performance improvement?
Thanks.
The text was updated successfully, but these errors were encountered: