[Question] Why no one use AVX instruction to optimize pcre library? #263

stkeke · 2023-06-07T02:47:33Z

Maybe this is a already-answered or zero-level question.
I just wondered why there is no one to use AVX instructions to optimize this PCRE library after SSE2 optimization?
Has someone already tried but no performance improvement?
Thanks.

zherczeg · 2023-06-07T04:20:50Z

I tried it once, but did not see much improvement. Probably because it is mostly just data loading, the actual computation is simple. Or I just don't know how to use them properly.

stkeke · 2023-06-08T11:29:42Z

@zherczeg Thanks for the information. I don't think we need to try again.

zherczeg · 2023-11-10T13:39:11Z

Since the compiler knows more, I have added AVX2 support for the jit compiler:
#328

I see some improvements for simple fixed strings, but they never dominated benchmarks.

zherczeg · 2023-11-12T08:00:06Z

@stkeke Are you still available? There is something I don't understand. There is a fast_forward_char_pair_sse2_compare function which can use avx2 now. The main difference between the SSE4.1 and AVX2 code is the using of 32 byte registers instead of 16 byte. However, it seems slower on certain machines.

SSE4.1 code:

Pattern: 'how to' Matches: 395
  8 bit: Int runtime:     44 ms JIT runtime:      2 ms 15.67 as fast 93.6% save
  16 bit: Int runtime:     54 ms JIT runtime:      4 ms 13.09 as fast 92.4% save
Pattern: '^how to' Matches: 33
  8 bit: Int runtime:     99 ms JIT runtime:      2 ms 46.37 as fast 97.8% save
  16 bit: Int runtime:    101 ms JIT runtime:      4 ms 23.16 as fast 95.7% save

AVX2 code:

Pattern: 'how to' Matches: 395
  8 bit: Int runtime:     46 ms JIT runtime:      6 ms 7.38 as fast 86.5% save
  16 bit: Int runtime:     74 ms JIT runtime:      8 ms 8.63 as fast 88.4% save
Pattern: '^how to' Matches: 33
  8 bit: Int runtime:    156 ms JIT runtime:      4 ms 33.30 as fast 97.0% save
  16 bit: Int runtime:    114 ms JIT runtime:      7 ms 16.08 as fast 93.8% save

Do you have any idea why this happens? The code reads an aligned 16/32 byte block, then another 16/32 byte block with an unaligned (negative) offset, then doing some compare and logic instructions. Then use the mask collector instruction to get the sign bits.

stkeke · 2023-11-12T08:07:58Z

@zherczeg Hmmm...unexpected...let me talk with my colleagues tomorrow. Is it possible to provide CPU model showed by lscpu?

zherczeg · 2023-11-12T13:53:26Z

Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz It is a virtual machine, but the results are consistent.

zherczeg · 2023-11-12T14:15:25Z

If the following lines are copied into pcre2_jit_simd_inc.h you can force the SSE4.1 code:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L44

#undef SLJIT_HAS_AVX2
#define SLJIT_HAS_AVX2 -1

The code is quite similar except:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L593

It also uses vbroadcastd instead of pshufd on 32 byte registers to replicate the elements.

chen-hu-97 · 2023-11-13T09:00:24Z

fast_forward_char_pair_sse2_compare

Hi @zherczeg, can you kindly share your test method?

zherczeg · 2023-11-13T11:17:13Z

My test method is kind of complicated, so I created a simplified code from it. I see the differences there as well.

Source code:
https://gist.github.com/zherczeg/43d1266dbc73df8f4223c9416156278b

Input:
http://www.gutenberg.org/files/3200/old/mtent12.zip

This zip contains a single text file (~20 Mbyte), please rename it to text.txt.

Compiling:

Configure pcre2:

export CFLAGS="-O3"
./configure --enable-shared=no --enable-pcre2-16 --enable-pcre2-32 --enable-jit --enable-unicode
make

I prefer static builds but it is not necessary. Probably ./configure --enable-jit would be enough, but I haven't tested it.

Then just build the performance test program and link it with the 8 bit pcre2 library:
gcc perf_test.c -O3 -o perf_test -lpcre2-8

stkeke · 2023-11-13T11:44:56Z

Thanks. We can check it out tomorrow. Now it is night in our time zone.

…

________________________________ From: Zoltan Herczeg ***@***.***> Sent: Monday, November 13, 2023 7:17:23 PM To: PCRE2Project/pcre2 ***@***.***> Cc: Su, Tao ***@***.***>; Mention ***@***.***> Subject: Re: [PCRE2Project/pcre2] [Question] Why no one use AVX instruction to optimize pcre library? (Issue #263) My test method is kind of complicated, so I created a simplified code from it. I see the differences there as well. Source code: https://gist.github.com/zherczeg/43d1266dbc73df8f4223c9416156278b Input: http://www.gutenberg.org/files/3200/old/mtent12.zip This zip contains a single text file (~20 Mbyte), please rename it to text.txt. Compiling: Configure pcre2: export CFLAGS="-O3" ./configure --enable-shared=no --enable-pcre2-16 --enable-pcre2-32 --enable-jit --enable-unicode make I prefer static builds but it is not necessary. Probably ./configure --enable-jit would be enough, but I haven't tested it. Then just build the performance test program and link it with the 8 bit pcre2 library: gcc perf_test.c -O3 -o perf_test -lpcre2-8 — Reply to this email directly, view it on GitHub<#263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABRT7U5H6YPMQ7A7LIPJ7LDYEH6UHAVCNFSM6AAAAAAY5GZXFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBXHE3DGMBZGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

chen-hu-97 · 2023-11-15T10:26:54Z

make

Hi, I follow the steps to run the test on bare metal machine. I pin the process to one cpu. I got some small value, that is, very small ms. I don't know if I run the test correctly?

I suppose the default repo from upstream is avx2 version:

$ ./perf_test_avx
Size of input: 20045118
Pattern: how to
  average interpreter runtime: 0.041 ms
  average jit runtime: 0.011 ms
Pattern: ^how to
  average interpreter runtime: 0.100 ms
  average jit runtime: 0.011 ms
Pattern: how( to|ever)
  average interpreter runtime: 0.042 ms
  average jit runtime: 0.012 ms

Then I modify pcre2_jit_simd_inc.h to compile the libpcre2-8.a then compile the perf_test and get SSE version:

$ ./perf_test_sse
Size of input: 20045118
Pattern: how to
  average interpreter runtime: 0.043 ms
  average jit runtime: 0.002 ms
Pattern: ^how to
  average interpreter runtime: 0.093 ms
  average jit runtime: 0.002 ms
Pattern: how( to|ever)
  average interpreter runtime: 0.043 ms
  average jit runtime: 0.002 ms

zherczeg · 2023-11-15T10:40:44Z

Looks correct. Even if the input is 20 Mb, scanning it with a modern cpu does not take much time. The patterns in the test are very simple ones, where the simd code dominates the match. The interpreter runtime is roughly the same (it uses memchr() to find a single character).

The AVX2 is (should be) auto detected by default. The 0.011 ms looks five times slower than the 0.002 ms. This confirms my finding, although AVX2 was only twice as slow on my system. Btw if bigger values are needed, you can use memcpy() to repeat the input data.

zherczeg · 2023-11-15T10:51:58Z

Let me give you some info about how the code works. It searches character pairs, e.g. an a...b, where an 'a' is four characters before 'b'. The 'a' and 'b' are repeated in simd registers:

cmp1_reg = 'bbb...bbb'
cmp2_reg = 'aaa...aaa'

str_ptr is a string pointer in the input, which is aligned to 16/32 byte.

str_ptr -= sizeof(data1_reg)
loop {
  str_ptr += sizeof(data1_reg)
  data1_reg = read_mem(str_ptr) // This is an aligned read
  data2_reg = read_mem(str_ptr - 4) // The 'a' is four characters before 'b'. This is an unaligned read.

  data1_reg = (data1_reg == cmp1_reg) & (data2_reg == cmp2_reg)
  // data1_reg contains 0xff where the pair is found
  cpu_reg = pmovmsk(data1_reg); // get the mask
  if (cpu_reg != 0) break;
}

// Find the lowest bit set in cpu_reg

stkeke · 2023-11-18T01:44:48Z

Thanks for the addition info. We're still working on this...

…

________________________________ From: Zoltan Herczeg ***@***.***> Sent: Saturday, November 18, 2023 6:47:55 AM To: PCRE2Project/pcre2 ***@***.***> Cc: Su, Tao ***@***.***>; Mention ***@***.***> Subject: Re: [PCRE2Project/pcre2] [Question] Why no one use AVX instruction to optimize pcre library? (Issue #263) If the following lines are copied into pcre2_jit_simd_inc.h you can force the SSE4.1 code: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L44 #undef SLJIT_HAS_AVX2 #define SLJIT_HAS_AVX2 -1 The code is quite similar except: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L593 It also uses vbroadcastd instead of pshufd on 32 byte registers to replicate the elements. — Reply to this email directly, view it on GitHub<#263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABRT7U6J4BXH4ALXVQQRYL3YE7SRXAVCNFSM6AAAAAAY5GZXFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBXGE2DCMBXHA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

zherczeg · 2023-11-19T16:54:37Z

Thank you! Maybe this could help: the jit compiler can be used to insert instructions. For example sljit_emit_op0(compiler, SLJIT_BREAKPOINT); and sljit_emit_op0(compiler, SLJIT_NOP); can be used to insert breakpoint (int3 on x86) and nop instructions respectively. When the program is executed with gdb, it stops at the breakpoint and the code can be disassembled. Inserting these instructions at the beginning and end of fast_forward_char_pair_simd, we get the following for the first test pattern:

AVX2:

   0x00007ffff6a9dc50:  int3
=> 0x00007ffff6a9dc51:  add    $0x1,%rsi
   0x00007ffff6a9dc55:  cmp    %rbx,%rsi
   0x00007ffff6a9dc58:  jae    0x7ffff6a9de4a
   0x00007ffff6a9dc5e:  mov    $0x6f6f6f6f,%rax
   0x00007ffff6a9dc65:  movd   %eax,%xmm2
   0x00007ffff6a9dc69:  mov    $0x68686868,%rax
   0x00007ffff6a9dc70:  movd   %eax,%xmm3
   0x00007ffff6a9dc74:  vpbroadcastd %xmm2,%ymm2
   0x00007ffff6a9dc79:  vpbroadcastd %xmm3,%ymm3
   0x00007ffff6a9dc7e:  lea    -0x1(%rsi),%rax
   0x00007ffff6a9dc82:  mov    %rsi,%rcx
   0x00007ffff6a9dc85:  and    $0xffffffffffffffe0,%rsi
   0x00007ffff6a9dc89:  vmovdqa (%rsi),%ymm0
   0x00007ffff6a9dc8d:  cmp    %rsi,%rax
   0x00007ffff6a9dc90:  jae    0x7ffff6a9dc99
   0x00007ffff6a9dc92:  vmovdqu -0x1(%rsi),%ymm1
   0x00007ffff6a9dc97:  jmp    0x7ffff6a9dca5
   0x00007ffff6a9dc99:  vpslldq $0x1,%xmm0,%xmm1
   0x00007ffff6a9dc9e:  vinserti128 $0x1,0xf(%rsi),%ymm1,%ymm1
   0x00007ffff6a9dca5:  and    $0x1f,%rcx
   0x00007ffff6a9dca9:  vpcmpeqb %ymm3,%ymm1,%ymm1
   0x00007ffff6a9dcad:  vpcmpeqb %ymm2,%ymm0,%ymm0
   0x00007ffff6a9dcb1:  vpand  %ymm1,%ymm0,%ymm0
   0x00007ffff6a9dcb5:  vpmovmskb %ymm0,%eax
   0x00007ffff6a9dcb9:  add    %rcx,%rsi
   0x00007ffff6a9dcbc:  shr    %cl,%rax
   0x00007ffff6a9dcbf:  cmp    $0x0,%rax
   0x00007ffff6a9dcc3:  jne    0x7ffff6a9dcf8
   0x00007ffff6a9dcc9:  sub    %rcx,%rsi
   0x00007ffff6a9dccc:  add    $0x20,%rsi
   0x00007ffff6a9dcd0:  cmp    %rbx,%rsi
   0x00007ffff6a9dcd3:  jae    0x7ffff6a9de4a
   0x00007ffff6a9dcd9:  vmovdqa (%rsi),%ymm0
   0x00007ffff6a9dcdd:  vmovdqu -0x1(%rsi),%ymm1
   0x00007ffff6a9dce2:  vpcmpeqb %ymm2,%ymm0,%ymm0
   0x00007ffff6a9dce6:  vpcmpeqb %ymm3,%ymm1,%ymm1
   0x00007ffff6a9dcea:  vpand  %ymm1,%ymm0,%ymm0
   0x00007ffff6a9dcee:  vpmovmskb %ymm0,%eax
   0x00007ffff6a9dcf2:  cmp    $0x0,%rax
   0x00007ffff6a9dcf6:  je     0x7ffff6a9dccc
   0x00007ffff6a9dcf8:  bsf    %eax,%eax
   0x00007ffff6a9dcfb:  add    %rax,%rsi
   0x00007ffff6a9dcfe:  cmp    %rbx,%rsi
   0x00007ffff6a9dd01:  jae    0x7ffff6a9de4a
   0x00007ffff6a9dd07:  sub    $0x1,%rsi
   0x00007ffff6a9dd0b:  nop

SSE4.1

   0x00007ffff6a9dc58:  int3
=> 0x00007ffff6a9dc59:  add    $0x1,%rsi
   0x00007ffff6a9dc5d:  cmp    %rbx,%rsi
   0x00007ffff6a9dc60:  jae    0x7ffff6a9de4f
   0x00007ffff6a9dc66:  mov    $0x6f6f6f6f,%rax
   0x00007ffff6a9dc6d:  movd   %eax,%xmm2
   0x00007ffff6a9dc71:  mov    $0x68686868,%rax
   0x00007ffff6a9dc78:  movd   %eax,%xmm3
   0x00007ffff6a9dc7c:  vpbroadcastd %xmm2,%xmm2
   0x00007ffff6a9dc81:  vpbroadcastd %xmm3,%xmm3
   0x00007ffff6a9dc86:  lea    -0x1(%rsi),%rax
   0x00007ffff6a9dc8a:  mov    %rsi,%rcx
   0x00007ffff6a9dc8d:  and    $0xfffffffffffffff0,%rsi
   0x00007ffff6a9dc91:  movdqa (%rsi),%xmm0
   0x00007ffff6a9dc95:  cmp    %rsi,%rax
   0x00007ffff6a9dc98:  jae    0x7ffff6a9dca1
   0x00007ffff6a9dc9a:  movdqu -0x1(%rsi),%xmm1
   0x00007ffff6a9dc9f:  jmp    0x7ffff6a9dcaa
   0x00007ffff6a9dca1:  movdqa %xmm0,%xmm1
   0x00007ffff6a9dca5:  pslldq $0x1,%xmm1
   0x00007ffff6a9dcaa:  and    $0xf,%rcx
   0x00007ffff6a9dcae:  pcmpeqb %xmm3,%xmm1
   0x00007ffff6a9dcb2:  pcmpeqb %xmm2,%xmm0
   0x00007ffff6a9dcb6:  pand   %xmm1,%xmm0
   0x00007ffff6a9dcba:  pmovmskb %xmm0,%eax
   0x00007ffff6a9dcbe:  add    %rcx,%rsi
   0x00007ffff6a9dcc1:  shr    %cl,%rax
   0x00007ffff6a9dcc4:  cmp    $0x0,%rax
   0x00007ffff6a9dcc8:  jne    0x7ffff6a9dcfd
   0x00007ffff6a9dcce:  sub    %rcx,%rsi
   0x00007ffff6a9dcd1:  add    $0x10,%rsi
   0x00007ffff6a9dcd5:  cmp    %rbx,%rsi
   0x00007ffff6a9dcd8:  jae    0x7ffff6a9de4f
   0x00007ffff6a9dcde:  movdqa (%rsi),%xmm0
   0x00007ffff6a9dce2:  movdqu -0x1(%rsi),%xmm1
   0x00007ffff6a9dce7:  pcmpeqb %xmm2,%xmm0
   0x00007ffff6a9dceb:  pcmpeqb %xmm3,%xmm1
   0x00007ffff6a9dcef:  pand   %xmm1,%xmm0
   0x00007ffff6a9dcf3:  pmovmskb %xmm0,%eax
   0x00007ffff6a9dcf7:  cmp    $0x0,%rax
   0x00007ffff6a9dcfb:  je     0x7ffff6a9dcd1
   0x00007ffff6a9dcfd:  bsf    %eax,%eax
   0x00007ffff6a9dd00:  add    %rax,%rsi
   0x00007ffff6a9dd03:  cmp    %rbx,%rsi
   0x00007ffff6a9dd06:  jae    0x7ffff6a9de4f
   0x00007ffff6a9dd0c:  sub    $0x1,%rsi
   0x00007ffff6a9dd10:  nop

zherczeg · 2023-11-20T05:40:52Z

[Comment was updated]

I have found another interesting observation. I played around with some fixed string patterns:

\xf0\xf1\xf2\xf3\xf4 - the input is ascii, so these characters are never found there
abcde - all characters are frequent, but no matches found
hoW to - similar to "how to", but never found
how to - found 395 times

AVX2 times:

Pattern: \xf0\xf1\xf2\xf3\xf4
  average interpreter runtime: 0.002 ms (matched: 0)
  average jit runtime: 0.002 ms (matched: 0)
Pattern: abcde
  average interpreter runtime: 0.072 ms (matched: 0)
  average jit runtime: 0.003 ms (matched: 0)
Pattern: hoW to
  average interpreter runtime: 0.045 ms (matched: 0)
  average jit runtime: 0.005 ms (matched: 0)
Pattern: how to
  average interpreter runtime: 0.037 ms (matched: 395)
  average jit runtime: 0.005 ms (matched: 395)

SSE4.1 times

Pattern: \xf0\xf1\xf2\xf3\xf4
  average interpreter runtime: 0.002 ms (matched: 0)
  average jit runtime: 0.003 ms (matched: 0)
Pattern: abcde
  average interpreter runtime: 0.064 ms (matched: 0)
  average jit runtime: 0.002 ms (matched: 0)
Pattern: hoW to
  average interpreter runtime: 0.042 ms (matched: 0)
  average jit runtime: 0.002 ms (matched: 0)
Pattern: how to
  average interpreter runtime: 0.035 ms (matched: 395)
  average jit runtime: 0.002 ms (matched: 395)

The AVX2 code path gets slower gradually with these patterns. The SS4.1 code path is slower first, but then it is unaffected.

The first pattern is basically runs on the loop of the lower part of the assembly code above. In this case AVX2 seems faster. But when we have matches, it gets slower. If it has frequent matches, it is really slow.

zherczeg · 2023-11-20T06:08:38Z

The "ab" is found 23832 times and "ho" is found "58736" times in the text. So the latter is more than twice as frequent. The character pair search leaves to search loop more frequently for "how to".

stkeke · 2023-11-22T00:37:45Z

@zherczeg things are more complicated that the first looking. We are doing several things in parallel right now. Trying to find the root cause, more people we need to link and talk. It will take a while...
Can you try 'perf' thing? does it help?

zherczeg · 2023-11-22T09:23:05Z

No problem, this is not urgent. Anyway this is what I know so far:

SSE4.1:

Size of input: 20045118
Pattern: xz
  average interpreter runtime: 0.004 ms (matched: 0)
  average jit runtime: 0.003 ms (matched: 0)
Pattern: ab
  average interpreter runtime: 0.065 ms (matched: 23832)
  average jit runtime: 0.003 ms (matched: 23832)
Pattern: ho
  average interpreter runtime: 0.045 ms (matched: 58736)
  average jit runtime: 0.003 ms (matched: 58736)

AVX2:

Pattern: xz
  average interpreter runtime: 0.004 ms (matched: 0)
  average jit runtime: 0.002 ms (matched: 0)
Pattern: ab
  average interpreter runtime: 0.066 ms (matched: 23832)
  average jit runtime: 0.004 ms (matched: 23832)
Pattern: ho
  average interpreter runtime: 0.042 ms (matched: 58736)
  average jit runtime: 0.006 ms (matched: 58736)

The SSE code path does not affected by the number of character pair matches. However the AVX2 gets slower. With zero match, it is faster, but with 58K matches its runtime is tripled.

Does perf support jit generated code? Otherwise it is not too helpful.

chen-hu-97 · 2023-12-05T10:20:18Z

I modify the perf_test.c as below so we can focus on JIT part.

Only run the first pattern, i.e., "Pattern: how to"
Only run "pcre2_jit_match_8"

AVX

I collect trace with perf record -ag ./perf_test_avx. I don't know if perf can works smoothly on JITed code, but it does report some hotspot. For now, let's assume it's correct :). The hotspot is:

+   20.20%  perf_test_avx    [JIT] tid 1239365     [.] 0x00007ffff7fa6c7b
+   15.99%  perf_test_avx    [JIT] tid 1239365     [.] 0x00007ffff7fa6c6c

and let's see the corresponding JITed code:

Dump of assembler code from 0x7ffff7fa6c58 to 0x7ffff7fa7000:
   0x00007ffff7fa6c58:  add    $0x1,%rsi
   0x00007ffff7fa6c5c:  cmp    %rbx,%rsi
   0x00007ffff7fa6c5f:  jae    0x7ffff7fa6e50
   0x00007ffff7fa6c65:  mov    $0x6f6f6f6f,%rax
   0x00007ffff7fa6c6c:  movd   %eax,%xmm2
   0x00007ffff7fa6c70:  mov    $0x68686868,%rax
   0x00007ffff7fa6c77:  movd   %eax,%xmm3
   0x00007ffff7fa6c7b:  vpbroadcastd %xmm2,%ymm2
   0x00007ffff7fa6c80:  vpbroadcastd %xmm3,%ymm3
   0x00007ffff7fa6c85:  lea    -0x1(%rsi),%rax
   0x00007ffff7fa6c89:  mov    %rsi,%rcx
   0x00007ffff7fa6c8c:  and    $0xffffffffffffffe0,%rsi
   0x00007ffff7fa6c90:  vmovdqa (%rsi),%ymm0
   0x00007ffff7fa6c94:  cmp    %rsi,%rax
   0x00007ffff7fa6c97:  jae    0x7ffff7fa6ca0
   0x00007ffff7fa6c99:  vmovdqu -0x1(%rsi),%ymm1
   0x00007ffff7fa6c9e:  jmp    0x7ffff7fa6cac
   0x00007ffff7fa6ca0:  vpslldq $0x1,%xmm0,%xmm1
   0x00007ffff7fa6ca5:  vinserti128 $0x1,0xf(%rsi),%ymm1,%ymm1
   0x00007ffff7fa6cac:  and    $0x1f,%rcx
   0x00007ffff7fa6cb0:  vpcmpeqb %ymm3,%ymm1,%ymm1
   0x00007ffff7fa6cb4:  vpcmpeqb %ymm2,%ymm0,%ymm0
   0x00007ffff7fa6cb8:  vpand  %ymm1,%ymm0,%ymm0
   0x00007ffff7fa6cbc:  vpmovmskb %ymm0,%eax
   0x00007ffff7fa6cc0:  add    %rcx,%rsi
   0x00007ffff7fa6cc3:  shr    %cl,%rax
   0x00007ffff7fa6cc6:  cmp    $0x0,%rax
   0x00007ffff7fa6cca:  jne    0x7ffff7fa6cff
   0x00007ffff7fa6cd0:  sub    %rcx,%rsi
   0x00007ffff7fa6cd3:  add    $0x20,%rsi
   0x00007ffff7fa6cd7:  cmp    %rbx,%rsi
   0x00007ffff7fa6cda:  jae    0x7ffff7fa6e50

So the difference between AVX and SSE version is this instruction:
AVX:
0x00007ffff7fa6c7b: vpbroadcastd %xmm2,%ymm2
SSE:
0x00007ffff7fa6c7b: vpbroadcastd %xmm2,%xmm2

SSE

For sse, we have different hotspot:

+    5.66%  perf_test_sse    [JIT] tid 1244347     [.] 0x00007ffff7fa6ce1                                                                                                                                                                              
+    4.16%  perf_test_sse    [kernel.vmlinux]      [k] copyout                                                                                                                                                                                         
+    3.59%  perf_test_sse    [JIT] tid 1244347     [.] 0x00007ffff7fa6cf2                                                                                                                                                                              
+    3.56%  perf_test_sse    [JIT] tid 1244347     [.] 0x00007ffff7fa6cfa

and JITed code:

Dump of assembler code from 0x7ffff7fa6c58 to 0x7ffff7fa7000:
   0x00007ffff7fa6c58:  add    $0x1,%rsi
   0x00007ffff7fa6c5c:  cmp    %rbx,%rsi
   0x00007ffff7fa6c5f:  jae    0x7ffff7fa6e4d
   0x00007ffff7fa6c65:  mov    $0x6f6f6f6f,%rax
   0x00007ffff7fa6c6c:  movd   %eax,%xmm2
   0x00007ffff7fa6c70:  mov    $0x68686868,%rax
   0x00007ffff7fa6c77:  movd   %eax,%xmm3
   0x00007ffff7fa6c7b:  vpbroadcastd %xmm2,%xmm2
   0x00007ffff7fa6c80:  vpbroadcastd %xmm3,%xmm3
   0x00007ffff7fa6c85:  lea    -0x1(%rsi),%rax
   0x00007ffff7fa6c89:  mov    %rsi,%rcx
   0x00007ffff7fa6c8c:  and    $0xfffffffffffffff0,%rsi
   0x00007ffff7fa6c90:  movdqa (%rsi),%xmm0
   0x00007ffff7fa6c94:  cmp    %rsi,%rax
   0x00007ffff7fa6c97:  jae    0x7ffff7fa6ca0
   0x00007ffff7fa6c99:  movdqu -0x1(%rsi),%xmm1
   0x00007ffff7fa6c9e:  jmp    0x7ffff7fa6ca9
   0x00007ffff7fa6ca0:  movdqa %xmm0,%xmm1
   0x00007ffff7fa6ca4:  pslldq $0x1,%xmm1
   0x00007ffff7fa6ca9:  and    $0xf,%rcx
   0x00007ffff7fa6cad:  pcmpeqb %xmm3,%xmm1
   0x00007ffff7fa6cb1:  pcmpeqb %xmm2,%xmm0
   0x00007ffff7fa6cb5:  pand   %xmm1,%xmm0
   0x00007ffff7fa6cb9:  pmovmskb %xmm0,%eax
   0x00007ffff7fa6cbd:  add    %rcx,%rsi
   0x00007ffff7fa6cc0:  shr    %cl,%rax
   0x00007ffff7fa6cc3:  cmp    $0x0,%rax
   0x00007ffff7fa6cc7:  jne    0x7ffff7fa6cfc
   0x00007ffff7fa6ccd:  sub    %rcx,%rsi
   0x00007ffff7fa6cd0:  add    $0x10,%rsi
   0x00007ffff7fa6cd4:  cmp    %rbx,%rsi
   0x00007ffff7fa6cd7:  jae    0x7ffff7fa6e4d
   0x00007ffff7fa6cdd:  movdqa (%rsi),%xmm0
   0x00007ffff7fa6ce1:  movdqu -0x1(%rsi),%xmm1
   0x00007ffff7fa6ce6:  pcmpeqb %xmm2,%xmm0
   0x00007ffff7fa6cea:  pcmpeqb %xmm3,%xmm1
   0x00007ffff7fa6cee:  pand   %xmm1,%xmm0
   0x00007ffff7fa6cf2:  pmovmskb %xmm0,%eax
   0x00007ffff7fa6cf6:  cmp    $0x0,%rax
   0x00007ffff7fa6cfa:  je     0x7ffff7fa6cd0

chen-hu-97 · 2023-12-05T10:21:08Z

BTW, why I can still see AVX2 instruction such as vpbroadcastd on SSE version?

zherczeg · 2023-12-05T10:30:41Z

Because the jit compiler independently detects its availability, and use it when available. I could get rid of it for testing if it is really needed but it is a bit more complicated. Btw mixing avx and sse forms has any negative impact? I thought the instruction decoder changes all instructions into internal forms.

zherczeg · 2023-12-05T10:46:06Z

If the hotspot data is correct, in avx2 the initialization of 256 bit registers is somehow heavy. In sse2 the main loops is heavy, which is expected.

stkeke · 2023-12-06T00:03:17Z

We will see if we can find some public whitepaper mentioning that this kind of operation is heavy.

zherczeg · 2023-12-06T05:40:43Z

Thank you. Honestly, it is surprising that operation is heavy according to perf, but the 128 bit variant of the same instruction is not.

chen-hu-97 · 2023-12-06T06:15:58Z

Because I am not sure if perf works well on JITed code, I want to do some experiment.

   0x00007ffff7fa6c65:  mov    $0x6f6f6f6f,%rax
   0x00007ffff7fa6c6c:  movd   %eax,%xmm2
   0x00007ffff7fa6c70:  mov    $0x68686868,%rax
   0x00007ffff7fa6c77:  movd   %eax,%xmm3
   0x00007ffff7fa6c7b:  vpbroadcastd %xmm2,%ymm2

What's purpose of the code? Looks like it just fill or initialize the 256bits ymm2 by broadcast a dword integer "0x6f6f6f6f" to eight locations in ymm2?

Can we switch to other method to see the effects?

btw, I'm new to this repo so I am still ramp up the source code to see how to do that.

chen-hu-97 · 2023-12-06T06:22:37Z

Should we avoid broadcast in same reg:
0x00007ffff7fa6c7b: vpbroadcastd %xmm2,%ymm2
and switch to things like:
0x00007ffff7fa6c7b: vpbroadcastd %xmm3,%ymm2

zherczeg · 2023-12-06T06:37:35Z

The purpose of the broadcast is replicating 8/16/32 bit values depending on the character type (pcre is a pattern matching engine). The old code used pshufd to do the replication, so 32 bit values were the best. First the 8/16 bit values were pre-replicated to a 32 bit immediate, which is replicated to the 128 bit register.

Yes, you can use other registers to hold the data before replication:
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L566
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_jit_simd_inc.h#L479

E.g. you can use SLJIT_FR0 / SLJIT_FR1 to hold the constant before replication.

chen-hu-97 · 2023-12-07T10:06:29Z

Looks like it’s caused by AVX-SSE transition penalties. Please refer to below links:
https://stackoverflow.com/questions/43879935/avoiding-avx-sse-vex-transition-penalties
https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake

According the links, we may do followings to avoid penalties:

Issue vzeroupper after AVX2 instructions
Emit all the JITed code with VEX perfixes
Use zmm16 - zmm31

Perf reported the hotspot:

0x00007ffff7fa6c6c:  movd   %eax,%xmm2
0x00007ffff7fa6c7b:  vpbroadcastd %xmm2,%ymm2

"movd %eax,%xmm2" doesn’t need upper bits of ymm2, but it actually get penalties (tens of cycles). So I add vzeroupper before it. (We should add vzeroupper after VEX / AVX2 instr, but I don't understand the logic of this assembly and corrupt the data in reg.)

Now the JITed code looks like:

   0x00007ffff7fa6c5d:  mov    $0x6f6f6f6f,%rax
   0x00007ffff7fa6c64:  vzeroupper
   0x00007ffff7fa6c67:  movd   %eax,%xmm2
   0x00007ffff7fa6c6b:  mov    $0x68686868,%rax
   0x00007ffff7fa6c72:  movd   %eax,%xmm3
   0x00007ffff7fa6c76:  vpbroadcastd %xmm2,%ymm2

This helps on my Intel Sapphire Rapids server, the AVX2 and SSE version now has similar performance.
Before:
AVX: average jit runtime: 0.011 ms
SSE: average jit runtime: 0.002 ms
After:
AVX: average jit runtime: 0.003 ms
SSE: average jit runtime: 0.002 ms

My POC is:

diff --git a/src/pcre2_jit_simd_inc.h b/src/pcre2_jit_simd_inc.h
index 793da05..be7933b 100644
--- a/src/pcre2_jit_simd_inc.h
+++ b/src/pcre2_jit_simd_inc.h
@@ -531,7 +531,7 @@ else
     OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, character_to_int32(char1b));
     }
   }
-
+emit_vzeroupper(compiler);
 value = SLJIT_SIMD_REG_128 | SLJIT_SIMD_ELEM_32 | SLJIT_SIMD_LANE_ZERO;
 sljit_emit_simd_lane_mov(compiler, value, SLJIT_FR2, 0, TMP1, 0);

diff --git a/src/sljit/sljitNativeX86_common.c b/src/sljit/sljitNativeX86_common.c
index cc330d4..8e201b7 100644
--- a/src/sljit/sljitNativeX86_common.c
+++ b/src/sljit/sljitNativeX86_common.c
@@ -980,6 +980,19 @@ static SLJIT_INLINE sljit_s32 emit_endbranch(struct sljit_compiler *compiler)
        return SLJIT_SUCCESS;
 }

+static SLJIT_INLINE sljit_s32 emit_vzeroupper(struct sljit_compiler *compiler)
+{
+       /* Emit vzeroupper.  */
+       sljit_u8 *inst;
+       inst = (sljit_u8*)ensure_buf(compiler, 1 + 3);
+       FAIL_IF(!inst);
+       INC_SIZE(3);
+       inst[0] = 0xc5;
+       inst[1] = 0xf8;
+       inst[2] = 0x77;
+       return SLJIT_SUCCESS;
+}
+
 #if (defined SLJIT_CONFIG_X86_CET && SLJIT_CONFIG_X86_CET) && defined (__SHSTK__)

 static SLJIT_INLINE sljit_s32 emit_rdssp(struct sljit_compiler *compiler, sljit_s32 reg)

chen-hu-97 · 2023-12-07T10:16:47Z

I am not sure if this is the root cause. But I think at least JIT need support vzeroupper when AVX come in.

Also, the perf metric tma_avx_assists which reflects the penalties becomes better once we emit vzeroupper, from 60% -> 11%. For SSE version, it's 0%. So we still have space to improve?

zherczeg · 2023-12-07T10:30:30Z

Likely. I don't know much about this topic, so everything is new to me. I never heard of tma_avx_assists.

Is it true, that mixing vex prefixed and non-vex prefixed instructions have a penalty, even if they only use 128 bit registers?
Mixing 128 and 256 bit operations has a hidden penalty, because the cpu tries to preserve the upper 128 bits? It is surprising that the upper parts are not implemented as separate registers internally, or not zeroed automatically after a 128 bit operation.

It seems the registers cannot be specified for vzeroupper. I think the abi on windows only requires to preserve the lower 128 bits of saved float registers.

chen-hu-97 · 2023-12-07T10:41:52Z

Edited:
Based on my current understanding,

Is it true, that mixing vex prefixed and non-vex prefixed instructions have a penalty, even if they only use 128 bit registers?
Maybe no penalty if the upper 128bits are not touched.

Mixing 128 and 256 bit operations has a hidden penalty, because the cpu tries to preserve the upper 128 bits? It is
Yes

If I have time tomorrow, I will search some Intel manual or Agner Fog's doc and share the snippet.

chen-hu-97 · 2023-12-08T07:17:16Z

Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software Optimization manual, https://cdrdv2-public.intel.com/671488/248966-046A-software-optimization-manual.pdf :

"If software inter-mixes AVX and SSE instructions without using VZEROUPPER properly, it can experience an AVX/SSE transition penalty."
"To enable fast transitions between 256-bit Intel AVX and Intel SSE code blocks, use the VZEROUPPER instruction before and after an AVX code block that would need to switch to execute SSE code."

Next steps:

Hi @zherczeg, can you kindly test if the PoC patch helps on your side if you have time? I tested it on two machines on my side and it works.
In future, maybe we need bunch AVX2 instructions in JITed code as much as possible. So we can vzeroupper before and after the AVX2 block.
How about we submit a PR to add support for vzeroupper at first?

zherczeg · 2023-12-08T08:40:12Z

Thank you very much for your help!

This is very interesting:

Size of input: 20045118
Pattern: xz
  average interpreter runtime: 0.004 ms (matched: 0)
  average jit runtime: 0.002 ms (matched: 0)
Pattern: ab
  average interpreter runtime: 0.064 ms (matched: 23832)
  average jit runtime: 0.004 ms (matched: 23832)
Pattern: ho
  average interpreter runtime: 0.042 ms (matched: 58736)
  average jit runtime: 0.006 ms (matched: 58736)
Pattern: how to
  average interpreter runtime: 0.037 ms (matched: 395)
  average jit runtime: 0.002 ms (matched: 395)

Vzeroupper helped in the "how to" case, but not in the other cases. Regardless, it is worth adding this instruction.

I thought the avx is just an alias to sse when 128 bit registers are used. I can modify the jit compiler to always generate avx instructions when avx is available. This will be a longer term work though.

stkeke · 2023-12-08T09:02:17Z

If avx can reduce instruction path length and save cycles, it's worthy, considering the popularity and volume of installation of this library.

…

________________________________ From: Zoltan Herczeg ***@***.***> Sent: Friday, December 8, 2023 4:40:23 PM To: PCRE2Project/pcre2 ***@***.***> Cc: Su, Tao ***@***.***>; Mention ***@***.***> Subject: Re: [PCRE2Project/pcre2] [Question] Why no one use AVX instruction to optimize pcre library? (Issue #263) Thank you very much for your help! This is very interesting: Size of input: 20045118 Pattern: xz average interpreter runtime: 0.004 ms (matched: 0) average jit runtime: 0.002 ms (matched: 0) Pattern: ab average interpreter runtime: 0.064 ms (matched: 23832) average jit runtime: 0.004 ms (matched: 23832) Pattern: ho average interpreter runtime: 0.042 ms (matched: 58736) average jit runtime: 0.006 ms (matched: 58736) Pattern: how to average interpreter runtime: 0.037 ms (matched: 395) average jit runtime: 0.002 ms (matched: 395) Vzeroupper helped in the "how to" case, but not in the other cases. Regardless, it is worth adding this instruction. I thought the avx is just an alias to sse when 128 bit registers are used. I can modify the jit compiler to always generate avx instructions when avx is available. This will be a longer term work though. — Reply to this email directly, view it on GitHub<#263 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABRT7UY3ZAFFX66SVSFWEOLYILG7PAVCNFSM6AAAAAAY5GZXFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWG43TIOBYGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

chen-hu-97 · 2023-12-08T09:03:44Z

The perf_test.c on https://gist.github.com/zherczeg/43d1266dbc73df8f4223c9416156278b only test the "how to" pattern?
If you can provide the reference code for pattern "xz", "ab" and "ho", I can also debug to find the bottleneck in following weeks. I think they are still related to vzeroupper. My PoC patch place vzeroupper in a special place. (It doesn't help if I put it to another place).

In future, if the JITed code is well orgnized to pack all AVX2 together, that will be easy, we just place vzeroupper before/after the AVX2 block.

zherczeg · 2023-12-08T10:04:05Z

You can add any pcre2 pattern to test_cases:

static struct performance_test_case test_cases[] = {
        { PCRE2_MULTILINE, "xz" },
        { PCRE2_MULTILINE, "ab" },
        { PCRE2_MULTILINE, "ho" },
        { PCRE2_MULTILINE, "how to" },
        { 0, NULL }
};

The first member contains the compile flags. The last pattern is marked with a NULL.

zherczeg · 2023-12-08T10:05:55Z

I have also noticed that vzeroupper needs to be placed there. I will try if zeroing with pxor would help as well.

chen-hu-97 · 2023-12-11T08:25:19Z

I check avx/sse mix penalty for pattern "ab" and "ho" after appling our PoC patch. It's 59.2% (almost same as without the patch) and is too high.

$ sudo perf stat -M TopdownL5  ./perf_test_avx
1,059,129      ASSISTS.SSE_AVX_MIX              #     59.2 %  tma_mixing_vectors       (10.86%)

To run this command, you need update your Linux kernel.

The function sljit_emit_simd_sign will emit below code:

   vpcmpeqb %ymm2,%ymm0,%ymm0
   vpcmpeqb %ymm3,%ymm1,%ymm1
   vpand  %ymm1,%ymm0,%ymm0
   vpmovmskb %ymm0,%eax

So I add vzeroupper after it (zero upper after AVX2 block):

diff --git a/src/pcre2_jit_simd_inc.h b/src/pcre2_jit_simd_inc.h
index 793da05..9ec6fe7 100644
--- a/src/pcre2_jit_simd_inc.h
+++ b/src/pcre2_jit_simd_inc.h
@@ -711,7 +711,7 @@ instruction[3] = 0xc0 | (data1_ind << 3) | data2_ind;
 sljit_emit_op_custom(compiler, instruction, 4);

 sljit_emit_simd_sign(compiler, SLJIT_SIMD_STORE | reg_type | SLJIT_SIMD_ELEM_8, SLJIT_FR0, TMP1, 0);
-
+emit_vzeroupper(compiler);
 CMPTO(SLJIT_ZERO, TMP1, 0, SLJIT_IMM, 0, start);

 JUMPHERE(jump[0]);
diff --git a/src/sljit/sljitNativeX86_common.c b/src/sljit/sljitNativeX86_common.c
index cc330d4..7720f92 100644
--- a/src/sljit/sljitNativeX86_common.c
+++ b/src/sljit/sljitNativeX86_common.c
@@ -980,6 +980,19 @@ static SLJIT_INLINE sljit_s32 emit_endbranch(struct sljit_compiler *compiler)
        return SLJIT_SUCCESS;
 }

+static SLJIT_INLINE sljit_s32 emit_vzeroupper(struct sljit_compiler *compiler)
+{
+       /* Emit vzeroupper.  */
+       sljit_u8 *inst;
+       inst = (sljit_u8*)ensure_buf(compiler, 1 + 3);
+       FAIL_IF(!inst);
+       INC_SIZE(3);
+       inst[0] = 0xc5;
+       inst[1] = 0xf8;
+       inst[2] = 0x77;
+       return SLJIT_SUCCESS;
+}
+
 #if (defined SLJIT_CONFIG_X86_CET && SLJIT_CONFIG_X86_CET) && defined (__SHSTK__)

 static SLJIT_INLINE sljit_s32 emit_rdssp(struct sljit_compiler *compiler, sljit_s32 reg)

This, however, breaks the functionality but do benefit the ASSISTS.SSE_AVX_MIX. So for long term, people may optimzie the JIT engine to pack AVX2 code together.

Size of input: 20045118
Pattern: ho
  WARNING: Number of matches differs! (jit:32671 interpreter:0)
  average jit runtime: 0.003 ms

chen-hu-97 · 2023-12-11T08:28:40Z

I said it breaks the functionality because the matches should be 58736:

Size of input: 20045118
Pattern: ho
  WARNING: Number of matches differs! (jit:58736 interpreter:0)
  average jit runtime: 0.012 ms

stkeke closed this as completed Jun 8, 2023

zherczeg reopened this Nov 19, 2023

[Question] Why no one use AVX instruction to optimize pcre library? #263

[Question] Why no one use AVX instruction to optimize pcre library? #263

Comments

stkeke commented Jun 7, 2023

zherczeg commented Jun 7, 2023

stkeke commented Jun 8, 2023

zherczeg commented Nov 10, 2023

zherczeg commented Nov 12, 2023 • edited

stkeke commented Nov 12, 2023 • edited

zherczeg commented Nov 12, 2023

zherczeg commented Nov 12, 2023 • edited

chen-hu-97 commented Nov 13, 2023

zherczeg commented Nov 13, 2023

stkeke commented Nov 13, 2023 via email

chen-hu-97 commented Nov 15, 2023

zherczeg commented Nov 15, 2023 • edited

zherczeg commented Nov 15, 2023

stkeke commented Nov 18, 2023 via email

zherczeg commented Nov 19, 2023

zherczeg commented Nov 20, 2023 • edited

zherczeg commented Nov 20, 2023

stkeke commented Nov 22, 2023

zherczeg commented Nov 22, 2023

chen-hu-97 commented Dec 5, 2023

AVX

SSE

chen-hu-97 commented Dec 5, 2023

zherczeg commented Dec 5, 2023

zherczeg commented Dec 5, 2023

stkeke commented Dec 6, 2023

zherczeg commented Dec 6, 2023

chen-hu-97 commented Dec 6, 2023

chen-hu-97 commented Dec 6, 2023

zherczeg commented Dec 6, 2023

chen-hu-97 commented Dec 7, 2023 • edited

chen-hu-97 commented Dec 7, 2023

zherczeg commented Dec 7, 2023

chen-hu-97 commented Dec 7, 2023 • edited

chen-hu-97 commented Dec 8, 2023

zherczeg commented Dec 8, 2023

stkeke commented Dec 8, 2023 via email

chen-hu-97 commented Dec 8, 2023

zherczeg commented Dec 8, 2023

zherczeg commented Dec 8, 2023

chen-hu-97 commented Dec 11, 2023

chen-hu-97 commented Dec 11, 2023

zherczeg commented Nov 12, 2023 •

edited

stkeke commented Nov 12, 2023 •

edited

zherczeg commented Nov 12, 2023 •

edited

zherczeg commented Nov 15, 2023 •

edited

zherczeg commented Nov 20, 2023 •

edited

chen-hu-97 commented Dec 7, 2023 •

edited

chen-hu-97 commented Dec 7, 2023 •

edited