Performance issues for avx512 on Skylake X nodes. #143

bartoldeman · 2018-05-08T17:10:39Z

in testing with --enable-avx512 we (@boegel and me) found that for one simple example:

curl -OL http://micro.stanford.edu/mediawiki/images/a/a9/Simple_example.tar
tar xfv Simple_example.tar
cd simple_example
sed -i'' 's/\(N[01] =\) [0-9]*/\1 16384/g' simple_example.c
gcc -O2 -march=native simple_example.c -lfftw3 -lm -o simple_example

the avx512 FFTW 3.3.7 is consistently (tested on multiple Skylake varieties) slower than the avx2 FFTW 3.3.7 by about a factor of 1.75.

See here for more details:
easybuilders/easybuild-easyblocks#1416 (comment)
it seems that the avx512 ops used by FFTW are simply more expensive on their own and this is not because of CPU frequency issues.

Is this just because FFTW's avx512 support was written before these chips were available so it could not be benchmarked at the time?

The text was updated successfully, but these errors were encountered:

matteo-frigo · 2018-05-08T17:47:42Z

Could well be. We have never tested FFTW on Skylake (I did some earlier tests on Xeon Phi or Mic or Larrabee or whatever it was called at the time).

Can you run

$ ./tests/bench -v2 -oestimate i16kx16k
$ ./tests/bench -v2 i16kx16k

in both configurations? It may well be that the -oestimate mode is totally suboptimal for avx512 (it is probably totally suboptimal for a 16kx16k transform to begin with, irrespective of avx512).

boegel · 2018-05-08T18:47:06Z

Test setup

CentOS 7.4.1708
Intel Xeon Gold 6140 CPU @ 2.30GHz
GCC 6.4.0 + binutils 2.28 (+ OpenMPI 2.1.2, but that's irrelevant here imho)
- compiler flags ($CFLAGS & co): -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC
FFTW 3.3.7, configured with --enable-avx --enable-avx2 --enable-sse2

Results without `--enable-avx512` configure option

$ ./tests/bench -v2 -oestimate i16kx16k
planner time: 0.000239 s
(dft-rank>=2/1
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/32
      (dftw-direct-32/8 "t3fv_32_avx2_128")
      (dft-buffered-512-x32/32-6
        (dft-vrank>=1-x32/1
          (dft-ct-dit/16
            (dftw-direct-16/8 "t3fv_16_avx2_128")
            (dft-direct-32-x16 "n2fv_32_avx2_128")))
        (dft-r2hc-1
          (rdft-rank0-iter-ci/1024-x32))
        (dft-nop))))
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/32
      (dftw-direct-32/8 "t3fv_32_avx2_128")
      (dft-buffered-512-x32/32-6
        (dft-vrank>=1-x32/1
          (dft-ct-dit/16
            (dftw-direct-16/8 "t3fv_16_avx2_128")
            (dft-direct-32-x16 "n2fv_32_avx2_128")))
        (dft-r2hc-1
          (rdft-rank0-tiledbuf/2-x32-x512))
        (dft-nop)))))
flops: 9831448576 add, 4831838208 mul, 671088640 fma
estimated cost: 18153083429.904301, pcost = 0.000000
Problem: i16kx16k, setup: 578.00 us, time: 22.54 s, ``mflops'': 1667.3
Took 8 measurements for at least 10.00 ms each.
Time: min 22.54 s, max 23.13 s, avg 22.73 s, median 22.79 s

$ ./tests/bench -v2 i16kx16k
planner time: 38.075 s
(dft-rank>=2/1
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/8
      (dftw-direct-8/28 "t1fv_8_avx2")
      (dft-buffered-2048-x8/8-6
        (dft-vrank>=1-x8/1
          (dft-ct-dit/32
            (dftw-direct-32/248 "t2fv_32_avx")
            (dft-direct-64-x32 "n1fv_64_avx2")))
        (dft-r2hc-1
          (rdft-rank0-iter-ci/4096-x8))
        (dft-nop))))
  (indirect-transpose
    (dft-r2hc-1
      (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384))
    (dft-indirect-after
      (dft-vrank>=1-x16384/1
        (dft-ct-dit/8
          (dftw-direct-8/56 "t2fv_8_avx")
          (dft-ct-dif/8
            (dftw-directsq-8/28-x8 "q1fv_8_avx")
            (dft-vrank>=1-x8/1
              (dft-vrank>=1-x8/1
                (dft-ct-dit/2
                  (dftw-direct-2/4 "t1fuv_2_avx2")
                  (dft-ct-dif/2
                    (dftw-directsq-2/4-x2 "q1fv_2_avx")
                    (dft-vrank>=1-x2/1
                      (dft-direct-64-x2 "n1fv_64_avx2")))))))))
      (dft-r2hc-1
        (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384)))
    (dft-nop)))
flops: 4601151488 add, 1744830464 mul, 285212672 fma
estimated cost: 10141458074.264620, pcost = 0.000000
Problem: i16kx16k, setup: 38.08 s, time: 6.31 s, ``mflops'': 5951.1
Took 8 measurements for at least 10.00 ms each.
Time: min 6.31 s, max 6.33 s, avg 6.32 s, median 6.32 s

Results with `--enable-avx512` configure option

$ ./tests/bench -v2 -oestimate i16kx16k
planner time: 0.000975 s
(dft-rank>=2/1
  (dft-buffered-16384-x2/16384-6
    (dft-vrank>=1-x2/1
      (dft-ct-dit/8
        (dftw-direct-8/56 "t1fuv_8_avx512")
        (dft-vrank>=1-x8/1
          (dft-ct-dit/8
            (dftw-direct-8/56 "t1fuv_8_avx512")
            (dft-vrank>=1-x8/1
              (dft-ct-dit/8
                (dftw-direct-8/56 "t1fuv_8_avx512")
                (dft-direct-32-x8 "n1fv_32_avx512")))))))
    (dft-r2hc-1
      (rdft-rank0-iter-ci/32768-x2))
    (dft-nop))
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/8
      (dftw-direct-8/56 "t1fuv_8_avx512")
      (dft-ct-dif/8
        (dftw-directsq-8/56-x8 "q1fv_8_avx512")
        (dft-vrank>=1-x8/1
          (dft-vrank>=1-x8/1
            (dft-ct-dit/8
              (dftw-direct-8/56 "t1fuv_8_avx512")
              (dft-ct-dif/8
                (dftw-directsq-8/56-x8 "q1fv_8_avx512")
                (dft-vrank>=1-x8/1
                  (dft-direct-4-x8 "n1fv_4_avx512"))))))))))
flops: 2428502016 add, 994050048 mul, 33554432 fma
estimated cost: 4567657371.512790, pcost = 0.000000
Problem: i16kx16k, setup: 1.35 ms, time: 45.19 s, ``mflops'': 831.67
Took 8 measurements for at least 10.00 ms each.
Time: min 45.19 s, max 45.49 s, avg 45.40 s, median 45.43 s

$ ./tests/bench -v2 i16kx16k
planner time: 38.1632 s
(dft-rank>=2/1
  (dft-vrank>=1-x16384/1
    (dft-buffered-16384/1-0
      (dft-ct-dit/32
        (dftw-direct-32/32 "t3fv_32_avx512")
        (dft-vrank>=1-x32/1
          (dft-ct-dit/64
            (dftw-direct-64/1008 "t2fv_64_avx512")
            (dft-direct-8-x64 "n1fv_8_avx"))))
      (dft-r2hc-1
        (rdft-rank0-memcpy/32768))
      (dft-nop)))
  (indirect-transpose
    (dft-r2hc-1
      (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384))
    (dft-indirect-after
      (dft-vrank>=1-x16384/1
        (dft-ct-dit/8
          (dftw-direct-8/56 "t1fv_8_avx512")
          (dft-ct-dif/8
            (dftw-directsq-8/28-x8 "q1fv_8_avx2")
            (dft-vrank>=1-x8/1
              (dft-vrank>=1-x8/1
                (dft-ct-dit/2
                  (dftw-direct-2/4 "t1fuv_2_avx")
                  (dft-ct-dif/2
                    (dftw-directsq-2/4-x2 "q1fv_2_avx")
                    (dft-vrank>=1-x2/1
                      (dft-direct-64-x2 "n1fv_64_avx2")))))))))
      (dft-r2hc-1
        (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384)))
    (dft-nop)))
flops: 3484418048 add, 1361051648 mul, 197132288 fma
estimated cost: 8464785050.264620, pcost = 0.000000
Problem: i16kx16k, setup: 38.16 s, time: 6.01 s, ``mflops'': 6250
Took 8 measurements for at least 10.00 ms each.
Time: min 6.01 s, max 6.02 s, avg 6.02 s, median 6.02 s

mboisson · 2018-10-02T15:43:27Z

Is something in development to fix this issue ?

…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, though it might not be on Knights Landing (I don't have access to one to test anymore), so I've added an --enable-avx512-scattergather option to retain the old behavior. This should help with FFTW#143.

…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL. This should help with FFTW#143.

ruixingw · 2022-07-09T05:10:17Z

any progress on this? I guess many projects have been using fftw avx512 version during these years.

AngryLoki · 2024-01-02T21:01:32Z

Can somebody rerun these commands in Skylake X?

$ ./tests/bench -v2 -oestimate i16kx16k
$ ./tests/bench -v2 i16kx16k

On ryzen 9 7950x3d (zen4) speedup is significant:

$ ./tests/bench -v2 -oestimate i16kx16k
- with avx2: Time: min 22.24 s, max 22.50 s, avg 22.31 s, median 22.26 s (1690 mflops)
- with avx512: Time: min 13.09 s, max 13.15 s, avg 13.10 s, median 13.10 s (2871 mflops)
./tests/bench -v2 i16kx16k
- with avx2: Time: min 1.74 s, max 1.74 s, avg 1.74 s, median 1.74 s (21649 mflops)
- with avx512: Time: min 1.51 s, max 1.51 s, avg 1.51 s, median 1.51 s (24948 mflops)

It is sad that some library users and distro maintainers are afraid to globally enable this flag because someone in 2018 had performance issues.

matteo-frigo · 2024-01-02T23:15:56Z

What do you mean by "this flag"?

This has always been the pact with the devil in the FFTW design: You either spend the time planning, in which case FFTW figures out a reasonably good way to solve the problem, or you don't (-oestimate), in which case FFTW employs a heuristic that minimizes the total number of floating-point operations. What exactly are you proposing?

AngryLoki · 2024-01-03T08:33:11Z

I'm talking about --enable-avx512 in places like https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/fftw/fftw-3.3.10.ebuild#n80

Why I'm asking to reproduce the issue with SkylakeX - maybe back then it was not even an issue with FFTW, because in 2018 AVX-512 support in compilers was in embryonic state.

Upd: Clear Linux has --enable-avx512 in https://github.com/clearlinux-pkgs/fftw/blob/master/fftw.spec#L130-L131 for quite some time now, so I suppose it is out of experimental stage for their maintainers.

boegel mentioned this issue May 9, 2018

fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) easybuilders/easybuild-easyblocks#1416

Open

zyzzyxdonta mentioned this issue Jun 22, 2020

Adding results for measurements on HZDR's hemera cluster. mpicbg-scicomp/gearshifft_results#13

Merged

boegel mentioned this issue Mar 17, 2022

fftw.spec: fix typo & enable sse2/avx/avx2 openhpc/ohpc#1410

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues for avx512 on Skylake X nodes. #143

Performance issues for avx512 on Skylake X nodes. #143

bartoldeman commented May 8, 2018

matteo-frigo commented May 8, 2018

boegel commented May 8, 2018

mboisson commented Oct 2, 2018

ruixingw commented Jul 9, 2022

AngryLoki commented Jan 2, 2024

matteo-frigo commented Jan 2, 2024

AngryLoki commented Jan 3, 2024 •

edited

Performance issues for avx512 on Skylake X nodes. #143

Performance issues for avx512 on Skylake X nodes. #143

Comments

bartoldeman commented May 8, 2018

matteo-frigo commented May 8, 2018

boegel commented May 8, 2018

Test setup

Results without --enable-avx512 configure option

Results with --enable-avx512 configure option

mboisson commented Oct 2, 2018

ruixingw commented Jul 9, 2022

AngryLoki commented Jan 2, 2024

matteo-frigo commented Jan 2, 2024

AngryLoki commented Jan 3, 2024 • edited

Results without `--enable-avx512` configure option

Results with `--enable-avx512` configure option

AngryLoki commented Jan 3, 2024 •

edited