Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues for avx512 on Skylake X nodes. #143

Open
bartoldeman opened this issue May 8, 2018 · 7 comments
Open

Performance issues for avx512 on Skylake X nodes. #143

bartoldeman opened this issue May 8, 2018 · 7 comments

Comments

@bartoldeman
Copy link

in testing with --enable-avx512 we (@boegel and me) found that for one simple example:

curl -OL http://micro.stanford.edu/mediawiki/images/a/a9/Simple_example.tar
tar xfv Simple_example.tar
cd simple_example
sed -i'' 's/\(N[01] =\) [0-9]*/\1 16384/g' simple_example.c
gcc -O2 -march=native simple_example.c -lfftw3 -lm -o simple_example

the avx512 FFTW 3.3.7 is consistently (tested on multiple Skylake varieties) slower than the avx2 FFTW 3.3.7 by about a factor of 1.75.

See here for more details:
easybuilders/easybuild-easyblocks#1416 (comment)
it seems that the avx512 ops used by FFTW are simply more expensive on their own and this is not because of CPU frequency issues.

Is this just because FFTW's avx512 support was written before these chips were available so it could not be benchmarked at the time?

@matteo-frigo
Copy link
Member

Could well be. We have never tested FFTW on Skylake (I did some earlier tests on Xeon Phi or Mic or Larrabee or whatever it was called at the time).

Can you run

$ ./tests/bench -v2 -oestimate i16kx16k
$ ./tests/bench -v2 i16kx16k

in both configurations? It may well be that the -oestimate mode is totally suboptimal for avx512 (it is probably totally suboptimal for a 16kx16k transform to begin with, irrespective of avx512).

@boegel
Copy link

boegel commented May 8, 2018

Test setup

  • CentOS 7.4.1708
  • Intel Xeon Gold 6140 CPU @ 2.30GHz
  • GCC 6.4.0 + binutils 2.28 (+ OpenMPI 2.1.2, but that's irrelevant here imho)
    • compiler flags ($CFLAGS & co): -O2 -ftree-vectorize -march=native -fno-math-errno -fPIC
  • FFTW 3.3.7, configured with --enable-avx --enable-avx2 --enable-sse2

Results without --enable-avx512 configure option

$ ./tests/bench -v2 -oestimate i16kx16k
planner time: 0.000239 s
(dft-rank>=2/1
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/32
      (dftw-direct-32/8 "t3fv_32_avx2_128")
      (dft-buffered-512-x32/32-6
        (dft-vrank>=1-x32/1
          (dft-ct-dit/16
            (dftw-direct-16/8 "t3fv_16_avx2_128")
            (dft-direct-32-x16 "n2fv_32_avx2_128")))
        (dft-r2hc-1
          (rdft-rank0-iter-ci/1024-x32))
        (dft-nop))))
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/32
      (dftw-direct-32/8 "t3fv_32_avx2_128")
      (dft-buffered-512-x32/32-6
        (dft-vrank>=1-x32/1
          (dft-ct-dit/16
            (dftw-direct-16/8 "t3fv_16_avx2_128")
            (dft-direct-32-x16 "n2fv_32_avx2_128")))
        (dft-r2hc-1
          (rdft-rank0-tiledbuf/2-x32-x512))
        (dft-nop)))))
flops: 9831448576 add, 4831838208 mul, 671088640 fma
estimated cost: 18153083429.904301, pcost = 0.000000
Problem: i16kx16k, setup: 578.00 us, time: 22.54 s, ``mflops'': 1667.3
Took 8 measurements for at least 10.00 ms each.
Time: min 22.54 s, max 23.13 s, avg 22.73 s, median 22.79 s
$ ./tests/bench -v2 i16kx16k
planner time: 38.075 s
(dft-rank>=2/1
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/8
      (dftw-direct-8/28 "t1fv_8_avx2")
      (dft-buffered-2048-x8/8-6
        (dft-vrank>=1-x8/1
          (dft-ct-dit/32
            (dftw-direct-32/248 "t2fv_32_avx")
            (dft-direct-64-x32 "n1fv_64_avx2")))
        (dft-r2hc-1
          (rdft-rank0-iter-ci/4096-x8))
        (dft-nop))))
  (indirect-transpose
    (dft-r2hc-1
      (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384))
    (dft-indirect-after
      (dft-vrank>=1-x16384/1
        (dft-ct-dit/8
          (dftw-direct-8/56 "t2fv_8_avx")
          (dft-ct-dif/8
            (dftw-directsq-8/28-x8 "q1fv_8_avx")
            (dft-vrank>=1-x8/1
              (dft-vrank>=1-x8/1
                (dft-ct-dit/2
                  (dftw-direct-2/4 "t1fuv_2_avx2")
                  (dft-ct-dif/2
                    (dftw-directsq-2/4-x2 "q1fv_2_avx")
                    (dft-vrank>=1-x2/1
                      (dft-direct-64-x2 "n1fv_64_avx2")))))))))
      (dft-r2hc-1
        (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384)))
    (dft-nop)))
flops: 4601151488 add, 1744830464 mul, 285212672 fma
estimated cost: 10141458074.264620, pcost = 0.000000
Problem: i16kx16k, setup: 38.08 s, time: 6.31 s, ``mflops'': 5951.1
Took 8 measurements for at least 10.00 ms each.
Time: min 6.31 s, max 6.33 s, avg 6.32 s, median 6.32 s

Results with --enable-avx512 configure option

$ ./tests/bench -v2 -oestimate i16kx16k
planner time: 0.000975 s
(dft-rank>=2/1
  (dft-buffered-16384-x2/16384-6
    (dft-vrank>=1-x2/1
      (dft-ct-dit/8
        (dftw-direct-8/56 "t1fuv_8_avx512")
        (dft-vrank>=1-x8/1
          (dft-ct-dit/8
            (dftw-direct-8/56 "t1fuv_8_avx512")
            (dft-vrank>=1-x8/1
              (dft-ct-dit/8
                (dftw-direct-8/56 "t1fuv_8_avx512")
                (dft-direct-32-x8 "n1fv_32_avx512")))))))
    (dft-r2hc-1
      (rdft-rank0-iter-ci/32768-x2))
    (dft-nop))
  (dft-vrank>=1-x16384/1
    (dft-ct-dit/8
      (dftw-direct-8/56 "t1fuv_8_avx512")
      (dft-ct-dif/8
        (dftw-directsq-8/56-x8 "q1fv_8_avx512")
        (dft-vrank>=1-x8/1
          (dft-vrank>=1-x8/1
            (dft-ct-dit/8
              (dftw-direct-8/56 "t1fuv_8_avx512")
              (dft-ct-dif/8
                (dftw-directsq-8/56-x8 "q1fv_8_avx512")
                (dft-vrank>=1-x8/1
                  (dft-direct-4-x8 "n1fv_4_avx512"))))))))))
flops: 2428502016 add, 994050048 mul, 33554432 fma
estimated cost: 4567657371.512790, pcost = 0.000000
Problem: i16kx16k, setup: 1.35 ms, time: 45.19 s, ``mflops'': 831.67
Took 8 measurements for at least 10.00 ms each.
Time: min 45.19 s, max 45.49 s, avg 45.40 s, median 45.43 s
$ ./tests/bench -v2 i16kx16k
planner time: 38.1632 s
(dft-rank>=2/1
  (dft-vrank>=1-x16384/1
    (dft-buffered-16384/1-0
      (dft-ct-dit/32
        (dftw-direct-32/32 "t3fv_32_avx512")
        (dft-vrank>=1-x32/1
          (dft-ct-dit/64
            (dftw-direct-64/1008 "t2fv_64_avx512")
            (dft-direct-8-x64 "n1fv_8_avx"))))
      (dft-r2hc-1
        (rdft-rank0-memcpy/32768))
      (dft-nop)))
  (indirect-transpose
    (dft-r2hc-1
      (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384))
    (dft-indirect-after
      (dft-vrank>=1-x16384/1
        (dft-ct-dit/8
          (dftw-direct-8/56 "t1fv_8_avx512")
          (dft-ct-dif/8
            (dftw-directsq-8/28-x8 "q1fv_8_avx2")
            (dft-vrank>=1-x8/1
              (dft-vrank>=1-x8/1
                (dft-ct-dit/2
                  (dftw-direct-2/4 "t1fuv_2_avx")
                  (dft-ct-dif/2
                    (dftw-directsq-2/4-x2 "q1fv_2_avx")
                    (dft-vrank>=1-x2/1
                      (dft-direct-64-x2 "n1fv_64_avx2")))))))))
      (dft-r2hc-1
        (rdft-rank0-ip-sq-tiledbuf/2-x16384-x16384)))
    (dft-nop)))
flops: 3484418048 add, 1361051648 mul, 197132288 fma
estimated cost: 8464785050.264620, pcost = 0.000000
Problem: i16kx16k, setup: 38.16 s, time: 6.01 s, ``mflops'': 6250
Took 8 measurements for at least 10.00 ms each.
Time: min 6.01 s, max 6.02 s, avg 6.02 s, median 6.02 s

@mboisson
Copy link

mboisson commented Oct 2, 2018

Is something in development to fix this issue ?

rdolbeau added a commit to rdolbeau/fftw3 that referenced this issue Mar 11, 2020
…assemble/disassemble the vector in 128 bits chunks.

This is faster on Skylake, though it might not be on Knights Landing (I don't have access to one to test anymore), so I've added an --enable-avx512-scattergather option to retain the old behavior.
This should help with FFTW#143.
rdolbeau added a commit to rdolbeau/fftw3 that referenced this issue Jun 21, 2022
…assemble/disassemble the vector in 128 bits chunks.

This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL.
This should help with FFTW#143.
@ruixingw
Copy link

ruixingw commented Jul 9, 2022

any progress on this? I guess many projects have been using fftw avx512 version during these years.

@AngryLoki
Copy link

Can somebody rerun these commands in Skylake X?

$ ./tests/bench -v2 -oestimate i16kx16k
$ ./tests/bench -v2 i16kx16k

On ryzen 9 7950x3d (zen4) speedup is significant:

  • $ ./tests/bench -v2 -oestimate i16kx16k
    • with avx2: Time: min 22.24 s, max 22.50 s, avg 22.31 s, median 22.26 s (1690 mflops)
    • with avx512: Time: min 13.09 s, max 13.15 s, avg 13.10 s, median 13.10 s (2871 mflops)
  • ./tests/bench -v2 i16kx16k
    • with avx2: Time: min 1.74 s, max 1.74 s, avg 1.74 s, median 1.74 s (21649 mflops)
    • with avx512: Time: min 1.51 s, max 1.51 s, avg 1.51 s, median 1.51 s (24948 mflops)

It is sad that some library users and distro maintainers are afraid to globally enable this flag because someone in 2018 had performance issues.

@matteo-frigo
Copy link
Member

What do you mean by "this flag"?

This has always been the pact with the devil in the FFTW design: You either spend the time planning, in which case FFTW figures out a reasonably good way to solve the problem, or you don't (-oestimate), in which case FFTW employs a heuristic that minimizes the total number of floating-point operations. What exactly are you proposing?

@AngryLoki
Copy link

AngryLoki commented Jan 3, 2024

I'm talking about --enable-avx512 in places like https://gitweb.gentoo.org/repo/gentoo.git/tree/sci-libs/fftw/fftw-3.3.10.ebuild#n80

Why I'm asking to reproduce the issue with SkylakeX - maybe back then it was not even an issue with FFTW, because in 2018 AVX-512 support in compilers was in embryonic state.

Upd: Clear Linux has --enable-avx512 in https://github.com/clearlinux-pkgs/fftw/blob/master/fftw.spec#L130-L131 for quite some time now, so I suppose it is out of experimental stage for their maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants