New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues for avx512 on Skylake X nodes. #143
Comments
Could well be. We have never tested FFTW on Skylake (I did some earlier tests on Xeon Phi or Mic or Larrabee or whatever it was called at the time). Can you run $ ./tests/bench -v2 -oestimate i16kx16k in both configurations? It may well be that the -oestimate mode is totally suboptimal for avx512 (it is probably totally suboptimal for a 16kx16k transform to begin with, irrespective of avx512). |
Test setup
Results without
|
Is something in development to fix this issue ? |
…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, though it might not be on Knights Landing (I don't have access to one to test anymore), so I've added an --enable-avx512-scattergather option to retain the old behavior. This should help with FFTW#143.
…assemble/disassemble the vector in 128 bits chunks. This is faster on Skylake, but will not work on Knights Landing (as KNL lacks AVX512DQ), so I've added an --enable-avx512-scattergather option to retain the old behavior and enable compiling/using AVX512 on KNL. This should help with FFTW#143.
any progress on this? I guess many projects have been using fftw avx512 version during these years. |
Can somebody rerun these commands in Skylake X?
On ryzen 9 7950x3d (zen4) speedup is significant:
It is sad that some library users and distro maintainers are afraid to globally enable this flag because someone in 2018 had performance issues. |
What do you mean by "this flag"? This has always been the pact with the devil in the FFTW design: You either spend the time planning, in which case FFTW figures out a reasonably good way to solve the problem, or you don't (-oestimate), in which case FFTW employs a heuristic that minimizes the total number of floating-point operations. What exactly are you proposing? |
I'm talking about Why I'm asking to reproduce the issue with SkylakeX - maybe back then it was not even an issue with FFTW, because in 2018 AVX-512 support in compilers was in embryonic state. Upd: Clear Linux has |
in testing with
--enable-avx512
we (@boegel and me) found that for one simple example:the avx512 FFTW 3.3.7 is consistently (tested on multiple Skylake varieties) slower than the avx2 FFTW 3.3.7 by about a factor of 1.75.
See here for more details:
easybuilders/easybuild-easyblocks#1416 (comment)
it seems that the avx512 ops used by FFTW are simply more expensive on their own and this is not because of CPU frequency issues.
Is this just because FFTW's avx512 support was written before these chips were available so it could not be benchmarked at the time?
The text was updated successfully, but these errors were encountered: