Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized FFT #223

Merged
merged 5 commits into from Sep 12, 2021
Merged

Vectorized FFT #223

merged 5 commits into from Sep 12, 2021

Conversation

nbgl
Copy link
Member

@nbgl nbgl commented Sep 4, 2021

Vectorized version of FFT. Pretty much a direct translation of the old code. I worry that the vector stuff gets in the way of readability; I welcome suggestions for fixing that.

@nbgl nbgl requested review from unzvfu and dlubarov and removed request for unzvfu September 4, 2021 04:36
src/field/fft.rs Outdated
// We've already done the first log_packed_width (if they were required) iterations.
let s = max(r, log_packed_width) + 1;

for lg_m in s..=lg_n {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor but maybe we could tweak the second loop to emphasize how it's a continuation of the first:

  • let s = max(r, log_packed_width);
  • for lg_half_m in s..lg_n { ... }
  • rename i -> lg_half_m in the first loop, or we could just call both i if you think lg_half_m is long/confusing

@dlubarov
Copy link
Member

dlubarov commented Sep 8, 2021

Yeah it's fairly tough to read, but I think it makes sense, and I can't think of any significant improvements. Do you expect a significant speedup on certain (i.e. AVX-512) hardware? If so I think it seems OK. Would like to get @unzvfu's thoughts also though.

What do you think of having dedicated functions for the first several layers? It might improve readability in some ways, though of course there'd be some more redundancy, so I'm not really sure if it's a good tradeoff. Could also lead to some optimizations, e.g. roots could just be immediate values in the first few layers.

@nbgl
Copy link
Member Author

nbgl commented Sep 8, 2021

The speed improvement is fairly significant even without AVX-512. On my Skylake, which only has AVX2, vectorization increases the throughput of that hot loop by 1.4× (figure from my profiles of the fft benchmarks). So even though AVX2 doesn’t increase the throughput of multiplication per se (see #1), there are other effects:

  1. 4× fewer loads/stores/bounds checks
  2. AVX2 multiplication cannot use 4 instructions/cycle (the maximum on Skylake) because there’s only 3 vector ALUs. So we get the same multiplication throughput while doing loads/stores/bounds checks/conditional jumps for free
  3. Addition and subtraction are faster in vector

Of course, I expect the improvement to be bigger still on AVX-512.

@unzvfu
Copy link
Contributor

unzvfu commented Sep 12, 2021

FWIW this is looking good to me. The vectorised implementation is pretty clear (or at least not much harder to read than the original).

@nbgl nbgl merged commit a8d08aa into main Sep 12, 2021
@nbgl nbgl deleted the jakub/vectorized-fft branch September 12, 2021 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants