New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized FFT #223
Vectorized FFT #223
Conversation
src/field/fft.rs
Outdated
// We've already done the first log_packed_width (if they were required) iterations. | ||
let s = max(r, log_packed_width) + 1; | ||
|
||
for lg_m in s..=lg_n { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor but maybe we could tweak the second loop to emphasize how it's a continuation of the first:
let s = max(r, log_packed_width);
for lg_half_m in s..lg_n { ... }
- rename
i
->lg_half_m
in the first loop, or we could just call bothi
if you thinklg_half_m
is long/confusing
Yeah it's fairly tough to read, but I think it makes sense, and I can't think of any significant improvements. Do you expect a significant speedup on certain (i.e. AVX-512) hardware? If so I think it seems OK. Would like to get @unzvfu's thoughts also though. What do you think of having dedicated functions for the first several layers? It might improve readability in some ways, though of course there'd be some more redundancy, so I'm not really sure if it's a good tradeoff. Could also lead to some optimizations, e.g. roots could just be immediate values in the first few layers. |
The speed improvement is fairly significant even without AVX-512. On my Skylake, which only has AVX2, vectorization increases the throughput of that hot loop by 1.4× (figure from my profiles of the fft benchmarks). So even though AVX2 doesn’t increase the throughput of multiplication per se (see #1), there are other effects:
Of course, I expect the improvement to be bigger still on AVX-512. |
e97dfa5
to
214b2ee
Compare
FWIW this is looking good to me. The vectorised implementation is pretty clear (or at least not much harder to read than the original). |
Vectorized version of FFT. Pretty much a direct translation of the old code. I worry that the vector stuff gets in the way of readability; I welcome suggestions for fixing that.