Vectorized FFT #223

nbgl · 2021-09-04T04:36:42Z

Vectorized version of FFT. Pretty much a direct translation of the old code. I worry that the vector stuff gets in the way of readability; I welcome suggestions for fixing that.

dlubarov · 2021-09-08T00:12:44Z

src/field/fft.rs

+    // We've already done the first log_packed_width (if they were required) iterations.
+    let s = max(r, log_packed_width) + 1;
+
+    for lg_m in s..=lg_n {


This is minor but maybe we could tweak the second loop to emphasize how it's a continuation of the first:

let s = max(r, log_packed_width);

for lg_half_m in s..lg_n { ... }

rename i -> lg_half_m in the first loop, or we could just call both i if you think lg_half_m is long/confusing

dlubarov · 2021-09-08T00:32:22Z

Yeah it's fairly tough to read, but I think it makes sense, and I can't think of any significant improvements. Do you expect a significant speedup on certain (i.e. AVX-512) hardware? If so I think it seems OK. Would like to get @unzvfu's thoughts also though.

What do you think of having dedicated functions for the first several layers? It might improve readability in some ways, though of course there'd be some more redundancy, so I'm not really sure if it's a good tradeoff. Could also lead to some optimizations, e.g. roots could just be immediate values in the first few layers.

nbgl · 2021-09-08T20:05:25Z

The speed improvement is fairly significant even without AVX-512. On my Skylake, which only has AVX2, vectorization increases the throughput of that hot loop by 1.4× (figure from my profiles of the fft benchmarks). So even though AVX2 doesn’t increase the throughput of multiplication per se (see #1), there are other effects:

4× fewer loads/stores/bounds checks
AVX2 multiplication cannot use 4 instructions/cycle (the maximum on Skylake) because there’s only 3 vector ALUs. So we get the same multiplication throughput while doing loads/stores/bounds checks/conditional jumps for free
Addition and subtraction are faster in vector

Of course, I expect the improvement to be bigger still on AVX-512.

unzvfu · 2021-09-12T11:01:12Z

FWIW this is looking good to me. The vectorised implementation is pretty clear (or at least not much harder to read than the original).

nbgl requested review from unzvfu and dlubarov and removed request for unzvfu September 4, 2021 04:36

dlubarov reviewed Sep 8, 2021

View reviewed changes

nbgl added 3 commits September 8, 2021 13:56

Vectorized FFT

6de385a

Cleanup

ac61559

Use updated FieldPacking

214b2ee

nbgl force-pushed the jakub/vectorized-fft branch from e97dfa5 to 214b2ee Compare September 8, 2021 20:57

Use to_vec/from_slice (+ typo)

b47bc3e

nbgl mentioned this pull request Sep 11, 2021

FFT optimization: special case for lg_m = 2 #212

Closed

dlubarov approved these changes Sep 12, 2021

View reviewed changes

Cleanup + Daniel's comments

9bec1b4

nbgl merged commit a8d08aa into main Sep 12, 2021

nbgl deleted the jakub/vectorized-fft branch September 12, 2021 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized FFT #223

Vectorized FFT #223

nbgl commented Sep 4, 2021

dlubarov Sep 8, 2021

dlubarov commented Sep 8, 2021

nbgl commented Sep 8, 2021

unzvfu commented Sep 12, 2021

Vectorized FFT #223

Vectorized FFT #223

Conversation

nbgl commented Sep 4, 2021

dlubarov Sep 8, 2021

Choose a reason for hiding this comment

dlubarov commented Sep 8, 2021

nbgl commented Sep 8, 2021

unzvfu commented Sep 12, 2021