Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement native Poseidon #4

Closed
dlubarov opened this issue Apr 3, 2021 · 18 comments
Closed

Implement native Poseidon #4

dlubarov opened this issue Apr 3, 2021 · 18 comments
Assignees

Comments

@dlubarov
Copy link
Member

dlubarov commented Apr 3, 2021

I don't mean to suggest that we should necessarily replace GMiMC with Poseidon; might be good to wait for more clarity around the security of both. But in the meantime, it would be good to have both in the codebase so that we can get a clearer picture of their costs.

Note that implementing Poseidon efficiently involves a bit of trickery; there are some suggestions in Appendix B of the paper. The reference implementation might also be useful.

There are some variations to consider, but I think a good candidate would be

  • permutation width of 12 elements (as we use for GMiMC)
  • S-box of x7 (so we can keep our constraint degree at 8, just barely)
  • over the Goldilocks field mentioned in Investigate other field options #1

This leads to a recommendation of 8 full rounds and 22 partial rounds, for a total of 118 S-boxes.

It could also be worth considering x3, and trying to arithmetize it in a clever way that doesn't involve a wire for every S-box (see #2). However, our gate would then have degree 9 constraints, which would mean (since we use radix-2 FFTs) a bunch of intermediate polynonmials need to be degree 16n

@unzvfu
Copy link
Contributor

unzvfu commented Apr 6, 2021

[Nothing new below, I'm just doing a brain dump of my understanding at the moment together with some references.]

The core cost in evaluating Poseidon is the product of an MDS matrix with the state; for example, an experiment by bobbinth suggests that 92% of the time is in the MDS products, and only 5% in the S-Box permutations.

The speed improvements mentioned above in Appendix B of the Poseidon paper apply to the partial rounds; in short, because the S-Box permutation isn't applied in partial rounds, we can factor the MDS matrix into something close to an identity matrix and use that to update the state. Note that bobbinth's implementation of Poseidon for the test above does not implement this trick, so those results are probably somewhat misleading. The reference implementation cited above does implement the trick, but it's written in Sage so performance numbers from it are probably meaningless.

There is also the possibility of following the Fractal lead (hat tip to @dlubarov) who use a slightly relaxed property to produce "lightweight MDS" matrices which apparently gives a 10-fold performance improvement (see p68 here for the claim). More generally, there's a long history of research into finding MDS matrices that allow fast products; a recent example with a decent reference list is here. It is not currently clear what the security implications are when using the lightweight MDS matrices with Poseidon.

It would be great to compare an implementation of Poseidon using those two tricks and compare it with GMiMC; it wouldn't surprise me if Poseidon wasn't far behind GMiMC.

@dlubarov
Copy link
Member Author

dlubarov commented Apr 6, 2021

That all sounds right to me except

because the S-Box permutation isn't applied in partial rounds

Maybe this is what you meant, but it's applied to a single element in partial round, vs all the elements in full rounds.

To dig into the cost a bit more, the StarkWare report listed Poseidon as taking (m2 + 2m) R_f + (2m + 2) R_p multiplications, after the optimizations suggested in the paper. I think m2 R_f of that is full matrix multiplication, and 2m R_p of it is multiplication by a sparse matrix. For our candidate parameters, it looks like full matrix multiplication would be ~72% of the cost while sparse matrix multiplication would be ~3% of the cost, so we should probably just focus on the full matrix and not the sparse matrix.

@dlubarov
Copy link
Member Author

dlubarov commented Apr 6, 2021

I wonder what sort of matrix entries we should search for. The literature seems mostly focused on binary fields. Clearly 0s are best for performance, followed by 1s. I guess powers of two should give a decent speedup, though we'd still need to reduce. Coefficients less than 232 might also help to speed up reductions somewhat -- e.g. with the Goldilocks reduction you suggested, we'd have x11 = 0.

My understanding is that MDS matrices (at least square ones) can't have 0 entries, but maybe we could find one with several 1 entries, and the rest small and/or power-of-two entries.

@unzvfu
Copy link
Contributor

unzvfu commented Apr 7, 2021

because the S-Box permutation isn't applied in partial rounds

Maybe this is what you meant, but it's applied to a single element in partial round, vs all the elements in full rounds.

Oh, yes, good catch! I was a bit sloppy there. :)

To dig into the cost a bit more, the StarkWare report listed Poseidon as taking (m2 + 2m) R_f + (2m + 2) R_p multiplications, after the optimizations suggested in the paper. I think m2 R_f of that is full matrix multiplication, and 2m R_p of it is multiplication by a sparse matrix. For our candidate parameters, it looks like full matrix multiplication would be ~72% of the cost while sparse matrix multiplication would be ~3% of the cost, so we should probably just focus on the full matrix and not the sparse matrix.

Yep, agreed. Thanks for that reference!

I wonder what sort of matrix entries we should search for. [...]
My understanding is that MDS matrices (at least square ones) can't have 0 entries, but maybe we could find one with several 1 entries, and the rest small and/or power-of-two entries.

Yep, that was my thinking too. With the "lightweight MDS" matrices, apparently all the entries are either 0, 1, or 2, which explains the great speedup Valardragon referred to.

@dlubarov
Copy link
Member Author

I think another good candidate would be Poseidon with a width of 8, with the same field and same x7 S-box. It would have the same number of rounds (8 full, 22 partial), but 86 S-boxes instead of 118, and 1208 total muls instead of 2152.

(Aside: if we used a sponge-based 2-to-1 compression function in our (binary) Merkle tree, a width of 8 would mean we need two squeeze steps. But I think we can do secure tree hashing by just permuting left_child || right_child and truncating the result. For the leaf layer we would permute and truncate leaf || 0. I think this might be implied by this paper, though I haven't gotten through it yet. If not, I have a simple security argument sketched out.)

dlubarov added a commit that referenced this issue Apr 28, 2021
The defaults are quite slow, but we will override them with fast, precomputed, field-specific matrices; see #4.
dlubarov added a commit that referenced this issue Apr 28, 2021
The defaults are quite slow, but we will override them with fast, precomputed, field-specific matrices; see #4.
@unzvfu
Copy link
Contributor

unzvfu commented Jul 7, 2021

I wonder what sort of matrix entries we should search for. The literature seems mostly focused on binary fields. Clearly 0s are best for performance, followed by 1s. I guess powers of two should give a decent speedup, though we'd still need to reduce. Coefficients less than 232 might also help to speed up reductions somewhat -- e.g. with the Goldilocks reduction you suggested, we'd have x11 = 0.

My understanding is that MDS matrices (at least square ones) can't have 0 entries, but maybe we could find one with several 1 entries, and the rest small and/or power-of-two entries.

First results on this are in. TL;DR: The circulant matrix with row [4, 1, 2, 256, 16, 8, 1, 1] is MDS over CrandallField and satisfies the security algorithms of GRS20 mentioned on p6 of the Poseidon article (the implementation those algorithms is available here; it's pretty terrible, and I had to fiddle with it a bit to get it to work; I will upload my tweaked version eventually).

I obtained this matrix by starting with the matrix defined at src/field/crandall_field.rs#L18 and repeatedly reducing each entry by half, stopping if the matrix ever failed to be MDS. I then manually started adding or subtracting a small number from each of the entries while maintaining the MDS property, each time aiming for a power of two. It was a bit ad hoc, but, unless I've missed something obvious, the matrix above can't be simplified further and still satisfy the MDS property.

(Note that the 256 can be replaced with 5 and still be good to use; some of the other entries can be changed similarly; this sort of thing allows us to make use of the LEA instruction for multiplication but it's probably not helpful for CrandallField since there's no way to detect overflow. (On Skylake for example, LEA on two arguments has latency 1, throughput 2 which is the same as SHL, but LEA is on ports 1 and 5 while SHL is on ports 0 and 6, so they can happen in parallel giving two SHLs and two LEAs per cycle.))

So the process was a combination of automated "reduction" and manual fiddling; not entirely automate-able, but not too laborious either. The result looks pretty good, with three 1's, and the rest powers of two. Hence each dot product can be computed with a smallish number of adds and shifts which should be relatively easy to interleave to make use of the CPU pipeline. The main hassle will be dealing with the overflow, but that's inevitable since the CrandallField prime leaves no space at the top of the word.

@unzvfu
Copy link
Contributor

unzvfu commented Jul 7, 2021

A different approach to computing the MDS matrix-vector product quickly might be to use the fact that, if C is a circulant matrix with first column c, then the product Cv is given by FFT-1{FFT(c) * FFT(v)}, where the * is componentwise multiplication. In our case the c's are constants so FFT(c) can be precomputed. Unfortunately we can't chain several of these together in the full layers of Poseidon because, as far as I can see, there's no way to compute the S-boxes in the FFT domain: Given FFT(v = (x0, ..., xn-1)), the only way I can see to obtain FFT(x0α, ..., xn-1α) is to compute v with FFT-1, raise each element to the power α, and re-apply FFT. Given that n-1 = t = 8 or 12 or so in our case, which is very small, it's seems likely that the overhead of the FFT in this approach will prevent it from beating the one above.

@unzvfu
Copy link
Contributor

unzvfu commented Jul 7, 2021

Also worth noting that multiplying a circulant matrix by a vector even with normal multiplication is very amenable to CPU vectorisation: For example

[a  b  c] [x]   [ax + by + cz]     [x]     [y]     [z]
[c  a  b] [y] = [cx + ay + bz] = a [y] + b [z] + c [x]
[b  c  a] [z]   [bx + cy + az]     [z]     [x]     [y]

i.e. each term on the RHS is just a rotation of the original vector scaled by a constant (which will be one of the small constants found above). If we have t=8, then we can even fit all 8 of the 64-bit vector elements in a single AVX512 register.

@dlubarov
Copy link
Member Author

dlubarov commented Jul 8, 2021

That looks very practical! Seems about as good as it can get, at least with our current field.

Just thinking out loud about how much performance could gained with a smaller field. My understanding is that

  • With our current field, we would do widening (64 -> 128) multiplication between the matrix and vector entries, then add the u128s (which is of course two native adds, plus something extra for the carry), then do reduce128.
  • If we were to start with vector entries that fit in ~59 bits or so (assuming we replace 256 with 5), then the products and their sums would all fit in u64s, making the accumulation cheaper. However, reducing certain values all the way down to ~59 bits seems a bit expensive compared to just reducing to 64 bits (as reduce128 currently does). Or is there another approach that I'm not seeing?

@unzvfu
Copy link
Contributor

unzvfu commented Jul 11, 2021

With our current field, we would do widening (64 -> 128) multiplication between the matrix and vector entries, then add the u128s (which is of course two native adds, plus something extra for the carry), then do reduce128.

Right (by 'wide multiplication' I assume you mean to do them with shifts). It is pretty nice that we can just do a single reduce128 call at the end, after the t=8 shift+adds.

If we were to start with vector entries that fit in ~59 bits or so (assuming we replace 256 with 5), then the products and their sums would all fit in u64s, making the accumulation cheaper. However, reducing certain values all the way down to ~59 bits seems a bit expensive compared to just reducing to 64 bits (as reduce128 currently does). Or is there another approach that I'm not seeing?

I think you're right that the advantage is only obtained if the inputs are already reduced.

@unzvfu unzvfu self-assigned this Jul 12, 2021
@unzvfu
Copy link
Contributor

unzvfu commented Aug 9, 2021

A full and correct implementation of Poseidon is now available on the poseidon-impl branch. It includes two variants: a naive implementation that just translates the definitions, and a fast implementation (the default) that uses some mathemagic to reduce the cost of the partial rounds. Both variants are specialised to width t=12, the x3 s-box, and use the fast MDS matrices that I talked about above. The bench_hash binary has been updated to run a basic benchmark on GMiMC, Rescue, and the two Poseidon variants. The timings on my ancient ThinkPad (single core, i5-3230M @ 2.60GHz (peak)):

Hash Time (μs) Relative slowdown
GMiMC 2.71 1
Rescue 50.2 18.6
Poseidon (naive, slow MDS)* 56.0 20.7
Poseidon (naive, fast MDS) 9.30 3.44
Poseidon (smart, fast MDS) 4.97 1.84

(* Not currently available in the code; timing is approximate.)

Experience suggests that the relative timings are fairly constant. For example, before it died, the faster CPU in my other laptop produced timings of about 1.2μs for GMiMC and 2.3μs for Poseidon (smart, fast MDS).

There is still space in the implementation to improve CPU pipelining and parallelise with AVX if desired.

I should add some documentation and supporting scripts used to generate constants before this is merged.

@dlubarov
Copy link
Member Author

dlubarov commented Aug 9, 2021

The performance looks promising! On my M1:

Bench for GMiMC:
--- avg 1.4395833333333334μs
Bench for Rescue:
--- avg 34.552083333333336μs
Bench for Poseidon naive:
--- avg 3.1458333333333335μs
Bench for Poseidon:
--- avg 2.408333333333333μs

x3 s-box

I think if we use x3 we'll need 42 partial rounds; I was assuming x7 with the 22 partial round figure. I used this script to get the numbers.

Do you think x3 would be faster despite the extra partial rounds? In the recursive setting I think x7 should be a little cheaper, as we basically pay a certain cost per S-box; the exponent doesn't affect things much (as long as it's less than 8).

@unzvfu
Copy link
Contributor

unzvfu commented Aug 10, 2021

Thanks for the feedback @dlubarov. Good catch with the S-box monomial! Here are the updated numbers (same computer as before):

Hash Time (μs) Relative slowdown
GMiMC 2.71 1
Rescue 50.2 18.6
Poseidon (naive, fast MDS, x7 s-box, 22 partial) 10.19 3.77
Poseidon (smart, fast MDS, x7 s-box, 22 partial) 5.55 2.05
Poseidon (naive, fast MDS, x3 s-box, 42 partial) 15.26 5.65
Poseidon (smart, fast MDS, x3 s-box, 42 partial) 6.51 2.41

So x7 with 22 partial rounds comfortably beats x3 with 42 partial rounds. I'm a little surprised that evaluating x7 is that much slower than x3, since x3 takes 2 multiplications and x7 takes 4 but where two of them are independent (x2 = x * x; x4 = x2 * x2; x3 = x2 * x; x7 = x4 * x3). I guess it's testament to the fact that the fast MDS matrices mean the S-boxes are taking a bigger portion of the runtime.

@dlubarov
Copy link
Member Author

For me it only increased from 2.4us to 2.6us

Bench for Poseidon:
--- avg 2.6333333333333333μs

so seems like a bit more parallelism is being used on my machine.

@unzvfu
Copy link
Contributor

unzvfu commented Aug 18, 2021

Quick update: Timing for GMiMC and Poseidon with width = 8:

Hash Time (μs) Relative slowdown
GMiMC 2.43 1
Poseidon (naive, fast MDS, x7 s-box, 22 partial) 2.84 1.16
Poseidon (smart, fast MDS, x7 s-box, 22 partial) 2.61 1.07

So the performance really converges for width = 8 and, perhaps surprisingly, the smart partial round evaluation doesn't provide such a dramatic improvement. I would guess that this is because the s-box calculation now dominates the runtime, which could mean trading more partial rounds for a smaller s-box monomial is worth it. I'll continue investigating and report back.

Edit: update the numbers with a better MDS matrix.

@unzvfu
Copy link
Contributor

unzvfu commented Aug 18, 2021

Times with smaller s-box, more partial rounds:

Hash Time (μs) Relative slowdown
Poseidon (naive, fast MDS, x3 s-box, 42 partial) 3.70 1.52
Poseidon (smart, fast MDS, x3 s-box, 42 partial) 3.38 1.39

So it still looks like x7 is the sweet spot for s-box monomial vs number of partial layers.

Further experimentation has shown that, somewhat annoyingly, the smart MDS evaluation and the naive MDS evaluation actually cost about the same amount when the width is 8. In principle, for width w, the work is w2 muls in the naive case and 2w muls for the smart case. However that ignores a big difference: The naive MDS evaluation computes the product with the MDS matrix directly, hence it uses my fast MDS entries directly, whereas the "smart" MDS evalutation uses precomputed constants which are intrinsically random 64-bit values. I think this explains why the performance gains from smart MDS evaluation are less visible when we reduce the hash width from 12 to 8 (at least on my old laptop).

@unzvfu
Copy link
Contributor

unzvfu commented Sep 1, 2021

Repeating a comment from #207 for reference:

Bizarrely, somehow the performance has changed a bit somewhere between f118464 and 45613c9. The numbers I'm getting now for width=8 are (name/microseconds/slowdown):

GMiMC             2.07   1.00
Poseidon          3.31   1.60
Poseidon naive    3.04   1.47

So, compared to the numbers in #4, GMiMC is 20% faster, and 'fast' Poseidon is now 10% slower than 'naive' Poseidon (which itself is slightly slower).

I think we can just push ahead though, since they're still very good, and we don't want to be optimising for my old laptop anyway! ;)

@unzvfu unzvfu changed the title Implement Poseidon (both native and circuit versions) Implement native Poseidon Sep 4, 2021
@unzvfu
Copy link
Contributor

unzvfu commented Sep 4, 2021

This issue captures the now merged native Poseidon implementation. Renamed to reflect that, and added #219 for circuit version.

@unzvfu unzvfu closed this as completed Sep 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants