Scalar multiplication by the generator #697

fjarri · 2022-12-30T03:43:26Z

This PR introduces a way to multiply a scalar by the generator by keeping precomputed lookup tables. It is exported as mul_by_generator() (and used in the signing code). It takes 40us compared to 86us for regular multiplication; signing takes 70us compared to 116us.

The precomputed table contains values for every second radix and takes ~30kb; if we calculated it for every radix it would take 60kb with the performance improvement of only ~3%, so I don't think it is worth it (the same choice was taken by https://github.com/dalek-cryptography/curve25519-dalek).

I am not sure how to speed up the verification, which relies on calculating k1 * G + k2 * x. Pre-calculating lookup tables for G in lincomb only speeds it up from 129us to 122us on my machine. Calculating k1 * G via mul_by_generator() results in the timing of 126us. Not great, any ideas are welcome.

The precomputed tables use once_cell. Note that the feature critical-section of once_cell (which we need to use to support no_std) has the MSRV of 1.63. Proper gating is up to @tarcieri

tarcieri · 2022-12-30T04:31:39Z

Thanks for contributing this! I'm working on a somewhat analogous problem over here:

dalek-cryptography/ed25519-dalek#251

I agree it would be good to have a trait for fixed-base scalar multiplication / multiply-by-generator, particularly so basepoint tables can be enabled conditionally as a performance-vs-code-size tradeoff while encapsulating the details of whether or not basepoint tables are used.

fjarri · 2022-12-30T08:24:37Z

It is more of a startup time/memory than code size, since I'm using lazy_static.

But that actually gives me an idea, I can also put the g * 2^(i * window) into a table so that I can get rid of the doubling.

str4d · 2022-12-30T16:34:36Z

You may also want to look at the types in the group crate under group::wnaf, which also provide APIs for precomputing scalar multiplication tables.

tarcieri · 2022-12-30T16:58:31Z

k256/Cargo.toml

@@ -20,6 +20,7 @@ rust-version = "1.57"
 [dependencies]
 cfg-if = "1.0"
 elliptic-curve = { version = "0.12.3", default-features = false, features = ["hazmat", "sec1"] }
+lazy_static = { version = "1.4", features = ["spin_no_std"] }


I generally prefer once_cell these days, especially as it's slated for eventual inclusion in core/std.

The spin_no_std "feature" of lazy_static was always weirdly non-additive too. Enabling it enables it for all crates that use lazy_static.

Changed to once_cell. I am not 100% sure I'm using it correctly, so if you have experience with it, please check.

It seems that to use once_cell in no_std environment we need the critical_section feature, but when such a library is used with std enabled, there is a linker error that can be prevented by specifying critical-section = { version = "1.1", features = ["std"] } explicitly. This is kind of a bummer. Is there a better way?

Also once_cell has the MSRV 1.63, so this crate's MSRV will have to be updated.

Huh? once_cell should be MSRV 1.56

This is kind of a bummer. Is there a better way?

no_std users who want the basepoint tables can link once_cell and enable the critical_section feature themselves.

But I suspect many embedded will not want basepoint tables at all due to limitations on the size of the loaded program. In that regard it would probably be good to feature gate them ala dalek-cryptography/curve25519-dalek#489

Since you have a mul_by_generator function it can fallback to a slower implementation when the basepoint tables aren't available, but still provide a common API for either case.

no_std users who want the basepoint tables can link once_cell and enable the critical_section feature themselves.

Not sure I would like that as a user. Seems pretty brittle.

It is a lazily evaluated value, so the limitation will be on the memory/cache usage.

Well yes, but that's what I mean by "size of the loaded program". It may need to fit in e.g. SRAM, on a device with no DRAM.

Lazy or not, at the end of the day it's a static table that needs to fit in whatever RAM the device has available, and many embedded devices are highly constrained.

As a data point, the curve25519-dalek basepoint tables were too big to fit into some ARM Cortex M0/M4 devices people were experimenting with at OxidizeConf when I attended it a few years ago.

I thought the common API would be provided by the trait, which could use mul_by_generator() or mul() (in the default implementation)

We can use the trait when it's available, which provides a default mul-based implementation, and would let you conditionally override it in the impl block to use the basepoint tables.

But absent that upstream feature in group, you can do something like this with an inherent mul_by_generator() method:

https://github.com/dalek-cryptography/curve25519-dalek/pull/489/files#diff-0dbe5da3c59af6d6e20fbb1c92b248a7bcce0a5e15f0e5de6d16c9a60abe3ee3R705-R722

Not sure I would like that as a user. Seems pretty brittle.

Hopefully the ergonomic concerns here go away when it's merged upstream into core/std. Really that's the main reason to prefer it: it will soon hopefully be part of the standard library.

The spin_no_std feature of lazy_static is not a great solution here either. If one dependency in your std-dependent program activates it, suddenly every lazy static becomes spinlocked. And for that reason I think it's bad for dependencies to enable it rather than the toplevel [bin] crate.

@fjarri another option is to transform the precomputed tables into Rust literal syntax, and use that to define a static. That's what the dalek crates do:

https://github.com/dalek-cryptography/curve25519-dalek/blob/3effd73307a606e44469a425974f0b7b0eb85899/src/backend/serial/u64/constants.rs#L328

You can test if it diverges by running the precomputation again and seeing if it equals the static value.

That will lead to increased code size which may be less preferable than some start-up time, e.g. in WASM context.

tarcieri · 2022-12-30T17:08:37Z

@str4d speaking of group, it seems like mul_by_generator might be a useful addition there? zkcrypto/group#44

fjarri · 2022-12-30T23:44:53Z

replaced lazy_static with once_cell
separated mul_by_generator into its own module, since it doesn't need to use endomorphism
added more precalculated tables

So now mul_by_generator() takes 40us compared to 86us for regular multiplication; signing takes 70us compared to 116us. Verification would be improved too but less so, since we will lose the advantage of calculating linear combination. The price is 60kb of cache. This can be reduced to 30kb like they did in ed25519 by making the table more sparse and therefore performing some additional doublings.

Also I cannot change the verification code in this PR since it's in signatures/ecdsa and not in k256.

I guess this PR is blocked by zkcrypto/group#44; after that's merged and published, this can be updated too, and then signatures/ecdsa to use it for verification.

fjarri · 2022-12-30T23:51:58Z

As for using wNAF, I am not certain, won't it make the multiplication variable-time?

tarcieri · 2022-12-31T23:18:52Z

@fjarri I wouldn't say this PR is blocked on zkcrypto/group#44

You can make mul_by_generator an inherent method for now as you've already implemented it, and then we can potentially circle back on getting it added to the trait.

fjarri · 2022-12-31T23:25:41Z

I guess I'll focus on providing the optimized logic in this PR, and leave the issues of gating and once_cell usage to you :)

tarcieri · 2022-12-31T23:28:58Z

Sure, that's fine. Happy to do a followup which adds a basepoint-tables feature along the lines of dalek-cryptography/curve25519-dalek#489

tarcieri · 2023-01-02T17:02:14Z

Also I cannot change the verification code in this PR since it's in signatures/ecdsa and not in k256.

You can still change the verification implementation if you like: https://github.com/RustCrypto/elliptic-curves/blob/8ffa72c/k256/src/ecdsa.rs#L241

It's currently using the provided implementation of verify_prehashed:

https://docs.rs/ecdsa/0.15.0-pre.1/ecdsa/hazmat/trait.VerifyPrimitive.html#method.verify_prehashed

However if you copy/paste that implementation into the VerifyPrimitive impl you can tweak it however you'd like.

fjarri · 2023-01-02T18:37:55Z

It's actually not necessary anymore, since I currently don't know how to speed up verification (see the updated top message).

tarcieri · 2023-01-02T18:39:28Z

Yeah, seems like the two approaches are mutually incompatible.

It might be worth checking if the speedup provided by precomputed basepoint tables outperforms using linear combinations, though that said, being able to use the provided implementation of ECDSA from the ecdsa crate is also nice.

fjarri · 2023-01-02T18:40:51Z

It might be worth checking if the speedup provided by precomputed basepoint tables outperforms using linear combinations

I did check it (if I understand you correctly), see the timings in the first post

tarcieri · 2023-01-02T18:44:40Z

I meant specifically in regard to verification, where you said this:

Verification would be improved too but less so, since we will lose the advantage of calculating linear combination.

...which implies it would still benefit from basepoint tables over linear combinations, but I wasn't clear how much.

fjarri · 2023-01-02T18:47:37Z

Yes, at the moment I wrote this I was only guessing, but then I actually measured different approaches and updated the top post. Sorry for a bit of a non-linear post history.

tarcieri · 2023-01-02T21:05:04Z

@fjarri here's some semi-related code from curve25519-dalek, if it helps: https://github.com/dalek-cryptography/curve25519-dalek/blob/e01bb1b/src/backend/serial/scalar_mul/vartime_double_base.rs#L22-L64

fjarri · 2023-01-02T23:08:34Z

That's basically the approach I tried in here:

Pre-calculating lookup tables for G in lincomb only speeds it up from 129us to 122us on my machine.

Of course, that was still constant-time, using regular windows. They also use wNAF, since the verification does not have to be constant-time, which is an orthogonal change. It may give more speed-up. I don't think it should be done in this PR though.

tarcieri · 2023-01-02T23:23:55Z

That's cool.

Per @str4d's comment, I've been curious if the wNAF functionality in group::wnaf could assist here.

fjarri · 2023-01-08T23:12:32Z

@tarcieri , to use group::Wnaf I'll need group = 0.13, which requires elliptic-curve = 0.13 to be released and its version bumped here.

tarcieri · 2023-01-08T23:19:37Z

Duly noted. I'll try to wrap up another release cycle soon so we can switch master over to elliptic-curve v0.13 prereleases

tarcieri

Speaking of which, it would be nice to get this PR into that set of releases.

@fjarri it seems like this should be good to go other than feature-gating it? (which I can do in a followup)

… tables)

fjarri · 2023-01-08T23:29:00Z

Speaking of which, it would be nice to get this PR into that set of releases.

I was thinking the same, but it seems that I will need those version bumps first. The ecdsa dep will have to be bumped too since ecdsa=0.15.0-rc.1 depends on elliptic-curve=0.12 .

@fjarri it seems like this should be good to go other than feature-gating it? (which I can do in a followup)

Yes, should be fine.

tarcieri · 2023-01-08T23:34:22Z

I was thinking the same, but it seems that I will need those version bumps first.

What do you need bumped specifically?

The ecdsa dep will have to be bumped too since ecdsa=0.15.0-rc.1 depends on elliptic-curve=0.12 .

...which is the version currently in use? Or are you talking about for adding wNAF?

fjarri · 2023-01-08T23:36:56Z

Or are you talking about for adding wNAF?

Yes, for wNAF we need the new group which means the new ecdsa (will have to be -rc.2 I guess, because rc.1 still depends on elliptic-curve=0.12) and elliptic-curve. For this PR no bumps are necessary.

tarcieri · 2023-01-08T23:59:56Z

k256/src/arithmetic/mul.rs

+/// Calculages `k * G`, where `G` is the generator.
+pub fn mul_by_generator(k: &Scalar) -> ProjectivePoint {
+    let digits = Radix16Decomposition::<65>::new(k);
+    let table = *GEN_LOOKUP_TABLE;
+    let mut acc = table[32].select(digits.0[64]);
+    let mut acc2 = ProjectivePoint::IDENTITY;
+    for i in (0..32).rev() {
+        acc2 += &table[i].select(digits.0[i * 2 + 1]);
+        acc += &table[i].select(digits.0[i * 2]);
+    }
+    // This is the price of halving the precomputed table size (from 60kb to 30kb)
+    // The performance hit is minor, about 3%.
+    for _ in 0..4 {
+        acc2 = acc2.double();
+    }
+    acc + acc2
+}


@fjarri one thing I noticed too late refactoring... I think this would make sense as an inherent method of ProjectivePoint so it can be called as ProjectivePoint::mul_by_generator(&scalar)?

That way you don't have to import it separately.

Yes, I think I put it into mul.rs to make use of Radix16Decomposition and LookupTable, but it can be definitely defined in an impl block there.

~~Cool, I think I'll go ahead and refactor it into an inherent method in #705 then~~

Eh, or not, let's get the feature added first and then I can try to figure out how best to factor the inherent method change.

matklad · 2023-01-13T16:22:52Z

k256/Cargo.toml

@@ -20,6 +20,7 @@ rust-version = "1.60"
 [dependencies]
 cfg-if = "1.0"
 elliptic-curve = { version = "0.12.3", default-features = false, features = ["hazmat", "sec1"] }
+once_cell = { version = "1.16", default-features = false, features = ["critical-section"] }


🤔 not sure that enabling critical-section here is the right call. I'd imagine it should be up to the leaf crate with Cargo.lock to enable this, such that pulling this crate doesn't pull cs unconditionally.

One imporovement here would be to add std = "once-cell/std", so that once_cell doesn't use critical section (Cargo would still download and compile this crate, it'll just be unused).

But I think what you should do is:

[dependencies] once_cell = { version = "1.16", default-features = false } [features] default = ["arithmetic", "ecdsa", "pkcs8", "schnorr", "std"] alloc = ["ecdsa-core?/alloc", "elliptic-curve/alloc"] # Either std or critical-section should be enabled std = ["once-cell/std", "alloc", "ecdsa-core?/std", "elliptic-curve/std"] critical-section = ["once-cell/critical-section"]

That way:

people with std don't get critical_section at all

people with no_std would have to explicitly opt-into critical section

that's a bit bad because it doesn't "just work"

OTOH, silently opting the users in critical-section behavior feels not super friendly. It's a big hammer, better to be explicitly aware about it!

Sounds good to me. Want to make a PR?

FWIW, the situation is probably not as bad as it seems, because I expect most embedded users will want to disable precomputed-tables anyway to reduce the size of the library in memory

Want to make a PR?

Not really :)

FWIW, the situation is probably not as bad as it seems

This is a bit ambiguous, but, with the current code, my main worry is std people who get opted into critical-section. I am not sure how exactly critical-section behaves on std, but at minimum that hits much less tested and much slower path of once_cell, for all once cells in the final binary, and, at maximum, that actually serializes access to all once cells everywhere 😨

Heh ok, I'll make a PR.

Edit: opened #710

for all once cells in the final binary

I guess the once_cell API has the same problems as lazy_static. Big yikes. Hopefully it won't be the case when it's a part of the language proper.

std won't have this problem by virtue of not supporting no_std at all. If you want to configure synchronisation for a specific once_cell, rather than for all of them globally, https://crates.io/crates/generic_once_cell is the right crate.

Ah, so it's going to be std-only? That's a bummer.

@fjarri with an alloc dependency it would be possible to use race::OnceBox

fjarri force-pushed the mul-by-generator branch from 4245488 to 755e189 Compare December 30, 2022 03:48

fjarri requested a review from tarcieri December 30, 2022 04:00

fjarri marked this pull request as draft December 30, 2022 04:00

tarcieri reviewed Dec 30, 2022

View reviewed changes

fjarri force-pushed the mul-by-generator branch from 755e189 to 02b5469 Compare December 30, 2022 23:38

fjarri force-pushed the mul-by-generator branch from 02b5469 to 27833bb Compare January 1, 2023 02:03

fjarri marked this pull request as ready for review January 1, 2023 02:12

fjarri force-pushed the mul-by-generator branch 2 times, most recently from 4de6356 to 312fecd Compare January 1, 2023 02:18

tarcieri approved these changes Jan 8, 2023

View reviewed changes

k256: add optimized multiplication by generator (based on precomputed…

dd470ce

… tables)

fjarri force-pushed the mul-by-generator branch from 312fecd to dd470ce Compare January 8, 2023 23:23

tarcieri merged commit a15923c into RustCrypto:master Jan 8, 2023

fjarri deleted the mul-by-generator branch January 8, 2023 23:53

tarcieri reviewed Jan 8, 2023

View reviewed changes

matklad reviewed Jan 13, 2023

View reviewed changes

tarcieri mentioned this pull request Jan 16, 2023

k256 v0.12.0 #717

Merged

Scalar multiplication by the generator #697

Scalar multiplication by the generator #697

Conversation

fjarri commented Dec 30, 2022 • edited

tarcieri commented Dec 30, 2022

fjarri commented Dec 30, 2022 • edited

str4d commented Dec 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri Dec 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri Dec 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented Dec 30, 2022

fjarri commented Dec 30, 2022 • edited

fjarri commented Dec 30, 2022

tarcieri commented Dec 31, 2022 • edited

fjarri commented Dec 31, 2022

tarcieri commented Dec 31, 2022

tarcieri commented Jan 2, 2023

fjarri commented Jan 2, 2023

tarcieri commented Jan 2, 2023

fjarri commented Jan 2, 2023

tarcieri commented Jan 2, 2023

fjarri commented Jan 2, 2023

tarcieri commented Jan 2, 2023

fjarri commented Jan 2, 2023

tarcieri commented Jan 2, 2023

fjarri commented Jan 8, 2023

tarcieri commented Jan 8, 2023

tarcieri left a comment

Choose a reason for hiding this comment

fjarri commented Jan 8, 2023

tarcieri commented Jan 8, 2023

fjarri commented Jan 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri Jan 9, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri Jan 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjarri commented Dec 30, 2022 •

edited

fjarri commented Dec 30, 2022 •

edited

tarcieri Dec 31, 2022 •

edited

tarcieri Dec 31, 2022 •

edited

fjarri commented Dec 30, 2022 •

edited

tarcieri commented Dec 31, 2022 •

edited

fjarri commented Jan 8, 2023 •

edited

tarcieri Jan 9, 2023 •

edited

tarcieri Jan 13, 2023 •

edited