Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalar multiplication by the generator #697

Merged
merged 1 commit into from Jan 8, 2023

Conversation

fjarri
Copy link
Contributor

@fjarri fjarri commented Dec 30, 2022

This PR introduces a way to multiply a scalar by the generator by keeping precomputed lookup tables. It is exported as mul_by_generator() (and used in the signing code). It takes 40us compared to 86us for regular multiplication; signing takes 70us compared to 116us.

The precomputed table contains values for every second radix and takes ~30kb; if we calculated it for every radix it would take 60kb with the performance improvement of only ~3%, so I don't think it is worth it (the same choice was taken by https://github.com/dalek-cryptography/curve25519-dalek).

I am not sure how to speed up the verification, which relies on calculating k1 * G + k2 * x. Pre-calculating lookup tables for G in lincomb only speeds it up from 129us to 122us on my machine. Calculating k1 * G via mul_by_generator() results in the timing of 126us. Not great, any ideas are welcome.

The precomputed tables use once_cell. Note that the feature critical-section of once_cell (which we need to use to support no_std) has the MSRV of 1.63. Proper gating is up to @tarcieri

@fjarri fjarri requested a review from tarcieri December 30, 2022 04:00
@fjarri fjarri marked this pull request as draft December 30, 2022 04:00
@tarcieri
Copy link
Member

Thanks for contributing this! I'm working on a somewhat analogous problem over here:

dalek-cryptography/ed25519-dalek#251

I agree it would be good to have a trait for fixed-base scalar multiplication / multiply-by-generator, particularly so basepoint tables can be enabled conditionally as a performance-vs-code-size tradeoff while encapsulating the details of whether or not basepoint tables are used.

@fjarri
Copy link
Contributor Author

fjarri commented Dec 30, 2022

It is more of a startup time/memory than code size, since I'm using lazy_static.

But that actually gives me an idea, I can also put the g * 2^(i * window) into a table so that I can get rid of the doubling.

@str4d
Copy link
Contributor

str4d commented Dec 30, 2022

You may also want to look at the types in the group crate under group::wnaf, which also provide APIs for precomputing scalar multiplication tables.

k256/Cargo.toml Outdated
@@ -20,6 +20,7 @@ rust-version = "1.57"
[dependencies]
cfg-if = "1.0"
elliptic-curve = { version = "0.12.3", default-features = false, features = ["hazmat", "sec1"] }
lazy_static = { version = "1.4", features = ["spin_no_std"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally prefer once_cell these days, especially as it's slated for eventual inclusion in core/std.

The spin_no_std "feature" of lazy_static was always weirdly non-additive too. Enabling it enables it for all crates that use lazy_static.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to once_cell. I am not 100% sure I'm using it correctly, so if you have experience with it, please check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that to use once_cell in no_std environment we need the critical_section feature, but when such a library is used with std enabled, there is a linker error that can be prevented by specifying critical-section = { version = "1.1", features = ["std"] } explicitly. This is kind of a bummer. Is there a better way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also once_cell has the MSRV 1.63, so this crate's MSRV will have to be updated.

Copy link
Member

@tarcieri tarcieri Dec 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? once_cell should be MSRV 1.56

This is kind of a bummer. Is there a better way?

no_std users who want the basepoint tables can link once_cell and enable the critical_section feature themselves.

But I suspect many embedded will not want basepoint tables at all due to limitations on the size of the loaded program. In that regard it would probably be good to feature gate them ala dalek-cryptography/curve25519-dalek#489

Since you have a mul_by_generator function it can fallback to a slower implementation when the basepoint tables aren't available, but still provide a common API for either case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no_std users who want the basepoint tables can link once_cell and enable the critical_section feature themselves.

Not sure I would like that as a user. Seems pretty brittle.

Copy link
Member

@tarcieri tarcieri Dec 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a lazily evaluated value, so the limitation will be on the memory/cache usage.

Well yes, but that's what I mean by "size of the loaded program". It may need to fit in e.g. SRAM, on a device with no DRAM.

Lazy or not, at the end of the day it's a static table that needs to fit in whatever RAM the device has available, and many embedded devices are highly constrained.

As a data point, the curve25519-dalek basepoint tables were too big to fit into some ARM Cortex M0/M4 devices people were experimenting with at OxidizeConf when I attended it a few years ago.

I thought the common API would be provided by the trait, which could use mul_by_generator() or mul() (in the default implementation)

We can use the trait when it's available, which provides a default mul-based implementation, and would let you conditionally override it in the impl block to use the basepoint tables.

But absent that upstream feature in group, you can do something like this with an inherent mul_by_generator() method:

https://github.com/dalek-cryptography/curve25519-dalek/pull/489/files#diff-0dbe5da3c59af6d6e20fbb1c92b248a7bcce0a5e15f0e5de6d16c9a60abe3ee3R705-R722

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I would like that as a user. Seems pretty brittle.

Hopefully the ergonomic concerns here go away when it's merged upstream into core/std. Really that's the main reason to prefer it: it will soon hopefully be part of the standard library.

The spin_no_std feature of lazy_static is not a great solution here either. If one dependency in your std-dependent program activates it, suddenly every lazy static becomes spinlocked. And for that reason I think it's bad for dependencies to enable it rather than the toplevel [bin] crate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjarri another option is to transform the precomputed tables into Rust literal syntax, and use that to define a static. That's what the dalek crates do:

https://github.com/dalek-cryptography/curve25519-dalek/blob/3effd73307a606e44469a425974f0b7b0eb85899/src/backend/serial/u64/constants.rs#L328

You can test if it diverges by running the precomputation again and seeing if it equals the static value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will lead to increased code size which may be less preferable than some start-up time, e.g. in WASM context.

@tarcieri
Copy link
Member

@str4d speaking of group, it seems like mul_by_generator might be a useful addition there? zkcrypto/group#44

@fjarri
Copy link
Contributor Author

fjarri commented Dec 30, 2022

  • replaced lazy_static with once_cell
  • separated mul_by_generator into its own module, since it doesn't need to use endomorphism
  • added more precalculated tables

So now mul_by_generator() takes 40us compared to 86us for regular multiplication; signing takes 70us compared to 116us. Verification would be improved too but less so, since we will lose the advantage of calculating linear combination. The price is 60kb of cache. This can be reduced to 30kb like they did in ed25519 by making the table more sparse and therefore performing some additional doublings.

Also I cannot change the verification code in this PR since it's in signatures/ecdsa and not in k256.

I guess this PR is blocked by zkcrypto/group#44; after that's merged and published, this can be updated too, and then signatures/ecdsa to use it for verification.

@fjarri
Copy link
Contributor Author

fjarri commented Dec 30, 2022

As for using wNAF, I am not certain, won't it make the multiplication variable-time?

@tarcieri
Copy link
Member

tarcieri commented Dec 31, 2022

@fjarri I wouldn't say this PR is blocked on zkcrypto/group#44

You can make mul_by_generator an inherent method for now as you've already implemented it, and then we can potentially circle back on getting it added to the trait.

@fjarri
Copy link
Contributor Author

fjarri commented Dec 31, 2022

I guess I'll focus on providing the optimized logic in this PR, and leave the issues of gating and once_cell usage to you :)

@tarcieri
Copy link
Member

Sure, that's fine. Happy to do a followup which adds a basepoint-tables feature along the lines of dalek-cryptography/curve25519-dalek#489

@fjarri fjarri marked this pull request as ready for review January 1, 2023 02:12
@fjarri fjarri force-pushed the mul-by-generator branch 2 times, most recently from 4de6356 to 312fecd Compare January 1, 2023 02:18
@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2023

Also I cannot change the verification code in this PR since it's in signatures/ecdsa and not in k256.

You can still change the verification implementation if you like: https://github.com/RustCrypto/elliptic-curves/blob/8ffa72c/k256/src/ecdsa.rs#L241

It's currently using the provided implementation of verify_prehashed:

https://docs.rs/ecdsa/0.15.0-pre.1/ecdsa/hazmat/trait.VerifyPrimitive.html#method.verify_prehashed

However if you copy/paste that implementation into the VerifyPrimitive impl you can tweak it however you'd like.

@fjarri
Copy link
Contributor Author

fjarri commented Jan 2, 2023

It's actually not necessary anymore, since I currently don't know how to speed up verification (see the updated top message).

@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2023

Yeah, seems like the two approaches are mutually incompatible.

It might be worth checking if the speedup provided by precomputed basepoint tables outperforms using linear combinations, though that said, being able to use the provided implementation of ECDSA from the ecdsa crate is also nice.

@fjarri
Copy link
Contributor Author

fjarri commented Jan 2, 2023

It might be worth checking if the speedup provided by precomputed basepoint tables outperforms using linear combinations

I did check it (if I understand you correctly), see the timings in the first post

@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2023

I meant specifically in regard to verification, where you said this:

Verification would be improved too but less so, since we will lose the advantage of calculating linear combination.

...which implies it would still benefit from basepoint tables over linear combinations, but I wasn't clear how much.

@fjarri
Copy link
Contributor Author

fjarri commented Jan 2, 2023

Yes, at the moment I wrote this I was only guessing, but then I actually measured different approaches and updated the top post. Sorry for a bit of a non-linear post history.

@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2023

@fjarri
Copy link
Contributor Author

fjarri commented Jan 2, 2023

That's basically the approach I tried in here:

Pre-calculating lookup tables for G in lincomb only speeds it up from 129us to 122us on my machine.

Of course, that was still constant-time, using regular windows. They also use wNAF, since the verification does not have to be constant-time, which is an orthogonal change. It may give more speed-up. I don't think it should be done in this PR though.

@tarcieri
Copy link
Member

tarcieri commented Jan 2, 2023

That's cool.

Per @str4d's comment, I've been curious if the wNAF functionality in group::wnaf could assist here.

@fjarri
Copy link
Contributor Author

fjarri commented Jan 8, 2023

@tarcieri , to use group::Wnaf I'll need group = 0.13, which requires elliptic-curve = 0.13 to be released and its version bumped here.

@tarcieri
Copy link
Member

tarcieri commented Jan 8, 2023

Duly noted. I'll try to wrap up another release cycle soon so we can switch master over to elliptic-curve v0.13 prereleases

Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of which, it would be nice to get this PR into that set of releases.

@fjarri it seems like this should be good to go other than feature-gating it? (which I can do in a followup)

@fjarri
Copy link
Contributor Author

fjarri commented Jan 8, 2023

Speaking of which, it would be nice to get this PR into that set of releases.

I was thinking the same, but it seems that I will need those version bumps first. The ecdsa dep will have to be bumped too since ecdsa=0.15.0-rc.1 depends on elliptic-curve=0.12 .

@fjarri it seems like this should be good to go other than feature-gating it? (which I can do in a followup)

Yes, should be fine.

@tarcieri
Copy link
Member

tarcieri commented Jan 8, 2023

I was thinking the same, but it seems that I will need those version bumps first.

What do you need bumped specifically?

The ecdsa dep will have to be bumped too since ecdsa=0.15.0-rc.1 depends on elliptic-curve=0.12 .

...which is the version currently in use? Or are you talking about for adding wNAF?

@fjarri
Copy link
Contributor Author

fjarri commented Jan 8, 2023

Or are you talking about for adding wNAF?

Yes, for wNAF we need the new group which means the new ecdsa (will have to be -rc.2 I guess, because rc.1 still depends on elliptic-curve=0.12) and elliptic-curve. For this PR no bumps are necessary.

@tarcieri tarcieri merged commit a15923c into RustCrypto:master Jan 8, 2023
@fjarri fjarri deleted the mul-by-generator branch January 8, 2023 23:53
Comment on lines +392 to +408
/// Calculages `k * G`, where `G` is the generator.
pub fn mul_by_generator(k: &Scalar) -> ProjectivePoint {
let digits = Radix16Decomposition::<65>::new(k);
let table = *GEN_LOOKUP_TABLE;
let mut acc = table[32].select(digits.0[64]);
let mut acc2 = ProjectivePoint::IDENTITY;
for i in (0..32).rev() {
acc2 += &table[i].select(digits.0[i * 2 + 1]);
acc += &table[i].select(digits.0[i * 2]);
}
// This is the price of halving the precomputed table size (from 60kb to 30kb)
// The performance hit is minor, about 3%.
for _ in 0..4 {
acc2 = acc2.double();
}
acc + acc2
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjarri one thing I noticed too late refactoring... I think this would make sense as an inherent method of ProjectivePoint so it can be called as ProjectivePoint::mul_by_generator(&scalar)?

That way you don't have to import it separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I put it into mul.rs to make use of Radix16Decomposition and LookupTable, but it can be definitely defined in an impl block there.

Copy link
Member

@tarcieri tarcieri Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think I'll go ahead and refactor it into an inherent method in #705 then

Eh, or not, let's get the feature added first and then I can try to figure out how best to factor the inherent method change.

@@ -20,6 +20,7 @@ rust-version = "1.60"
[dependencies]
cfg-if = "1.0"
elliptic-curve = { version = "0.12.3", default-features = false, features = ["hazmat", "sec1"] }
once_cell = { version = "1.16", default-features = false, features = ["critical-section"] }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 not sure that enabling critical-section here is the right call. I'd imagine it should be up to the leaf crate with Cargo.lock to enable this, such that pulling this crate doesn't pull cs unconditionally.

One imporovement here would be to add std = "once-cell/std", so that once_cell doesn't use critical section (Cargo would still download and compile this crate, it'll just be unused).

But I think what you should do is:

[dependencies]
once_cell = { version = "1.16", default-features = false }

[features]
default = ["arithmetic", "ecdsa", "pkcs8", "schnorr", "std"]
alloc = ["ecdsa-core?/alloc", "elliptic-curve/alloc"]

# Either std or critical-section should be enabled
std = ["once-cell/std", "alloc", "ecdsa-core?/std", "elliptic-curve/std"]
critical-section = ["once-cell/critical-section"]

That way:

  • people with std don't get critical_section at all
  • people with no_std would have to explicitly opt-into critical section
    • that's a bit bad because it doesn't "just work"
    • OTOH, silently opting the users in critical-section behavior feels not super friendly. It's a big hammer, better to be explicitly aware about it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Want to make a PR?

FWIW, the situation is probably not as bad as it seems, because I expect most embedded users will want to disable precomputed-tables anyway to reduce the size of the library in memory

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make a PR?

Not really :)

FWIW, the situation is probably not as bad as it seems

This is a bit ambiguous, but, with the current code, my main worry is std people who get opted into critical-section. I am not sure how exactly critical-section behaves on std, but at minimum that hits much less tested and much slower path of once_cell, for all once cells in the final binary, and, at maximum, that actually serializes access to all once cells everywhere 😨

Copy link
Member

@tarcieri tarcieri Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh ok, I'll make a PR.

Edit: opened #710

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all once cells in the final binary

I guess the once_cell API has the same problems as lazy_static. Big yikes. Hopefully it won't be the case when it's a part of the language proper.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std won't have this problem by virtue of not supporting no_std at all. If you want to configure synchronisation for a specific once_cell, rather than for all of them globally, https://crates.io/crates/generic_once_cell is the right crate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it's going to be std-only? That's a bummer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjarri with an alloc dependency it would be possible to use race::OnceBox

@tarcieri tarcieri mentioned this pull request Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants