New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalar multiplication by the generator #697
Conversation
4245488
to
755e189
Compare
Thanks for contributing this! I'm working on a somewhat analogous problem over here: dalek-cryptography/ed25519-dalek#251 I agree it would be good to have a trait for fixed-base scalar multiplication / multiply-by-generator, particularly so basepoint tables can be enabled conditionally as a performance-vs-code-size tradeoff while encapsulating the details of whether or not basepoint tables are used. |
It is more of a startup time/memory than code size, since I'm using But that actually gives me an idea, I can also put the |
You may also want to look at the types in the |
k256/Cargo.toml
Outdated
@@ -20,6 +20,7 @@ rust-version = "1.57" | |||
[dependencies] | |||
cfg-if = "1.0" | |||
elliptic-curve = { version = "0.12.3", default-features = false, features = ["hazmat", "sec1"] } | |||
lazy_static = { version = "1.4", features = ["spin_no_std"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally prefer once_cell
these days, especially as it's slated for eventual inclusion in core
/std
.
The spin_no_std
"feature" of lazy_static
was always weirdly non-additive too. Enabling it enables it for all crates that use lazy_static
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to once_cell
. I am not 100% sure I'm using it correctly, so if you have experience with it, please check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that to use once_cell
in no_std
environment we need the critical_section
feature, but when such a library is used with std
enabled, there is a linker error that can be prevented by specifying critical-section = { version = "1.1", features = ["std"] }
explicitly. This is kind of a bummer. Is there a better way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also once_cell
has the MSRV 1.63, so this crate's MSRV will have to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh? once_cell
should be MSRV 1.56
This is kind of a bummer. Is there a better way?
no_std
users who want the basepoint tables can link once_cell
and enable the critical_section
feature themselves.
But I suspect many embedded will not want basepoint tables at all due to limitations on the size of the loaded program. In that regard it would probably be good to feature gate them ala dalek-cryptography/curve25519-dalek#489
Since you have a mul_by_generator
function it can fallback to a slower implementation when the basepoint tables aren't available, but still provide a common API for either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no_std users who want the basepoint tables can link once_cell and enable the critical_section feature themselves.
Not sure I would like that as a user. Seems pretty brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a lazily evaluated value, so the limitation will be on the memory/cache usage.
Well yes, but that's what I mean by "size of the loaded program". It may need to fit in e.g. SRAM, on a device with no DRAM.
Lazy or not, at the end of the day it's a static table that needs to fit in whatever RAM the device has available, and many embedded devices are highly constrained.
As a data point, the curve25519-dalek
basepoint tables were too big to fit into some ARM Cortex M0/M4 devices people were experimenting with at OxidizeConf when I attended it a few years ago.
I thought the common API would be provided by the trait, which could use mul_by_generator() or mul() (in the default implementation)
We can use the trait when it's available, which provides a default mul
-based implementation, and would let you conditionally override it in the impl block to use the basepoint tables.
But absent that upstream feature in group
, you can do something like this with an inherent mul_by_generator()
method:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I would like that as a user. Seems pretty brittle.
Hopefully the ergonomic concerns here go away when it's merged upstream into core
/std
. Really that's the main reason to prefer it: it will soon hopefully be part of the standard library.
The spin_no_std
feature of lazy_static
is not a great solution here either. If one dependency in your std
-dependent program activates it, suddenly every lazy static becomes spinlocked. And for that reason I think it's bad for dependencies to enable it rather than the toplevel [bin]
crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjarri another option is to transform the precomputed tables into Rust literal syntax, and use that to define a static. That's what the dalek crates do:
You can test if it diverges by running the precomputation again and seeing if it equals the static value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will lead to increased code size which may be less preferable than some start-up time, e.g. in WASM context.
@str4d speaking of |
755e189
to
02b5469
Compare
So now Also I cannot change the verification code in this PR since it's in I guess this PR is blocked by zkcrypto/group#44; after that's merged and published, this can be updated too, and then |
As for using wNAF, I am not certain, won't it make the multiplication variable-time? |
@fjarri I wouldn't say this PR is blocked on zkcrypto/group#44 You can make |
I guess I'll focus on providing the optimized logic in this PR, and leave the issues of gating and |
Sure, that's fine. Happy to do a followup which adds a |
02b5469
to
27833bb
Compare
4de6356
to
312fecd
Compare
You can still change the verification implementation if you like: https://github.com/RustCrypto/elliptic-curves/blob/8ffa72c/k256/src/ecdsa.rs#L241 It's currently using the provided implementation of https://docs.rs/ecdsa/0.15.0-pre.1/ecdsa/hazmat/trait.VerifyPrimitive.html#method.verify_prehashed However if you copy/paste that implementation into the |
It's actually not necessary anymore, since I currently don't know how to speed up verification (see the updated top message). |
Yeah, seems like the two approaches are mutually incompatible. It might be worth checking if the speedup provided by precomputed basepoint tables outperforms using linear combinations, though that said, being able to use the provided implementation of ECDSA from the |
I did check it (if I understand you correctly), see the timings in the first post |
I meant specifically in regard to verification, where you said this:
...which implies it would still benefit from basepoint tables over linear combinations, but I wasn't clear how much. |
Yes, at the moment I wrote this I was only guessing, but then I actually measured different approaches and updated the top post. Sorry for a bit of a non-linear post history. |
@fjarri here's some semi-related code from curve25519-dalek, if it helps: https://github.com/dalek-cryptography/curve25519-dalek/blob/e01bb1b/src/backend/serial/scalar_mul/vartime_double_base.rs#L22-L64 |
That's basically the approach I tried in here:
Of course, that was still constant-time, using regular windows. They also use wNAF, since the verification does not have to be constant-time, which is an orthogonal change. It may give more speed-up. I don't think it should be done in this PR though. |
That's cool. Per @str4d's comment, I've been curious if the wNAF functionality in |
@tarcieri , to use |
Duly noted. I'll try to wrap up another release cycle soon so we can switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of which, it would be nice to get this PR into that set of releases.
@fjarri it seems like this should be good to go other than feature-gating it? (which I can do in a followup)
312fecd
to
dd470ce
Compare
I was thinking the same, but it seems that I will need those version bumps first. The
Yes, should be fine. |
What do you need bumped specifically?
...which is the version currently in use? Or are you talking about for adding wNAF? |
Yes, for wNAF we need the new |
/// Calculages `k * G`, where `G` is the generator. | ||
pub fn mul_by_generator(k: &Scalar) -> ProjectivePoint { | ||
let digits = Radix16Decomposition::<65>::new(k); | ||
let table = *GEN_LOOKUP_TABLE; | ||
let mut acc = table[32].select(digits.0[64]); | ||
let mut acc2 = ProjectivePoint::IDENTITY; | ||
for i in (0..32).rev() { | ||
acc2 += &table[i].select(digits.0[i * 2 + 1]); | ||
acc += &table[i].select(digits.0[i * 2]); | ||
} | ||
// This is the price of halving the precomputed table size (from 60kb to 30kb) | ||
// The performance hit is minor, about 3%. | ||
for _ in 0..4 { | ||
acc2 = acc2.double(); | ||
} | ||
acc + acc2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjarri one thing I noticed too late refactoring... I think this would make sense as an inherent method of ProjectivePoint
so it can be called as ProjectivePoint::mul_by_generator(&scalar)
?
That way you don't have to import it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I put it into mul.rs
to make use of Radix16Decomposition
and LookupTable
, but it can be definitely defined in an impl
block there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I think I'll go ahead and refactor it into an inherent method in #705 then
Eh, or not, let's get the feature added first and then I can try to figure out how best to factor the inherent method change.
@@ -20,6 +20,7 @@ rust-version = "1.60" | |||
[dependencies] | |||
cfg-if = "1.0" | |||
elliptic-curve = { version = "0.12.3", default-features = false, features = ["hazmat", "sec1"] } | |||
once_cell = { version = "1.16", default-features = false, features = ["critical-section"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 not sure that enabling critical-section here is the right call. I'd imagine it should be up to the leaf crate with Cargo.lock to enable this, such that pulling this crate doesn't pull cs unconditionally.
One imporovement here would be to add std = "once-cell/std"
, so that once_cell doesn't use critical section (Cargo would still download and compile this crate, it'll just be unused).
But I think what you should do is:
[dependencies]
once_cell = { version = "1.16", default-features = false }
[features]
default = ["arithmetic", "ecdsa", "pkcs8", "schnorr", "std"]
alloc = ["ecdsa-core?/alloc", "elliptic-curve/alloc"]
# Either std or critical-section should be enabled
std = ["once-cell/std", "alloc", "ecdsa-core?/std", "elliptic-curve/std"]
critical-section = ["once-cell/critical-section"]
That way:
- people with
std
don't getcritical_section
at all - people with
no_std
would have to explicitly opt-into critical section- that's a bit bad because it doesn't "just work"
- OTOH, silently opting the users in critical-section behavior feels not super friendly. It's a big hammer, better to be explicitly aware about it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Want to make a PR?
FWIW, the situation is probably not as bad as it seems, because I expect most embedded users will want to disable precomputed-tables
anyway to reduce the size of the library in memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Want to make a PR?
Not really :)
FWIW, the situation is probably not as bad as it seems
This is a bit ambiguous, but, with the current code, my main worry is std people who get opted into critical-section. I am not sure how exactly critical-section behaves on std, but at minimum that hits much less tested and much slower path of once_cell, for all once cells in the final binary, and, at maximum, that actually serializes access to all once cells everywhere 😨
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh ok, I'll make a PR.
Edit: opened #710
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for all once cells in the final binary
I guess the once_cell
API has the same problems as lazy_static
. Big yikes. Hopefully it won't be the case when it's a part of the language proper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std won't have this problem by virtue of not supporting no_std at all. If you want to configure synchronisation for a specific once_cell, rather than for all of them globally, https://crates.io/crates/generic_once_cell is the right crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it's going to be std-only? That's a bummer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjarri with an alloc
dependency it would be possible to use race::OnceBox
This PR introduces a way to multiply a scalar by the generator by keeping precomputed lookup tables. It is exported as
mul_by_generator()
(and used in the signing code). It takes 40us compared to 86us for regular multiplication; signing takes 70us compared to 116us.The precomputed table contains values for every second radix and takes ~30kb; if we calculated it for every radix it would take 60kb with the performance improvement of only ~3%, so I don't think it is worth it (the same choice was taken by https://github.com/dalek-cryptography/curve25519-dalek).
I am not sure how to speed up the verification, which relies on calculating
k1 * G + k2 * x
. Pre-calculating lookup tables forG
inlincomb
only speeds it up from 129us to 122us on my machine. Calculatingk1 * G
viamul_by_generator()
results in the timing of 126us. Not great, any ideas are welcome.The precomputed tables use
once_cell
. Note that the featurecritical-section
ofonce_cell
(which we need to use to supportno_std
) has the MSRV of 1.63. Proper gating is up to @tarcieri