Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hasher::new_derive_key accepts a str, why not [u8]? #13

Closed
WildCryptoFox opened this issue Jan 10, 2020 · 21 comments
Closed

Hasher::new_derive_key accepts a str, why not [u8]? #13

WildCryptoFox opened this issue Jan 10, 2020 · 21 comments

Comments

@WildCryptoFox
Copy link

There is no reason for UTF-8 here. Please s/str/[u8]/ here. Alternatively provide another function to avoid the breaking change. Though, as I've mentioned in #11 I'm not against a breaking change for consistency with other crates, especially this early.

https://docs.rs/blake3/0.1.0/blake3/struct.Hasher.html#method.new_derive_key

pub fn new_derive_key(context: &str) -> Self
@Luro02
Copy link

Luro02 commented Jan 10, 2020

You could also change the signature to

pub fn new_derive_key<T: AsRef<[u8]>>(context: T) -> Self

which would allow vec and slices

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 10, 2020

@Luro02 That's also almost a non-breaking change.. but there are cases which would break (I.e. Hasher::new_derive_key(foo.into())). Could also use (context: impl AsRef<[u8]>). I usually do this, but forgot to suggest it in the top post. :)

@oconnor663
Copy link
Member

&str (and char* in the C code) was a deliberate choice in this implementation (not required by the spec) to discourage misuse. For all the details, please take a look at Sections 6.2 and 7.8 in the spec and at the summarized version in the Rust docs. I should be upfront that this is something that we knew in advance a lot of programmers would find weird, and that a lot of people would think the problem we're trying to solve isn't a big problem. So with that in mind, I'll describe what we mean by "misuse". Forgive me for what's going to be a long comment.

The way derive_key is supposed to be used is with a constant context string, like this:

const XYZ_CONTEXT_STRING: &str = "foo bar app 2020-01-10 14:54:14 purpose xyz";

fn purpose_xyz_subkey(root_key: &[u8; 32]) -> [u8; 32] {
    let mut out = [0; 32];
    blake3::derive_key(XYZ_CONTEXT_STRING, root_key, &mut out);
    out
}

Of course not everyone is going to follow that exact format for the context string, I'm not kidding myself, but the main thing is that I've picked something that no one else will pick and that I've hardcoded it. In particular, derive_key is not supposed to be used with a dynamic context string, like this:

fn purpose_xyz_user_subkey(root_key: &[u8; 32], user_id: &str) -> [u8; 32] {
    let mut out = [0; 32];
    let context_string = format!("foo bar app purpose xyz {}", user_id);
    blake3::derive_key(&context_string, root_key, &mut out);  // PLEASE DO NOT DO THIS
    out
}

Why not? The simplest reason is that some other programmer in the same application might decide to use purpose xyz 2 in their context string, and now maybe there are collisions between different subkeys when the user_id starts with a "2". Or maybe some other programmer changes user IDs from numbers to strings next year, not realizing that allowing whitespace in IDs destroys domain separation. We could try to define some specific convention like "always insert a null character between the context and dynamic data", but I think 1) it's unrealistic to expect programmers in the real world to respect such a convention, and 2) even that's not good enough, without additional even-less-realistic requirements like "the dynamic data must never contain null characters". (Similarly, a convention like "never use a context string that's a prefix of any other" implicitly also requires "always append dynamic data rather than prepending it.")

But beyond avoiding "obvious" mistakes, derive_key has a more difficult goal. It's trying to be safe to use, even with key material that's already in use with other algorithms. For example, say you're using a key with FooBar AEAD Cipher. Then (maybe years later) you decide to add new features, and you need to use the same key in new ways. If you can't break backwards compatibility by switching FooBar from the original key to a derived key, can you safely use derive_key to derive an independent subkey for your new feature? What we don't want, is for the answer to that question to be "that depends on whether FooBar uses derive_key internally." We don't want that, because no one in the real world knows things like that. FooBar might be in a totally different library, written by some other team in your company. Who knows. And the FooBar team might even change FooBar later to use derive_key internally, without knowing about the assumption you made.

What we'd rather say, is something like "you don't have to worry about it, as long as you and the FooBar team are both using derive_key correctly." We just need to figure out some meaning of "correctly" that's reasonable for the real world. For BLAKE3, we settled on: the context string should be globally unique, application-specific, and hardcoded. If you and the FooBar devs both use hardcoded context strings, and you pick them specifically enough that they're not going to happen to be exactly the same string, then you are safe, even though both of you are using derive_key without coordinating with each other.

So anyway, what does this have to do with &str vs &[u8]? My thinking is this: If the thing you are feeding to derive_key requires arbitrary bytes, you are almost certainly violating that security requirement with some kind of dynamic input. And if the surprising type on that parameter causes a programmer to go "huh" and read the docs, that's a really good thing too.

One objection might be that not all programming languages represent strings as UTF-8 bytes, and that this is going to be a burden on e.g. C# developers. I agree that it's a burden, but I think we'd have the same problem if we were taking bytes, except the problem would fall on every caller instead of on library/bindings authors.

Note that derive_key(context, key_material) is functionally identical to keyed_hash(key=hash(context), input=key_material). The only difference between those things is that derive_key sets a different internal flag, so its output is domain-separated. For that domain separation to have any value in the real world, we have to ask people to use derive_key in some vaguely consistent way. Saying "always hardcode the context" is dirt simple, and simplicity is crucial here. If you do know exactly what you're doing, and you want to do fun tricks with arbitrary bytes, you can go to town with hash + keyed_hash and take domain separation into your own hands.

So, with apologies for asking you read all that, what use cases do you have in mind for using arbitrary bytes in the context string?

@WildCryptoFox
Copy link
Author

@oconnor663 https://docs.rs/merlin uses &'static [u8] in attempt to prevent this misuse. The special 'static lifetime only applies to references which outlive the program itself; including memory leaks Box::leak. This requires leaking or unsafe code to misuse.

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 10, 2020

fn f(_: &str) {}
fn g(_: impl AsRef<[u8]> + 'static) {}

fn main() {
    f("aaa");
    g("aaa");
    g(b"aaa");
    //g(&"aaa".to_string());
    //---^^^^^^^^^^^^^^^^^-- temporary value is freed at the end of this statement
    //  |     |  |
    //  |     |  creates a temporary which is freed while still in use
    //  |     argument requires that borrow lasts for `'static`
}

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 10, 2020

Another option would be to use a doc-hidden function as a hidden constructor and use a macro which binds in the crate name or module path.

macro_rules! context {
    ($tt:tt) => {
        $crate::do_not_call_me_manually(
            concat!(env!("CARGO_PKG_NAME"), " or ", module_path!(), ": ", $tt).as_bytes()
        )
    }
}

struct SafeContextStr(&'static [u8]);

#[doc(hidden)]
fn do_not_call_me_manually(s: &'static [u8]) -> SafeContextStr {
    SafeContextStr(s)
}

fn f(_: SafeContextStr) {}

fn main() {
    f(context!("foobar"));
}

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 10, 2020

So, with apologies for asking you read all that, what use cases do you have in mind for using arbitrary bytes in the context string?

Returning to your question. I'd rather expose impl AsRef<[u8]> + 'static or &'static [u8] in my own transcript hashing library where I don't want to restrict the context string to UTF-8; but I do want it to be hardcoded and unique.

Edit: I see @Luro02 mentioned vectors; that'd be a bad example but impl AsRef<[u8]> does cover other types which may be suitable. Though it would be fine as just &'static [u8].

@oconnor663
Copy link
Member

oconnor663 commented Jan 13, 2020

I've used &'static before, but more recently I'm not super into that approach. As you mentioned, it can be worked around with Box::leak, and also with global things like lazy_static or once_cell. Additionally, it would disallow something like this, which I don't really have a good reason to disallow:

const COMPANY_NAME: &str = "example.com";
const PURPOSE_A: &str = "2020-01-13 16:17:46 purpose A";
const PURPOSE_B: &str = "2020-01-13 16:17:47 purpose B";

let root_key = b"my secret key";
let mut subkey_a = [0; 32];
blake3::derive_key(&format!("{} {}", COMPANY_NAME, PURPOSE_A), root_key, &mut subkey_a);
let mut subkey_b = [0; 32];
blake3::derive_key(&format!("{} {}", COMPANY_NAME, PURPOSE_B), root_key, &mut subkey_b);

Those are "dynamic" context strings in a kind of pedantic sense, but they don't contain any sort of input, which is what really matters. It wouldn't be my first recommendation --- mainly because seeing format! with derive_key might give some readers the impression that dynamic context strings are ok --- but there's nothing inherently wrong with it. I worry that if we get overzealous in the things we forbid, we might get programmers in the habit of working around our limitations, which would also be bad.

Another consideration is that the patterns in this library will have a big influence on future implementations of BLAKE3 in other languages. If we use &'static [u8] here, that'll get lost in translation. But if we use &str, that's a strong suggestion for other languages to use char* or std::string_view, which I believe they should.

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 14, 2020

Concatenating arrays does work with const-generics. staticvec::StaticVec exposes a similar interface; just consider the following without the vector.

const SUBKEY_A = StaticVec::new_from_const_array(COMPANY_NAME)
    .concat(StaticVec::new_from_const_array(PURPOSE_A));

The resulting StaticVec (if not just an array) can be passed to blake3::Hasher::new if it accepts impl AsRef<[u8]> + 'static (or &'static [u8] for just arrays).

Alternatively without const-generics you could use macros for this.

macro_rules! define {
    ($name:ident = $str:tt) => {
        macro_rules! $name {
            () => { $str }
        }
    }
}
macro_rules! context {
    ($($name:ident),*) => {
        concat!("" $(,$name!(),)" "*)
    }
}

define!(COMPANY_NAME = "example.com");
define!(PURPOSE_A = "purpose A");
define!(PURPOSE_B = "purpose B");

const SUBKEY_A: &str = context!(COMPANY_NAME, PURPOSE_A);
const SUBKEY_B: &str = context!(COMPANY_NAME, PURPOSE_B);

Edit: I have previously tried to define macros from inside macros and was surprised to see this define!-context! pair to work. This requires at least rustc 1.40 (the latest stable as of writing).

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 14, 2020

Regarding other languages. Both &str and &[u8] are const char* (or const uint8_t* ) in C, not char*. CStr in Rust maps to NULL-terminated char*. I don't know if there are any __magic_things__ to get something similar to 'static. Edit: C's const is the closest I think we have to Rust's 'static.

@Luro02
Copy link

Luro02 commented Jan 14, 2020

pub fn derive_key(context: &str, key_material: &[u8], output: &mut [u8])

Is the &str constrained to a static lifetime?
If not you can still use dynamic data with either https://doc.rust-lang.org/std/str/fn.from_utf8.html or https://doc.rust-lang.org/std/string/struct.String.html#method.as_str
so it is still possible to add something not hard coded.

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 14, 2020

@Luro02 No specified lifetime does not imply 'static unless it is a const or static, as those must be. Using 'static is a method to deny dynamically constructed strings to avoid poorly defined contexts. The alternative construction methods demonstrate you can have something partially dynamic if all the inputs are known at compile time. No runtime accepting user input (except to maybe decide which of the static set to use).

@oconnor663
Copy link
Member

You can do a fully dynamic (but also 'static) string this way:

use once_cell::sync::Lazy;
use std::io::prelude::*;

static CONTEXT: Lazy<String> = Lazy::new(|| {
    let mut s = String::new();
    std::io::stdin().read_to_string(&mut s).unwrap();
    s
});

fn main() {
    let s: &'static str = &*CONTEXT;
    println!("{}", s);
}

Both &str and &[u8] are uint8_t const* (or unsigned char const*) in C, not char*.

In layout yes, but what I meant to suggest was that each language should use its natural string type for the context parameter. As in, whatever type you get from the literal "foo".

(There is a performance downside to accepting a char*, because you'll have to traverse it an extra time to get the length. I expect that will be negligible in practice. Also, applications that really care about the performance of derive_key, will want to bust open the implementation and hardcode the intermediate context key, skipping the first hashing step entirely.)


All of the above comments are good points, and I expect this thread to grow over the coming year. My thoughts about this API are probably going to change over time. The crate is v0.1.1 now, and I'm not at all opposed to changing this API for v0.2, v0.3, etc. Incidentally I also expect the stabilization of const generics to have an influence here: I would rather return an [u8; N] than force the caller to pass in &mut [u8]. (The current API forces the caller to create an empty buffer first, and I fear cases where the caller might accidentally delete the call to derive_key and just use the empty buffer as a key. Variable length output will always be possible using the Hasher::finalize_xof API.)

@WildCryptoFox
Copy link
Author

WildCryptoFox commented Jan 15, 2020

The type you get with "string literals" in C is const char* which is exactly what we want, and matches &'static str better than I thought before. However, C's const char* does not restrict the string to UTF-8, like str in Rust does. Thus the appropriate translation here is closer to &'static [u8] when we use C's "string literals" as the baseline. Only 'almost' because C's strings are NULL-terminated.


Side note: once_cell's Lazy types are likely to land in std/core and I cannot wait for const-generics to land in full!

@oconnor663
Copy link
Member

oconnor663 commented Jan 15, 2020

Tangent: My understanding is that a const char* in C is very different from a &[u8] in Rust. It does not mean that the pointed-to bytes are immutable (even for the lifetime of the reference), but rather than this pointer in particular should not be used to mutate them. There maybe be other non-const pointers to the same bytes at the same time.

@WildCryptoFox
Copy link
Author

Hmm. Okay. Still the best matching without hunting through __magic_attributes__.

@oconnor663
Copy link
Member

Going to close this one for now. We will definitely revisit this API when const generics land, if not before.

@Luro02
Copy link

Luro02 commented Feb 4, 2020

I should keep this issue open and tag this with a blocked tag?
You might forget this issue, because closed issues are not shown by default under issues.

@WildCryptoFox
Copy link
Author

@oconnor663 Almost closed with impl crypto_mac::Mac for Hasher, but you didn't override the new_varkey method. Now that'd be a breaking change... Whoops. :/

@oconnor663
Copy link
Member

@Luro02 I prefer the current API to the alternatives we've proposed here so far. A caller requiring arbitrary bytes in the context field is odd, and I still suspect that the substantial majority of such callers are actually violating the security requirements and would be better served by an API that did not let them do that.

The main thing that would make me reconsider is a caller in the real world with a specific requirement. For example, "we're exposing this crate to a large legacy C++ codebase, and our convention for all string constants is UCS-2, and we cannot afford to pay the cost of converting to ASCII/UTF-8 every time a key is derived." There are several reasons I think such a caller is unlikely to appear:

  • That sort of codebase is unlikely to wrap a Rust crate.
  • That sort of codebase tends to have a million different string types anyway.
  • Callers with intense performance needs for derive_key already have a reason to fork the implementation: they need to skip the first hashing step and precompute+hardcode the intermediate context key. (Or somehow arrange for it to be precomputed at compile time.) Such an API is not exposed in this crate, because of the high potential for abuse compared to its niche utility. If you're forking the implementation to provide yourself such an API, you can give yourself the option of supplying arbitrary bytes at the same time.

So anyway, it could happen, but I want to wait until it does happen.

@oconnor663
Copy link
Member

See #42 for my stream-of-consciousness thoughts about what callers should be doing with arbitrary dynamic bytes. I'd love to have other folks chime in there with ideas.

jac-cbi pushed a commit to jac-cbi/BLAKE3 that referenced this issue Aug 28, 2024
This can be enabled with the unstable feature flag, `i_know_what_i_am_doing`.

Developers attempting to deploy blake3 via the Rust crate often encounter
`derive_key(&str, ...)` which forces the INFO parameter to be a valid UTF-8
string.

Please see the discussion in issue [13](BLAKE3-team#13),
in particular, this [comment](BLAKE3-team#13 (comment)).

The recommended course of action for those with non-UTF-8 INFO is to use `hash`
and `keyed_hash` and open code their own `derive_key` which takes an `&[u8]`.

This is not good for two reasons:  First, it is quickly seen that this forces
deviation from the Blake3 paper for how `derive_key` should be performed, as
`hash` doesn't let you set the `flags` field.  Attempting to use the underlying
`hash_all_at_once` fails because it's not exported.  Second, the developer is
now forced into the position of maintaining their own patches on top of the
blake3 repo.  This is a burden on the developer, and makes following blake3
upstream releases *much* less likely.

This patch proposes a reasonable compromise.  For developers who require `&[u8]`
INFO field, they can enable the rust feature flag `i_know_what_i_am_doing` to
expose the API `derive_key_u8ref`.  This enables developers to use upstream
blake3 directly, while still discouraging sloppy behavior in the default case.

Signed-off-by: Jason Cooper <jason@coldbeach.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants