Enhancement: b3sum add -c --check flag for verification #33

tERyceNzAchE · 2020-01-18T22:35:23Z

Adding -c, --check would make b3sum a drop in replacement for b2sum in some scripts.

oconnor663 · 2020-01-19T16:59:10Z

Agreed, this would be a good feature. It's not high on my personal priority list, because I don't use --check myself, but I'd be happy to review a PR.

phayes · 2020-01-21T19:09:29Z

I would be happy to create a PR for this.

But first I think I would like @oconnor663's decision on #24 since that will determine if the hex parsing is in b3sum or in blake3

oconnor663 · 2020-01-21T20:34:11Z

Apologies for dragging my feet on these. I'm hoping I'll get to them by the end of the week.

phayes · 2020-01-21T21:30:26Z

@oconnor663 , no worries! It's your project - take your time. :)

oconnor663 · 2020-01-23T19:38:10Z

I decided against #24, so let's go with putting the hex parsing in b3sum. (Or for what it's worth, we could do constant-time hex comparisons instead of parsing.)

phayes · 2020-01-23T23:06:58Z

OK great. I'll have this done in the next day or two.

oconnor663 · 2020-01-23T23:32:08Z

Exciting! I know the https://github.com/jtdowney/b2sum-rs project has a --check implementation, in case that helps. I'm not sure whether everyone does this the same way, or whether we're going to have to make choices.

phayes · 2020-01-26T01:57:25Z

A PR is ready for review here: #44

llowrey · 2020-05-02T20:56:55Z

Is there interest in completing this feature? It looks like there are a few unresolved issues with PR #44 and those issues appear to be more policy (requirements) related than technical/implementation/defects.

I am a heavy user of "b2sum --check" and would like to use b3sum, for performance reasons. In one trivial use case I am able to check a dir of files in 42s with b3sum that takes 5m37s with b2sum. That's a big difference but atm I have to do the check out-of-band which is neither convenient or reliable (in my case).

I'd like to help get this feature released but not sure how to contribute.

oconnor663 · 2020-05-02T22:02:44Z

I'm definitely interested in the abstract. It just hasn't been a priority of mine. There were some policy questions that came up in that PR, which I think it might be a good idea to for us to try to answer before we write much more code. Let me see if I can list them here:

Weird filepath support. Can filepaths contain e.g. newlines? Can they contain invalid UTF-8? "Do whatever Coreutils does here" could be a reasonable answer, but Coreutils actually does some very non-trivial things, and I'd want to make sure we really understand what those are.
Windows support. I don't know whether "do whatever Coreutils does" will work for us here. Can filepaths contain Windows path separators (\)? Should they do so by default on Windows? If so, is an output produced on Windows verifiable on Unix? What happens to paths from Unix systems that contain \ as a filename character? (Such paths aren't even representable on Windows as far as I know, so tools like Zip and Git have to use hacky escaping workarounds. I don't think we have to make everything magically work, but I'd like to have a good idea of what will work and what won't.)
Nonstandard hash parameters. Can we verify hashes that aren't 32 bytes? If so, should we auto-detect the hash length? Can we verify keyed hashes? My biggest concern here is that supporting short hashes could be a security problem.

For all of these things, I strongly lean towards "support as few things as possible now, such that we can choose to add features later as needed". Once we start encouraging people to start saving check files, it'll be painful to make any changes that break existing users.

oconnor663 · 2020-05-02T22:05:50Z

Maybe some of these questions can get swept under the rug as "we don't construct paths ourselves, we just get them from the command line"? I'm not totally sure. In Rust terms, command line args come to us as OsString, and there's no "unopinionated" conversion from OsString to String or Vec<u8>.

"Paths must be valid Unicode strings and may not contain newlines" could be a pretty good starting point, but even there I have a question: What does b3sum * do when the current directory contains a filename with a newline in it? Does it fail, or does it produce some output which --check cannot consume? If the latter, what error does --check report, and how does it detect this condition?

llowrey · 2020-05-03T01:42:36Z

Thank you for responding so quickly!

I suspect (hope) that the coreutils folks will adopt blake3 and create their own b3sum, with all the fancy rules they want. I liked you comment about aiming for a 90% solution at 10% the effort.

IMO, the contract must be that --check shall consume whatever b3sum produces as output. The output requirements should have priority and --check should do the best it can with whatever that is.

So...

Weird filepath support. b3sum must accept valid filepaths, for the OS it's running on. I haven't looked into what b3sum does now on output but I would expect it would backslash escape CR, LF and backslashes. I expect the output to be OsString since the output must be readable to the user. If we force output to UTF-8 the user won't be able to read it correctly if they're not using UTF-8.
Windows support. I believe it's the user's responsibility to deal with this. As you point out, a backslash could be present in a unix filename so the presence of a backslash cannot universally signal a windows path. This, to me, is part of the 90% solution approach.
Nonstandard hash parameters. The secret key is not output by b3sum (good thing) and therefore will not be available to --check. The consequence of this is that the input to --check will not have keys therefore either no hash uses a secret key or all use the same key. The tricky part is that b3sum must consume hashes via stdin which is where the secret key comes from. I'm inclined to require the user to prepend the hash stream with the secret key raw bytes. Not trivial, but possible. So, when --check is invoked with --keyed, the first 32 bytes are assumed to be the secret key and the remainder of the input is the hash stream. The secret key would be applied to all hashes. Variable lengths would need to be supported. I understand the security concern but, imo, --check should consume whatever b3sum produces. I'd leave it to the user not to harm themselves.

Re hash length. I would prefer that the hash length be part of the hash such that a 16 byte hash would produce a different value from the first 16 bytes of a 32 byte hash. IOW, truncating the hash to 1 byte would not produce the same hash as when computing a 1 byte hash. This would address your security concern, which exists irrespective of whether --check is doing the verification or users are doing that verification out-of-band.

Would it help if I documented coreutils' behavior wrt filepathing?

oconnor663 · 2020-05-03T17:58:57Z

Ok I understand the md5sum (and other Coreutils [HASH]sum utilities) behavior better now, based on this StackExchange question and the docs that it links to:

Without --zero, if file contains a backslash or newline, the line is started with a backslash, and each problematic character in the file name is escaped with a backslash, making the output unambiguous even in the presence of arbitrary file names.

For example, a filename with a newline in the middle looks like this:

\d41d8cd98f00b204e9800998ecf8427e  abc\ndef

And a filename with a backslash in the middle looks like this:

\d41d8cd98f00b204e9800998ecf8427e  abc\\def

Every other character (or I should say every other byte, since invalid Unicode is allowed) is copied literally to the output.

A subtle point here is that this is backwards-compatible with older versions of md5sum that did not do this escaping. That means an old checkfile with a literal, unescaped backslash in one of the filenames will still work today, even if it's coincidentally followed by an n character, because unescaping is only done when the line starts with a backslash. (Though I assume unescaped newlines are still hopeless.)

Another subtle point here is that, if --zero isn't used, it's possible to have a null character in the middle of a filepath. I don't think it's possible to make md5sum emit such a filepath, but it's certainly possible to insert a null character in the output manually, and then we have to ask what happens if md5sum consumes that doctored output. As far as I can tell, the answer is that it ignores all characters after the null, probably because it passes the string directly to libc, which interprets the interior null as terminating. I think we should error out in this case, by explicitly checking parsed strings for interior null, rather than reproducing this behavior.

Summing all that up, I think "do exactly what md5sum does, but be more aggressive about panicking on interior nulls" could be a good policy on Unix. However, I think it's going to give us trouble on Windows.

The first problem is that Windows filepaths aren't "arbitrary bytes", but rather "arbitrary byte pairs" (that is, arbitrary 16-bit characters). If we converted byte pairs to bytes in the naive, literal way, then "abc" on Unix would correspond to "a\0b\0c\0" on Windows. That wouldn't be acceptable; even in the most trivial cases, check files wouldn't be portable between Unix and Windows. (And I'm not even going to guess at how a Windows terminal would display those bytes to the user.)

The more reasonable strategy is to parse Windows paths as UTF-16 and convert them to UTF-8, however this runs into the second problem: Windows paths aren't guaranteed to be well-formed UTF-16. They might contain unpaired "surrogate characters" which can't be represented in well-formed UTF-8. This is why "WTF-8" was defined, and it's essentially the entire reason the OsString type exists in Rust. (If the whole world was Unix, we could probably just use &[u8] and Vec<u8> everywhere we use OsStr and OsString today.)

It's tempting to say "ok, just use WTF-8 in the output," but the spec is quite adamant about this:

WTF-8 must not be used to represent text in a file format or for transmission over the Internet.

And actually, "just use WTF-8" wouldn't even solve the problem, because while it can represent all Windows filepaths, it can't represent all Unix filepaths.

Given all that, here's what I propose:

All b3sum output is valid Unicode, always. We can leave open the possibility that we could detect some system/terminal encoding other than UTF-8, and convert to that as a final step, but for now this means all output is UTF-8.
Paths are received from the command line as OsString and opened in that form, so that hashing any file works. However, before emitting any output, paths are converted with .to_string_lossy(), replacing non-Unicode sequences with the � character.
When Unicode conversion is lossy, an additional line is emitted, in order to prevent b3sum from consuming the result as a valid check file. This could be anything that doesn't parse, but I recommend a comment line like: # Warning: Invalid Unicode. Notably, this line should be emitted to stdout (not stderr), so that it's preserved when output is redirected to a file.
- Edit: Maybe it would be simpler and cleaner to just bail if we see any � characters in the filepath. It's teeeeechnically possible for a valid Unicode filepath to contain such characters, but it must be super rare in any context where we really care about integrity, and we could avoid the messy "generate unparseable output" hack. Furthermore, the same rationale that makes us want to fail the check if we saw some invalid Unicode (the possibility that two distinct filepaths might get conflated) might equally apply if some other tool saw invalid Unicode before it got to us.
If --zero is not used, md5sum-like escaping is applied. All backslash and (Unix) newline characters are replaced with the escape sequences \\ and \n, and any line containing those escape sequences is prefixed with a \ character.
When consuming a check file, b3sum asserts that every line begins with a valid hash, with an optional \ character. (This is the part that will fail on comment lines, without requiring any comment-specific logic.) De-escaping is only done if the \ character is present. If de-escaping is done, any backslash sequence other than \\ and \n is an error. In addition, b3sum asserts that parsed filenames don't contain any interior null characters.
For now, --check is incompatible with any flags that modify the output, like --length or --keyed. Only unkeyed, 32-byte hashes can be checked. We can consider relaxing this in the future.

(Edit: More important bullet points below regarding Windows.)

oconnor663 · 2020-05-04T02:07:27Z

Major oversight above: We obviously shouldn't escape the backslashes in C:\foo\bar.txt on Windows. Does anyone know what md5sum does on Windows? (I see on Google that there is such a thing as "Coreutils on Windows", but I've never used it.) Maybe we can solve the problem with two additional rules:

After converting to Unicode strings, but before escaping, backslashes in Windows paths are normalized to forward slashes.
On Windows, b3sum --check reports an error if there are any backslashes in a filepath.

The goal there is that files with backslashes in their names can be hashed and checked on Unix, but attempting to check such a file on Windows is always an error. (The file cannot exist anyway, and we don't want the name fragments to unintentionally match existing directories.)

sneves · 2020-05-04T10:43:10Z

Backslashes are escaped on Windows:

C:\Users\John>md5sum.exe \\.\C:\Users\John\Desktop\b3sum_windows_x64_bin.exe
\1d54334dd99612a139fbb51736290e85 *\\\\.\\C:\\Users\\John\\Desktop\\b3sum_windows_x64_bin.exe

oconnor663 · 2020-05-04T16:05:55Z

Thanks, that makes sense. Looks like they don't have any special handling for Windows. I think we can do better, tentatively leaning towards the two rules I suggested above. Here's the scenario I'm imagining:

A Windows user clones this repo.
That user runs something like
```
cd BLAKE3 ; b3sum README.md Cargo.toml src\lib.rs
```
(note the backslash) and captures the output as a checkfile.
They send the checkfile to a Unix user.
The Unix user clones this repo too, and then runs b3sum --check on the checkfile.

I don't think we should go crazy with cross-platform compatibility, since non-Unicode paths tend to be totally unrepresentable on other platforms anyway. But "unzip an archive and hash the files in it" is the core use case of --check, and if all the filenames in there are well-behaved, it seems reasonable to make it work. If I understand correctly, it'll work today with md5sum if you go from Unix to Windows, but not if you go from Windows (with standard backslash-separated paths) to Unix.

Also, it would be nice if our default output on Windows wasn't super ugly :)

oconnor663 · 2020-05-04T16:13:15Z

By the way, in @sneves output above, I see a * character at the start of the path. What's that doing there? Is that some special feature of md5sum that I don't know about yet?

sneves · 2020-05-04T16:54:09Z

putchar (file_is_binary ? '*' : ' ');

Indicates whether the file is opened as fopen(name, "rb") or fopen(name, "r"). It should always be binary, since "r" mode will mess with newline characters and such across operating systems.

oconnor663 · 2020-05-04T17:29:12Z

Ok, yeah, I think it's safe to say b3sum will just always use binary mode, and never print the *.

sneves · 2020-05-04T19:18:58Z

To be clear, * indicates binary mode.

oconnor663 · 2020-05-04T19:52:23Z

Apparently it's "binary mode, if the difference is significant", which I assume means "only on Windows" in practice? Anyway, we'll always read raw bytes regardless of the OS, so I don't think we need to worry about it.

As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.

oconnor663 · 2020-05-04T22:57:34Z

Ooo, credit where credit is due, it looks like macOS does not allow invalid Unicode filenames!

As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.

oconnor663 · 2020-05-05T06:13:33Z

I'm working on this in the https://github.com/BLAKE3-team/BLAKE3/tree/check branch.

Looks like the Clap argument parsing library might have a bug on Windows that'll get in the way of testing this: clap-rs/clap#1905. I have a PR up to fix that, and I might like to get it released first.

Update: Clap v2.33.1 has been released with a fix.

oconnor663 · 2020-05-15T20:00:47Z

I've finished a --check implementation and pushed it to master. Its exact behavior is documented here: https://github.com/BLAKE3-team/BLAKE3/blob/master/b3sum/what_does_check_do.md. I'd really appreciate it if folks could test it out, especially folks on Windows. The intention is that producing a checkfile on Unix and checking it on Windows should work, and vice versa.

oconnor663 · 2020-05-17T19:22:59Z

Basically as soon as someone tells me that they've tried this and it seems to work for them, I'm gonna release it :)

llowrey · 2020-05-17T20:12:03Z

Linux
- Simple case, no paths: Verified detection of mismatch, returns 1. All good return 0. As expected.
- With path structure (rel and abs): a-ok
- Unicode text "Funky näme.txt" as expected in hash output and verify output
Windows
- Globbing does not work: eg b3sum *.rs return "b3sum: The filename, directory name, or volume label syntax is incorrect. (os error 123)"
- Simple case, no paths: Verified detection of mismatch, return 1. All good return 0. As Expected.
- With path structure (rel and abs): a-ok
- Noted that backslashes in path are converted to forward slash in output
- Unicode text "Funky näme.txt" in looks like "Funky n├ñme.txt" in the output
  - Check works as expected and outputs "Funky näme.txt: OK"
  - Verified hash output looks right when viewed as UTF-8
Confirmed b3sum hash from Linux verifies on Windows and hash from Windows verifies on Linux

This is awesome! Thank you very much. This is going to speed up several of my workflows substantially.

nice to have (from b2sum):
--quiet don't print OK for each successfully verified file

oconnor663 · 2020-05-17T20:40:36Z

Thanks for kicking the tires!

Windows: Globbing does not work

T is expected. Globbing is done by the shell on Unix, and my understanding is that the Windows shell has never supported it, and every different program does its own thing (or nothing). That said, I don't know which tools go through the trouble of supporting globbing on Windows. It seems like it would create an awkward situation involving filepaths with literal * characters in their names...

Windows: Unicode text "Funky näme.txt" in looks like "Funky n├ñme.txt" in the output

Is there any right thing to do here? Presumably the current codepage can't even represent most characters?

nice to have (from b2sum): --quiet

That does sound nice to have, and it should be very quick to implement. Let me take a shot at it.

Relatedly, I'm curious if anyone actually uses --zero.

llowrey · 2020-05-17T21:03:34Z

Thanks for kicking the tires!

Windows: Globbing does not work

T is expected. Globbing is done by the shell on Unix, and my understanding is that the Windows shell has never supported it, and every different program does its own thing (or nothing). That said, I don't know which tools go through the trouble of supporting globbing on Windows. It seems like it would create an awkward situation involving filepaths with literal * characters in their names...

Windows doesn't allow literal * or ? in filenames (or \, /,:,", <, > or | ). So, it should be safe to use the following when run on Windows (assuming it does what I think it does).

https://docs.rs/glob/0.3.0/glob/

Windows: Unicode text "Funky näme.txt" in looks like "Funky n├ñme.txt" in the output

Is there any right thing to do here? Presumably the current codepage can't even represent most characters?

I just noticed that the output of b3sum looks right unless it's directed to something that isn't console. I also verified that notepad and notepad++ show the saved output correctly so are detecting UTF-8. It's really only the type command that doesn't 'do the right thing'. So, I'm a-ok with the behavior as-is.

nice to have (from b2sum): --quiet

That does sound nice to have, and it should be very quick to implement. Let me take a shot at it.

Relatedly, I'm curious if anyone actually uses --zero.

There's probably a tool out there that execs b2sum and doesn't want to be bothered with parsing the output. I don't know what it might be but I don't see a need to implement this option unless someone shows up asking for it.

sneves · 2020-05-17T22:53:29Z

The MSVC C runtime is already perfectly able to handle globbing on its own, however this is not enabled by default. By linking against setargv.obj, or by defining

extern "C" _crt_argv_mode __CRTDECL _get_startup_argv_mode()
{
    return _crt_argv_expanded_arguments;
}

oneself, globbing becomes transparently enabled.

So with some special linker incantations, it could be possible to enable globbing on Windows without any extra code.

EDIT: But it isn't, because Rust ignores the argc and argv from the C runtime, and calls GetCommandLineW itself, along with all the parsing.

oconnor663 · 2020-05-19T15:59:34Z

Interesting. Reading through this thread, I found a link to the wild crate, which looks to be for exactly this problem. I'll try it out.

Edit: Success! The wild crate makes this a one-line change. 48512ec

@llowrey

Suggested by @llowrey: #33 (comment)

oconnor663 · 2020-05-23T19:14:31Z

Shipped as part of version 0.3.4! Globbing and --quiet support included.

vans163 · 2022-10-04T15:03:18Z

Not sure if this is a bug or a feature but it makes it unable to be a drop-in replacement for sha256sum.

using --check --quiet requires 2 spaces (\n\n) between the filename and hash otherwise we get Invalid space error.

root@node0:/tmp# b3sum --check --quiet <<< "0b8b60248fad7ac6dfac221b7e01a8b91c772421a15b387dd1fb2d6a94aee438 hi"
b3sum: Invalid space

root@node0:/tmp# b3sum --check --quiet <<< "0b8b60248fad7ac6dfac221b7e01a8b91c772421a15b387dd1fb2d6a94aee438 
 hi"

oconnor663 mentioned this issue May 15, 2020

Adding --check to b3sum #44

Closed

oconnor663 added a commit that referenced this issue May 23, 2020

add support for --quiet to b3sum --check

c9a1676

Suggested by @llowrey: #33 (comment)

oconnor663 closed this as completed May 23, 2020

Rudxain mentioned this issue Jul 12, 2022

Add --check support Rudxain/xorsum#6

Open

joveian mentioned this issue Jan 12, 2023

Warn on stderr when producing output that cannot be checked with --check #281

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: b3sum add -c --check flag for verification #33

Enhancement: b3sum add -c --check flag for verification #33

tERyceNzAchE commented Jan 18, 2020

oconnor663 commented Jan 19, 2020

phayes commented Jan 21, 2020 •

edited

Loading

oconnor663 commented Jan 21, 2020

phayes commented Jan 21, 2020

oconnor663 commented Jan 23, 2020

phayes commented Jan 23, 2020

oconnor663 commented Jan 23, 2020

phayes commented Jan 26, 2020 •

edited

Loading

llowrey commented May 2, 2020

oconnor663 commented May 2, 2020

oconnor663 commented May 2, 2020 •

edited

Loading

llowrey commented May 3, 2020

oconnor663 commented May 3, 2020 •

edited

Loading

oconnor663 commented May 4, 2020 •

edited

Loading

sneves commented May 4, 2020

oconnor663 commented May 4, 2020 •

edited

Loading

oconnor663 commented May 4, 2020

sneves commented May 4, 2020 •

edited

Loading

oconnor663 commented May 4, 2020

sneves commented May 4, 2020

oconnor663 commented May 4, 2020

oconnor663 commented May 4, 2020

oconnor663 commented May 5, 2020 •

edited

Loading

oconnor663 commented May 15, 2020

oconnor663 commented May 17, 2020

llowrey commented May 17, 2020

oconnor663 commented May 17, 2020

llowrey commented May 17, 2020

sneves commented May 17, 2020 •

edited

Loading

oconnor663 commented May 19, 2020 •

edited

Loading

oconnor663 commented May 23, 2020

vans163 commented Oct 4, 2022 •

edited

Loading

Enhancement: b3sum add -c --check flag for verification #33

Enhancement: b3sum add -c --check flag for verification #33

Comments

tERyceNzAchE commented Jan 18, 2020

oconnor663 commented Jan 19, 2020

phayes commented Jan 21, 2020 • edited Loading

oconnor663 commented Jan 21, 2020

phayes commented Jan 21, 2020

oconnor663 commented Jan 23, 2020

phayes commented Jan 23, 2020

oconnor663 commented Jan 23, 2020

phayes commented Jan 26, 2020 • edited Loading

llowrey commented May 2, 2020

oconnor663 commented May 2, 2020

oconnor663 commented May 2, 2020 • edited Loading

llowrey commented May 3, 2020

oconnor663 commented May 3, 2020 • edited Loading

oconnor663 commented May 4, 2020 • edited Loading

sneves commented May 4, 2020

oconnor663 commented May 4, 2020 • edited Loading

oconnor663 commented May 4, 2020

sneves commented May 4, 2020 • edited Loading

oconnor663 commented May 4, 2020

sneves commented May 4, 2020

oconnor663 commented May 4, 2020

oconnor663 commented May 4, 2020

oconnor663 commented May 5, 2020 • edited Loading

oconnor663 commented May 15, 2020

oconnor663 commented May 17, 2020

llowrey commented May 17, 2020

oconnor663 commented May 17, 2020

llowrey commented May 17, 2020

sneves commented May 17, 2020 • edited Loading

oconnor663 commented May 19, 2020 • edited Loading

oconnor663 commented May 23, 2020

vans163 commented Oct 4, 2022 • edited Loading

phayes commented Jan 21, 2020 •

edited

Loading

phayes commented Jan 26, 2020 •

edited

Loading

oconnor663 commented May 2, 2020 •

edited

Loading

oconnor663 commented May 3, 2020 •

edited

Loading

oconnor663 commented May 4, 2020 •

edited

Loading

oconnor663 commented May 4, 2020 •

edited

Loading

sneves commented May 4, 2020 •

edited

Loading

oconnor663 commented May 5, 2020 •

edited

Loading

sneves commented May 17, 2020 •

edited

Loading

oconnor663 commented May 19, 2020 •

edited

Loading

vans163 commented Oct 4, 2022 •

edited

Loading