-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: b3sum add -c --check flag for verification #33
Comments
Agreed, this would be a good feature. It's not high on my personal priority list, because I don't use |
I would be happy to create a PR for this. But first I think I would like @oconnor663's decision on #24 since that will determine if the hex parsing is in |
Apologies for dragging my feet on these. I'm hoping I'll get to them by the end of the week. |
@oconnor663 , no worries! It's your project - take your time. :) |
I decided against #24, so let's go with putting the hex parsing in |
OK great. I'll have this done in the next day or two. |
Exciting! I know the https://github.com/jtdowney/b2sum-rs project has a |
A PR is ready for review here: #44 |
Is there interest in completing this feature? It looks like there are a few unresolved issues with PR #44 and those issues appear to be more policy (requirements) related than technical/implementation/defects. I am a heavy user of "b2sum --check" and would like to use b3sum, for performance reasons. In one trivial use case I am able to check a dir of files in 42s with b3sum that takes 5m37s with b2sum. That's a big difference but atm I have to do the check out-of-band which is neither convenient or reliable (in my case). I'd like to help get this feature released but not sure how to contribute. |
I'm definitely interested in the abstract. It just hasn't been a priority of mine. There were some policy questions that came up in that PR, which I think it might be a good idea to for us to try to answer before we write much more code. Let me see if I can list them here:
For all of these things, I strongly lean towards "support as few things as possible now, such that we can choose to add features later as needed". Once we start encouraging people to start saving check files, it'll be painful to make any changes that break existing users. |
Maybe some of these questions can get swept under the rug as "we don't construct paths ourselves, we just get them from the command line"? I'm not totally sure. In Rust terms, command line args come to us as "Paths must be valid Unicode strings and may not contain newlines" could be a pretty good starting point, but even there I have a question: What does |
Thank you for responding so quickly! I suspect (hope) that the coreutils folks will adopt blake3 and create their own IMO, the contract must be that So...
Re hash length. I would prefer that the hash length be part of the hash such that a 16 byte hash would produce a different value from the first 16 bytes of a 32 byte hash. IOW, truncating the hash to 1 byte would not produce the same hash as when computing a 1 byte hash. This would address your security concern, which exists irrespective of whether Would it help if I documented coreutils' behavior wrt filepathing? |
Ok I understand the
For example, a filename with a newline in the middle looks like this:
And a filename with a backslash in the middle looks like this:
Every other character (or I should say every other byte, since invalid Unicode is allowed) is copied literally to the output. A subtle point here is that this is backwards-compatible with older versions of Another subtle point here is that, if Summing all that up, I think "do exactly what The first problem is that Windows filepaths aren't "arbitrary bytes", but rather "arbitrary byte pairs" (that is, arbitrary 16-bit characters). If we converted byte pairs to bytes in the naive, literal way, then The more reasonable strategy is to parse Windows paths as UTF-16 and convert them to UTF-8, however this runs into the second problem: Windows paths aren't guaranteed to be well-formed UTF-16. They might contain unpaired "surrogate characters" which can't be represented in well-formed UTF-8. This is why "WTF-8" was defined, and it's essentially the entire reason the It's tempting to say "ok, just use WTF-8 in the output," but the spec is quite adamant about this:
And actually, "just use WTF-8" wouldn't even solve the problem, because while it can represent all Windows filepaths, it can't represent all Unix filepaths. Given all that, here's what I propose:
(Edit: More important bullet points below regarding Windows.) |
Major oversight above: We obviously shouldn't escape the backslashes in
The goal there is that files with backslashes in their names can be hashed and checked on Unix, but attempting to check such a file on Windows is always an error. (The file cannot exist anyway, and we don't want the name fragments to unintentionally match existing directories.) |
Backslashes are escaped on Windows:
|
Thanks, that makes sense. Looks like they don't have any special handling for Windows. I think we can do better, tentatively leaning towards the two rules I suggested above. Here's the scenario I'm imagining:
I don't think we should go crazy with cross-platform compatibility, since non-Unicode paths tend to be totally unrepresentable on other platforms anyway. But "unzip an archive and hash the files in it" is the core use case of Also, it would be nice if our default output on Windows wasn't super ugly :) |
By the way, in @sneves output above, I see a |
Indicates whether the file is opened as |
Ok, yeah, I think it's safe to say |
To be clear, |
Apparently it's "binary mode, if the difference is significant", which I assume means "only on Windows" in practice? Anyway, we'll always read raw bytes regardless of the OS, so I don't think we need to worry about it. |
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
Ooo, credit where credit is due, it looks like macOS does not allow invalid Unicode filenames! |
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
As proposed in #33 (comment) This brings b3sum behavior close to md5sum. All occurrences of backslash are replaced with "\\", and all occurrences of (Unix) newline are replaced with "\n". In addition, any line containing these escapes has a single "\" prepended to the front. Filepaths were already being converted to UTF-8 with to_string_lossy(), but this commit adds an extra warning when that conversion is in fact lossy (because the path is not valid Unicode). This new warning is printed to stdout, with the goal of deliberately breaking --check (which is not yet implemented) in this case.
I'm working on this in the https://github.com/BLAKE3-team/BLAKE3/tree/check branch. Looks like the Clap argument parsing library might have a bug on Windows that'll get in the way of testing this: clap-rs/clap#1905. I have a PR up to fix that, and I might like to get it released first. Update: Clap v2.33.1 has been released with a fix. |
I've finished a |
Basically as soon as someone tells me that they've tried this and it seems to work for them, I'm gonna release it :) |
This is awesome! Thank you very much. This is going to speed up several of my workflows substantially. nice to have (from b2sum): |
Thanks for kicking the tires!
T is expected. Globbing is done by the shell on Unix, and my understanding is that the Windows shell has never supported it, and every different program does its own thing (or nothing). That said, I don't know which tools go through the trouble of supporting globbing on Windows. It seems like it would create an awkward situation involving filepaths with literal
Is there any right thing to do here? Presumably the current codepage can't even represent most characters?
That does sound nice to have, and it should be very quick to implement. Let me take a shot at it. Relatedly, I'm curious if anyone actually uses |
Windows doesn't allow literal https://docs.rs/glob/0.3.0/glob/
I just noticed that the output of
There's probably a tool out there that execs |
The MSVC C runtime is already perfectly able to handle globbing on its own, however this is not enabled by default. By linking against
oneself, globbing becomes transparently enabled. So with some special linker incantations, it could be possible to enable globbing on Windows without any extra code. EDIT: But it isn't, because Rust ignores the |
Interesting. Reading through this thread, I found a link to the Edit: Success! The |
Suggested by @llowrey: #33 (comment)
Shipped as part of version 0.3.4! Globbing and |
Not sure if this is a bug or a feature but it makes it unable to be a drop-in replacement for sha256sum. using
|
Adding -c, --check would make b3sum a drop in replacement for b2sum in some scripts.
The text was updated successfully, but these errors were encountered: