b3sum: Implement recursive file hashing #170

daviessm · 2021-05-07T20:12:31Z

Add an argument -r (--recurse) to recurse through any directories in the list of files and process all the containing files in a defined order.

daviessm · 2021-05-08T13:46:42Z

I've been thinking about my implementation of this. I'm a Rust newbie and I'd appreciate it if someone could review the code and check that my understanding is correct:

Prior to this change, all files passed on the command line are processed in separate threads up to num_threads; therefore the order of the output hashes is undefined.
Each file is sent to the BLAKE3 hasher which also uses num_threads to process the file, so overall there could be num_threads * num_threads threads running.
The directory recursion is done within a single thread so all the files processed within a directory are guaranteed to be processed in the same order.
If there are multiple directories on the command line then each of them is recursed in a separate thread and the files within them will be intermingled; the output order will be predictable for each directory but the output will be mixed between directories.

If that's the case, is the order of the output this change gives correct for a --recurse option?

oconnor663 · 2021-05-08T15:12:12Z

b3sum/src/main.rs

+    if md.is_dir() && args.recurse() {
+        let mut entries = fs::read_dir(path)?
+            .map(|res| res.map(|e| e.path()))
+            .collect::<Result<Vec<_>, io::Error>>()?;


.collect::<Result<Vec<_>, io::Error>>()? very smooth but now I'm not sure I believe you when you say "I'm a Rust newbie" :)

I can neither confirm nor deny this was copied from some documentation I looked up. 😁

oconnor663 · 2021-05-08T15:21:06Z

Prior to this change, all files passed on the command line are processed in separate threads up to num_threads; therefore the order of the output hashes is undefined.

This doesn't sound right to me. I think you're referring to the call to thread_pool.install in the main loop. But that call is only made once, and the closure inside of it does a serial for-loop over the args list.

oconnor663 · 2021-05-08T15:29:03Z

b3sum/src/main.rs

-    }
-    if args.no_names() {
+    let md = metadata(path).unwrap();
+    if md.is_dir() && args.recurse() {


Have you thought about what will happen for directory entries that are symlinks in this case? It looks like if they're symlinks to directories, then the'll be opened as though they were files, which will probably cause an error. But in fixing this, we have to be careful not to infinitely loop on circular symlinks. The right thing to do here isn't clear to me, and we might want to look at what similar recursive tools do.

No, good point, I hadn't considered that. My use-case is on Windows where symlinks are very unlikely to exist but I'll add another option --follow-symlinks to follow them, and investigate how to detect loops. IMHO it would make sense not to follow them by default but I don't mind doing the opposite and having a --no-follow-symlinks option.

I think we need something similar to Javascript WeakSet to track symlinks, thereby avoiding infinite loops.

Edit: found it

My use-case is on Windows where symlinks are very unlikely to exist

Directory junctions have been commonly used by Microsoft to ensure backward compatibility when OS upgrades changed the directory structure, e.g. the XP => Vista migration created %userprofile%/My Documents <<===>> %userprofile%/Documents.

oconnor663 · 2021-05-08T15:44:30Z

I was about to suggest that a Zsh glob could accomplish something similar, but lo and behold it didn't work when I tried it:

$ b3sum **/*
zsh: argument list too long: b3sum

So I immediately see some value in this feature! :) Could you say more about the use case that you're interested in using it for?

A high level thought: So far, b3sum hasn't put any effort into optimizing how it hashes multiple files. As I mentioned above, there's just a serial loop that goes one-by-one through each file you supplied on the command line. This is actually fine if your total runtime is going to be dominated by a few large files, because we'll be able to use all your cores for each one of those, and there's no extra benefit to doing them in parallel. But if you're going to be hashing thousands or millions of small files, which might not individually be able to make good use of all your cores, and where we might do a lot of waiting on IO for files and directories to open, then b3sum's current performance isn't good.

So this becomes something of an existential question: Do we want b3sum to be a highly optimized way of hashing a million small files? I'm not sure. If we did want to implement this, I think the next step would be to look at what ripgrep does using the walkdir and ignore crates for efficient directory traversals. (We'd also want to think carefully about the choices that ripgrep makes, like respecting .gitignore files. Note that a traversing an ignored target/ dir is eactly why that Zsh glob above failed for me.)

It could be that the answer here is "no". Maybe it could make sense to support --recursive, but not to put in a lot of work to make it fast. (Or to leave that work for later if the need arises.) I'm not sure. What do you think?

daviessm · 2021-05-08T17:22:34Z

Could you say more about the use case that you're interested in using it for?

I have a project to move directory trees of files of differing sizes from one server to another and to be able to cryptographically prove that the source matches the destination - i.e. that nothing was modified in the transfer. The tree might contain terabyte-size files or millions of tiny files. I'd want to use this along with #171 to show the digest of the whole directory tree.

You make good points about the parallelism. If the processing were to run in parallel and the output were made upon completion of each file then my objective of comparing the whole set of files wouldn't be as simple as the files may not be processed in the same order on each server. In that case the results would have to be stored in some kind of tree map to be output at the end of the process - actually an idea I was contemplating for #117 anyway. Would you be open to considering that?

daviessm · 2021-05-08T17:27:07Z

Prior to this change, all files passed on the command line are processed in separate threads up to num_threads; therefore the order of the output hashes is undefined.

This doesn't sound right to me. I think you're referring to the call to thread_pool.install in the main loop. But that call is only made once, and the closure inside of it does a serial for-loop over the args list.

Ok, that's my lack of Rust knowledge coming through. Is the thread_pool simply initialised there for use in the main hashing?

daviessm · 2021-05-09T09:31:23Z

I think the next step would be to look at what ripgrep does using the walkdir and ignore crates for efficient directory traversals. (We'd also want to think carefully about the choices that ripgrep makes, like respecting .gitignore files. Note that a traversing an ignored target/ dir is eactly why that Zsh glob above failed for me.)

I've had a quick look at those crates and they seem like they might fit this project. Where do you envisage the overheads being? Do we want to look at using a producer/consumer pattern with (at least) two threads: one to find the files and put them in a queue; and the other to process them off the queue as quickly as possible? I've done that a few times in Java for similar projects and it works well. Perhaps using flume?

daviessm · 2021-05-09T16:32:48Z

Do we want to look at using a producer/consumer pattern with (at least) two threads: one to find the files and put them in a queue; and the other to process them off the queue as quickly as possible? I've done that a few times in Java for similar projects and it works well. Perhaps using flume?

I had a go at doing this and gave up because it's way beyond my current knowledge of Rust so would take me a few months to implement. I have to stick to synchronous code for now!

xczh · 2023-07-26T02:51:01Z

A very practical feature. Is there any progress now?

polarathene · 2023-07-26T04:32:28Z

Is there any progress now?

If there was, you'd see it above in the activity/discussion.

The associated issue provides a workaround CLI command to use find to grab the files, sort them (by checksum hash) and create a b3sum of that output.

The associated issue also had a comment after that recently, linking to paq that provides a crate your project can use, or as a binary you can run directly to accomplish the same goal.

Implement recursive file hashing

fdb711c

daviessm changed the title ~~Implement recursive file hashing~~ b3sum: Implement recursive file hashing May 8, 2021

oconnor663 reviewed May 8, 2021

View reviewed changes

daviessm mentioned this pull request May 8, 2021

b3sum: digest checksum for multiple files #171

Open

Merge branch 'BLAKE3-team:master' into recursion

75d5834

Rudxain mentioned this pull request Jul 18, 2022

Multithreading Rudxain/xorsum#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b3sum: Implement recursive file hashing #170

b3sum: Implement recursive file hashing #170

daviessm commented May 7, 2021

daviessm commented May 8, 2021

oconnor663 May 8, 2021

daviessm May 8, 2021

oconnor663 commented May 8, 2021

oconnor663 May 8, 2021

daviessm May 8, 2021

Rudxain Jul 18, 2022 •

edited

Loading

BurningEnlightenment Jun 12, 2023

oconnor663 commented May 8, 2021 •

edited

Loading

daviessm commented May 8, 2021

daviessm commented May 8, 2021

daviessm commented May 9, 2021

daviessm commented May 9, 2021

xczh commented Jul 26, 2023

polarathene commented Jul 26, 2023

b3sum: Implement recursive file hashing #170

Are you sure you want to change the base?

b3sum: Implement recursive file hashing #170

Conversation

daviessm commented May 7, 2021

daviessm commented May 8, 2021

oconnor663 May 8, 2021

Choose a reason for hiding this comment

daviessm May 8, 2021

Choose a reason for hiding this comment

oconnor663 commented May 8, 2021

oconnor663 May 8, 2021

Choose a reason for hiding this comment

daviessm May 8, 2021

Choose a reason for hiding this comment

Rudxain Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

BurningEnlightenment Jun 12, 2023

Choose a reason for hiding this comment

oconnor663 commented May 8, 2021 • edited Loading

daviessm commented May 8, 2021

daviessm commented May 8, 2021

daviessm commented May 9, 2021

daviessm commented May 9, 2021

xczh commented Jul 26, 2023

polarathene commented Jul 26, 2023

Rudxain Jul 18, 2022 •

edited

Loading

oconnor663 commented May 8, 2021 •

edited

Loading