Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b3sum: Implement recursive file hashing #170

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

daviessm
Copy link

@daviessm daviessm commented May 7, 2021

Add an argument -r (--recurse) to recurse through any directories in the list of files and process all the containing files in a defined order.

@daviessm
Copy link
Author

daviessm commented May 8, 2021

I've been thinking about my implementation of this. I'm a Rust newbie and I'd appreciate it if someone could review the code and check that my understanding is correct:

  • Prior to this change, all files passed on the command line are processed in separate threads up to num_threads; therefore the order of the output hashes is undefined.
  • Each file is sent to the BLAKE3 hasher which also uses num_threads to process the file, so overall there could be num_threads * num_threads threads running.
  • The directory recursion is done within a single thread so all the files processed within a directory are guaranteed to be processed in the same order.
  • If there are multiple directories on the command line then each of them is recursed in a separate thread and the files within them will be intermingled; the output order will be predictable for each directory but the output will be mixed between directories.

If that's the case, is the order of the output this change gives correct for a --recurse option?

@daviessm daviessm changed the title Implement recursive file hashing b3sum: Implement recursive file hashing May 8, 2021
if md.is_dir() && args.recurse() {
let mut entries = fs::read_dir(path)?
.map(|res| res.map(|e| e.path()))
.collect::<Result<Vec<_>, io::Error>>()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.collect::<Result<Vec<_>, io::Error>>()? very smooth but now I'm not sure I believe you when you say "I'm a Rust newbie" :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can neither confirm nor deny this was copied from some documentation I looked up. 😁

@oconnor663
Copy link
Member

Prior to this change, all files passed on the command line are processed in separate threads up to num_threads; therefore the order of the output hashes is undefined.

This doesn't sound right to me. I think you're referring to the call to thread_pool.install in the main loop. But that call is only made once, and the closure inside of it does a serial for-loop over the args list.

}
if args.no_names() {
let md = metadata(path).unwrap();
if md.is_dir() && args.recurse() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about what will happen for directory entries that are symlinks in this case? It looks like if they're symlinks to directories, then the'll be opened as though they were files, which will probably cause an error. But in fixing this, we have to be careful not to infinitely loop on circular symlinks. The right thing to do here isn't clear to me, and we might want to look at what similar recursive tools do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, good point, I hadn't considered that. My use-case is on Windows where symlinks are very unlikely to exist but I'll add another option --follow-symlinks to follow them, and investigate how to detect loops. IMHO it would make sense not to follow them by default but I don't mind doing the opposite and having a --no-follow-symlinks option.

Copy link
Contributor

@Rudxain Rudxain Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need something similar to Javascript WeakSet to track symlinks, thereby avoiding infinite loops.

Edit: found it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My use-case is on Windows where symlinks are very unlikely to exist

Directory junctions have been commonly used by Microsoft to ensure backward compatibility when OS upgrades changed the directory structure, e.g. the XP => Vista migration created %userprofile%/My Documents <<===>> %userprofile%/Documents.

@oconnor663
Copy link
Member

oconnor663 commented May 8, 2021

I was about to suggest that a Zsh glob could accomplish something similar, but lo and behold it didn't work when I tried it:

$ b3sum **/*
zsh: argument list too long: b3sum

So I immediately see some value in this feature! :) Could you say more about the use case that you're interested in using it for?

A high level thought: So far, b3sum hasn't put any effort into optimizing how it hashes multiple files. As I mentioned above, there's just a serial loop that goes one-by-one through each file you supplied on the command line. This is actually fine if your total runtime is going to be dominated by a few large files, because we'll be able to use all your cores for each one of those, and there's no extra benefit to doing them in parallel. But if you're going to be hashing thousands or millions of small files, which might not individually be able to make good use of all your cores, and where we might do a lot of waiting on IO for files and directories to open, then b3sum's current performance isn't good.

So this becomes something of an existential question: Do we want b3sum to be a highly optimized way of hashing a million small files? I'm not sure. If we did want to implement this, I think the next step would be to look at what ripgrep does using the walkdir and ignore crates for efficient directory traversals. (We'd also want to think carefully about the choices that ripgrep makes, like respecting .gitignore files. Note that a traversing an ignored target/ dir is eactly why that Zsh glob above failed for me.)

It could be that the answer here is "no". Maybe it could make sense to support --recursive, but not to put in a lot of work to make it fast. (Or to leave that work for later if the need arises.) I'm not sure. What do you think?

@daviessm
Copy link
Author

daviessm commented May 8, 2021

Could you say more about the use case that you're interested in using it for?

I have a project to move directory trees of files of differing sizes from one server to another and to be able to cryptographically prove that the source matches the destination - i.e. that nothing was modified in the transfer. The tree might contain terabyte-size files or millions of tiny files. I'd want to use this along with #171 to show the digest of the whole directory tree.

You make good points about the parallelism. If the processing were to run in parallel and the output were made upon completion of each file then my objective of comparing the whole set of files wouldn't be as simple as the files may not be processed in the same order on each server. In that case the results would have to be stored in some kind of tree map to be output at the end of the process - actually an idea I was contemplating for #117 anyway. Would you be open to considering that?

@daviessm
Copy link
Author

daviessm commented May 8, 2021

Prior to this change, all files passed on the command line are processed in separate threads up to num_threads; therefore the order of the output hashes is undefined.

This doesn't sound right to me. I think you're referring to the call to thread_pool.install in the main loop. But that call is only made once, and the closure inside of it does a serial for-loop over the args list.

Ok, that's my lack of Rust knowledge coming through. Is the thread_pool simply initialised there for use in the main hashing?

@daviessm
Copy link
Author

daviessm commented May 9, 2021

I think the next step would be to look at what ripgrep does using the walkdir and ignore crates for efficient directory traversals. (We'd also want to think carefully about the choices that ripgrep makes, like respecting .gitignore files. Note that a traversing an ignored target/ dir is eactly why that Zsh glob above failed for me.)

I've had a quick look at those crates and they seem like they might fit this project. Where do you envisage the overheads being? Do we want to look at using a producer/consumer pattern with (at least) two threads: one to find the files and put them in a queue; and the other to process them off the queue as quickly as possible? I've done that a few times in Java for similar projects and it works well. Perhaps using flume?

@daviessm
Copy link
Author

daviessm commented May 9, 2021

Do we want to look at using a producer/consumer pattern with (at least) two threads: one to find the files and put them in a queue; and the other to process them off the queue as quickly as possible? I've done that a few times in Java for similar projects and it works well. Perhaps using flume?

I had a go at doing this and gave up because it's way beyond my current knowledge of Rust so would take me a few months to implement. I have to stick to synchronous code for now!

@xczh
Copy link

xczh commented Jul 26, 2023

A very practical feature. Is there any progress now?

@polarathene
Copy link

Is there any progress now?

If there was, you'd see it above in the activity/discussion.

The associated issue provides a workaround CLI command to use find to grab the files, sort them (by checksum hash) and create a b3sum of that output.

The associated issue also had a comment after that recently, linking to paq that provides a crate your project can use, or as a binary you can run directly to accomplish the same goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants