Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of requests in dir_size #3382

Merged
merged 2 commits into from Aug 1, 2022

Conversation

robertbindar
Copy link
Contributor

Extend ls_with_sizes to get is_dir/is_file metadata info within the same request and use the extra info to get rid of the
recursive vfs::is_file and vfs::file_size requests.

Note for reviewers:

  • We can further optimize this function if we get rid of the initial is_dir sanity check.
  • If we make the assumption that dir_size(file) = file_size(file), which is similar to what du does on unix, we can get rid of the extra request which we pay for all dir_size calls.
  • This will result in a slight change of dir_size APIs behavior, so it's not obvious to me if it's worth chasing this or not.

TYPE: IMPROVEMENT
DESC: Reduce the number of requests in dir_size

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #18944: Size for arrays is often N/A.

@@ -75,9 +75,10 @@ class directory_entry {
* @param p The path of the entry
* @param size The size of the filesystem entry
*/
directory_entry(const std::string& p, uintmax_t size)
directory_entry(const std::string& p, uintmax_t size, bool is_directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the new parameter to the doc.

@anastasop
Copy link

Tested this with the arrays mentioned in the SC card. For almost all of them there is a speedup of 3

@robertbindar
Copy link
Contributor Author

Spyros is reporting good numbers, but still, it's minutes for a dir_size op. The S3 v1 client can get the dir size in 90sec for an array, whilst VFS gets the same size in 9 mins (my implementation here), so I guess there is still room for improvement.

I've researched this a bit for S3 (with a bit of luck we might be able to generalize it to other backends) and it seems it's possible to get the dir_size with a single ls_with_sizes(prefix, delimiter="") request, which should give us all the objects under a prefix, no squashing based on \. @Shelnutt2 is this somehing worth pursuing?

@robertbindar robertbindar force-pushed the rbin/ch18944/reduce_req_no_dir_size branch from 7df437a to 3ca4a22 Compare August 1, 2022 11:28
@ihnorton ihnorton merged commit 767fe80 into dev Aug 1, 2022
@ihnorton ihnorton deleted the rbin/ch18944/reduce_req_no_dir_size branch August 1, 2022 15:21
github-actions bot pushed a commit that referenced this pull request Aug 1, 2022
* Extend ls_with_sizes to retrieve is_directory metadata

Optimize no of requests for dir_size

* address comment from @-KiterLuc, fix hdfs failing build
ihnorton pushed a commit that referenced this pull request Aug 1, 2022
* Extend ls_with_sizes to retrieve is_directory metadata

Optimize no of requests for dir_size

* address comment from @-KiterLuc, fix hdfs failing build

Co-authored-by: Robert Bindar <robertbindar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants