New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support to consolidate ok/wrt files. #2933
Conversation
This pull request has been linked to Shortcut Story #14827: Support consolidating |
571c022
to
a4c0433
Compare
This adds support for consolidating/vacuuming ok/wrt files. To speed up opening arrays with many fragments, a user can now consolidate the commit files into one. --- TYPE: IMPROVEMENT DESC: Adding support to consolidate ok/wrt files.
a4c0433
to
2b60e5c
Compare
tiledb/sm/array/array_directory.h
Outdated
@@ -45,6 +45,13 @@ using namespace tiledb::common; | |||
namespace tiledb { | |||
namespace sm { | |||
|
|||
enum class ArrayDirectoryMode { | |||
DEFAULT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean "load all"? If so, shall we remain to LOAD_ALL
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not. In some cases, for example for consolidated commits vacuuming, default might skip work that's not required.
URI latest_fragment_meta_uri_v12_or_higher; | ||
|
||
// Load (in parallel) the root directory data | ||
if (!only_schemas_) | ||
if (mode_ != ArrayDirectoryMode::SCHEMA_ONLY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better clarity now that we have multiple modes, is it possible to make cases based on the mode that is true rather than the mode that is not true? Can we also break this into different functions, e.g., load_schema_only
, load_commits
, load_<mode>
, etc? I understand that there may be a bit of overlap, but it will be easier to read and maintain down the road, especially if we add more modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unable to get to this one for now, but added some more comments to clarify. Maybe we can file a follow up?
if (stdx::string::ends_with( | ||
uri.to_string(), constants::ignore_file_suffix)) { | ||
uint64_t size = 0; | ||
RETURN_NOT_OK_TUPLE(vfs_->file_size(uri, &size), nullopt, nullopt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a note to eliminate this once we implement VFS::ls_with_sizes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
for (auto& uri : uris_set) { | ||
uris.emplace_back(uri); | ||
} | ||
std::sort(uris.begin(), uris.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this unnecessary? I think the set iterator already visits the URIs in sorted order. If not, perhaps making a sorted map passing the URI comparator is a better idea here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. The unordered_set gets used in multiple functions later. And changing it to a set makes the count function logarithmic instead of constant time. So this might be better?
tiledb/sm/array/array_directory.h
Outdated
SCHEMA_ONLY, | ||
COMMITS, | ||
VACUUM_FRAGMENTS | ||
DEFAULT, // Default mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this READ
? Does default mean load all URIs? (can be addressed in a future PR)
406a09b
to
ea87597
Compare
ctx_, vfs_, v11_arrays_dir.c_str(), SPARSE_ARRAY_NAME) == TILEDB_OK); | ||
|
||
// Write v11 fragment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KiterLuc double-checking: is this function writing a v11 fragment or writing a new fragment at the current version, to the v11 array copied above? cc @bekadavis9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ihnorton This will write at the current version, then we upgrade the array to the current version and write one more fragment...
This adds support for consolidating/vacuuming ok/wrt files. To speed up
opening arrays with many fragments, a user can now consolidate the
commit files into one.
TYPE: IMPROVEMENT
DESC: Adding support to consolidate ok/wrt files.