Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support to consolidate ok/wrt files. #2933

Merged
merged 4 commits into from Mar 9, 2022

Conversation

KiterLuc
Copy link
Contributor

@KiterLuc KiterLuc commented Mar 3, 2022

This adds support for consolidating/vacuuming ok/wrt files. To speed up
opening arrays with many fragments, a user can now consolidate the
commit files into one.


TYPE: IMPROVEMENT
DESC: Adding support to consolidate ok/wrt files.

@shortcut-integration
Copy link

This pull request has been linked to Shortcut Story #14827: Support consolidating wrt files.

@KiterLuc KiterLuc force-pushed the lr/wrt-consolidation/ch14827 branch from 571c022 to a4c0433 Compare March 3, 2022 08:17
This adds support for consolidating/vacuuming ok/wrt files. To speed up
opening arrays with many fragments, a user can now consolidate the
commit files into one.

---
TYPE: IMPROVEMENT
DESC: Adding support to consolidate ok/wrt files.
@KiterLuc KiterLuc force-pushed the lr/wrt-consolidation/ch14827 branch from a4c0433 to 2b60e5c Compare March 3, 2022 14:28
format_spec/consolidated_commits_file.md Outdated Show resolved Hide resolved
format_spec/consolidated_commits_file.md Show resolved Hide resolved
format_spec/ignore_file.md Show resolved Hide resolved
format_spec/consolidated_commits_file.md Outdated Show resolved Hide resolved
format_spec/array_file_hierarchy.md Outdated Show resolved Hide resolved
@@ -45,6 +45,13 @@ using namespace tiledb::common;
namespace tiledb {
namespace sm {

enum class ArrayDirectoryMode {
DEFAULT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean "load all"? If so, shall we remain to LOAD_ALL or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. In some cases, for example for consolidated commits vacuuming, default might skip work that's not required.

tiledb/sm/array/array_directory.h Outdated Show resolved Hide resolved
tiledb/sm/array/array_directory.h Show resolved Hide resolved
tiledb/sm/misc/constants.cc Outdated Show resolved Hide resolved
URI latest_fragment_meta_uri_v12_or_higher;

// Load (in parallel) the root directory data
if (!only_schemas_)
if (mode_ != ArrayDirectoryMode::SCHEMA_ONLY) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better clarity now that we have multiple modes, is it possible to make cases based on the mode that is true rather than the mode that is not true? Can we also break this into different functions, e.g., load_schema_only, load_commits, load_<mode>, etc? I understand that there may be a bit of overlap, but it will be easier to read and maintain down the road, especially if we add more modes.

Copy link
Contributor Author

@KiterLuc KiterLuc Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unable to get to this one for now, but added some more comments to clarify. Maybe we can file a follow up?

if (stdx::string::ends_with(
uri.to_string(), constants::ignore_file_suffix)) {
uint64_t size = 0;
RETURN_NOT_OK_TUPLE(vfs_->file_size(uri, &size), nullopt, nullopt);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a note to eliminate this once we implement VFS::ls_with_sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for (auto& uri : uris_set) {
uris.emplace_back(uri);
}
std::sort(uris.begin(), uris.end());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this unnecessary? I think the set iterator already visits the URIs in sorted order. If not, perhaps making a sorted map passing the URI comparator is a better idea here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The unordered_set gets used in multiple functions later. And changing it to a set makes the count function logarithmic instead of constant time. So this might be better?

tiledb/sm/array/array_directory.cc Show resolved Hide resolved
SCHEMA_ONLY,
COMMITS,
VACUUM_FRAGMENTS
DEFAULT, // Default mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this READ? Does default mean load all URIs? (can be addressed in a future PR)

@KiterLuc KiterLuc force-pushed the lr/wrt-consolidation/ch14827 branch from 406a09b to ea87597 Compare March 9, 2022 15:26
@KiterLuc KiterLuc merged commit 2cdbb09 into dev Mar 9, 2022
@KiterLuc KiterLuc deleted the lr/wrt-consolidation/ch14827 branch March 9, 2022 18:36
ctx_, vfs_, v11_arrays_dir.c_str(), SPARSE_ARRAY_NAME) == TILEDB_OK);

// Write v11 fragment.
Copy link
Member

@ihnorton ihnorton Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KiterLuc double-checking: is this function writing a v11 fragment or writing a new fragment at the current version, to the v11 array copied above? cc @bekadavis9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ihnorton This will write at the current version, then we upgrade the array to the current version and write one more fragment...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants