Support position deletes for Iceberg TableEngine #83094
Conversation
alexey-milovidov
left a comment
There was a problem hiding this comment.
will be fixed.
Otherwise, any pull request on data lakes will be reverted immediately.
a85efd8 to
fd9a66b
Compare
…k/add_positional_delete_after_flaky_test_2
|
@alexey-milovidov, what relation do Delta, Hive Metastore and Glue have with a feature that is specific to Iceberg? The Iceberg flaky test is going to be investigated as stated in the header of this PR, of course, it seems this PR is connected to the failure |
|
I want to prioritize the resolution of technical debt. |
|
For now I failed to reproduce what id going on here :-( |
|
Status update: I managed to reproduce the test locally with help of @ fm4v (one test fails in a hundred of attempts, so feed back loop is going to be big, but at least now it exists), will try to establish a root cause of flaky test using this repro. Also will try to get repro of this commit to use CI for reproducing: a9c7ec1 |
|
|
||
| mutable std::optional<Strings> cached_unprunned_files_for_last_processed_snapshot TSA_GUARDED_BY(cached_unprunned_files_for_last_processed_snapshot_mutex); | ||
| mutable std::optional<std::vector<ParsedDataFileInfo>> | ||
| cached_unprunned_data_files_for_last_processed_snapshot TSA_GUARDED_BY(cached_unprunned_files_for_last_processed_snapshot_mutex); |
There was a problem hiding this comment.
it looks very complex here. I suspect why this cache is needed:
now that we cache ManifestFileContent in IcebergMetadataFilesCache, it has cached:
std::vector<ManifestFileEntry> data_files;std::vector<ManifestFileEntry> position_deletes_files;
That means if we hit this cache, we have almost no cost to get data/deletes files entry.
There was a problem hiding this comment.
so can we just remove these cache?
There was a problem hiding this comment.
It theoretically can lead to unpredicted perfomance consequences, though I really also want to do it, so let's try and maybe return it if it is needed
There was a problem hiding this comment.
We can remove cache but we can't remove ParsedDataFileInfo, unfortunately (it is used not only in cache)
|
What happened in the stress test? |
…k/add_positional_delete_after_flaky_test_2
… 'master' of github.com:ClickHouse/ClickHouse into divanik/add_positional_delete_after_flaky_test_2
…k/add_positional_delete_after_flaky_test_2
| { | ||
| IcebergDataObjectInfo(std::optional<ObjectMetadata> metadata_, ParsedDataFileInfo parsed_data_file_info_); | ||
|
|
||
| ParsedDataFileInfo parsed_data_file_info; |
There was a problem hiding this comment.
The IcebergDataObjectInfo only contains one field ParsedDataFileInfo parsed_data_file_info, why do we need two different object info IcebergDataObjectInfo and ParserdDataFileInfo? what is the difference between them?
There was a problem hiding this comment.
IcebergDataObjectInfo is also a derived from RelativePathWithMetadata, it can potentially take a lot of memory and we don't to store it explicitly
There was a problem hiding this comment.
In ManifestFile.h we have stored std::vector<ManifestFileEntry> data_files; and in IcebergMetadata we want to transform it to ParsedDataFileInfo to save memory use. How about using std::shared_ptr<ManifestFileEntry> in both files?
There was a problem hiding this comment.
This doesn't work unfortunately, we can't allow ourselves to store ManifestFileEntry of all data files in a table, doesn't matter if we use shared_ptr or not. In ManifestFileContent only files which correspond to one manifest file are explicitly stored which is affordable
There was a problem hiding this comment.
Why will it store ManifestFileEntry of all data files in a table? It's called in iterate function in a streaming way, everytime we iterate we also read a ManifestFileContent, so I think the ManifestFileEntrys have same life span in iterators as in ManifestFileContent
…k/add_positional_delete_after_flaky_test_2
Changelog category
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Support position deletes for Iceberg TableEngine
Documentation entry for user-facing changes
This PR is a second attempt of merging Positional deletes to CH. Comparing to this PR several issues are going to be addressed: