Skip to content

Support position deletes for Iceberg TableEngine #83094

Merged
divanik merged 20 commits intomasterfrom
divanik/add_positional_delete_after_flaky_test_2
Aug 7, 2025
Merged

Support position deletes for Iceberg TableEngine #83094
divanik merged 20 commits intomasterfrom
divanik/add_positional_delete_after_flaky_test_2

Conversation

@divanik
Copy link
Copy Markdown
Member

@divanik divanik commented Jul 2, 2025

Changelog category

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Support position deletes for Iceberg TableEngine

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

This PR is a second attempt of merging Positional deletes to CH. Comparing to this PR several issues are going to be addressed:

  • Docs should be written (the box is above)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 2, 2025

Workflow [PR], commit [7251b06]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Jul 2, 2025
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not merge until all tests for Iceberg:
https://play.clickhouse.com/play?user=play#U0VMRUNUIGNoZWNrX3N0YXJ0X3RpbWUsIGNoZWNrX25hbWUsIHRlc3RfbmFtZSwgcmVwb3J0X3VybApGUk9NIGNoZWNrcwpXSEVSRSBjaGVja19zdGFydF90aW1lID49IG5vdygpIC0gSU5URVJWQUwgMSBNT05USAogICAgQU5EIChoZWFkX3JlZiA9ICdtYXN0ZXInIEFORCBzdGFydHNXaXRoKGhlYWRfcmVwbywgJ0NsaWNrSG91c2UvJykpCiAgICBBTkQgdGVzdF9zdGF0dXMgIT0gJ1NLSVBQRUQnCiAgICBBTkQgKHRlc3Rfc3RhdHVzIExJS0UgJ0YlJyBPUiB0ZXN0X3N0YXR1cyBMSUtFICdFJScpIAogICAgQU5EIGNoZWNrX3N0YXR1cyAhPSAnc3VjY2VzcycKICAgIEFORCBwb3NpdGlvbih0ZXN0X25hbWUsICdpY2ViZXJnJykgPiAwCk9SREVSIEJZIGNoZWNrX3N0YXJ0X3RpbWU=

for Delta https://play.clickhouse.com/play?user=play#U0VMRUNUIGNoZWNrX3N0YXJ0X3RpbWUsIGNoZWNrX25hbWUsIHRlc3RfbmFtZSwgcmVwb3J0X3VybApGUk9NIGNoZWNrcwpXSEVSRSBjaGVja19zdGFydF90aW1lID49IG5vdygpIC0gSU5URVJWQUwgMSBNT05USAogICAgQU5EIChoZWFkX3JlZiA9ICdtYXN0ZXInIEFORCBzdGFydHNXaXRoKGhlYWRfcmVwbywgJ0NsaWNrSG91c2UvJykpCiAgICBBTkQgdGVzdF9zdGF0dXMgIT0gJ1NLSVBQRUQnCiAgICBBTkQgKHRlc3Rfc3RhdHVzIExJS0UgJ0YlJyBPUiB0ZXN0X3N0YXR1cyBMSUtFICdFJScpIAogICAgQU5EIGNoZWNrX3N0YXR1cyAhPSAnc3VjY2VzcycKICAgIEFORCBwb3NpdGlvbih0ZXN0X25hbWUsICdkZWx0YScpID4gMApPUkRFUiBCWSBjaGVja19zdGFydF90aW1l

for Hive Metastore https://play.clickhouse.com/play?user=play#U0VMRUNUIGNoZWNrX3N0YXJ0X3RpbWUsIGNoZWNrX25hbWUsIHRlc3RfbmFtZSwgcmVwb3J0X3VybApGUk9NIGNoZWNrcwpXSEVSRSBjaGVja19zdGFydF90aW1lID49IG5vdygpIC0gSU5URVJWQUwgMSBNT05USAogICAgQU5EIChoZWFkX3JlZiA9ICdtYXN0ZXInIEFORCBzdGFydHNXaXRoKGhlYWRfcmVwbywgJ0NsaWNrSG91c2UvJykpCiAgICBBTkQgdGVzdF9zdGF0dXMgIT0gJ1NLSVBQRUQnCiAgICBBTkQgKHRlc3Rfc3RhdHVzIExJS0UgJ0YlJyBPUiB0ZXN0X3N0YXR1cyBMSUtFICdFJScpIAogICAgQU5EIGNoZWNrX3N0YXR1cyAhPSAnc3VjY2VzcycKICAgIEFORCBwb3NpdGlvbih0ZXN0X25hbWUsICdobXMnKSA+IDAKT1JERVIgQlkgY2hlY2tfc3RhcnRfdGltZQ==

for Glue https://play.clickhouse.com/play?user=play#U0VMRUNUIGNoZWNrX3N0YXJ0X3RpbWUsIGNoZWNrX25hbWUsIHRlc3RfbmFtZSwgcmVwb3J0X3VybApGUk9NIGNoZWNrcwpXSEVSRSBjaGVja19zdGFydF90aW1lID49IG5vdygpIC0gSU5URVJWQUwgMSBNT05USAogICAgQU5EIChoZWFkX3JlZiA9ICdtYXN0ZXInIEFORCBzdGFydHNXaXRoKGhlYWRfcmVwbywgJ0NsaWNrSG91c2UvJykpCiAgICBBTkQgdGVzdF9zdGF0dXMgIT0gJ1NLSVBQRUQnCiAgICBBTkQgKHRlc3Rfc3RhdHVzIExJS0UgJ0YlJyBPUiB0ZXN0X3N0YXR1cyBMSUtFICdFJScpIAogICAgQU5EIGNoZWNrX3N0YXR1cyAhPSAnc3VjY2VzcycKICAgIEFORCBwb3NpdGlvbih0ZXN0X25hbWUsICdnbHVlJykgPiAwCk9SREVSIEJZIGNoZWNrX3N0YXJ0X3RpbWU=

will be fixed.

Otherwise, any pull request on data lakes will be reverted immediately.

@hanfei1991 hanfei1991 self-assigned this Jul 3, 2025
@divanik divanik force-pushed the divanik/add_positional_delete_after_flaky_test_2 branch from a85efd8 to fd9a66b Compare July 3, 2025 10:41
@divanik
Copy link
Copy Markdown
Member Author

divanik commented Jul 3, 2025

@alexey-milovidov, what relation do Delta, Hive Metastore and Glue have with a feature that is specific to Iceberg? The Iceberg flaky test is going to be investigated as stated in the header of this PR, of course, it seems this PR is connected to the failure

@alexey-milovidov
Copy link
Copy Markdown
Member

I want to prioritize the resolution of technical debt.

@divanik
Copy link
Copy Markdown
Member Author

divanik commented Jul 4, 2025

For now I failed to reproduce what id going on here :-(
It seems that I need to launch the whole job on a particular commit but I don't know how to do it

@divanik divanik added ci-integration-test-flaky Run only integration flaky tests ci-integration-test CI with integration test jobs only and removed ci-integration-test-flaky Run only integration flaky tests ci-integration-test CI with integration test jobs only labels Jul 4, 2025
@divanik
Copy link
Copy Markdown
Member Author

divanik commented Jul 7, 2025

Status update: I managed to reproduce the test locally with help of @ fm4v (one test fails in a hundred of attempts, so feed back loop is going to be big, but at least now it exists), will try to establish a root cause of flaky test using this repro. Also will try to get repro of this commit to use CI for reproducing: a9c7ec1


mutable std::optional<Strings> cached_unprunned_files_for_last_processed_snapshot TSA_GUARDED_BY(cached_unprunned_files_for_last_processed_snapshot_mutex);
mutable std::optional<std::vector<ParsedDataFileInfo>>
cached_unprunned_data_files_for_last_processed_snapshot TSA_GUARDED_BY(cached_unprunned_files_for_last_processed_snapshot_mutex);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks very complex here. I suspect why this cache is needed:
now that we cache ManifestFileContent in IcebergMetadataFilesCache, it has cached:

  • std::vector<ManifestFileEntry> data_files;
  • std::vector<ManifestFileEntry> position_deletes_files;

That means if we hit this cache, we have almost no cost to get data/deletes files entry.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so can we just remove these cache?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It theoretically can lead to unpredicted perfomance consequences, though I really also want to do it, so let's try and maybe return it if it is needed

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove cache but we can't remove ParsedDataFileInfo, unfortunately (it is used not only in cache)

@alexey-milovidov
Copy link
Copy Markdown
Member

What happened in the stress test?

divanik added 4 commits July 30, 2025 16:41
… 'master' of github.com:ClickHouse/ClickHouse into divanik/add_positional_delete_after_flaky_test_2
@divanik divanik marked this pull request as ready for review August 1, 2025 17:56
{
IcebergDataObjectInfo(std::optional<ObjectMetadata> metadata_, ParsedDataFileInfo parsed_data_file_info_);

ParsedDataFileInfo parsed_data_file_info;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IcebergDataObjectInfo only contains one field ParsedDataFileInfo parsed_data_file_info, why do we need two different object info IcebergDataObjectInfo and ParserdDataFileInfo? what is the difference between them?

Copy link
Copy Markdown
Member Author

@divanik divanik Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IcebergDataObjectInfo is also a derived from RelativePathWithMetadata, it can potentially take a lot of memory and we don't to store it explicitly

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ManifestFile.h we have stored std::vector<ManifestFileEntry> data_files; and in IcebergMetadata we want to transform it to ParsedDataFileInfo to save memory use. How about using std::shared_ptr<ManifestFileEntry> in both files?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work unfortunately, we can't allow ourselves to store ManifestFileEntry of all data files in a table, doesn't matter if we use shared_ptr or not. In ManifestFileContent only files which correspond to one manifest file are explicitly stored which is affordable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will it store ManifestFileEntry of all data files in a table? It's called in iterate function in a streaming way, everytime we iterate we also read a ManifestFileContent, so I think the ManifestFileEntrys have same life span in iterators as in ManifestFileContent

Copy link
Copy Markdown
Member

@hanfei1991 hanfei1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

@divanik divanik added this pull request to the merge queue Aug 7, 2025
Merged via the queue into master with commit 4e307b1 Aug 7, 2025
240 of 244 checks passed
@divanik divanik deleted the divanik/add_positional_delete_after_flaky_test_2 branch August 7, 2025 10:40
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants