Skip to content

validate added data files for snapshot compatibility #2050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 24, 2025

Conversation

kaushiksrini
Copy link
Contributor

Closes #1929

Rationale for this change

Are these changes tested?

Yes

Are there any user-facing changes?

No

@kaushiksrini kaushiksrini changed the title Feat/validate added data files validate added data files by supporting validateAddedDataFiles May 28, 2025
@kaushiksrini kaushiksrini changed the title validate added data files by supporting validateAddedDataFiles validate added data files for snapshot compatibility May 28, 2025
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kaushiksrini thank you for working on this PR!

The logic looks sound to me, I have some nit picks regarding doc strings and reducing duplication of code.

Please let me know what you think about the feedback!

partition_set: Optional[dict[int, set[Record]]],
parent_snapshot: Optional[Snapshot],
) -> Iterator[ManifestEntry]:
if parent_snapshot is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we introduce a doc string for each function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if makes sense, why not make the parent_snapshot non-optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko
Copy link
Contributor

Fokko commented Jun 2, 2025

@kaushiksrini could you resolve the conflicts? Thanks!

Co-authored-by: Jayce Slesar <47452474+jayceslesar@users.noreply.github.com>
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @kaushiksrini Some small comments, but I think we can move this forward 👍

Comment on lines 154 to +159
for manifest in manifests:
for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False):
if entry.snapshot_id not in snapshot_ids:
continue

if entry.status != ManifestEntryStatus.DELETED:
continue

if data_filter is not None and evaluator(entry.data_file) is ROWS_CANNOT_MATCH:
continue

if partition_set is not None:
spec_id = entry.data_file.spec_id
partition = entry.data_file.partition
if spec_id not in partition_set or partition not in partition_set[spec_id]:
continue

yield entry
if _filter_manifest_entries(
entry, snapshot_ids, data_filter, partition_set, ManifestEntryStatus.DELETED, table.schema()
):
yield entry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore this comment, but a missed opportunity for some syntactic sugar:

for manifest in manifests:
    yield from (
        entry
        for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False)
        if _filter_manifest_entries(
            entry, snapshot_ids, data_filter, partition_set, ManifestEntryStatus.DELETED, table.schema()
        )
    )

partition_set: Optional[dict[int, set[Record]]],
parent_snapshot: Optional[Snapshot],
) -> Iterator[ManifestEntry]:
if parent_snapshot is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if makes sense, why not make the parent_snapshot non-optional?

@Fokko Fokko merged commit c32aa04 into apache:main Jun 24, 2025
10 checks passed
@Fokko
Copy link
Contributor

Fokko commented Jun 24, 2025

This looks great @kaushiksrini, let's move this forward 🚀 Thanks for the review @sungwy and @jayceslesar 🙌

amitgilad3 pushed a commit to amitgilad3/iceberg-python that referenced this pull request Jul 7, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

Closes apache#1929

# Rationale for this change

- Since we want to support snapshot write compatibility (apache#1772) and is
part of the following parent issue apache#819

# Are these changes tested?

Yes

# Are there any user-facing changes?

No

<!-- In the case of user-facing changes, please add the changelog label.
-->

---------

Co-authored-by: Jayce Slesar <47452474+jayceslesar@users.noreply.github.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

Closes apache#1929

# Rationale for this change

- Since we want to support snapshot write compatibility (apache#1772) and is
part of the following parent issue apache#819

# Are these changes tested?

Yes

# Are there any user-facing changes?

No

<!-- In the case of user-facing changes, please add the changelog label.
-->

---------

Co-authored-by: Jayce Slesar <47452474+jayceslesar@users.noreply.github.com>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Concurrency Safety Validation: Implement validateAddedDataFiles
4 participants