-
Notifications
You must be signed in to change notification settings - Fork 347
validate added data files for snapshot compatibility #2050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validate added data files for snapshot compatibility #2050
Conversation
validateAddedDataFiles
validateAddedDataFiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kaushiksrini thank you for working on this PR!
The logic looks sound to me, I have some nit picks regarding doc strings and reducing duplication of code.
Please let me know what you think about the feedback!
partition_set: Optional[dict[int, set[Record]]], | ||
parent_snapshot: Optional[Snapshot], | ||
) -> Iterator[ManifestEntry]: | ||
if parent_snapshot is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we introduce a doc string for each function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this if
makes sense, why not make the parent_snapshot
non-optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its what the java implementation does right? https://github.com/apache/iceberg/blob/3a29199e73f2e9ae0f8f92a1a0732a338c66aa0d/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L685
@kaushiksrini could you resolve the conflicts? Thanks! |
Co-authored-by: Jayce Slesar <47452474+jayceslesar@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @kaushiksrini Some small comments, but I think we can move this forward 👍
for manifest in manifests: | ||
for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False): | ||
if entry.snapshot_id not in snapshot_ids: | ||
continue | ||
|
||
if entry.status != ManifestEntryStatus.DELETED: | ||
continue | ||
|
||
if data_filter is not None and evaluator(entry.data_file) is ROWS_CANNOT_MATCH: | ||
continue | ||
|
||
if partition_set is not None: | ||
spec_id = entry.data_file.spec_id | ||
partition = entry.data_file.partition | ||
if spec_id not in partition_set or partition not in partition_set[spec_id]: | ||
continue | ||
|
||
yield entry | ||
if _filter_manifest_entries( | ||
entry, snapshot_ids, data_filter, partition_set, ManifestEntryStatus.DELETED, table.schema() | ||
): | ||
yield entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignore this comment, but a missed opportunity for some syntactic sugar:
for manifest in manifests:
yield from (
entry
for entry in manifest.fetch_manifest_entry(table.io, discard_deleted=False)
if _filter_manifest_entries(
entry, snapshot_ids, data_filter, partition_set, ManifestEntryStatus.DELETED, table.schema()
)
)
partition_set: Optional[dict[int, set[Record]]], | ||
parent_snapshot: Optional[Snapshot], | ||
) -> Iterator[ManifestEntry]: | ||
if parent_snapshot is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this if
makes sense, why not make the parent_snapshot
non-optional?
This looks great @kaushiksrini, let's move this forward 🚀 Thanks for the review @sungwy and @jayceslesar 🙌 |
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> Closes apache#1929 # Rationale for this change - Since we want to support snapshot write compatibility (apache#1772) and is part of the following parent issue apache#819 # Are these changes tested? Yes # Are there any user-facing changes? No <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Jayce Slesar <47452474+jayceslesar@users.noreply.github.com> Co-authored-by: Fokko Driesprong <fokko@apache.org>
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> Closes apache#1929 # Rationale for this change - Since we want to support snapshot write compatibility (apache#1772) and is part of the following parent issue apache#819 # Are these changes tested? Yes # Are there any user-facing changes? No <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Jayce Slesar <47452474+jayceslesar@users.noreply.github.com> Co-authored-by: Fokko Driesprong <fokko@apache.org>
Closes #1929
Rationale for this change
Are these changes tested?
Yes
Are there any user-facing changes?
No