-
Notifications
You must be signed in to change notification settings - Fork 83
Implement snapshot expiration #401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
for refName, ref := range t.meta.refs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we combine these loops and just put the continue after refsToDelete[refName] = struct{}{}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to keep it as explained here: https://iceberg.apache.org/spec/#snapshot-retention-policy
I can change that if you think it's better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything in that description which prevents combining these loops into a single loop, so let's change it to simplify the logic and avoid the extra trip through all of the refs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! other than a few comments, the only thing missing here is some tests. Can you add some tests please?
Hi @zeroshade Sure, I'll add tests.
If we decide to handle file removal as part of ExpireSnapshots, we would have to postpone the actual deletion after the commit is successful. I don't see any easy way to do it. |
@kevinjqliu @Fokko do either of you know the answers to @arnaudbriche's questions here you both have more experience with the spec side of things than I do as far as what makes sense for those situations. |
a4b4b36
to
ec8d33b
Compare
I just added a test and implemented snapshot-log pruning because it's growing unbounded otherwise. |
Regarding file deletion, maybe we can add a new method to the |
It's a good question. If we have the table, then we have the file system interface so that we can remove the files. It might make sense for us to simply have an option to tell it whether or not to delete the files (rather than allow generic post-commit code). |
the
From the ExpireSnapshots javadoc, it looks you can optionally remove the data and metadata files, but its set to true by default
Also from the java docs for When committing, these changes will be applied to the latest table metadata. Commit conflicts will be resolved by applying the changes to the new latest metadata and reattempting the commit. A new snapshot should be created reflecting the newest table update, i.e. old snapshots removed Hope this helps! Happy to help point to any other questions :) |
ec8d33b
to
97a5820
Compare
Thx for all the answers @kevinjqliu. I have a hard time figuring out how a new snapshot could be created, I we technically don't write , update or delete any data from the table here. |
…e slice's underlying arrays with the old one
b.schemaList = metadata.Schemas() | ||
b.schemaList = slices.Clone(metadata.Schemas()) | ||
b.currentSchemaID = metadata.CurrentSchema().ID | ||
b.specs = metadata.PartitionSpecs() | ||
b.specs = slices.Clone(metadata.PartitionSpecs()) | ||
b.defaultSpecID = metadata.DefaultPartitionSpec() | ||
b.lastPartitionID = metadata.LastPartitionSpecID() | ||
b.props = metadata.Properties() | ||
b.snapshotList = metadata.Snapshots() | ||
b.sortOrderList = metadata.SortOrders() | ||
b.props = maps.Clone(metadata.Properties()) | ||
b.snapshotList = slices.Clone(metadata.Snapshots()) | ||
b.sortOrderList = slices.Clone(metadata.SortOrders()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there's a way we could lazily do this cloning? Avoid performing the copies until we actually need to?
refAge := time.Now().UnixMilli() - snap.TimestampMs | ||
if refAge > *maxRefAgeMs { | ||
updates = append(updates, NewRemoveSnapshotRefUpdate(refName)) | ||
refsToDelete[refName] = struct{}{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract time.Now().UnixMilli()
from the loop instead of calling it repeatedly. Stick it in the var block at the top of this function.
for _, snap := range preTable.Metadata().Snapshots() { | ||
if slices.Contains(u.SnapshotIDs, snap.SnapshotID) { | ||
filesToDelete[snap.ManifestList] = struct{}{} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have our list of snapshotIDs, so why do we need this loop?
mans, err := snap.Manifests(preTable.fs) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
for _, man := range mans { | ||
filesToDelete[man.FilePath()] = struct{}{} | ||
|
||
entries, err := man.FetchEntries(preTable.fs, false) | ||
if err != nil { | ||
return err | ||
} | ||
|
||
for _, entry := range entries { | ||
filesToDelete[entry.DataFile().FilePath()] = struct{}{} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snapshot
objects have a dataFiles
method which will return an iterator over the files, maybe we can leverage this to encapsulate this logic more rather than duplicate it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is shaping up nicely! Just a few questions left.
|
||
package internal | ||
|
||
func Coalesce[T any](vals ...*T) *T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to move it to utils.go, no need to span a new file just for this function.
Hi @zeroshade,
As discussed on a previous PR, I'm trying to implement snapshots expiration now.
This is not complete, but I wanted to discuss it with you before I further work on it.
If the design looks ok to you, here are a few technical questions regarding snapshot expiration:
References