-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix S3Store retry might cause poisoned data #1383
Fix S3Store retry might cause poisoned data #1383
Conversation
If using S3Store + VerifyStore is in use, S3 store could receive data that VerifyStore deemed invalid due to retry logic. This only effects S3Store + VerifyStore due to the way AwsS3Sdk crate works, we need to hold recent data in the BufChannel, in the event VerifyStore got an invalid hash, the retry logic in S3 would trigger, but instead of being seen as "invalid" it would actually have stored the sent data and sent it, causing S3 to to still receive the invalid data. This PR causes BuffChannel logic to set a flag making the next read in S3 store always trigger the error condition.
d6801d2
to
b71f04f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and 0 of 3 files reviewed, and pending CI: pre-commit-checks (waiting on @adam-singer)
nativelink-store/src/s3_store.rs
line 437 at r1 (raw file):
.retrier .retry(unfold(reader, move |mut reader| async move { let UploadSizeInfo::ExactSize(sz) = upload_size else {
fyi: Just seemed weird to do this in retry. Not related to this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and all files reviewed, and pending CI: Installation / macos-13, Remote / large-ubuntu-22.04, and 1 discussions need to be resolved
nativelink-util/src/buf_channel.rs
line 222 at r2 (raw file):
self.queued_data.clear(); self.recent_data.clear(); self.bytes_received = 0;
Should this be part of some sort of "reset stream" or "reset channel" type function? Would there be other cases where we need to reset/clear?
Code quote:
self.queued_data.clear();
self.recent_data.clear();
self.bytes_received = 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 1 LGTMs obtained, and all files reviewed
nativelink-util/src/buf_channel.rs
line 222 at r2 (raw file):
Previously, adam-singer (Adam Singer) wrote…
Should this be part of some sort of "reset stream" or "reset channel" type function? Would there be other cases where we need to reset/clear?
I originally had it this way, but then realized the only way an error can happen is if it drops early. We don't expose a way for an error to be sent from the sender side.
If using S3Store + VerifyStore is in use, S3 store could receive data that VerifyStore deemed invalid due to retry logic.
This only effects S3Store + VerifyStore due to the way AwsS3Sdk crate works, we need to hold recent data in the BufChannel, in the event VerifyStore got an invalid hash, the retry logic in S3 would trigger, but instead of being seen as "invalid" it would actually have stored the sent data and sent it, causing S3 to to still receive the invalid data. This PR causes BuffChannel logic to set a flag making the next read in S3 store always trigger the error condition.
This change is