Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix avoidable S3 race condition (#693) #735

Merged
merged 2 commits into from
Nov 29, 2022

Conversation

RachitSharma2001
Copy link
Contributor

Motivation

Fixes #693

As proposed by @ronkorving, it may be common for the s3 iter_bucket function to encounter 404 not found errors due to the s3 list bucket command sometimes returning recently deleted keys. So a workaround of this problem is to surpress 404 errors for iter_bucket, and instead log them as warnings for the user to assess.

Tests

All existing tests related to iter_bucket within s3.py pass. I also added two new tests: test_iter_bucket_404 and test_iter_bucket_non_404. The first test simulates the scenario with iter_bucket for a key throws a 404 error, and expects that error to be suppressed and for computation to carry on for the rest of the keys. The latter test simulates the scenario where a non 404 error occurs, in which case it should not be suppressed.

smart_open/s3.py Outdated
if 'Error' in err.response and err.response['Error'].get('Code') == '404':
logger.warning(
"Encountered '404 Not Found' error for key #%i",
key_no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it perhaps make sense to log the key, rather than the key_no? If anybody does care to see the log, I'm not sure the number would be as helpful in debugging a situation.

An argument could perhaps be made that it's not actually worthy of being logged as a warning. An end-user could read this as something being wrong or broken, when this is completely valid behavior that is to be expected from time to time.

Copy link
Contributor Author

@RachitSharma2001 RachitSharma2001 Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very good point. If a 404 error occurs during iter_bucket, it is almost certainly not the users fault, but rather the fault of the internal list_keys function. I agree that logging the error would just be confusing to the user. I've removed the warning in my latest commit. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

smart_open/s3.py Outdated
while True:
try:
(key, content) = result_iterator.__next__()
if True or key_no % 1000 == 0:
Copy link
Contributor

@ronkorving ronkorving Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any need for the True or bit?

edit: I realize that was there before. I wonder why 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also wondering this while I was looking at the code. I am not sure why it is there, I can remove it if its confirmed that it is not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @mpenkov knows more about this.

@piskvorky piskvorky added this to the 6.3.0 milestone Nov 19, 2022
@mpenkov mpenkov merged commit 7472d65 into piskvorky:develop Nov 29, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Nov 29, 2022

Thank you @RachitSharma2001 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Probably avoidable fatal S3 race condition
4 participants