New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix s3.iter_bucket failure when botocore_session passed in #749
Conversation
Thank you for this carefully prepared PR. I think you may be misunderstanding the StackOverflow post. Multiprocessing does not require things to be global. It only requires them to be pickle-able. For functions, this means defined at module scope (I think this is what you refer to as global). For objects (like boto3 sessions) there is no such requirement. Is there a reason why we're not passing the entire session object to iter_bucket? I think that would make it possible to avoid using globals here. |
I remembered why we're doing this. The boto3 docs advise to do so:
So it looks like their recommendation is part of the problem here. We could go against it (and pass the Session across the process boundary like I suggested above). There aren't really any good alternatives here, because passing the Client (instead of the Session) across a process boundary is forbidden:
I've re-read the original issue:
... and now I'm unsure if the use case is valid. If the botocore.Session cannot be safely passed across a subprocess boundary (effectively cloning it) and re-used, there isn't much we can do. Going against Boto3 recommendations and passing a boto3.Session across the subprocess boundary (whether as a global or otherwise) also sounds like a bad idea to me. |
@mpenkov That is a good point. So from what I am thinking, either we don't allow the user to pass in a custom boto3 Session object at all (and throw an error or warning if they do), or (if we are willing to go against boto3 recommendations and pass around the given session object) allow them to pass it in but throw necessary warnings about possible side effects (or warn that errors can occur if the passed in session is not thread-safe). For either case, I could update this pr to make the necessary changes if needed. |
I think a warning is sufficient. Also, in this particular case, we don't care about thread safety - we care about process safety. The other tricky thing to consider is how this process issue interplays with the MULTIPROCESSING flag. If that flag is set to False, then the game changes significantly, and using the single session is OK. So we may have to split the implementation into two:
It's a bit of a pain... I'm not sure what the best way to handle this is. |
I have an idea of how to handle this but it does seem a bit messy:
This would help avoid the 'upload file' error that occurs when multiprocessing is False while also avoiding any unnecessary errors when multiprocessing is True (except, of course, when We could also just add a botocore_session argument to iter_bucket itself. Let me know if this idea sounds good or if there could be a better way to do this. |
Yeah, that does sound a bit messy. There should be a better way. Give me a little while to think about it. |
I'm going to close this, because the use case described in the original issue has a work-around (disable multiprocessing). Thank you for your work. In this particular case, supporting the original use case does not justify increasing the complexity of the iter_bucket function even further. I had a think about what the best solution here would be, but couldn't come up with anything. If I end up refactoring iter_bucket in the future, I'll take this work into consideration. |
Motivation
Fixes #670
As shown by @SootyOwl, when a user declares their own
botocore_session
object and passes it intos3.iter_bucket
, one of two errors occur:AttributeError: Can't pickle local object 'lazy_call.<locals>._handler
RuntimeError: Cannot inject class attribute "upload_file", attribute already exists in class dict.
As explained here, the reason the first error occurs is that the multiprocessing module performs pickling on objects and requires those objects to be global, not local.
As explained in the original raised issue, the reason the second error occurs is that
_list_bucket
and_download_key
both createsboto3.session.Session
objects out of the passed inbotocore_session
, which is not allowed by the boto3 library.The proposed changes address both issues by creating a global session object within
iter_bucket
that_list_bucket
and_download_key
can access.Tests
All existing tests related to
iter_bucket
withins3.py
pass. I also added two new tests:test_iter_bucket_passed_in_session_multiprocessing_false
andtest_iter_bucket_passed_in_session_multiprocessing_true
. These test the two previously failing situations.