Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix s3.iter_bucket failure when botocore_session passed in #749

Closed
wants to merge 1 commit into from

Conversation

RachitSharma2001
Copy link
Contributor

Motivation

Fixes #670

As shown by @SootyOwl, when a user declares their own botocore_session object and passes it into s3.iter_bucket, one of two errors occur:

  1. With smart_open.concurrency._MULTIPROCESSING = True:
    AttributeError: Can't pickle local object 'lazy_call.<locals>._handler
  2. With smart_open.concurrency._MULTIPROCESSING = False:
    RuntimeError: Cannot inject class attribute "upload_file", attribute already exists in class dict.

As explained here, the reason the first error occurs is that the multiprocessing module performs pickling on objects and requires those objects to be global, not local.

As explained in the original raised issue, the reason the second error occurs is that _list_bucket and _download_key both creates boto3.session.Session objects out of the passed in botocore_session, which is not allowed by the boto3 library.

The proposed changes address both issues by creating a global session object within iter_bucket that _list_bucket and _download_key can access.

Tests

All existing tests related to iter_bucket within s3.py pass. I also added two new tests: test_iter_bucket_passed_in_session_multiprocessing_false and test_iter_bucket_passed_in_session_multiprocessing_true. These test the two previously failing situations.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 25, 2022

Thank you for this carefully prepared PR.

I think you may be misunderstanding the StackOverflow post. Multiprocessing does not require things to be global. It only requires them to be pickle-able. For functions, this means defined at module scope (I think this is what you refer to as global). For objects (like boto3 sessions) there is no such requirement.

Is there a reason why we're not passing the entire session object to iter_bucket? I think that would make it possible to avoid using globals here.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 25, 2022

Is there a reason why we're not passing the entire session object to iter_bucket?

I remembered why we're doing this. The boto3 docs advise to do so:

It's recommended to create a new Session object for each thread or process

So it looks like their recommendation is part of the problem here. We could go against it (and pass the Session across the process boundary like I suggested above). There aren't really any good alternatives here, because passing the Client (instead of the Session) across a process boundary is forbidden:

Multi-Processing: While clients are thread-safe, they cannot be shared across processes due to their networking implementation. Doing so may lead to incorrect response ordering when calling services.

I've re-read the original issue:

As explained in the original raised issue, the reason the second error occurs is that _list_bucket and _download_key both creates boto3.session.Session objects out of the passed in botocore_session, which is not allowed by the boto3 library.

... and now I'm unsure if the use case is valid. If the botocore.Session cannot be safely passed across a subprocess boundary (effectively cloning it) and re-used, there isn't much we can do. Going against Boto3 recommendations and passing a boto3.Session across the subprocess boundary (whether as a global or otherwise) also sounds like a bad idea to me.

@RachitSharma2001
Copy link
Contributor Author

RachitSharma2001 commented Dec 26, 2022

@mpenkov That is a good point. So from what I am thinking, either we don't allow the user to pass in a custom boto3 Session object at all (and throw an error or warning if they do), or (if we are willing to go against boto3 recommendations and pass around the given session object) allow them to pass it in but throw necessary warnings about possible side effects (or warn that errors can occur if the passed in session is not thread-safe). For either case, I could update this pr to make the necessary changes if needed.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 27, 2022

I think a warning is sufficient. Also, in this particular case, we don't care about thread safety - we care about process safety.

The other tricky thing to consider is how this process issue interplays with the MULTIPROCESSING flag. If that flag is set to False, then the game changes significantly, and using the single session is OK. So we may have to split the implementation into two:

  1. multiprocessing implementation, process-safety caveats apply
  2. single-process implementation, process-safety caveats do not apply

It's a bit of a pain... I'm not sure what the best way to handle this is.

@RachitSharma2001
Copy link
Contributor Author

I have an idea of how to handle this but it does seem a bit messy:

  1. Give _download_key and _list_bucket two additional arguments: boto3_session which by default is None, and **session_kwargs.
  2. Then, in iter_bucket, check if botocore_session exists in session_kwargs. If so, create a boto3 session object out of it, and pass it in to _download_key and _list_bucket (thus avoiding the 'upload_file' error that occurs when multiprocessing is False)
  3. But if no botocore_session object is passed in, then we can just call _download_key and _list_bucket just with **session_kwargs, and they will both individually create their own boto3 session objects (thus avoiding any pickling error that may occur when multiprocessing is true yet no botocore session object is passed in)
  4. and if the user passes in botocore_session to iter_bucket but multiprocessing is true, we give a warning

This would help avoid the 'upload file' error that occurs when multiprocessing is False while also avoiding any unnecessary errors when multiprocessing is True (except, of course, when botocore_session is passed in).

We could also just add a botocore_session argument to iter_bucket itself. Let me know if this idea sounds good or if there could be a better way to do this.

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 29, 2022

Yeah, that does sound a bit messy. There should be a better way. Give me a little while to think about it.

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 22, 2024

I'm going to close this, because the use case described in the original issue has a work-around (disable multiprocessing).

Thank you for your work. In this particular case, supporting the original use case does not justify increasing the complexity of the iter_bucket function even further. I had a think about what the best solution here would be, but couldn't come up with anything.

If I end up refactoring iter_bucket in the future, I'll take this work into consideration.

@mpenkov mpenkov closed this Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using botocore_session parameter of s3.iter_bucket fails
2 participants