Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create separate compression-specific layer to enable writing gzipped files #131

Merged
merged 26 commits into from Oct 26, 2017
Merged

Create separate compression-specific layer to enable writing gzipped files #131

merged 26 commits into from Oct 26, 2017

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Aug 17, 2017

This PR addresses issue #91. There are two major parts to this PR.

First, I rewrote all the lower-level S3 stuff using boto3 in a separate module, smart_open/s3.py. he reader and writer inherit from the io classes in the standard library, meaning that you can pretty much treat them like local file streams and pass them to codecs like gzip, bz2, etc.

Second, I integrated this module into smart_open_lib.py. This part required a rebase: I started working on this over a year ago, and much has happened since then. The part I couldn't handle well during the rebase was: there was a new s3u mode added, with code that depends on the old boto library. I couldn't port this part to boto3 - this functionality may be discontinued. The code is isolated in the setup_unsecured_mode function and is currently commented out.

This takes care of #91 for S3 only. In the future, we should apply the same approach to other protocols like HDFS, more specifically: make sure readers and writers inherit from stdlib io classes.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Aug 17, 2017

I forgot to mention that this also resolves #42 (s3 seek), and partially resolves #43 (starts boto3 migration).

I'm not 100% sure, but it should also resolve #64 and #74.

AttributeError: 'module' object has no attribute 'test_smart_open'
@piskvorky
Copy link
Owner

piskvorky commented Aug 17, 2017

Epic! @menshikh-iv let's review ASAP.

We should probably clean up and merge the smaller PRs first, since this will be a major update and release.

@mpenkov what kind of backward incompatibilities do you expect here?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Aug 17, 2017

@piskvorky I don't really understand the s3u stuff that was added in the meanwhile. Since that's currently effectively disabled in this branch, there is a chance of an incompatibility there.

Other than that, I don't expect these changes to break anything - they still pass pretty much the same tests and work the same way.

# Conflicts:
#	requirements-test.txt
#	smart_open/smart_open_lib.py
#	smart_open/tests/test_gzipstreamfile.py
#	smart_open/tests/test_smart_open.py
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @mpenkov, sorry for very long waiting, I resolved conflicts with master, you can check it
8ea8387, looks good for me

smart_open/s3.py Outdated
#
# https://github.com/matthewwithanm/django-imagekit/commit/0d5bfe37517955cca2284735769fe1dffb38ed37
#
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove (we dropped 2.6 support)

smart_open/s3.py Outdated
import six


_LOGGER = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use logger (everywhere)


return self._read_from_buffer(size)

def read1(self, size=-1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this needed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the interface for BufferedIOBase.

"""This is the same as read()."""
return self.read(size=size)

def readinto(self, b):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very strange, why we need this method with mutable b?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. We try to behave like a BufferedIOBase, which defines this method.

Some application downstream of smart_open may depend on this functionality.

return len(data)

def terminate(self):
"""Do nothing."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add pass

@mpenkov
Copy link
Collaborator Author

mpenkov commented Oct 21, 2017

@menshikh-iv I made changes in response to your review. I left two items as they are because they are required to satisfy interface contracts. Let me know if I need to do anything else. Thanks!

@menshikh-iv
Copy link
Contributor

Thanks for clarification about BufferIOBase, great work, big thanks for your contribution @mpenkov 🔥 🔥 🔥

@menshikh-iv menshikh-iv merged commit 5f3b6fa into piskvorky:master Oct 26, 2017
@mpenkov mpenkov deleted the issue91-rebase branch December 21, 2017 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants