Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create separate compression-specific layer to enable writing gzipped files #91

Closed
tmylk opened this issue Oct 5, 2016 · 16 comments
Closed

Comments

@tmylk
Copy link
Contributor

tmylk commented Oct 5, 2016

Implement the solution described by mpenkov in #82 (comment)

@tmylk tmylk changed the title Create separate compression-specific layer Create separate compression-specific layer to enable writing gzipped files Oct 5, 2016
@mpenkov
Copy link
Collaborator

mpenkov commented Oct 5, 2016

Is anybody working on this? If not, I'll have a look at it.

@tmylk
Copy link
Contributor Author

tmylk commented Oct 5, 2016

@mpenkov You will be very welcome!

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 7, 2016

@tmylk @piskvorky I've started working on this on a separate branch.

Basically, I rewrote the S3 subsystem using a hierarchy of classes based on the native io library. The S3 subsystem now returns file-like objects that can be passed to other decoders, etc. The existing tests, as well as ones I added, pass; things seem to work well. Almost.

gzip (using the native library) doesn't work with Python 2. The 2.x implementation tries to seek around the file, and AFAIK that just isn't possible with S3 (no random access). This is a real shame, since plugging the gzip library in is a one-liner.

The alternative is to write a separate decoder using the lower-level zlib, which we could use for Python 2.x only, since gzip is much more powerful. Please let me know what you think.

@bgreen-litl
Copy link

👍

This issue is tagged as "easy". Is that true? Is the @mpenkov branch the best place to start on this?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 27, 2017

It's reasonably easy. I don't know how far behind master my branch is, but I imagine it'd be a good place to start.

Working around the gzip issue is pretty much the only thing what's left, if I recall correctly @bgreen-litl

@cariaso
Copy link

cariaso commented Jul 18, 2017

just a nudge, I was bitten by this. It was trivial in my case to just modify the filename so that .gz wasn't at the end. That's good enough for my needs. But this remains an unresolved wart on the smart_open api.

@val314159
Copy link

val314159 commented Jul 18, 2017

I wonder if this could be used as a backend to the FUSE filesystem library, so that you can "just" mount a smart_open drive as an actual block device (given you have the permission, of course).

Bonus: This would allow any C++ lib to use it as well.

(btw, i've implemented multi-user encrypted filesystems w/ FUSE, so I know it can be done:)

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 20, 2017

@menshikh-iv Are you actively working on this? I've scheduled some time off in August and may have time to look into it.

@menshikh-iv
Copy link
Contributor

@mpenkov you are welcome, feel free to contribute

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 3, 2017

@menshikh-iv I've had a look at it. The problem can be summarized as:

  • Python 2's gzip.GzipFile uses seek and tell to detect EOF. More specifically, it seeks to the end of the file to detect EOF. Obviously, this isn't something we can do when streaming from S3.
  • Python 3's gzip.GzipFile is a bit more clever. It uses zlib's decompressorobj.eof flag to detect EOF, so no seeking/telling is necessary.
  • It appears that we can't easily backport gzip.GzipFile from Python 3 and bundle it with smart_open, because it depends on a newer version of zlib, which in turn is implemented in C.

The above means that we should continue to use the GzipStreamFile instead of gzip.GzipFile. I'm not sure how well this fits into the design that I proposed at the start of this issue - I need to think about it.

In the meanwhile, can someone please comment on the logic of the above? Is it really impossible to backport Python 3's gzip and bundle it with smart_open?

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 10, 2017

The other way forward would be to use boto3 (sample solution in #42), because that's seekable, but we're currently blocked from doing that (#41).

@menshikh-iv
Copy link
Contributor

Hi @mpenkov, sorry for late answer (I missed a notification).

For my opinion, boto3 is the best variant (Moreover, we should migrate to boto3), backport of gzip from python3 looks difficult. About #41, how it's blocked boto3 migration?

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 14, 2017

@menshikh-iv Sorry, I linked the wrong issue, the blocker is #43.

AFAICT, the reason for blocking is that boto3 may not be entirely backwards-compatible with boto, although that was brought up a while ago and the situation may have changed already.

@cariaso
Copy link

cariaso commented Aug 14, 2017 via email

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 14, 2017

@menshikh-iv OK. I will go ahead and bring in boto3 to implement S3 seeking.

@menshikh-iv
Copy link
Contributor

Good luck @mpenkov👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants