RFC/merge: Support for gzip functions in zlib #1355

Merged
merged 2 commits into from Oct 28, 2012

Projects

None yet

4 participants

@kmsquire
Member

I've been sitting on this for a while, and thought I would dust it off and commit it. It probably should be turned into a module.

Some notes:

  • This interface is only for gzipped files, not the streaming zlib compression interface. Internally, it depends on/uses the streaming interface, but the gzip related functions are higher level functions.
  • GZipStream is an implementation of IO and should be able to be used anywhere IO is used.
  • This implementation mimics the IOStream implementation, with some consequences:
    • There are some inefficiencies/awkwardness because of differences in the internals of ios and zlib (e.g., readuntil(), and read() for arrays oft BitKinds). Some suggestions for these two functions, in particular, would be greatly appreciated.
    • Unicode is supported in a very similar manner to ios, but it might be better to do so using icu.jl (CC: @nolta)
  • Currently missing documentation.

Thoughts: while wrapping zlib is fine, the impedance mismatch/overlap between ios and zlib is awkward. Ideally, it would be nice if these were integrated and/or one or both were moved into pure julia code. But probably won't happen right now.

Comments/suggestions appreciated.

Edit: since this is a large patch and might not be merged for a while, moved fix for #1377 to it's own pull request.

@JeffBezanson
Member

Given the state of the IO stuff, I'd say you did a good job here and made all the right choices.

@kmsquire
Member

Cool, thanks for reviewing! I'll address your comments and add some docs.

Kevin

@ViralBShah
Member

Perhaps #1377 can be addressed in this pull request.

@kmsquire
Member

I'll look into it. Might not get to it for a couple of days.

@kmsquire
Member

Okay, I was finally able to work through the gzip patch. The current version is squashed and rebased, and includes:

  • docs for gzip
  • A few additional tests for gzip
  • Fix for #1337 (zlib is now built with large file support by default)
  • minor changes to gzopen, gzdopen, to prevent files from being closed twice
  • changes addressing most of Jeff's original comments (which were, unfortunately, lost when I squashed... wish there was a way around that.)
  • I Included some minor fixes to documentation of other modules missing spaces in key places. (put in a separate pull request)
  • I removed BitsType test in function read{T<:Union(Int8,Uint8,Int16,Uint16,Int32,Uint32,Int64,Uint64,Int128,Uint128,Float32,Float64,Complex64,Complex128)}(s::IOStream, a::Array{T}), in io.jl, in response to Jeff's comment on the gzip version of that function.

Not addressed:

  • ~Jeff suggested generalizing write(s::GZipStream, c::Char) and read(s::GZipStream, ::Type{Char}) to work with any IO subtype. I attempted that, but if I included those definitions in io.jl, I received a @str not defined error while building. I'll put that into a separate pull request.~Taken care of in a separate pull request.

I think it's in good shape. Let me know of any concerns, otherwise feel free to merge.

Cheers!

@ViralBShah
Member

This can't be automatically merged now. Perhaps a rebase is needed?

@kmsquire
Member

Should be good now.

@ViralBShah
Member

Will let @JeffBezanson pull the trigger, since he has reviewed this.

@kmsquire
Member

Just saw that @carlobaldassi implemented some of this functionality in zlib.jl last week...

@carlobaldassi
Member

I just saw that as well... it was actually test code committed by accident, which I have already reverted. Sorry!

@kmsquire
Member

No worries!

@carlobaldassi
Member

@kmsquire, can you fix this? Github says it cannot be merged automatically anymore (I suppose this has to do with 5b2727e).

Apart from that, I'd like to see this merged. (I'd also like the more low-level functions to be exported, or at least gzread, which I'm using in parser to achieve as much performance as possible.)

@kmsquire kmsquire Support for gzip functions in zlib.
(Moved fix for #1377 to it's own pull request.)
469ab1e
@kmsquire
Member

Thanks for the heads up, @carlobaldassi. It was actually test/Makefile. Should be good to go. I don't know if you want to wait for @JeffBezanson or commit it yourself.

Regarding the low-level functions: they can certainly be exposed, but I'm wondering if it's a good idea... there are other compression libraries (bzip2 and xz), and exposing the low-level interfaces here might encourage their use and would make it that much harder to use the libraries interchangeably (assuming they are wrapped/implemented in the future). I would also love to someday see a pure julia port replace this (e.g., with parallel compression).

Of course, you can always access them with GZip.gz* or (similar to what you pointed out to me the other day) do

import GZip
gzread = GZip.gzread

What do you think?

(I'm actually quite curious about the parser you're working on, as I've been thinking about parsing in julia a bit recently. Anything you can share?)

@carlobaldassi
Member

Thanks! I think I'll still wait to see if there are objections, otherwise I'll merge this in the next days.

About exposing the low-lever functions, I don't think there's any harm in that: if one wants to write a high-level generic interface, he still can, but in this case buffered reads are more efficient, so a low-level call exploiting that is useful. I also feel uncomfortable in using not-explicitly-exported functions, since to me it means they are for internal use and therefore they might not be stable API. But perhaps in time there could be a more fine-grained specification (possibly based on some conventions) where some part of the API is documented but not exported.

The parser I've been writing is not very exciting unfortunately: I just need to scan FASTA files very quickly while keeping them compressed on disk (some of them tend to be quite large), but the format is trivial (I have no intention of doing anything fancy for now, just read out header-sequence pairs one at a time). I was thinking of proposing it as a package once there are some packaging examples to refer to.

@kmsquire
Member

About exposing the low-lever functions, I don't think there's any harm in
that: if one wants to write a high-level generic interface, he still can,
but in this case buffered reads are more efficient, so a low-level call
exploiting that is useful. I also feel uncomfortable in using
not-explicitly-exported functions, since to me it means they are for
internal use and therefore they might not be stable API. But perhaps in
time there could be a more fine-grained specification (possibly based on
some conventions) where some part of the API is documented but not exported.

In this case, I'm assuming you're wanting to buffer the reads yourself for
the parser, since zlib already buffers reads internally? In case you
didn't notice, I added a parameter to gzopen() to increase that buffer
size--zlib's default is 8Kb, but 64 or 128Kb would be better for your (and
most other) application, whether or not you buffer yourself.

It would also seem to me that implementing something like

read!{T<:...}(s::IO, Array{T}, size)

or maybe

read!{T<:...}(s::IO, Ptr{T}, size)

would still be better, since it would allow the code to work with
uncompressed files or (in the future) files compressed with other methods.

But anyway, I can expose the lower level functions.

The parser I've been writing is not very exciting unfortunately: I just
need to scan FASTA http://en.wikipedia.org/wiki/FASTA_format files very
quickly while keeping them compressed on disk (some of them tend to be
quite large), but the format is trivial (I have no intention of doing
anything fancy for now, just read out header-sequence pairs one at a time).
I was thinking of proposing it as a package once there are some packaging
examples to refer to.

Actually, what's cool depends on your perspective: I work in
bioinformatics! ;-) (Although if you have to pick a bioinformatics format,
fasta is about as unsexy as you can get...)

I'd actually like to have something like ragel (
http://www.complang.org/ragel/) (with https://github.com/kortschak/bioragel)
to produce julia parsing code, preferably inline. Maybe I should clone
myself first....

[pao: Markdown cleanup...or not]

@carlobaldassi
Member

@kmsquire (and anyone interested), if you want to check out the FASTA parser I have uploaded it here. It's inspired to kseq.h. I tried to make it as fast as I could and I ended up needing gzread, but I don't remember all the details and I'm in a bit of a hurry now.

@carlobaldassi
Member

Thanks @kmsquire, I'm merging this.

@carlobaldassi carlobaldassi merged commit d58c5c1 into JuliaLang:master Oct 28, 2012
@JeffBezanson

You want mktempdir

@JeffBezanson

Is it possible to get rid of these and just use the ones in io.jl?

Member

Done in #1464.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment