Add GZipBufferedStream for buffered reading #32

jiahao · 2015-07-28T13:47:52Z

Also introduces new gzbopen() function that generates GZipBufferedStreams.

reading a character stream from a GZipBufferedStream is ~6x faster than from an ordinary GZipStream. The speedup comes from managing an internal IOBuffer and populating it with gzread(), rather than making a call to the unbuffered gzgetc() each time a character is sought.

Here are some benchmark numbers for how the performance varies with buffer size. 0 = ordinary GZipStream, sys = system time for gunzip. 2^12 = 8192 is the default buffer size and 2^17 = 131072 is the "big" buffer size used by GZip.

(gunzip is still 2.8x faster though...)

log2(buf_size)	time(s)
0	9.434
12 (default)	1.484
13	1.505
14	1.504
15	1.479
16	1.504
17	1.497
18	1.481
19	1.496
20	1.497
21	1.487
22	1.467
23	1.470
24	1.442
25	1.417
26	1.376
27	1.310
28 (entire file fits)	1.264
sys	0.519

Benchmark code:

function benchmark(nreps=10)
  filename = "ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz"
  isfile(filename) || download("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz")

  timings=Dict{Int,Vector}()
  for i=1:nreps
    gzbopen(filename) do io
        n = 0
        t = @elapsed while true
            read(io, UInt8)
            n += 1
            eof(io) && break
        end
        timings[0] = push!(get(timings, 0, Float64[]), t)
    end

    for sexp in 28:-1:12
        println("Buffer size: ", 2^sexp)
        n = 0
        io = gzbopen(filename, "rb", 2^sexp)
        t = @elapsed while true
            read(io, UInt8)
            n += 1
            eof(io) && break
        end
        timings[sexp] = push!(get(timings, sexp, Float64[]), t)
        close(io)
    end

    for k in sort!(collect(keys(timings)))
        println(k, '\t', minimum(timings[k]))
    end
  end
end

vtjnash · 2015-07-28T15:01:02Z

src/buffered.jl

@@ -0,0 +1,57 @@
+immutable GZipBufferedStream


Yes. What does the IO interface look like?

jakebolewski · 2015-07-28T15:55:52Z

Shouldn't the buffered version just be the default?

jiahao · 2015-07-28T16:45:04Z

@jakebolewski yes it would be nice to make the buffered stream the default eventually. The interface is not complete though. (I didn't implement buffered writes)

jakebolewski · 2015-07-28T16:48:06Z

Have you seen http://www.htslib.org/benchmarks/zlib.html?

kmsquire · 2015-07-28T21:33:23Z

On one hand, this is definitely a good direction to go. I would think/hope with some work we could reach gunzip speeds, but I haven't looked at this code in a long time, so I don't know where the gains would come from.

Rather than a separate name, I'd prefer something like a keyword parameter to open / gzopen (e.g., buffered=true).

For kicks, it would be good to test against the Zlib Stream API.

stevengj · 2015-07-28T23:48:22Z

I'd also recommend just making GZipStream buffered. There's no point in having two APIs, one slow and one fast, if the fast one can support all of the same functionality.

jiahao · 2015-07-29T01:28:33Z

@stevengj @jakebolewski I'd like to merge this functionality as is and wait until we get buffered writes (#23) before we get rid of the current GZipStream.

jiahao · 2015-07-29T01:31:38Z

I like @kmsquire's idea of the keyword parameter to open/gzopen. It would probably help with the transition to buffered I/O.

jakebolewski · 2015-07-29T02:08:47Z

Adding to the API only to remove it seems weird, why not just write the buffered write? It will be ~30 lines of code.

jiahao · 2015-07-29T02:21:56Z

It's not a priority for me now. You're more than welcome to implement it :)

Introduce new buffered keywords to gzopen, open, and gzdopen that create GZipBufferedStreams

jiahao · 2015-07-29T06:20:26Z

I've profiled this code and it looks like ~40% of the execution time is spent in the bowels of iobuffer.jl. Inlining the byte read (vide JuliaLang/julia#12364) may be able to cut down the overhead somewhat, but it looks like IOBuffer is too heavy an abstraction for this problem.

Here is a reimplementation using a plain Vector{UInt8} that achieves a 1/3 speedup relative to gunzip:

using GZip

type GZBufferedStream <: IO
    io::GZipStream
    buf::Vector{UInt8}
    len::Int
    ptr::Int

    function GZBufferedStream(io::GZipStream)
        buf = Array(UInt8, io.buf_size)

        len = ccall((:gzread, GZip._zlib), Int32,
            (Ptr{Void}, Ptr{Void}, UInt32), io.gz_file, buf, io.buf_size)

        new(io, buf, len, 1)
    end
end

Base.close(io::GZBufferedStream) = close(io.io)

@inline function Base.read(io::GZBufferedStream, ::Type{UInt8})
    c = io.buf[io.ptr]
    io.ptr += 1

    if io.ptr == io.len+1 #No more data
        io.len = ccall((:gzread, GZip._zlib), Int32,
            (Ptr{Void}, Ptr{Void}, UInt32), io.io.gz_file, io.buf, io.io.buf_size)
        io.ptr = 1
    end
    c
end

Base.eof(io::GZBufferedStream) = io.len == 0

function bench()
    io = GZBufferedStream(GZip.open("random.gz", "rb"))

    thischar = 0x00
    n = 0
    while !eof(io)
        thischar = read(io, UInt8)
        n += 1
    end
    close(io)

    println(n)
    thischar
end


bench()

#random contains 10^8 rand(UInt8)s
@time run(`gunzip -k -f random.gz`) #255 ms
@time bench() #168 ms

kmsquire · 2015-07-29T06:25:21Z

Nice!

quinnj · 2015-08-07T22:40:03Z

@jiahao why the close?

jakebolewski · 2015-08-10T18:48:33Z

We also have Zlib.jl that looks like it does buffered reads / writes. It seems silly to have 2 packages with duplicated functionality.

jiahao · 2015-08-10T21:09:28Z

@quinnj I was going to prepare a new PR based on the code snippet here: #32 (comment)

It will take a significant rewrite of GZip.jl though.

@jakebolewski you're echoing #23.

quinnj · 2015-08-10T21:18:29Z

Cool. I may take a stab at it in the next week or two since I want to really integrate with this in the CSV/ODBC/SQLite packages.

jiahao · 2015-08-10T21:41:28Z

I just benchmarked Zlib.Reader for the same random data and it took 9.38 seconds.

jakebolewski · 2015-08-10T21:44:01Z

Hmm, that seems suspiciously close to the unbuffered timings you posted above.

jiahao · 2015-08-10T22:10:02Z

I'm not sure why that is the case...

jiahao force-pushed the cjh/bufread branch from 220861b to dc37cc4 Compare July 28, 2015 13:58

vtjnash reviewed Jul 28, 2015
View reviewed changes

jiahao force-pushed the cjh/bufread branch 2 times, most recently from a60dcbc to fd96d9a Compare July 29, 2015 02:04

jiahao mentioned this pull request Jul 29, 2015

Benchmark against (and package with?) other Zlib implementations #33

Open

jiahao force-pushed the cjh/bufread branch from fd96d9a to 54cb4c3 Compare July 29, 2015 02:21

Add GZipBufferedStream for buffered reading

81aae73

Introduce new buffered keywords to gzopen, open, and gzdopen that create GZipBufferedStreams

jiahao force-pushed the cjh/bufread branch from 54cb4c3 to 81aae73 Compare July 29, 2015 02:35

jiahao mentioned this pull request Jul 29, 2015

Improve IOBuffer read performance JuliaLang/julia#12364

Merged

jiahao closed this Jul 29, 2015

dcjones mentioned this pull request Aug 20, 2015

GZip.IOBuffer type? #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GZipBufferedStream for buffered reading #32

Add GZipBufferedStream for buffered reading #32

jiahao commented Jul 28, 2015

vtjnash Jul 28, 2015

jiahao Jul 28, 2015

jakebolewski commented Jul 28, 2015

jiahao commented Jul 28, 2015

jakebolewski commented Jul 28, 2015

kmsquire commented Jul 28, 2015

stevengj commented Jul 28, 2015

jiahao commented Jul 29, 2015

jiahao commented Jul 29, 2015

jakebolewski commented Jul 29, 2015

jiahao commented Jul 29, 2015

jiahao commented Jul 29, 2015

kmsquire commented Jul 29, 2015

quinnj commented Aug 7, 2015

jakebolewski commented Aug 10, 2015

jiahao commented Aug 10, 2015

quinnj commented Aug 10, 2015

jiahao commented Aug 10, 2015

jakebolewski commented Aug 10, 2015

jiahao commented Aug 10, 2015

Add GZipBufferedStream for buffered reading #32

Add GZipBufferedStream for buffered reading #32

Conversation

jiahao commented Jul 28, 2015

vtjnash Jul 28, 2015

Choose a reason for hiding this comment

jiahao Jul 28, 2015

Choose a reason for hiding this comment

jakebolewski commented Jul 28, 2015

jiahao commented Jul 28, 2015

jakebolewski commented Jul 28, 2015

kmsquire commented Jul 28, 2015

stevengj commented Jul 28, 2015

jiahao commented Jul 29, 2015

jiahao commented Jul 29, 2015

jakebolewski commented Jul 29, 2015

jiahao commented Jul 29, 2015

jiahao commented Jul 29, 2015

kmsquire commented Jul 29, 2015

quinnj commented Aug 7, 2015

jakebolewski commented Aug 10, 2015

jiahao commented Aug 10, 2015

quinnj commented Aug 10, 2015

jiahao commented Aug 10, 2015

jakebolewski commented Aug 10, 2015

jiahao commented Aug 10, 2015