Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GZipBufferedStream for buffered reading #32

Closed
wants to merge 1 commit into from
Closed

Conversation

jiahao
Copy link
Contributor

@jiahao jiahao commented Jul 28, 2015

Also introduces new gzbopen() function that generates GZipBufferedStreams.

reading a character stream from a GZipBufferedStream is ~6x faster than from an ordinary GZipStream. The speedup comes from managing an internal IOBuffer and populating it with gzread(), rather than making a call to the unbuffered gzgetc() each time a character is sought.

Here are some benchmark numbers for how the performance varies with buffer size. 0 = ordinary GZipStream, sys = system time for gunzip. 2^12 = 8192 is the default buffer size and 2^17 = 131072 is the "big" buffer size used by GZip.

(gunzip is still 2.8x faster though...)

log2(buf_size) time(s)
0 9.434
12 (default) 1.484
13 1.505
14 1.504
15 1.479
16 1.504
17 1.497
18 1.481
19 1.496
20 1.497
21 1.487
22 1.467
23 1.470
24 1.442
25 1.417
26 1.376
27 1.310
28 (entire file fits) 1.264
sys 0.519

Benchmark code:

function benchmark(nreps=10)
  filename = "ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz"
  isfile(filename) || download("ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v1a.20130502.genotypes.vcf.gz")

  timings=Dict{Int,Vector}()
  for i=1:nreps
    gzbopen(filename) do io
        n = 0
        t = @elapsed while true
            read(io, UInt8)
            n += 1
            eof(io) && break
        end
        timings[0] = push!(get(timings, 0, Float64[]), t)
    end

    for sexp in 28:-1:12
        println("Buffer size: ", 2^sexp)
        n = 0
        io = gzbopen(filename, "rb", 2^sexp)
        t = @elapsed while true
            read(io, UInt8)
            n += 1
            eof(io) && break
        end
        timings[sexp] = push!(get(timings, sexp, Float64[]), t)
        close(io)
    end

    for k in sort!(collect(keys(timings)))
        println(k, '\t', minimum(timings[k]))
    end
  end
end

@@ -0,0 +1,57 @@
immutable GZipBufferedStream
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<: IO ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What does the IO interface look like?

@jakebolewski
Copy link
Contributor

Shouldn't the buffered version just be the default?

@jiahao
Copy link
Contributor Author

jiahao commented Jul 28, 2015

@jakebolewski yes it would be nice to make the buffered stream the default eventually. The interface is not complete though. (I didn't implement buffered writes)

@jakebolewski
Copy link
Contributor

Have you seen http://www.htslib.org/benchmarks/zlib.html?

@kmsquire
Copy link
Contributor

On one hand, this is definitely a good direction to go. I would think/hope with some work we could reach gunzip speeds, but I haven't looked at this code in a long time, so I don't know where the gains would come from.

Rather than a separate name, I'd prefer something like a keyword parameter to open / gzopen (e.g., buffered=true).

For kicks, it would be good to test against the Zlib Stream API.

@stevengj
Copy link
Member

I'd also recommend just making GZipStream buffered. There's no point in having two APIs, one slow and one fast, if the fast one can support all of the same functionality.

@jiahao
Copy link
Contributor Author

jiahao commented Jul 29, 2015

@stevengj @jakebolewski I'd like to merge this functionality as is and wait until we get buffered writes (#23) before we get rid of the current GZipStream.

@jiahao
Copy link
Contributor Author

jiahao commented Jul 29, 2015

I like @kmsquire's idea of the keyword parameter to open/gzopen. It would probably help with the transition to buffered I/O.

@jiahao jiahao force-pushed the cjh/bufread branch 2 times, most recently from a60dcbc to fd96d9a Compare July 29, 2015 02:04
@jakebolewski
Copy link
Contributor

Adding to the API only to remove it seems weird, why not just write the buffered write? It will be ~30 lines of code.

@jiahao
Copy link
Contributor Author

jiahao commented Jul 29, 2015

It's not a priority for me now. You're more than welcome to implement it :)

Introduce new buffered keywords to gzopen, open, and gzdopen that create
GZipBufferedStreams
@jiahao
Copy link
Contributor Author

jiahao commented Jul 29, 2015

I've profiled this code and it looks like ~40% of the execution time is spent in the bowels of iobuffer.jl. Inlining the byte read (vide JuliaLang/julia#12364) may be able to cut down the overhead somewhat, but it looks like IOBuffer is too heavy an abstraction for this problem.

Here is a reimplementation using a plain Vector{UInt8} that achieves a 1/3 speedup relative to gunzip:

using GZip

type GZBufferedStream <: IO
    io::GZipStream
    buf::Vector{UInt8}
    len::Int
    ptr::Int

    function GZBufferedStream(io::GZipStream)
        buf = Array(UInt8, io.buf_size)

        len = ccall((:gzread, GZip._zlib), Int32,
            (Ptr{Void}, Ptr{Void}, UInt32), io.gz_file, buf, io.buf_size)

        new(io, buf, len, 1)
    end
end

Base.close(io::GZBufferedStream) = close(io.io)

@inline function Base.read(io::GZBufferedStream, ::Type{UInt8})
    c = io.buf[io.ptr]
    io.ptr += 1

    if io.ptr == io.len+1 #No more data
        io.len = ccall((:gzread, GZip._zlib), Int32,
            (Ptr{Void}, Ptr{Void}, UInt32), io.io.gz_file, io.buf, io.io.buf_size)
        io.ptr = 1
    end
    c
end

Base.eof(io::GZBufferedStream) = io.len == 0

function bench()
    io = GZBufferedStream(GZip.open("random.gz", "rb"))

    thischar = 0x00
    n = 0
    while !eof(io)
        thischar = read(io, UInt8)
        n += 1
    end
    close(io)

    println(n)
    thischar
end


bench()

#random contains 10^8 rand(UInt8)s
@time run(`gunzip -k -f random.gz`) #255 ms
@time bench() #168 ms

@jiahao jiahao closed this Jul 29, 2015
@kmsquire
Copy link
Contributor

Nice!

@quinnj
Copy link
Member

quinnj commented Aug 7, 2015

@jiahao why the close?

@jakebolewski
Copy link
Contributor

We also have Zlib.jl that looks like it does buffered reads / writes. It seems silly to have 2 packages with duplicated functionality.

@jiahao
Copy link
Contributor Author

jiahao commented Aug 10, 2015

@quinnj I was going to prepare a new PR based on the code snippet here: #32 (comment)

It will take a significant rewrite of GZip.jl though.

@jakebolewski you're echoing #23.

@quinnj
Copy link
Member

quinnj commented Aug 10, 2015

Cool. I may take a stab at it in the next week or two since I want to really integrate with this in the CSV/ODBC/SQLite packages.

@jiahao
Copy link
Contributor Author

jiahao commented Aug 10, 2015

I just benchmarked Zlib.Reader for the same random data and it took 9.38 seconds.

@jakebolewski
Copy link
Contributor

Hmm, that seems suspiciously close to the unbuffered timings you posted above.

@jiahao
Copy link
Contributor Author

jiahao commented Aug 10, 2015

I'm not sure why that is the case...

@dcjones dcjones mentioned this pull request Aug 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants