-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GZipBufferedStream for buffered reading #32
Conversation
@@ -0,0 +1,57 @@ | |||
immutable GZipBufferedStream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<: IO
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. What does the IO interface look like?
Shouldn't the buffered version just be the default? |
@jakebolewski yes it would be nice to make the buffered stream the default eventually. The interface is not complete though. (I didn't implement buffered writes) |
Have you seen http://www.htslib.org/benchmarks/zlib.html? |
On one hand, this is definitely a good direction to go. I would think/hope with some work we could reach gunzip speeds, but I haven't looked at this code in a long time, so I don't know where the gains would come from. Rather than a separate name, I'd prefer something like a keyword parameter to For kicks, it would be good to test against the Zlib Stream API. |
I'd also recommend just making |
@stevengj @jakebolewski I'd like to merge this functionality as is and wait until we get buffered writes (#23) before we get rid of the current |
I like @kmsquire's idea of the keyword parameter to open/gzopen. It would probably help with the transition to buffered I/O. |
a60dcbc
to
fd96d9a
Compare
Adding to the API only to remove it seems weird, why not just write the buffered |
It's not a priority for me now. You're more than welcome to implement it :) |
Introduce new buffered keywords to gzopen, open, and gzdopen that create GZipBufferedStreams
I've profiled this code and it looks like ~40% of the execution time is spent in the bowels of Here is a reimplementation using a plain using GZip
type GZBufferedStream <: IO
io::GZipStream
buf::Vector{UInt8}
len::Int
ptr::Int
function GZBufferedStream(io::GZipStream)
buf = Array(UInt8, io.buf_size)
len = ccall((:gzread, GZip._zlib), Int32,
(Ptr{Void}, Ptr{Void}, UInt32), io.gz_file, buf, io.buf_size)
new(io, buf, len, 1)
end
end
Base.close(io::GZBufferedStream) = close(io.io)
@inline function Base.read(io::GZBufferedStream, ::Type{UInt8})
c = io.buf[io.ptr]
io.ptr += 1
if io.ptr == io.len+1 #No more data
io.len = ccall((:gzread, GZip._zlib), Int32,
(Ptr{Void}, Ptr{Void}, UInt32), io.io.gz_file, io.buf, io.io.buf_size)
io.ptr = 1
end
c
end
Base.eof(io::GZBufferedStream) = io.len == 0
function bench()
io = GZBufferedStream(GZip.open("random.gz", "rb"))
thischar = 0x00
n = 0
while !eof(io)
thischar = read(io, UInt8)
n += 1
end
close(io)
println(n)
thischar
end
bench()
#random contains 10^8 rand(UInt8)s
@time run(`gunzip -k -f random.gz`) #255 ms
@time bench() #168 ms |
Nice! |
@jiahao why the close? |
We also have Zlib.jl that looks like it does buffered reads / writes. It seems silly to have 2 packages with duplicated functionality. |
@quinnj I was going to prepare a new PR based on the code snippet here: #32 (comment) It will take a significant rewrite of GZip.jl though. @jakebolewski you're echoing #23. |
Cool. I may take a stab at it in the next week or two since I want to really integrate with this in the CSV/ODBC/SQLite packages. |
I just benchmarked |
Hmm, that seems suspiciously close to the unbuffered timings you posted above. |
I'm not sure why that is the case... |
Also introduces new
gzbopen()
function that generatesGZipBufferedStream
s.read
ing a character stream from aGZipBufferedStream
is ~6x faster than from an ordinaryGZipStream
. The speedup comes from managing an internalIOBuffer
and populating it withgzread()
, rather than making a call to the unbufferedgzgetc()
each time a character is sought.Here are some benchmark numbers for how the performance varies with buffer size. 0 = ordinary
GZipStream
, sys = system time forgunzip
. 2^12 = 8192 is the default buffer size and 2^17 = 131072 is the "big" buffer size used by GZip.(
gunzip
is still 2.8x faster though...)Benchmark code: