Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in-place transcode when output size is known #134

Closed
wants to merge 2 commits into from

Conversation

svilupp
Copy link

@svilupp svilupp commented Mar 9, 2023

Proposal for #132

This is an initial proposal to tackle in-place decoding by providing the correctly-sized buffer.

It would be hugely beneficial for Arrow format, which is the workhorse of modern data analytics (even Pandas is moving to it as a backend). I've done a quick benchmark and Arrow.jl is among the slowest parsers, especially with compressed files. Based on profiling, the time spent resizing buffers in transcode() is roughly the same as the decoding itself!
Thanks to Arrow IPC file specifications, we always know the output size of a field, however, at the moment we cannot take advantage of it.

This PR defines a new function transcode! that would write into a user-provided Buffer.
Benefits are quantified below in the code snippet (eg, 3x faster processing with large texts).
I haven't yet explored the failure modes, I just wanted to start the discussion.

using TranscodingStreams, CodecZstd
using BenchmarkTools


text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean sollicitudin
mauris non nisi consectetur, a dapibus urna pretium. Vestibulum non posuere
erat. Donec luctus a turpis eget aliquet. Cras tristique iaculis ex, eu
malesuada sem interdum sed. Vestibulum ante ipsum primis in faucibus orci luctus
et ultrices posuere cubilia Curae; Etiam volutpat, risus nec gravida ultricies,
erat ex bibendum ipsum, sed varius ipsum ipsum vitae dui.
"""

# Array API
decompressor=ZstdDecompressor()
compressed = transcode(ZstdCompressor, text)
@assert sizeof(compressed) < sizeof(text)
@assert transcode(decompressor, compressed) == Vector{UInt8}(text)

# Proposed in-place API, writes directly to output buffer.
output_length = ncodeunits(text)
output=TranscodingStreams.Buffer(output_length)
TranscodingStreams.transcode!(decompressor, compressed, output);
@assert output.data == Vector{UInt8}(text)


# Benchmark on smaller text
compressed = transcode(ZstdCompressor, text)
output=TranscodingStreams.Buffer(ncodeunits(text))
@btime transcode($decompressor, $compressed);
# 2.120 μs (4 allocations: 736 bytes)

@btime(TranscodingStreams.transcode!($decompressor, $compressed, output),evals=1,setup=(output=deepcopy($output)));
# 2.041 μs (2 allocations: 64 bytes)

# Benchmark on longer text
longtext= text^100
compressed = transcode(ZstdCompressor, longtext)
@btime transcode($decompressor, $compressed);
# 10.167 μs (10 allocations: 232.92 KiB)

@btime output=TranscodingStreams.Buffer(ncodeunits($longtext));
# 443.409 ns (3 allocations: 43.47 KiB)
@btime(TranscodingStreams.transcode!($decompressor, $compressed, output),evals=1,setup=(output=deepcopy($output)));
# 3.791 μs (2 allocations: 64 bytes)

@Moelf
Copy link

Moelf commented Mar 9, 2023

this is great, I'm not very familiar with the internal Buffer type, but eventually I want ByteData back but can I assume I can construct a Buffer myself and extract the bytes out afterwards?

@baumgold
Copy link
Contributor

@svilupp - this suggestion looks great to me. Can this PR be moved out of draft status so it can be reviewed/approved/merged/released? It seems a quite useful feature so I’d like to get it out there ASAP. Thanks!

@baumgold
Copy link
Contributor

I think we can implement this without as much code duplication. See #136. Thoughts?

@svilupp svilupp closed this Mar 15, 2023
@Moelf
Copy link

Moelf commented Mar 16, 2023

thanks, @svilupp for the initiative!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants