Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read trait for raw::Encoder/Decoder #41

Closed
milesgranger opened this issue Mar 15, 2021 · 4 comments
Closed

Read trait for raw::Encoder/Decoder #41

milesgranger opened this issue Mar 15, 2021 · 4 comments

Comments

@milesgranger
Copy link

Hello,

Picking up from the rabbit hole conversation I got us into about Read for raw::Encoder/Decoder

I'm trying to expose various de/compression algorithms written in rust to Python. Most, including snappy's framed format, implement either Read/Write enabled En/Decoders. This allows me to give Python users the ability to use file-like, bytes-like or numpy arrays, all of which on the Rust side implement Read/Write, to pass onto all de/compression variations; this is done by having common entry signature like the following:

fn compress<W: Write + ?Sized, R: Read>(input: R, output: &mut W) -> Result<usize, Error>;

As mentioned in the previous thread, there are users with good use cases to have an equal API for snappy raw format.
However, as we know, raw de/encoders ::de/compress only accept byte slices.

I've attempted to implement my own RawEncoder/RawDecoder which implements the Read trait, here but seems to require doing a full read into a buffer before being able to pass it to snap::raw::Encoder::compress and thus ends up being slower than python-snappy, where before using byte slices it was "just as fast" at the cost of less flexibility from the Python user.

Should you find yourself unconvinced this would add value to rust-snappy, I would be very happy if you could point out any of my (perhaps many) performance mistakes when implementing it myself in RawEncoder/RawDecoder

@BurntSushi
Copy link
Owner

@milesgranger Thanks for the issue! I appreciate the link to the diff. I'm on parental leave at the moment, so my time is pretty limited, and reviewing that patch would eat up quite a bit of time. Is there a way for you to shrink your issue down into a smaller code snippet?

Another thing that would be helpful to address is which APIs you were using previously. (From the C++ Snappy reference implementation, I presume?) As I understand it, the Snappy raw format requires everything to be in memory at once. IIRC, this was a key point of confusion that I had in our previous discussion, and I really think we should work to try and resolve this point so that we're both on the same page. That is, when it comes to the Snappy raw format, you cannot read piecemeal from a stream and pass it into some raw encoder to be compressed, unless you're okay with dealing with a raw Snappy compressed block for each piece that you read.

@BurntSushi
Copy link
Owner

As an example, the C API of the reference Snappy implementation is almost exactly the same as the raw API exposed in this crate: https://github.com/google/snappy/blob/master/snappy-c.h

So what I'm saying here is, whatever it was you were doing before, you should be able to do with this crate too.

@milesgranger
Copy link
Author

I should have been more clear, I was using the snap::raw::De/Encoder::de/compress.

Agreed, I think we're on the same page now about everything needing to be in memory for snappy raw format. This is maybe the flaw in my desired to support raw format for file-like objects, one would simply not be able to stream an uncompressed file into a snappy raw compressed; without first reading the entire file into memory. Which is basically what I wanted to show in that diff, but can't seem to link to a specific code snippet.

However, current implementation looks like this:

impl<R: Read> Read for RawDecoder<R> {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        // if this is the first iteration, we'd need to read all bytes into a buffer, compress it, then read it out
        if self.buffer.is_empty() {
            
            // self.inner -> R: Read, so maybe a file, who knows, needs the whole thing
            let n = self.inner.read_to_end(&mut self.buffer)?;  
            self.buffer.truncate(n);
            let len = decompress_len(&self.buffer)?;
            self.decompressed.get_mut().resize(len, 0);
            let n = self
                .decoder
                .decompress(self.buffer.as_slice(), self.decompressed.get_mut())?;
            self.decompressed.get_mut().truncate(n);
            self.decompressed.set_position(0);
        }
        self.decompressed.read(buf)  // <- Cursor<Vec<u8>>
    }
}

so ya, maybe there isn't much room for improvement after all, as you say, and I agree, for raw to work, we need the whole thing.

Thank you again for the help; and congratulations on the new kiddo, enjoy paternity leave!

@BurntSushi
Copy link
Owner

This is maybe the flaw in my desired to support raw format for file-like objects, one would simply not be able to stream an uncompressed file into a snappy raw compressed; without first reading the entire file into memory.

Yup, that is exactly correct and matches my expectations. That's exactly why the crates docs push you toward the Snappy stream impl. Because the raw format is inconvenient. Of course, if you have external stuff using the raw format, then you have no choice, you gotta play their ball game. But with that comes the in-memory requirement. My guess is that in those scenarios, memory buffering is implemented at a different level. Or at least, I would hope. Not only does the raw format require everything to be in memory, but it has a relatively small limit on the maximum such size. And that's not just a limitation of this library. That's a limitation imposed by the spec itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants