`GzDecoder` eager reading in the constructor blocks IO #368

cgbur · 2023-08-03T16:49:48Z

Currently the GzDecoder new method eagerly attempts to read the header information in the constructor. It has match statements for if the io should be non-blocking, but in general this will never happen. I think it would be more proper to follow the usual pattern of constructors in Rust not doing work by trying to read the underlying stream, and instead lazily determine the state when a read is requested.

The case where this eager reading behavior is a problem is when there are multiple fifo stream being initialized. The first decoder created will block on trying to read from the underlying stream before any work as been done, preventing the rest of the application from working. I don't think it is technically incorrect, but in general Rust prefers being lazy with computation especially when it has side effects.

The current workaround that I have to use to avoid generating read syscalls during construction is pretty hacky:

enum LazyGzipReader {
    Uninitialized(File),
    Initialized(MultiGzDecoder<BufReader<File>>),
}

impl Read for LazyGzipReader {
    fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
        match self {
            LazyGzipReader::Uninitialized(file) => {
                *self = LazyGzipReader::Initialized(MultiGzDecoder::new(BufReader::new(
                    file.try_clone()?,
                )));
                self.read(buf)
            }
            LazyGzipReader::Initialized(reader) => reader.read(buf),
        }
    }
}

The text was updated successfully, but these errors were encountered:

jongiddy · 2023-08-03T20:18:19Z

You have a good point that parsing the header in the constructor is not idiomatic. But a change would be backwards-incompatible. If the inner reader is blocking then a caller can expect that, after a successful new call, calling the header method will always return a header.

cgbur · 2023-08-03T20:25:31Z

In this case calling header would be the blocking call because that would initiate the process of issuing reads. So I believe construction could still be lazy, and you would not be breaking anything in a backwards incompatible way as I don't think users should be depending on construction being blocking vs calling header or read. I am imagining a solution similar to one I provided but probably less hacky. I think a following something like the lateinit pattern would work here?

jongiddy · 2023-08-03T20:55:41Z

True. The header method could read until the state is GzState::Body and then return the header. Technically the same should be done for get_ref, get_mut and into_inner so that they always return a reader that has read past the header. But we could get away with saying it was never specified in the docs that that was guaranteed.

cgbur · 2023-08-03T22:05:00Z

I feel whether get_ref and friends should give a reader that is past the header is a decision that should be made, one way or the other, and then documented. If I had not known about the header skipping I would expect a stream that has been unmodified / not advanced, as those methods don't feel like they should have side effects given other Reader types. For example making a BufReader and then calling into_inner does not pull any bytes from the underlying stream. The current new and header methods do not return a Result<_ io:error> so its no entirely clear from the signature that there is work being done on the streams.

I feel it is probably ok to go ahead and make header do the work of reading and buffer up the error until a read is called as it does now.

Byron · 2023-08-04T05:59:39Z

I'd be interested in understanding how the current behaviour of eagerly reading the header came to be.
What I don't particularly like about the API in its current state is that errors are delayed. From what I can tell, an error in the constructor will be exposed only when calling read().

Without any knowledge on how this came to be, I would tend to make this API more conventional, reporting errors when they occour and doing work only when needed, but reflect that, along with failure modes, in the signature of each method.

Making any change to signatures would probably break the symmetry among the different implementations, which makes it even more prohibitive and brings me back to trying to understand why this way of handling things was desirable in the first place, as I am clearly having a knowledge gap :/.

jongiddy · 2023-08-04T08:16:33Z

Looking at some older code the reason for parsing the header seems to be that all the formats supported by flate2 are containers for DEFLATE encoded data and the different decoders exist to strip off the header and trailer of the specific format. The types get you to the point where read decodes DEFLATE data and the header is mostly an inconvenience to be skipped.

In #185 support for ~~async~~ non-blocking was added. The first commit removes the header parsing from the constructor and the PR author makes the comment that "Usually in async code, we don't want to read IO before poll. The new behavior does not allow us to get gzip header before read." But then the 3rd commit adds it back with the message "keep old behavior".

Byron · 2023-08-05T06:13:35Z

Thanks for the detective-work and the additional explanation. Regarding the header being an inconvenience to be skippedI could imagine that it was simply easiest to strip the header off the first chance one gets, disregarding that doing IO in the constructor might be unexpected or even problematic under circumstances.

To me it seems that there is no reason we'd know that would rate the current behaviour higher (to make it worth keeping) in the face of the problem its causing. To fix it, one would have to lazily parse the header during the first read. The API wouldn't change which seems beneficial.

The behaviour of header() isn't documented, so one might argue it's not a breaking change to adjust it and maybe document it while we are there. Existing code that relies on the header being present early would have to be adjusted to try the header after the first read or once the reading is done. To me, this seems like acceptable churn to fix an issue, which will leave not only this crate more robust, but also the calling code.

Trying to scrutinise the paragraph above, one might say previously one could use GzDecoder to strip the header from a stream and read it, which wouldn't be possible anymore as the first read would set the stream past the header. Maybe that wouldn't be an issue with decoder.read(&[]) though, a subtle but explicit feature to restore the previous capability.

This looks like we should go ahead and make this type/these types lazy - what do you think?

jongiddy · 2023-08-05T07:08:08Z

GzDecoder::read already parses the header if it has not yet been parsed (since non-blocking reads may put it in that state). So the only change required is to replace the GzDecoder::new header parsing with let state = GzState::Header(header_parser);.

If the state is GzState::Header then the header method should call header_parser.parse once, similar to current new. This ensures that, for blocking reads, calling new then header on valid data still gives the header.

The as_ref, as_mut, and into_inner methods should be documented as generally not guaranteeing how much data has been read from the inner type. For the read module this is always true. For the bufread module the guarantees are:

when read returns 0 the next read will start immediately after the end of the compressed data
for blocking readers only, after calling new then header the next read will start at the beginning of the DEFLATE data, matching the existing behaviour of just calling new.

Before making these technically backwards-incompatible changes (even if they are relied on by very few, if any, people) I suggest that there is a 1.0.27 release with the changes to main, hopefully including #367

The changes suggested for this issue can then go into a 1.1.0 release so that if anyone does rely on the previous behaviour they can pin to 1.0 and we keep open the possibility of back-porting fixes to 1.0 if there is demand.

Byron · 2023-08-08T05:52:06Z

Thanks a lot for the assessment, it makes sense to me and I am looking forward to seeing it through.

The new release is in preparation and can happen once my PR was merged.

Byron added the question label Aug 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`GzDecoder` eager reading in the constructor blocks IO #368

`GzDecoder` eager reading in the constructor blocks IO #368

cgbur commented Aug 3, 2023

jongiddy commented Aug 3, 2023

cgbur commented Aug 3, 2023

jongiddy commented Aug 3, 2023

cgbur commented Aug 3, 2023

Byron commented Aug 4, 2023

jongiddy commented Aug 4, 2023 •

edited

Byron commented Aug 5, 2023

jongiddy commented Aug 5, 2023 •

edited

Byron commented Aug 8, 2023

GzDecoder eager reading in the constructor blocks IO #368

GzDecoder eager reading in the constructor blocks IO #368

Comments

cgbur commented Aug 3, 2023

jongiddy commented Aug 3, 2023

cgbur commented Aug 3, 2023

jongiddy commented Aug 3, 2023

cgbur commented Aug 3, 2023

Byron commented Aug 4, 2023

jongiddy commented Aug 4, 2023 • edited

Byron commented Aug 5, 2023

jongiddy commented Aug 5, 2023 • edited

Byron commented Aug 8, 2023

`GzDecoder` eager reading in the constructor blocks IO #368

`GzDecoder` eager reading in the constructor blocks IO #368

jongiddy commented Aug 4, 2023 •

edited

jongiddy commented Aug 5, 2023 •

edited