Helper for parsing from a `R: Read` or `R: BufRead` #1145

str4d · 2020-04-25T10:42:05Z

I use nom and cookie-factory for parsing in I/O. In every instance, I end up with code similar to the following:

fn read<R: Read>(mut input: R) -> io::Result<Self> {
    let mut data = vec![];
    loop {
        match read::header(&data) {
            Ok((_, header)) => break Ok(header),
            Err(nom::Err::Incomplete(nom::Needed::Size(n))) => {
                // Read the needed additional bytes. We need to be careful how the
                // parser is constructed, because if we read more than we need, the
                // remainder of the input will be truncated.
                let m = data.len();
                data.resize(m + n, 0);
                input.read_exact(&mut data[m..m + n])?;
            }
            Err(_) => {
                break Err(io::Error::new(io::ErrorKind::InvalidData, "invalid header"));
            }
        }
    }
}

struct Connection<R: Read> {
    input: BufReader<R>,
    buffer: String,
}

fn read_response(&mut self) -> io::Result<Response> {
    // We are finished with any prior response.
    self.buffer.zeroize();
    self.buffer.clear();

    loop {
        match read::server_response(self.buffer.as_bytes()) {
            // We can't return the response here, because we need to be able to mutate
            // self.buffer inside the loop.
            Ok(_) => break,
            Err(nom::Err::Incomplete(_)) => {
                if self.input.read_line(&mut self.buffer)? == 0 {
                    return Err(io::Error::new(
                        io::ErrorKind::UnexpectedEof,
                        "incomplete response",
                    ));
                };
            }
            Err(_) => {
                return Err(io::Error::new(
                    io::ErrorKind::InvalidData,
                    "invalid response",
                ));
            }
        }
    }

    // Now that we know the buffer contains a valid response, we re-parse so that we
    // can return an immutable lifetime.
    Ok(read::server_response(self.buffer.as_bytes())
        .map(|(_, r)| r)
        .expect("Is valid"))
}

It would be useful if nom had helpers for this general pattern of "read just enough from some R: Read to either satisfy the parser or reach an error case".

The text was updated successfully, but these errors were encountered:

str4d · 2020-04-25T10:43:18Z

#1055 is also relevant (specifically wrt being able to read no more than necessary from R: Read).

sollyucko · 2020-07-05T00:28:46Z

It should even be possible to implement some operations on std::io traits without an additional buffer, for example:

FindToken<u8> can be implemented for impl BufRead
InputIter can be implemented by impl Read (besides position and possibly slice_index)
InputLength can be implemented by impl Seek
A modified version of InputTake can be implemented by impl Read
A limited version of InputTakeAtPosition can be implemented by impl BufRead

Geal · 2020-08-23T15:31:43Z

reading from a Read is common, so I should provide tooling for that.
BufReader is not really the right interface for that: its fill_buf method only refill when it reaches the end of the buffer ( https://doc.rust-lang.org/src/std/io/buffered.rs.html#306-323 ), while with nom, we would want to tell the IO side to refill according to the parser's result (Incomplete or not)

xhebox · 2020-08-31T12:21:55Z

Methods of InputIter take a &self, but reading from Read needs a &mut self. It would be good if parsers can accept a mutable version InputIter.

I have tried to make a custom InputIter that wraps around Read, but suffers from the mutability and lifetime problems. @Geal

EDIT: I only managed to get pass it by RC

Geal · 2021-10-25T21:22:34Z

FYI it's now possible to use https://crates.io/crates/nom-bufreader

douglas-raillard-arm · 2023-01-04T16:04:48Z

Could we consider re-opening this issue ? As described here #1582 streaming parsers are not a silver bullet. One cannot simply change from complete to streaming parsers and expect things to keep working. I had some pretty nasty subtle bugs coming from streaming parsers.

What would really solve the issue is implementing the "input traits" of nom for Read, and modify the traits if necessary to make it possible. Then all parsers become streaming out-of-the-box (with the caveat that input production would be driven by the parser itself, so not async-friendly).

As it stands, the only robust solution is to apply sub-parsers on known-length parts of input. This can be made to work for binary formats that encode the size of each blob prior to the blob, so it can be extracted into one &[u8] and then processed with a complete parser, but that is extra pain and it does not cover every case.

Geal · 2023-01-04T19:14:53Z

In my experience trying to apply the parsers directly on Read or BufRead is the wrong approach. It does not apply correctly on parsers that need to backtrack and performance does not end up at the level you'd expect, so much that you'd rather just read everything then parse in one pass.
Similarly, I tried the async await style and storing a closure, and that was not worth the complexity they introduced. Very quickly you end up introducing mutexes and allocations all over the place, and lose what makes nom fast: working on and returning immutable slices of the input data. Haskell makes this a lot easier than Rust, but at a great cost in performance.

Streaming parsers work well when one or more of those conditions are met:

the format uses a kind of tag/length/value design for messages so you can know in advance how much you need to buffer
the format has clear delimiters for subsections of the format (example: HTTP headers) so those subsections can be parsed one by one
there's a state machine driving IO and parsing that can accumulate data. This is what I do in all my protocol implementations

I am currently testing a design where the input type indicates if we're in streaming mode or not, which should help in writing parsers. But if you want to work with streaming parsers, there's a basic level of complexity you need to manage, and decisions you have to take in your format, that nom cannot do for you because that's the wrong level for it.

Can you tell me what kind of format you are working on? Maybe I can provide some pointers to make things easier

douglas-raillard-arm · 2023-01-05T11:47:41Z

Thanks for your quick response :)

The format I'm writing a parser for is:
https://man7.org/linux/man-pages/man5/trace-cmd.dat.5.html

Overall, it's organized as the following. "sized XXX" means "an XXX blob preceded by e.g. a u64 stating the size of the blob":

a header:
- a sequence of sized blobs (but sadly no overall header size, so I can't just extract the whole header easily without nom)
A concatenation of arbitrary large data buffers (could be bigger than memory):
- Each buffer is composed of a sequence of sized small blobs

It does not apply correctly on parsers that need to backtrack

I don't need to backtrack in my specific case, but wouldn't Seek help fix that ? I'm not sure how efficient Seek is on BufReader though.

so much that you'd rather just read everything then parse in one pass.

This is unfortunately not an option. The file I'm parsing may be larger than memory. The problem I'm facing is to parse the header of the format, which is not easily sized without decoding it (it's made of multiple reasonably-sized blobs, but I need basic le_u64/be_u64, take() and tag() to gather them). The data themselves will have the same issue.

Similarly, I tried the async await style and storing a closure.

Yes, that would be more to enable new use cases. It's possible to make an API for arbitrary large files without async, and I'm sure a Parser can still be exposed to async with a blocking read on channel in a separate thread fed by the async world or something like that.

So to summarize:

I'm very happy with nom to parse individual blobs (they fit for sure in memory so I can load them in a &[u8] and just run a complete parser)
I can't use nom as-is to parse the skeleton of the file, as it is basically an arbitrary large sequence of sized blobs so it can't be made into a &[u8].
In order to parse the skeleton, I only need basic parsers (le_u64() , take(), tag()), but they need to work on a Read/Seek type.

EDIT: added mentions to tag()

Inspired by - rust-bakery/nom#1160 - rust-bakery/nom#1582 - rust-bakery/nom#1145#issuecomment-678788326

Geal added this to the 6.0 milestone Aug 16, 2020

fritschy mentioned this issue Nov 3, 2020

Writing a streaming parser #1160

Open

Geal closed this as completed Oct 25, 2021

epage added a commit to epage/winnow that referenced this issue Feb 17, 2023

doc(cookbook): Include more details on Partial parsing

8608032

Inspired by - rust-bakery/nom#1160 - rust-bakery/nom#1582 - rust-bakery/nom#1145#issuecomment-678788326

epage added a commit to epage/winnow that referenced this issue Feb 17, 2023

doc(cookbook): Include more details on Partial parsing

c278418

Inspired by - rust-bakery/nom#1160 - rust-bakery/nom#1582 - rust-bakery/nom#1145#issuecomment-678788326

epage added a commit to epage/winnow that referenced this issue Feb 17, 2023

doc(cookbook): Include more details on Partial parsing

e67774f

Inspired by - rust-bakery/nom#1160 - rust-bakery/nom#1582 - rust-bakery/nom#1145#issuecomment-678788326

epage added a commit to epage/winnow that referenced this issue Feb 17, 2023

doc(cookbook): Include more details on Partial parsing

b1635b7

Inspired by - rust-bakery/nom#1160 - rust-bakery/nom#1582 - rust-bakery/nom#1145#issuecomment-678788326

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helper for parsing from a `R: Read` or `R: BufRead` #1145

Helper for parsing from a `R: Read` or `R: BufRead` #1145

str4d commented Apr 25, 2020

str4d commented Apr 25, 2020

sollyucko commented Jul 5, 2020

Geal commented Aug 23, 2020

xhebox commented Aug 31, 2020 •

edited

Loading

Geal commented Oct 25, 2021

douglas-raillard-arm commented Jan 4, 2023

Geal commented Jan 4, 2023

douglas-raillard-arm commented Jan 5, 2023 •

edited

Loading

Helper for parsing from a R: Read or R: BufRead #1145

Helper for parsing from a R: Read or R: BufRead #1145

Comments

str4d commented Apr 25, 2020

str4d commented Apr 25, 2020

sollyucko commented Jul 5, 2020

Geal commented Aug 23, 2020

xhebox commented Aug 31, 2020 • edited Loading

Geal commented Oct 25, 2021

douglas-raillard-arm commented Jan 4, 2023

Geal commented Jan 4, 2023

douglas-raillard-arm commented Jan 5, 2023 • edited Loading

Helper for parsing from a `R: Read` or `R: BufRead` #1145

Helper for parsing from a `R: Read` or `R: BufRead` #1145

xhebox commented Aug 31, 2020 •

edited

Loading

douglas-raillard-arm commented Jan 5, 2023 •

edited

Loading