Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helper for parsing from a R: Read or R: BufRead #1145

Closed
str4d opened this issue Apr 25, 2020 · 8 comments
Closed

Helper for parsing from a R: Read or R: BufRead #1145

str4d opened this issue Apr 25, 2020 · 8 comments
Milestone

Comments

@str4d
Copy link

str4d commented Apr 25, 2020

I use nom and cookie-factory for parsing in I/O. In every instance, I end up with code similar to the following:

fn read<R: Read>(mut input: R) -> io::Result<Self> {
    let mut data = vec![];
    loop {
        match read::header(&data) {
            Ok((_, header)) => break Ok(header),
            Err(nom::Err::Incomplete(nom::Needed::Size(n))) => {
                // Read the needed additional bytes. We need to be careful how the
                // parser is constructed, because if we read more than we need, the
                // remainder of the input will be truncated.
                let m = data.len();
                data.resize(m + n, 0);
                input.read_exact(&mut data[m..m + n])?;
            }
            Err(_) => {
                break Err(io::Error::new(io::ErrorKind::InvalidData, "invalid header"));
            }
        }
    }
}
struct Connection<R: Read> {
    input: BufReader<R>,
    buffer: String,
}

fn read_response(&mut self) -> io::Result<Response> {
    // We are finished with any prior response.
    self.buffer.zeroize();
    self.buffer.clear();

    loop {
        match read::server_response(self.buffer.as_bytes()) {
            // We can't return the response here, because we need to be able to mutate
            // self.buffer inside the loop.
            Ok(_) => break,
            Err(nom::Err::Incomplete(_)) => {
                if self.input.read_line(&mut self.buffer)? == 0 {
                    return Err(io::Error::new(
                        io::ErrorKind::UnexpectedEof,
                        "incomplete response",
                    ));
                };
            }
            Err(_) => {
                return Err(io::Error::new(
                    io::ErrorKind::InvalidData,
                    "invalid response",
                ));
            }
        }
    }

    // Now that we know the buffer contains a valid response, we re-parse so that we
    // can return an immutable lifetime.
    Ok(read::server_response(self.buffer.as_bytes())
        .map(|(_, r)| r)
        .expect("Is valid"))
}

It would be useful if nom had helpers for this general pattern of "read just enough from some R: Read to either satisfy the parser or reach an error case".

@str4d
Copy link
Author

str4d commented Apr 25, 2020

#1055 is also relevant (specifically wrt being able to read no more than necessary from R: Read).

@sollyucko
Copy link

It should even be possible to implement some operations on std::io traits without an additional buffer, for example:

@Geal Geal added this to the 6.0 milestone Aug 16, 2020
@Geal
Copy link
Collaborator

Geal commented Aug 23, 2020

reading from a Read is common, so I should provide tooling for that.
BufReader is not really the right interface for that: its fill_buf method only refill when it reaches the end of the buffer ( https://doc.rust-lang.org/src/std/io/buffered.rs.html#306-323 ), while with nom, we would want to tell the IO side to refill according to the parser's result (Incomplete or not)

@xhebox
Copy link

xhebox commented Aug 31, 2020

Methods of InputIter take a &self, but reading from Read needs a &mut self. It would be good if parsers can accept a mutable version InputIter.

I have tried to make a custom InputIter that wraps around Read, but suffers from the mutability and lifetime problems. @Geal

EDIT: I only managed to get pass it by RC

@Geal
Copy link
Collaborator

Geal commented Oct 25, 2021

FYI it's now possible to use https://crates.io/crates/nom-bufreader

@Geal Geal closed this as completed Oct 25, 2021
@douglas-raillard-arm
Copy link

Could we consider re-opening this issue ? As described here #1582 streaming parsers are not a silver bullet. One cannot simply change from complete to streaming parsers and expect things to keep working. I had some pretty nasty subtle bugs coming from streaming parsers.

What would really solve the issue is implementing the "input traits" of nom for Read, and modify the traits if necessary to make it possible. Then all parsers become streaming out-of-the-box (with the caveat that input production would be driven by the parser itself, so not async-friendly).

As it stands, the only robust solution is to apply sub-parsers on known-length parts of input. This can be made to work for binary formats that encode the size of each blob prior to the blob, so it can be extracted into one &[u8] and then processed with a complete parser, but that is extra pain and it does not cover every case.

@Geal
Copy link
Collaborator

Geal commented Jan 4, 2023

In my experience trying to apply the parsers directly on Read or BufRead is the wrong approach. It does not apply correctly on parsers that need to backtrack and performance does not end up at the level you'd expect, so much that you'd rather just read everything then parse in one pass.
Similarly, I tried the async await style and storing a closure, and that was not worth the complexity they introduced. Very quickly you end up introducing mutexes and allocations all over the place, and lose what makes nom fast: working on and returning immutable slices of the input data. Haskell makes this a lot easier than Rust, but at a great cost in performance.

Streaming parsers work well when one or more of those conditions are met:

  • the format uses a kind of tag/length/value design for messages so you can know in advance how much you need to buffer
  • the format has clear delimiters for subsections of the format (example: HTTP headers) so those subsections can be parsed one by one
  • there's a state machine driving IO and parsing that can accumulate data. This is what I do in all my protocol implementations

I am currently testing a design where the input type indicates if we're in streaming mode or not, which should help in writing parsers. But if you want to work with streaming parsers, there's a basic level of complexity you need to manage, and decisions you have to take in your format, that nom cannot do for you because that's the wrong level for it.

Can you tell me what kind of format you are working on? Maybe I can provide some pointers to make things easier

@douglas-raillard-arm
Copy link

douglas-raillard-arm commented Jan 5, 2023

Thanks for your quick response :)

The format I'm writing a parser for is:
https://man7.org/linux/man-pages/man5/trace-cmd.dat.5.html

Overall, it's organized as the following. "sized XXX" means "an XXX blob preceded by e.g. a u64 stating the size of the blob":

  • a header:
    • a sequence of sized blobs (but sadly no overall header size, so I can't just extract the whole header easily without nom)
  • A concatenation of arbitrary large data buffers (could be bigger than memory):
    • Each buffer is composed of a sequence of sized small blobs

It does not apply correctly on parsers that need to backtrack

I don't need to backtrack in my specific case, but wouldn't Seek help fix that ? I'm not sure how efficient Seek is on BufReader though.

so much that you'd rather just read everything then parse in one pass.

This is unfortunately not an option. The file I'm parsing may be larger than memory. The problem I'm facing is to parse the header of the format, which is not easily sized without decoding it (it's made of multiple reasonably-sized blobs, but I need basic le_u64/be_u64, take() and tag() to gather them). The data themselves will have the same issue.

Similarly, I tried the async await style and storing a closure.

Yes, that would be more to enable new use cases. It's possible to make an API for arbitrary large files without async, and I'm sure a Parser can still be exposed to async with a blocking read on channel in a separate thread fed by the async world or something like that.

So to summarize:

  • I'm very happy with nom to parse individual blobs (they fit for sure in memory so I can load them in a &[u8] and just run a complete parser)
  • I can't use nom as-is to parse the skeleton of the file, as it is basically an arbitrary large sequence of sized blobs so it can't be made into a &[u8].
  • In order to parse the skeleton, I only need basic parsers (le_u64() , take(), tag()), but they need to work on a Read/Seek type.

EDIT: added mentions to tag()

epage added a commit to epage/winnow that referenced this issue Feb 17, 2023
epage added a commit to epage/winnow that referenced this issue Feb 17, 2023
epage added a commit to epage/winnow that referenced this issue Feb 17, 2023
epage added a commit to epage/winnow that referenced this issue Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants