New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protecting against bomb-like inputs #129

Open
paulkernfeld opened this Issue Sep 3, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@paulkernfeld
Contributor

paulkernfeld commented Sep 3, 2018

It is possible to use a simple streaming input to make rust-csv allocate an infinite amount of memory parsing a single CSV line:

extern crate csv;
extern crate serde;

#[cfg(test)]
mod tests {
    use csv::Reader;
    use std::io::repeat;

    #[test]
    fn test_csv_bomb() {
        let mut rdr = Reader::from_reader(repeat(b','));

        // This line runs forever and keeps allocating memory
        rdr.deserialize::<f64>().next();
    }
}

As a user, I had expected that rust-csv would know that each line is supposed to be deserialized into a single f64 and therefore only parse a single field before returning an error.

This would make it risky to use rust-csv to parse data from untrusted streaming sources, even if the user is expecting a finite number of records where each record has a finite size. I can imagine someone writing a server that parses a CSV as it is uploaded, which would be vulnerable to this kind of issue.

I originally thought of this while working on ndarray-csv; it's pretty easy to deal with arrays that have too many rows, but I'm not sure what to do about arrays with too many columns.

By the way, the performance of this is awesome! I am able to fill up about a gigabyte of memory per second on my laptop! 🤣

@BurntSushi

This comment has been minimized.

Owner

BurntSushi commented Sep 3, 2018

Interesting bug report! Unfortunately, I don't think I agree with your framing/analysis of this issue.

Firstly, without setting some kind of limit, I don't think you have any reasonable expectation that something consuming an unbounded input stream would itself not also use unbounded resources, unless it's explicitly documented.

Secondly, the framing of this issue is somewhat off. Namely, if you really want to feed untrusted input data to this library, then this case is just the tip of the iceberg. I note that this program will not terminate either, and there is no reasonable expectation for the deserializer to "know" there is only one field because it never gets done parsing a single field:

extern crate csv;
extern crate serde;

use std::io;

use csv::Reader;

fn main() {
    let mut rdr = Reader::from_reader(io::repeat(b'0'));
    rdr.deserialize::<f64>().next();
}

A csv library that can execute safely on untrusted streams of data (including unbounded streams) would have a fairly different design IMO. I'm not keen on bolting something like that on to this library. In particular, how do other CSV libraries deal with untrusted data, if at all?

If you want to use this library on untrusted data that can be unbounded and you don't want to use memory proportional to that stream, then I think you should be able to write your own io::Read shim that imposes some fixed limits. e.g., A limit on the number of bytes on each line.

@paulkernfeld

This comment has been minimized.

Contributor

paulkernfeld commented Sep 3, 2018

Okay, that makes sense! I like your 0 example.

I have not done research on untrusted CSV data; I'm guessing that most CSV libraries don't even try to deal with untrusted data.

Your suggestion of wrapping io::Read seems like a great idea! Since I know the number of columns I'm expecting, I could multiply that by the "maximum reasonable size of a single field" to get the number of bytes per line.

Would it be possible to describe what kinds of inputs are and are not safe to parse with rust-csv? E.g. is it safe to accept untrusted inputs if the line length is limited and you parse a limited number of lines?

@BurntSushi

This comment has been minimized.

Owner

BurntSushi commented Sep 3, 2018

Would it be possible to describe what kinds of inputs are and are not safe to parse with rust-csv? E.g. is it safe to accept untrusted inputs if the line length is limited and you parse a limited number of lines?

I do think it might be worthwhile to document this. Before doing so, I think it would be worth an audit of the code, but I am fairly certain that if you limit both the line length and the number of lines, then you should be guaranteed to only use resources proportional to the size of the limits. That is, there shouldn't be an quadratic blow up (or worse) in parsing time or memory use.

@BurntSushi BurntSushi added the doc label Sep 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment