The trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord` #76

emk · 2017-05-26T12:26:58Z

Hello! I'm trying out the new csv beta release on some of in-house data-munging tools, and it's great.

But I did discover one tiny way to improve the ergonomics:

        // Use `byte_records` to get raw binary data without UTF-8 validation
        // for speed, since we may have enormous numbers of rows and hundreds of
        // columns.
        for row in rdr.byte_records() {
            let mut row = row?;
            let zip = from_utf8(&row[zip_col_idx])
                .chain_err(|| -> Error { "Could not parse zip code as UTF-8".into() })?
                .to_owned();
            row.push_field(self.chunk_for(&zip)?.as_bytes());
            wtr.serialize(&row)?;
        }

The above code has no idea what the columns are in any given record, except for zip_col_idx, which it uses to compute a new column and add it to the output row. But this code fails with:

error[E0277]: the trait bound `csv::ByteRecord: serde::ser::Serialize` is not satisfied
   --> src/zip2010.rs:101:17
    |
101 |             wtr.serialize(&row)?;
    |                 ^^^^^^^^^ the trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord`
    |
    = note: required because of the requirements on the impl of `serde::ser::Serialize` for `&csv::ByteRecord`

This turns up in a lot of places in the API when working with generic CSV tools. I suspect one way to improve this might be to implement Serialize and Deserialize for both ByteRecord and Record.

I might be able to find time sometime soon to put together a PR. But is there a better way to fix this?

The text was updated successfully, but these errors were encountered:

We also upgraded to the latest beta release of `csv` to get support for `position`, which we may want to use in error messages in the future. But that's blocked on BurntSushi/rust-csv#76

BurntSushi · 2017-05-26T12:45:57Z

@emk Could you maybe say why you are using serialize here? Note that there is also write_record (which takes an iterator of fields) and a write_byte_record (which takes a &ByteRecord and is the fastest).

Also note that byte_records isn't the fastest way to iterate over records, since you're forcing an allocation for each record. I think the section on amortization in the tutorial might be helpful!

I can't test this with the compiler, but I'd think this would be the best way to write your loop:

let mut row = csv::ByteRecord::new();
while rdr.read_byte_record(&mut row)? {
    let zip = from_utf8(&row[zip_col_idx])
        .chain_err(|| -> Error { "Could not parse zip code as UTF-8".into() })?;
    row.push_field(self.chunk_for(zip)?.as_bytes());
    wtr.write_byte_record(&row)?;
}

Note that I also tried to get rid of the to_owned call in your let zip = ...; statement, since it seems like an unnecessary allocation, but I don't know for sure. :-)

I suspect one way to improve this might be to implement Serialize and Deserialize for both ByteRecord and Record.

I did indeed consider that. I decided to not do it for now because (as I think is the case here) it results in folks using sub-optimal APIs. e.g., You already have a ByteRecord, so I don't think there's any good reason to go through the Serde infrastructure just to write that row. Does that make sense or am I missing something else about your use case?

emk · 2017-05-26T12:52:38Z

Does that make sense or am I missing something else about your use case?

I think it's me who missed the relevant sections of the new documentation while porting code forward. :-/ Sorry about that. I'll go re-read the documentation more carefully this time. Thank you for the pointer in the right direction!

BurntSushi · 2017-05-26T12:54:33Z

@emk No worries! Note that write_byte_record isn't mentioned in the tutorial. I added it in 1.0.0-beta.3 and haven't updated the tutorial to include it. But there's an example in the API docs. :-)

emk · 2017-05-26T13:03:42Z

I really learned a lot from the old tutorial section that showed the absolute fastest ways to do various operations—that was a super-helpful bit of documentation.

Basically, if we touch CSV data, there's automatically going to be at least 150 GB of it. So I try to know all the tricks for these inner loops.

Thank you once again for csv! It's a terrific library, and everybody at Faraday is constantly impressed by the sheer speed of what the lower-level APIs can do, and by the xsv tool.

(Now I need to go rebuild my Pachyderm containers and get ready for another couple thousand CPU hours worth of data processing next week. ;-) And no, csv is not where the bulk of those hours are going, happily! Most of that time is now spent on a custom bit of very hairy data munging.)

BurntSushi · 2017-05-26T13:17:05Z

@emk Thanks for your kind words. :-) Note that xsv 0.12.1 is quite a bit faster on several operations now!

BurntSushi · 2017-05-27T14:42:55Z

@emk Coming back around to this, I wanted to say a little bit more: I would very much appreciate any feedback you have on csv 1.0. I released a beta first specifically so I could still make breaking changes if I have to. :-)

emk · 2017-05-27T17:13:21Z

I really ought to update scrubcsv, which relies on the fact that csv makes a partial effort to deal with malformed quotes and still return data. catcsv actually avoids CSV parsing as much as possible—it just parses and looks at headers, and then pipes everything after straight through with io::copy. So aside from scrubcsv's particular misuse of csv's APIs, I think we should mostly be OK with whatever you do. We mostly stick to your lower-level APIs and try to squeeze out everything we can.

We like CSV for "extract transform load" (ETL) tasks, not because it's the most performant data format, but because it allows us to build pipelines between heterogeneous tools and still get mostly acceptable performance. csv and xsv have made this strategy a lot more viable.

emk · 2017-05-27T17:14:52Z

Oh, also, I personally released cli_test_dir, which is a standalone version of your WorkDir test pattern so that I can trivially test the CLIs of new Rust tools.

I noticed this pattern while working on xsv partition and wanted to use it elsewhere. :-)

We also upgraded to the latest beta release of `csv` to get support for `position`, which we may want to use in error messages in the future. But that's blocked on BurntSushi/rust-csv#76

emk closed this as completed May 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord` #76

The trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord` #76

emk commented May 26, 2017

BurntSushi commented May 26, 2017

emk commented May 26, 2017

BurntSushi commented May 26, 2017

emk commented May 26, 2017

BurntSushi commented May 26, 2017

BurntSushi commented May 27, 2017

emk commented May 27, 2017

emk commented May 27, 2017 •

edited

The trait serde::ser::Serialize is not implemented for csv::ByteRecord #76

The trait serde::ser::Serialize is not implemented for csv::ByteRecord #76

Comments

emk commented May 26, 2017

BurntSushi commented May 26, 2017

emk commented May 26, 2017

BurntSushi commented May 26, 2017

emk commented May 26, 2017

BurntSushi commented May 26, 2017

BurntSushi commented May 27, 2017

emk commented May 27, 2017

emk commented May 27, 2017 • edited

The trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord` #76

The trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord` #76

emk commented May 27, 2017 •

edited