Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The trait serde::ser::Serialize is not implemented for csv::ByteRecord #76

Closed
emk opened this issue May 26, 2017 · 8 comments
Closed

The trait serde::ser::Serialize is not implemented for csv::ByteRecord #76

emk opened this issue May 26, 2017 · 8 comments

Comments

@emk
Copy link

emk commented May 26, 2017

Hello! I'm trying out the new csv beta release on some of in-house data-munging tools, and it's great.

But I did discover one tiny way to improve the ergonomics:

        // Use `byte_records` to get raw binary data without UTF-8 validation
        // for speed, since we may have enormous numbers of rows and hundreds of
        // columns.
        for row in rdr.byte_records() {
            let mut row = row?;
            let zip = from_utf8(&row[zip_col_idx])
                .chain_err(|| -> Error { "Could not parse zip code as UTF-8".into() })?
                .to_owned();
            row.push_field(self.chunk_for(&zip)?.as_bytes());
            wtr.serialize(&row)?;
        }

The above code has no idea what the columns are in any given record, except for zip_col_idx, which it uses to compute a new column and add it to the output row. But this code fails with:

error[E0277]: the trait bound `csv::ByteRecord: serde::ser::Serialize` is not satisfied
   --> src/zip2010.rs:101:17
    |
101 |             wtr.serialize(&row)?;
    |                 ^^^^^^^^^ the trait `serde::ser::Serialize` is not implemented for `csv::ByteRecord`
    |
    = note: required because of the requirements on the impl of `serde::ser::Serialize` for `&csv::ByteRecord`

This turns up in a lot of places in the API when working with generic CSV tools. I suspect one way to improve this might be to implement Serialize and Deserialize for both ByteRecord and Record.

I might be able to find time sometime soon to put together a PR. But is there a better way to fix this?

emk added a commit to faradayio/geochunk that referenced this issue May 26, 2017
We also upgraded to the latest beta release of `csv` to get support for
`position`, which we may want to use in error messages in the future.
But that's blocked on BurntSushi/rust-csv#76
@BurntSushi
Copy link
Owner

@emk Could you maybe say why you are using serialize here? Note that there is also write_record (which takes an iterator of fields) and a write_byte_record (which takes a &ByteRecord and is the fastest).

Also note that byte_records isn't the fastest way to iterate over records, since you're forcing an allocation for each record. I think the section on amortization in the tutorial might be helpful!

I can't test this with the compiler, but I'd think this would be the best way to write your loop:

let mut row = csv::ByteRecord::new();
while rdr.read_byte_record(&mut row)? {
    let zip = from_utf8(&row[zip_col_idx])
        .chain_err(|| -> Error { "Could not parse zip code as UTF-8".into() })?;
    row.push_field(self.chunk_for(zip)?.as_bytes());
    wtr.write_byte_record(&row)?;
}

Note that I also tried to get rid of the to_owned call in your let zip = ...; statement, since it seems like an unnecessary allocation, but I don't know for sure. :-)

I suspect one way to improve this might be to implement Serialize and Deserialize for both ByteRecord and Record.

I did indeed consider that. I decided to not do it for now because (as I think is the case here) it results in folks using sub-optimal APIs. e.g., You already have a ByteRecord, so I don't think there's any good reason to go through the Serde infrastructure just to write that row. Does that make sense or am I missing something else about your use case?

@emk
Copy link
Author

emk commented May 26, 2017

Does that make sense or am I missing something else about your use case?

I think it's me who missed the relevant sections of the new documentation while porting code forward. :-/ Sorry about that. I'll go re-read the documentation more carefully this time. Thank you for the pointer in the right direction!

@emk emk closed this as completed May 26, 2017
@BurntSushi
Copy link
Owner

@emk No worries! Note that write_byte_record isn't mentioned in the tutorial. I added it in 1.0.0-beta.3 and haven't updated the tutorial to include it. But there's an example in the API docs. :-)

@emk
Copy link
Author

emk commented May 26, 2017

I really learned a lot from the old tutorial section that showed the absolute fastest ways to do various operations—that was a super-helpful bit of documentation.

Basically, if we touch CSV data, there's automatically going to be at least 150 GB of it. So I try to know all the tricks for these inner loops.

Thank you once again for csv! It's a terrific library, and everybody at Faraday is constantly impressed by the sheer speed of what the lower-level APIs can do, and by the xsv tool.

(Now I need to go rebuild my Pachyderm containers and get ready for another couple thousand CPU hours worth of data processing next week. ;-) And no, csv is not where the bulk of those hours are going, happily! Most of that time is now spent on a custom bit of very hairy data munging.)

@BurntSushi
Copy link
Owner

@emk Thanks for your kind words. :-) Note that xsv 0.12.1 is quite a bit faster on several operations now!

@BurntSushi
Copy link
Owner

@emk Coming back around to this, I wanted to say a little bit more: I would very much appreciate any feedback you have on csv 1.0. I released a beta first specifically so I could still make breaking changes if I have to. :-)

@emk
Copy link
Author

emk commented May 27, 2017

I really ought to update scrubcsv, which relies on the fact that csv makes a partial effort to deal with malformed quotes and still return data. catcsv actually avoids CSV parsing as much as possible—it just parses and looks at headers, and then pipes everything after straight through with io::copy. So aside from scrubcsv's particular misuse of csv's APIs, I think we should mostly be OK with whatever you do. We mostly stick to your lower-level APIs and try to squeeze out everything we can.

We like CSV for "extract transform load" (ETL) tasks, not because it's the most performant data format, but because it allows us to build pipelines between heterogeneous tools and still get mostly acceptable performance. csv and xsv have made this strategy a lot more viable.

@emk
Copy link
Author

emk commented May 27, 2017

Oh, also, I personally released cli_test_dir, which is a standalone version of your WorkDir test pattern so that I can trivially test the CLIs of new Rust tools.

I noticed this pattern while working on xsv partition and wanted to use it elsewhere. :-)

emk added a commit to faradayio/csv-tools that referenced this issue Feb 12, 2021
We also upgraded to the latest beta release of `csv` to get support for
`position`, which we may want to use in error messages in the future.
But that's blocked on BurntSushi/rust-csv#76
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants