Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional Fields #33

Closed
teburd opened this issue Oct 14, 2015 · 5 comments
Closed

Optional Fields #33

teburd opened this issue Oct 14, 2015 · 5 comments

Comments

@teburd
Copy link

teburd commented Oct 14, 2015

So I was attempting to use rust-csv to directly read in GTFS data, which is a bunch of csv files with headers. The CSV files may optionally have some fields however, and when they are there I should read them, when they are not, I should be able to skip them.

I didn't see any way to do that with rust-csv with rustc_serialize as it stands today, it looks like I'll have to do some sort of csv -> hashmap -> struct type mapping function. It would be cool if this csv library could deal with optional fields automatically.

@BurntSushi
Copy link
Owner

Could you show some example data and describe specifically how you want to read it? I ask because "optional fields" can mean a number of different things. For example, if optional field means "the column is there but a value isn't always there," then that's easy: just use an Option<TheTypeForWhenTheValueIsPresent>.

@teburd
Copy link
Author

teburd commented Oct 14, 2015

Certainly, so in gtfs data one file might look like...

agency_name,agency_url,agency_timezone
Chicago Transit Authority,http://www.cta.com,America/Chicago

Another might look like...

agency_id,agency_name,agency_url,agency_timezone
MTA,Metropolitan Transportation Authority,http://mta.info,America/NewYork

Notably the agency_id column is optional according to the GTFS spec,
https://developers.google.com/transit/gtfs/reference#agency_fields

So in most cases its entirely absent, but in some cases it is there.

@BurntSushi
Copy link
Owner

Hmm, yeah, that will be tricky. You'll unfortunately need to roll your own handling for that sort of thing right now, because the decoder doesn't actually take field names into account. It's purely based on order, and since that kind of optional field changes the ordering, it makes that approach not workable.

I wonder if this may be helped by serde (which is the serialization framework I'll eventually move this crate too).

Having to decode each row into a hash map would be deeply unfortunate because I imagine that will slow processing down by at least an order of magnitude.

@teburd
Copy link
Author

teburd commented Oct 14, 2015

Since serde does work with json well I imagine that if you follow a similiar lead to the serde_json crate, at least as possibly an optional method of decoding given csv headers (vs the position based that your using now) it should work

I appreciate the time you took to look at this issue, but yeah for now it looks like a hashmap might be the best bet

@abyrd
Copy link

abyrd commented Oct 19, 2015

Hello @BurntSushi, thanks for all your work on this CSV library. I just wanted to add another voice in support of using the header row to determine the order of fields and their presence or absence. My use case is the same as @BFrog. GTFS is an increasingly common CSV-based format.

BurntSushi added a commit that referenced this issue May 23, 2017
This commit contains a ground up rewrite of the CSV crate. Nothing
survived. This rewrite was long overdue. Namely, the API of the previous
version was initially designed 3 years ago, which was 1 year before Rust
1.0 was released.

The big changes:

1. Use a DFA to get nearly a factor of 2 speed improvement across the board.
2. Serde support, including matching headers with names in structs.
3. A new crate, csv-core, for parsing CSV without the standard library.

The performance improvements:

    count_game_deserialize_owned_bytes  30,404,805 (85 MB/s)   23,878,089 (108 MB/s)    -6,526,716  -21.47%   x 1.27
    count_game_deserialize_owned_str    30,431,169 (85 MB/s)   22,861,276 (113 MB/s)    -7,569,893  -24.88%   x 1.33
    count_game_iter_bytes               21,751,711 (119 MB/s)  11,873,257 (218 MB/s)    -9,878,454  -45.41%   x 1.83
    count_game_iter_str                 25,609,184 (101 MB/s)  13,769,390 (188 MB/s)   -11,839,794  -46.23%   x 1.86
    count_game_read_bytes               12,110,082 (214 MB/s)  6,686,121 (388 MB/s)     -5,423,961  -44.79%   x 1.81
    count_game_read_str                 15,497,249 (167 MB/s)  8,269,207 (314 MB/s)     -7,228,042  -46.64%   x 1.87
    count_mbta_deserialize_owned_bytes  5,779,138 (125 MB/s)   3,775,874 (191 MB/s)     -2,003,264  -34.66%   x 1.53
    count_mbta_deserialize_owned_str    5,777,055 (125 MB/s)   4,353,921 (166 MB/s)     -1,423,134  -24.63%   x 1.33
    count_mbta_iter_bytes               3,991,047 (181 MB/s)   1,805,387 (400 MB/s)     -2,185,660  -54.76%   x 2.21
    count_mbta_iter_str                 4,726,647 (153 MB/s)   2,354,842 (307 MB/s)     -2,371,805  -50.18%   x 2.01
    count_mbta_read_bytes               2,690,641 (268 MB/s)   1,253,111 (577 MB/s)     -1,437,530  -53.43%   x 2.15
    count_mbta_read_str                 3,399,631 (212 MB/s)   1,743,035 (415 MB/s)     -1,656,596  -48.73%   x 1.95
    count_nfl_deserialize_owned_bytes   10,608,513 (128 MB/s)  5,828,747 (234 MB/s)     -4,779,766  -45.06%   x 1.82
    count_nfl_deserialize_owned_str     10,612,366 (128 MB/s)  6,814,770 (200 MB/s)     -3,797,596  -35.78%   x 1.56
    count_nfl_iter_bytes                6,798,767 (200 MB/s)   2,564,448 (532 MB/s)     -4,234,319  -62.28%   x 2.65
    count_nfl_iter_str                  7,888,662 (172 MB/s)   3,579,865 (381 MB/s)     -4,308,797  -54.62%   x 2.20
    count_nfl_read_bytes                4,588,369 (297 MB/s)   1,911,120 (714 MB/s)     -2,677,249  -58.35%   x 2.40
    count_nfl_read_str                  5,755,926 (237 MB/s)   2,847,833 (479 MB/s)     -2,908,093  -50.52%   x 2.02
    count_pop_deserialize_owned_bytes   11,052,436 (86 MB/s)   8,848,364 (108 MB/s)     -2,204,072  -19.94%   x 1.25
    count_pop_deserialize_owned_str     11,054,638 (86 MB/s)   9,184,678 (104 MB/s)     -1,869,960  -16.92%   x 1.20
    count_pop_iter_bytes                6,190,345 (154 MB/s)   3,110,704 (307 MB/s)     -3,079,641  -49.75%   x 1.99
    count_pop_iter_str                  7,679,804 (124 MB/s)   4,274,842 (223 MB/s)     -3,404,962  -44.34%   x 1.80
    count_pop_read_bytes                3,898,119 (245 MB/s)   2,218,535 (430 MB/s)     -1,679,584  -43.09%   x 1.76
    count_pop_read_str                  5,195,237 (183 MB/s)   3,209,998 (297 MB/s)     -1,985,239  -38.21%   x 1.62

The rewrite/redesign was largely fueled by two things:

1. Reorganizing the API to permit performance improvements. For example,
   the lower level APIs now operate on entire records instead of
   one-field-at-a-time.
2. Fix a large number of outstanding issues.

Fixes #16, Fixes #28, Fixes #29, Fixes #32, Fixes #33, Fixes #36,
Fixes #39, Fixes #42, Fixes #44, Fixes #46, Fixes #49, Fixes #52,
Fixes #56, Fixes #59, Fixes #67
BurntSushi added a commit that referenced this issue May 23, 2017
This commit contains a ground up rewrite of the CSV crate. Nothing
survived. This rewrite was long overdue. Namely, the API of the previous
version was initially designed 3 years ago, which was 1 year before Rust
1.0 was released.

The big changes:

1. Use a DFA to get nearly a factor of 2 speed improvement across the board.
2. Serde support, including matching headers with names in structs.
3. A new crate, csv-core, for parsing CSV without the standard library.

The performance improvements:

    count_game_deserialize_owned_bytes  30,404,805 (85 MB/s)   23,878,089 (108 MB/s)    -6,526,716  -21.47%   x 1.27
    count_game_deserialize_owned_str    30,431,169 (85 MB/s)   22,861,276 (113 MB/s)    -7,569,893  -24.88%   x 1.33
    count_game_iter_bytes               21,751,711 (119 MB/s)  11,873,257 (218 MB/s)    -9,878,454  -45.41%   x 1.83
    count_game_iter_str                 25,609,184 (101 MB/s)  13,769,390 (188 MB/s)   -11,839,794  -46.23%   x 1.86
    count_game_read_bytes               12,110,082 (214 MB/s)  6,686,121 (388 MB/s)     -5,423,961  -44.79%   x 1.81
    count_game_read_str                 15,497,249 (167 MB/s)  8,269,207 (314 MB/s)     -7,228,042  -46.64%   x 1.87
    count_mbta_deserialize_owned_bytes  5,779,138 (125 MB/s)   3,775,874 (191 MB/s)     -2,003,264  -34.66%   x 1.53
    count_mbta_deserialize_owned_str    5,777,055 (125 MB/s)   4,353,921 (166 MB/s)     -1,423,134  -24.63%   x 1.33
    count_mbta_iter_bytes               3,991,047 (181 MB/s)   1,805,387 (400 MB/s)     -2,185,660  -54.76%   x 2.21
    count_mbta_iter_str                 4,726,647 (153 MB/s)   2,354,842 (307 MB/s)     -2,371,805  -50.18%   x 2.01
    count_mbta_read_bytes               2,690,641 (268 MB/s)   1,253,111 (577 MB/s)     -1,437,530  -53.43%   x 2.15
    count_mbta_read_str                 3,399,631 (212 MB/s)   1,743,035 (415 MB/s)     -1,656,596  -48.73%   x 1.95
    count_nfl_deserialize_owned_bytes   10,608,513 (128 MB/s)  5,828,747 (234 MB/s)     -4,779,766  -45.06%   x 1.82
    count_nfl_deserialize_owned_str     10,612,366 (128 MB/s)  6,814,770 (200 MB/s)     -3,797,596  -35.78%   x 1.56
    count_nfl_iter_bytes                6,798,767 (200 MB/s)   2,564,448 (532 MB/s)     -4,234,319  -62.28%   x 2.65
    count_nfl_iter_str                  7,888,662 (172 MB/s)   3,579,865 (381 MB/s)     -4,308,797  -54.62%   x 2.20
    count_nfl_read_bytes                4,588,369 (297 MB/s)   1,911,120 (714 MB/s)     -2,677,249  -58.35%   x 2.40
    count_nfl_read_str                  5,755,926 (237 MB/s)   2,847,833 (479 MB/s)     -2,908,093  -50.52%   x 2.02
    count_pop_deserialize_owned_bytes   11,052,436 (86 MB/s)   8,848,364 (108 MB/s)     -2,204,072  -19.94%   x 1.25
    count_pop_deserialize_owned_str     11,054,638 (86 MB/s)   9,184,678 (104 MB/s)     -1,869,960  -16.92%   x 1.20
    count_pop_iter_bytes                6,190,345 (154 MB/s)   3,110,704 (307 MB/s)     -3,079,641  -49.75%   x 1.99
    count_pop_iter_str                  7,679,804 (124 MB/s)   4,274,842 (223 MB/s)     -3,404,962  -44.34%   x 1.80
    count_pop_read_bytes                3,898,119 (245 MB/s)   2,218,535 (430 MB/s)     -1,679,584  -43.09%   x 1.76
    count_pop_read_str                  5,195,237 (183 MB/s)   3,209,998 (297 MB/s)     -1,985,239  -38.21%   x 1.62

The rewrite/redesign was largely fueled by two things:

1. Reorganizing the API to permit performance improvements. For example,
   the lower level APIs now operate on entire records instead of
   one-field-at-a-time.
2. Fix a large number of outstanding issues.

Fixes #16, Fixes #28, Fixes #29, Fixes #32, Fixes #33, Fixes #36,
Fixes #39, Fixes #42, Fixes #44, Fixes #46, Fixes #49, Fixes #52,
Fixes #56, Fixes #59, Fixes #67
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants