New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept whitespace trimming settings #97

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@medwards
Contributor

medwards commented Nov 13, 2017

Fixes #78

Per my comment in #78, I'm popping whitespace and updating the record bounds to reflect the changes.

Some thoughts:

  • I don't know if match or if statements are more idiomatic in this case. Let me know if you have an opinion.
  • If the user turns off headers then I apply trimming to the first row if Trim::All or Trim::Fields are set. I think this is reasonable but I could imagine that maybe Trim::Headers should be Trim::FirstRow or something in which case it isn't reasonable.
  • Just now I was thinking maybe Trim::Fields should be Trim::Records instead.
@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Nov 14, 2017

Contributor

Added some updates, its much cleaner and a bit faster.

I've also done some performance analysis now. Here is master:
test count_nfl_read_bytes ... bench: 1,969,508 ns/iter (+/- 97,802) = 692 MB/s
And if you add trimming you get
test count_nfl_read_bytes ... bench: 3,838,782 ns/iter (+/- 601,554) = 355 MB/s
(full bench results)

The average performance impact across all benchmarks is roughly 46% slower (the error bars might be up to 10% off that, not helped by weird things like write benchmarks being mysteriously slower too).

Contributor

medwards commented Nov 14, 2017

Added some updates, its much cleaner and a bit faster.

I've also done some performance analysis now. Here is master:
test count_nfl_read_bytes ... bench: 1,969,508 ns/iter (+/- 97,802) = 692 MB/s
And if you add trimming you get
test count_nfl_read_bytes ... bench: 3,838,782 ns/iter (+/- 601,554) = 355 MB/s
(full bench results)

The average performance impact across all benchmarks is roughly 46% slower (the error bars might be up to 10% off that, not helped by weird things like write benchmarks being mysteriously slower too).

@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Nov 20, 2017

Contributor

@BurntSushi: Hey I know you're probably busy with impl period stuff, but I was hoping you could take a preliminary look at this and at least give me an acceptable/unacceptable in principle. I'm happy to come up with an alternative approach, but I'll have really spotty internet in a week so if its necessary I'd like to figure out the high-level stuff this week.

Contributor

medwards commented Nov 20, 2017

@BurntSushi: Hey I know you're probably busy with impl period stuff, but I was hoping you could take a preliminary look at this and at least give me an acceptable/unacceptable in principle. I'm happy to come up with an alternative approach, but I'll have really spotty internet in a week so if its necessary I'd like to figure out the high-level stuff this week.

@BurntSushi

@medwards Thanks so much for doing this! I know this is hard work, and I'm sorry it took me so long to look at this, but PRs implementing tricky tasks like this are time consuming to review.

Overall, most of my comments on the code are pretty minor, but the bigger piece of feedback here is: should trimming only account for ASCII spaces in string records? It makes sense for byte records, but for string records, it kind of seems like we should trim all kinds of whitespace, as defined by Unicode.

Similarly, if we offer a trim method on ByteRecord, then we should also offer a trim method on StringRecord. This may be a little tricky to implement in a way that's fast, so I'd suggest picking a slow but convenient implementation if possible. I can try to help with that...

Show outdated Hide outdated csv-core/src/lib.rs Outdated
Show outdated Hide outdated csv-core/src/reader.rs Outdated
Show outdated Hide outdated csv-core/src/lib.rs Outdated
@@ -417,6 +417,52 @@ impl ByteRecord {
self.truncate(0);
}
/// Trim the fields of this record so that leading and trailing whitespace is removed.
///
/// Note that the whitespace trimmed is currently only the ASCII space and tab.

This comment has been minimized.

@BurntSushi

BurntSushi Nov 20, 2017

Owner

I suspect this should include all ASCII space characters. According to Unicode, that's [\u0009-\u000D] and \u0020.

@BurntSushi

BurntSushi Nov 20, 2017

Owner

I suspect this should include all ASCII space characters. According to Unicode, that's [\u0009-\u000D] and \u0020.

Show outdated Hide outdated src/byte_record.rs Outdated
Show outdated Hide outdated src/reader.rs Outdated
Show outdated Hide outdated src/reader.rs Outdated
Show outdated Hide outdated src/reader.rs Outdated
Show outdated Hide outdated src/reader.rs Outdated
Show outdated Hide outdated src/reader.rs Outdated
@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Nov 21, 2017

Contributor

@BurntSushi: Hey man, no worries. We're all pitching in with the time we got, I'm just trying to schedule optimally within that ;) I especially appreciate the detailed code review. I'll address these once I'm on planes and stuff but FYI you might not see the result til after Dec 10.

As for the high level stuff:
re: just trimming ASCII whitespace - aha I misread your comment and thought you were limiting the scope to just ByteRecord trimming.
re: core_trim - at first I did it because I thought I would be making the change in core-csv, then after that turned out not to be true it felt kind of like polluting the non-core reader to add it (all of the knobs currently seem to be in core-csv). I'm fine either way and it should be easy
re: more tests for ByteRecord.trim - are the doctests insufficient? Several (all?) of the cases you list are covered by the doctest and I got the sense that several functions in ByteRecord had their core usecases tested through doctests and just weird edge cases went into unit tests. lmk either way.

While we're at it, do you want a benchmark too rather than my adhoc benchmarking?

Contributor

medwards commented Nov 21, 2017

@BurntSushi: Hey man, no worries. We're all pitching in with the time we got, I'm just trying to schedule optimally within that ;) I especially appreciate the detailed code review. I'll address these once I'm on planes and stuff but FYI you might not see the result til after Dec 10.

As for the high level stuff:
re: just trimming ASCII whitespace - aha I misread your comment and thought you were limiting the scope to just ByteRecord trimming.
re: core_trim - at first I did it because I thought I would be making the change in core-csv, then after that turned out not to be true it felt kind of like polluting the non-core reader to add it (all of the knobs currently seem to be in core-csv). I'm fine either way and it should be easy
re: more tests for ByteRecord.trim - are the doctests insufficient? Several (all?) of the cases you list are covered by the doctest and I got the sense that several functions in ByteRecord had their core usecases tested through doctests and just weird edge cases went into unit tests. lmk either way.

While we're at it, do you want a benchmark too rather than my adhoc benchmarking?

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Nov 21, 2017

Owner

re: just trimming ASCII whitespace - aha I misread your comment and thought you were limiting the scope to just ByteRecord trimming.

Ah yeah sorry about the confusion. I think the issue here is that we're introducing a trim option on the reader itself, and if we're going to go that route, then we need to go all the way and implement proper Unicode trimming for StringRecord as well.

There does exist an incremental step here, should you wish to take it. You could get rid of the trim option on the reader and instead just focus on providing the trim method on ByteRecord. We could then incrementally add trim to StringRecord, and then finally add it to the reader. It's up to you though!

re: core_trim - at first I did it because I thought I would be making the change in core-csv, then after that turned out not to be true it felt kind of like polluting the non-core reader to add it (all of the knobs currently seem to be in core-csv). I'm fine either way and it should be easy

Yup I definitely don't see this as pollution. It just so happens that most of the knobs exist at the csv core level, but not all knobs do! For example, the flexible and header knobs exist at the csv crate level.

re: more tests for ByteRecord.trim - are the doctests insufficient? Several (all?) of the cases you list are covered by the doctest and I got the sense that several functions in ByteRecord had their core usecases tested through doctests and just weird edge cases went into unit tests. lmk either way.

Oh hmm I missed those. They don't cover all the cases though. The one I don't see is a test for the empty record. It might also be good to add more variations with 1-field record vs multi-field records.

While we're at it, do you want a benchmark too rather than my adhoc benchmarking?

That'd be great! But it's strictly bonus. :-)

FWIW, I'm totally OK with this option being "slow" in the initial implementation. We can always improve it later. (And a benchmark would help with that.)

Owner

BurntSushi commented Nov 21, 2017

re: just trimming ASCII whitespace - aha I misread your comment and thought you were limiting the scope to just ByteRecord trimming.

Ah yeah sorry about the confusion. I think the issue here is that we're introducing a trim option on the reader itself, and if we're going to go that route, then we need to go all the way and implement proper Unicode trimming for StringRecord as well.

There does exist an incremental step here, should you wish to take it. You could get rid of the trim option on the reader and instead just focus on providing the trim method on ByteRecord. We could then incrementally add trim to StringRecord, and then finally add it to the reader. It's up to you though!

re: core_trim - at first I did it because I thought I would be making the change in core-csv, then after that turned out not to be true it felt kind of like polluting the non-core reader to add it (all of the knobs currently seem to be in core-csv). I'm fine either way and it should be easy

Yup I definitely don't see this as pollution. It just so happens that most of the knobs exist at the csv core level, but not all knobs do! For example, the flexible and header knobs exist at the csv crate level.

re: more tests for ByteRecord.trim - are the doctests insufficient? Several (all?) of the cases you list are covered by the doctest and I got the sense that several functions in ByteRecord had their core usecases tested through doctests and just weird edge cases went into unit tests. lmk either way.

Oh hmm I missed those. They don't cover all the cases though. The one I don't see is a test for the empty record. It might also be good to add more variations with 1-field record vs multi-field records.

While we're at it, do you want a benchmark too rather than my adhoc benchmarking?

That'd be great! But it's strictly bonus. :-)

FWIW, I'm totally OK with this option being "slow" in the initial implementation. We can always improve it later. (And a benchmark would help with that.)

@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Nov 22, 2017

Contributor

There does exist an incremental step here, should you wish to take it

Naw, I want to be able to use this so I'll take it to the finish line and make sure its exposed.

I don't see is a test for the empty record. It might also be good to add more variations with 1-field record vs multi-field records.

K, I'll try to think of some single field edge cases. Empty record is literally just an empty string? With multiple fields?

Contributor

medwards commented Nov 22, 2017

There does exist an incremental step here, should you wish to take it

Naw, I want to be able to use this so I'll take it to the finish line and make sure its exposed.

I don't see is a test for the empty record. It might also be good to add more variations with 1-field record vs multi-field records.

K, I'll try to think of some single field edge cases. Empty record is literally just an empty string? With multiple fields?

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Nov 22, 2017

Owner

Empty record is literally just an empty string? With multiple fields?

Either, I suppose. The goal here should be to test trim on any valid record value.

Owner

BurntSushi commented Nov 22, 2017

Empty record is literally just an empty string? With multiple fields?

Either, I suppose. The goal here should be to test trim on any valid record value.

@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Nov 29, 2017

Contributor

Did this on the plane. Needs a lot more tests but I am curious to see whether the recursive approach in StringRecord.trim could be used in ByteRecord and what the performance difference is.

Some thoughts I'd like comments on now:

  • trim setting is in ReaderState with similar settings, but it feels weird there. Just less weird than on Reader which is pretty bare
  • I think I got set_headers_impl hashed out. I think its really hard to understand what its doing and the Result that contains two potential input values doesn't help. Is Result here being used as a bootstrap for an "either/or" struct?
    • NOTE: UTF8 errors in str headers can have valid_to value that doesn't sync up with the byte version of the headers (see the unit test for an example). I think we need to think about this (or we could just cover it with appropriate documentation in Utf8Error)
  • Every StringRecord that gets trimmed first gets trimmed as a ByteRecord as part of string_record::read so there is some duplicated and wasted efffort. Is this OK for now?

unit tests and benchmarks coming...

Contributor

medwards commented Nov 29, 2017

Did this on the plane. Needs a lot more tests but I am curious to see whether the recursive approach in StringRecord.trim could be used in ByteRecord and what the performance difference is.

Some thoughts I'd like comments on now:

  • trim setting is in ReaderState with similar settings, but it feels weird there. Just less weird than on Reader which is pretty bare
  • I think I got set_headers_impl hashed out. I think its really hard to understand what its doing and the Result that contains two potential input values doesn't help. Is Result here being used as a bootstrap for an "either/or" struct?
    • NOTE: UTF8 errors in str headers can have valid_to value that doesn't sync up with the byte version of the headers (see the unit test for an example). I think we need to think about this (or we could just cover it with appropriate documentation in Utf8Error)
  • Every StringRecord that gets trimmed first gets trimmed as a ByteRecord as part of string_record::read so there is some duplicated and wasted efffort. Is this OK for now?

unit tests and benchmarks coming...

@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Nov 30, 2017

Contributor

@BurntSushi I think I'm ready for another review.

There are some notes in my previous comment, as to the new stuff:

  • Tell me if the unit tests are comprehensive enough for you. I could have gotten into single record "trim left/trim right/ignore internal whitespace" tests but I felt this was excessive given the doctests. lmk either way
  • only added benchmarks to one of the sample sets, lemme know if where you want it or for everyone
test count_mbta_iter_bytes                 ... bench:   1,980,153 ns/iter (+/- 14,208) = 365 MB/s
test count_mbta_iter_bytes_trimmed         ... bench:   3,484,228 ns/iter (+/- 23,231) = 207 MB/s
test count_mbta_iter_str                   ... bench:   2,527,966 ns/iter (+/- 9,659) = 286 MB/s
test count_mbta_iter_str_trimmed           ... bench:   7,438,755 ns/iter (+/- 112,220) = 97 MB/s
Contributor

medwards commented Nov 30, 2017

@BurntSushi I think I'm ready for another review.

There are some notes in my previous comment, as to the new stuff:

  • Tell me if the unit tests are comprehensive enough for you. I could have gotten into single record "trim left/trim right/ignore internal whitespace" tests but I felt this was excessive given the doctests. lmk either way
  • only added benchmarks to one of the sample sets, lemme know if where you want it or for everyone
test count_mbta_iter_bytes                 ... bench:   1,980,153 ns/iter (+/- 14,208) = 365 MB/s
test count_mbta_iter_bytes_trimmed         ... bench:   3,484,228 ns/iter (+/- 23,231) = 207 MB/s
test count_mbta_iter_str                   ... bench:   2,527,966 ns/iter (+/- 9,659) = 286 MB/s
test count_mbta_iter_str_trimmed           ... bench:   7,438,755 ns/iter (+/- 112,220) = 97 MB/s
@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Dec 1, 2017

Contributor

I did the experiment I mentioned above:
The current ByteRecord.trim implementation is about 36% faster than a recursive version and isn't much less readable. I've left the current version in the PR.

Contributor

medwards commented Dec 1, 2017

I did the experiment I mentioned above:
The current ByteRecord.trim implementation is about 36% faster than a recursive version and isn't much less readable. I've left the current version in the PR.

@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Jan 10, 2018

Contributor

@BurntSushi Hey man, hope you had a swell New Years :) Is there anything here left for me to do?

Contributor

medwards commented Jan 10, 2018

@BurntSushi Hey man, hope you had a swell New Years :) Is there anything here left for me to do?

@BurntSushi

This comment has been minimized.

Show comment
Hide comment
@BurntSushi

BurntSushi Jan 30, 2018

Owner

@medwards I wanted to try to get this merged, so I took this over the finish line. I hope you don't mind. You can see my PR that is based off your work in #106.

Many of my changes were stylistic, but there were a few logical changes as well:

  • When reading a header, byte records should only be trimmed via ASCII whitespace rules. I believe this PR would check to see if there was a string record, trim that, and then set the byte record to that result.
  • Recursion in StringRecord's implementation of trim just is not viable. I dare say that recursion anywhere in this library is not viable. :-) I modified it to use a simple for-loop.
  • I added a few more tests.
  • I added more documentation, particularly on the CSV reader builder, including an example.
Owner

BurntSushi commented Jan 30, 2018

@medwards I wanted to try to get this merged, so I took this over the finish line. I hope you don't mind. You can see my PR that is based off your work in #106.

Many of my changes were stylistic, but there were a few logical changes as well:

  • When reading a header, byte records should only be trimmed via ASCII whitespace rules. I believe this PR would check to see if there was a string record, trim that, and then set the byte record to that result.
  • Recursion in StringRecord's implementation of trim just is not viable. I dare say that recursion anywhere in this library is not viable. :-) I modified it to use a simple for-loop.
  • I added a few more tests.
  • I added more documentation, particularly on the CSV reader builder, including an example.

@BurntSushi BurntSushi closed this Jan 30, 2018

@medwards

This comment has been minimized.

Show comment
Hide comment
@medwards

medwards Feb 2, 2018

Contributor

Hrm, yeah I feel you wanting to push it over the finish line as I felt the same way. Regarding the headers at least its probably best if you fiddle with it. Looking at my old diff I'm not even sure what I was doing in that section of code. At minimum please refer to my comment above regarding set_headers_impl readability needing improvement.

Contributor

medwards commented Feb 2, 2018

Hrm, yeah I feel you wanting to push it over the finish line as I felt the same way. Regarding the headers at least its probably best if you fiddle with it. Looking at my old diff I'm not even sure what I was doing in that section of code. At minimum please refer to my comment above regarding set_headers_impl readability needing improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment