Accept whitespace trimming settings#97
Conversation
98d40f6 to
b9baeee
Compare
|
Added some updates, its much cleaner and a bit faster. I've also done some performance analysis now. Here is master: The average performance impact across all benchmarks is roughly 46% slower (the error bars might be up to 10% off that, not helped by weird things like write benchmarks being mysteriously slower too). |
|
@BurntSushi: Hey I know you're probably busy with impl period stuff, but I was hoping you could take a preliminary look at this and at least give me an acceptable/unacceptable in principle. I'm happy to come up with an alternative approach, but I'll have really spotty internet in a week so if its necessary I'd like to figure out the high-level stuff this week. |
BurntSushi
left a comment
There was a problem hiding this comment.
@medwards Thanks so much for doing this! I know this is hard work, and I'm sorry it took me so long to look at this, but PRs implementing tricky tasks like this are time consuming to review.
Overall, most of my comments on the code are pretty minor, but the bigger piece of feedback here is: should trimming only account for ASCII spaces in string records? It makes sense for byte records, but for string records, it kind of seems like we should trim all kinds of whitespace, as defined by Unicode.
Similarly, if we offer a trim method on ByteRecord, then we should also offer a trim method on StringRecord. This may be a little tricky to implement in a way that's fast, so I'd suggest picking a slow but convenient implementation if possible. I can try to help with that...
| /// Trim whitespace from headers | ||
| Headers, | ||
| /// Trim whitespace from fields | ||
| Fields, |
| /// The terminator that separates records. | ||
| term: Terminator, | ||
| /// Whether to trim fields or headers | ||
| pub trim: Trim, |
|
|
||
| /// The whitespace preservation behaviour when reading CSV data. | ||
| #[derive(Clone, Copy, Debug, PartialEq)] | ||
| pub enum Trim { |
There was a problem hiding this comment.
I'm a little surprised that this type is in csv-core. In particular, csv-core doesn't know about things like headers, and more importantly, this type isn't even used in csv-core at all. It seems like you set it as a knob on the reader, but you never actually use it anywhere.
(To be clear, I don't think csv-core should implement trimming either. It should strictly be in the csv crate.)
|
|
||
| /// Trim the fields of this record so that leading and trailing whitespace is removed. | ||
| /// | ||
| /// Note that the whitespace trimmed is currently only the ASCII space and tab. |
There was a problem hiding this comment.
I suspect this should include all ASCII space characters. According to Unicode, that's [\u0009-\u000D] and \u0020.
| fn count_leading_whitespace<R: Iterator<Item=usize>>(&self, range: R) -> usize { | ||
| let mut count = 0; | ||
| for i in range { | ||
| let field_char = self.0.fields[i] as char; |
There was a problem hiding this comment.
We're dealing with ASCII here, so there's no need to cast to a scalar value. You can use, e.g., b'\t' instead of '\t'.
| if let Some(ref headers) = self.state.headers { | ||
| self.state.first = true; | ||
| record.clone_from(&headers.byte_record); | ||
| if self.core.trim == CoreTrim::Fields || self.core.trim == CoreTrim::All { |
| // read and return the next one. | ||
| if self.state.has_headers { | ||
| return self.read_byte_record_impl(record); | ||
| let ret_val = self.read_byte_record_impl(record); |
There was a problem hiding this comment.
Call this let result = ... please.
| if self.state.has_headers { | ||
| return self.read_byte_record_impl(record); | ||
| let ret_val = self.read_byte_record_impl(record); | ||
| if self.core.trim == CoreTrim::Fields || self.core.trim == CoreTrim::All { |
| } | ||
| return ret_val; | ||
| } | ||
| } else if self.core.trim == CoreTrim::Fields || self.core.trim == CoreTrim::All { |
| } | ||
|
|
||
| #[test] | ||
| fn read_trimmed_records() { |
|
@BurntSushi: Hey man, no worries. We're all pitching in with the time we got, I'm just trying to schedule optimally within that ;) I especially appreciate the detailed code review. I'll address these once I'm on planes and stuff but FYI you might not see the result til after Dec 10. As for the high level stuff: While we're at it, do you want a benchmark too rather than my adhoc benchmarking? |
Ah yeah sorry about the confusion. I think the issue here is that we're introducing a There does exist an incremental step here, should you wish to take it. You could get rid of the
Yup I definitely don't see this as pollution. It just so happens that most of the knobs exist at the csv core level, but not all knobs do! For example, the flexible and header knobs exist at the
Oh hmm I missed those. They don't cover all the cases though. The one I don't see is a test for the empty record. It might also be good to add more variations with 1-field record vs multi-field records.
That'd be great! But it's strictly bonus. :-) FWIW, I'm totally OK with this option being "slow" in the initial implementation. We can always improve it later. (And a benchmark would help with that.) |
Naw, I want to be able to use this so I'll take it to the finish line and make sure its exposed.
K, I'll try to think of some single field edge cases. Empty record is literally just an empty string? With multiple fields? |
Either, I suppose. The goal here should be to test |
b9baeee to
288acc1
Compare
|
Did this on the plane. Needs a lot more tests but I am curious to see whether the recursive approach in StringRecord.trim could be used in ByteRecord and what the performance difference is. Some thoughts I'd like comments on now:
unit tests and benchmarks coming... |
288acc1 to
38c8630
Compare
|
@BurntSushi I think I'm ready for another review. There are some notes in my previous comment, as to the new stuff:
|
|
I did the experiment I mentioned above: |
|
@BurntSushi Hey man, hope you had a swell New Years :) Is there anything here left for me to do? |
|
@medwards I wanted to try to get this merged, so I took this over the finish line. I hope you don't mind. You can see my PR that is based off your work in #106. Many of my changes were stylistic, but there were a few logical changes as well:
|
|
Hrm, yeah I feel you wanting to push it over the finish line as I felt the same way. Regarding the headers at least its probably best if you fiddle with it. Looking at my old diff I'm not even sure what I was doing in that section of code. At minimum please refer to my comment above regarding |
Fixes #78
Per my comment in #78, I'm popping whitespace and updating the record bounds to reflect the changes.
Some thoughts:
Trim::AllorTrim::Fieldsare set. I think this is reasonable but I could imagine that maybeTrim::Headersshould beTrim::FirstRowor something in which case it isn't reasonable.Trim::Fieldsshould beTrim::Recordsinstead.