Optionally use UInt8 for various chars #88

davidanthoff · 2018-11-29T02:58:04Z

Before I go on with this, I wanted to get some feedback whether folks are ok with this. If we limit things like delim to ASCII characters, we can speed up things by quite a bit again. I tested this branch with my comprehensive test suite, and while I don't remember the precise improvement, it was pretty sizable.

If folks are ok with that, I would probably also change some of the other special characters to ASCII only, which should give us a bit more performance, and then it would be ready to merge.

@shashi any thoughts?

codecov-io · 2018-11-29T03:10:29Z

Codecov Report

Merging #88 into master will increase coverage by 0.48%.
The diff coverage is 84.9%.

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
+ Coverage   74.76%   75.24%   +0.48%     
==========================================
  Files          11       11              
  Lines        1153     1224      +71     
==========================================
+ Hits          862      921      +59     
- Misses        291      303      +12

Impacted Files	Coverage Δ
src/field.jl	`77.66% <100%> (-1.22%)`	⬇️
src/csv.jl	`78.99% <100%> (ø)`	⬆️
src/lib/date-tryparse-internal.jl	`65.21% <100%> (ø)`	⬆️
src/guesstype.jl	`84.48% <66.66%> (ø)`	⬆️
src/utf8optimizations.jl	`78.19% <81.08%> (+1.87%)`	⬆️
src/util.jl	`69.23% <0%> (+3.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e34a12a...928ec8d. Read the comment docs.

shashi · 2018-11-29T04:43:08Z

Can this be done in a way that does not require ASCII? Like maybe with an AsciiChar type that we use if the given delim is in fact ascii.

shashi · 2018-11-29T04:46:12Z

But that does mean we pay compilation cost

davidanthoff · 2018-12-25T20:14:21Z

Alright, this is ready for review. It needs a VERY careful review, because I'm quite nervous about some of the changes I made in particular about how the global config for things like quotechar etc are handled, and am not a 100% I got it right...

The main difference here is how I've changed the handling of all "global default" choices for quotechar, escapechar and delim. Essentially, the old implementation that had these functions that returned say either the value from the opts instance or the Quoted instance, always end up being type-instable once we allow for either Char or UInt8. In this new implementation, one always has to set these values, and I'm not 100% sure I got that right everywhere...

Having said that, in terms of performance, things are looking very good. With this PR here and #95, I get the following performance numbers, which are excellent:

The benchmarks with missing data look slightly less good (and are not posted here), but then we can still give the NAToken the UTF8 treatment, so I'm quite positive we'll get that under control as well.

But the short version of a long story is that we are now essentially en par with R fread single threaded. We'll probably get a little bit worse again once we address #79, but the profiler also indicated a number of other areas where there might still be room for improvements.

Can this be done in a way that does not require ASCII? Like maybe with an AsciiChar type that we use if the given delim is in fact ascii.

Done, one can now either use a Char or a UInt8 for all these special characters. If one uses ASCII chars, it will switch to the faster implementation based on UInt8.

But that does mean we pay compilation cost

I think like 98% of users will use ASCII characters for this, and so in all those cases we'll end up with one specialized code path that uses the UInt8 version of the code. If someone uses a non-ASCII character, yes, then there will be extra compile overhead, but I feel that is an ok trade-off, given how rare that should be.

shashi · 2018-12-25T22:26:09Z

Awesome!! Is there a comparison of how much speed up this PR gives over TextParse master? I'm just curious.

davidanthoff · 2018-12-25T23:08:33Z

Awesome!! Is there a comparison of how much speed up this PR gives over TextParse master? I'm just curious.

master:

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.391713 seconds (55.36 k allocations: 218.195 MiB, 6.16% gc time)
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.277756 seconds (39.16 k allocations: 217.360 MiB)               
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.277830 seconds (39.16 k allocations: 217.360 MiB)               
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.452584 seconds (39.17 k allocations: 217.360 MiB, 9.62% gc time)
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.293188 seconds (39.16 k allocations: 217.360 MiB)

This PR:

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  1.031143 seconds (54.06 k allocations: 218.152 MiB, 10.52% gc time)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  0.889580 seconds (37.86 k allocations: 217.317 MiB)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  0.882089 seconds (37.86 k allocations: 217.317 MiB)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  1.095873 seconds (37.87 k allocations: 217.317 MiB, 13.29% gc time)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  0.889244 seconds (37.86 k allocations: 217.317 MiB)

shashi · 2019-01-03T22:51:38Z

Good stuff @davidanthoff !

davidanthoff force-pushed the fieldparsing-opt branch 3 times, most recently from 8a1b708 to aaed63c Compare December 24, 2018 17:08

davidanthoff changed the title ~~WIP Use UInt8 for delim~~ Use UInt8 for delim Dec 24, 2018

davidanthoff added 7 commits December 24, 2018 14:31

Add Field parsing benchmark

f919c5d

Optimize field parsing

1fe880f

Add diagnostic code

8561845

Remove diagnostic code

6863f12

Make field parsing work on vector backed string

6b8a0df

Only use UInt8 for endchar is delim is ASCII char

19feed0

Add tests for field parsing for non UTF8 case

35bd9e1

davidanthoff force-pushed the fieldparsing-opt branch from 467132b to 35bd9e1 Compare December 24, 2018 22:31

Add UInt8 character version for Quoted parsing

928ec8d

davidanthoff changed the title ~~Use UInt8 for delim~~ WIP Use UInt8 for delim Dec 25, 2018

davidanthoff changed the title ~~WIP Use UInt8 for delim~~ Optionally use UInt8 for various chars Dec 25, 2018

davidanthoff requested a review from shashi December 25, 2018 20:14

davidanthoff added the enhancement label Dec 25, 2018

shashi merged commit 48fa957 into queryverse:master Jan 3, 2019

davidanthoff deleted the fieldparsing-opt branch January 3, 2019 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optionally use UInt8 for various chars #88

Optionally use UInt8 for various chars #88

Uh oh!

davidanthoff commented Nov 29, 2018

Uh oh!

codecov-io commented Nov 29, 2018 •

edited

Loading

Uh oh!

shashi commented Nov 29, 2018 •

edited

Loading

Uh oh!

shashi commented Nov 29, 2018

Uh oh!

davidanthoff commented Dec 25, 2018

Uh oh!

shashi commented Dec 25, 2018

Uh oh!

davidanthoff commented Dec 25, 2018

Uh oh!

shashi commented Jan 3, 2019

Uh oh!

Uh oh!

Optionally use UInt8 for various chars #88

Optionally use UInt8 for various chars #88

Uh oh!

Conversation

davidanthoff commented Nov 29, 2018

Uh oh!

codecov-io commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

shashi commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shashi commented Nov 29, 2018

Uh oh!

davidanthoff commented Dec 25, 2018

Uh oh!

shashi commented Dec 25, 2018

Uh oh!

davidanthoff commented Dec 25, 2018

Uh oh!

shashi commented Jan 3, 2019

Uh oh!

Uh oh!

codecov-io commented Nov 29, 2018 •

edited

Loading

shashi commented Nov 29, 2018 •

edited

Loading