Skip to content

Conversation

davidanthoff
Copy link
Member

Before I go on with this, I wanted to get some feedback whether folks are ok with this. If we limit things like delim to ASCII characters, we can speed up things by quite a bit again. I tested this branch with my comprehensive test suite, and while I don't remember the precise improvement, it was pretty sizable.

If folks are ok with that, I would probably also change some of the other special characters to ASCII only, which should give us a bit more performance, and then it would be ready to merge.

@shashi any thoughts?

@codecov-io
Copy link

codecov-io commented Nov 29, 2018

Codecov Report

Merging #88 into master will increase coverage by 0.48%.
The diff coverage is 84.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #88      +/-   ##
==========================================
+ Coverage   74.76%   75.24%   +0.48%     
==========================================
  Files          11       11              
  Lines        1153     1224      +71     
==========================================
+ Hits          862      921      +59     
- Misses        291      303      +12
Impacted Files Coverage Δ
src/field.jl 77.66% <100%> (-1.22%) ⬇️
src/csv.jl 78.99% <100%> (ø) ⬆️
src/lib/date-tryparse-internal.jl 65.21% <100%> (ø) ⬆️
src/guesstype.jl 84.48% <66.66%> (ø) ⬆️
src/utf8optimizations.jl 78.19% <81.08%> (+1.87%) ⬆️
src/util.jl 69.23% <0%> (+3.84%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e34a12a...928ec8d. Read the comment docs.

@shashi
Copy link
Collaborator

shashi commented Nov 29, 2018

Can this be done in a way that does not require ASCII? Like maybe with an AsciiChar type that we use if the given delim is in fact ascii.

@shashi
Copy link
Collaborator

shashi commented Nov 29, 2018

But that does mean we pay compilation cost

@davidanthoff davidanthoff force-pushed the fieldparsing-opt branch 3 times, most recently from 8a1b708 to aaed63c Compare December 24, 2018 17:08
@davidanthoff davidanthoff changed the title WIP Use UInt8 for delim Use UInt8 for delim Dec 24, 2018
@davidanthoff davidanthoff changed the title Use UInt8 for delim WIP Use UInt8 for delim Dec 25, 2018
@davidanthoff davidanthoff changed the title WIP Use UInt8 for delim Optionally use UInt8 for various chars Dec 25, 2018
@davidanthoff
Copy link
Member Author

Alright, this is ready for review. It needs a VERY careful review, because I'm quite nervous about some of the changes I made in particular about how the global config for things like quotechar etc are handled, and am not a 100% I got it right...

The main difference here is how I've changed the handling of all "global default" choices for quotechar, escapechar and delim. Essentially, the old implementation that had these functions that returned say either the value from the opts instance or the Quoted instance, always end up being type-instable once we allow for either Char or UInt8. In this new implementation, one always has to set these values, and I'm not 100% sure I got that right everywhere...

Having said that, in terms of performance, things are looking very good. With this PR here and #95, I get the following performance numbers, which are excellent:
cols_20_withoutna

The benchmarks with missing data look slightly less good (and are not posted here), but then we can still give the NAToken the UTF8 treatment, so I'm quite positive we'll get that under control as well.

But the short version of a long story is that we are now essentially en par with R fread single threaded. We'll probably get a little bit worse again once we address #79, but the profiler also indicated a number of other areas where there might still be room for improvements.

Can this be done in a way that does not require ASCII? Like maybe with an AsciiChar type that we use if the given delim is in fact ascii.

Done, one can now either use a Char or a UInt8 for all these special characters. If one uses ASCII chars, it will switch to the faster implementation based on UInt8.

But that does mean we pay compilation cost

I think like 98% of users will use ASCII characters for this, and so in all those cases we'll end up with one specialized code path that uses the UInt8 version of the code. If someone uses a non-ASCII character, yes, then there will be extra compile overhead, but I feel that is an ok trade-off, given how rare that should be.

@shashi
Copy link
Collaborator

shashi commented Dec 25, 2018

Awesome!! Is there a comparison of how much speed up this PR gives over TextParse master? I'm just curious.

@davidanthoff
Copy link
Member Author

Awesome!! Is there a comparison of how much speed up this PR gives over TextParse master? I'm just curious.

master:

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.391713 seconds (55.36 k allocations: 218.195 MiB, 6.16% gc time)
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.277756 seconds (39.16 k allocations: 217.360 MiB)               
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.277830 seconds (39.16 k allocations: 217.360 MiB)               
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.452584 seconds (39.17 k allocations: 217.360 MiB, 9.62% gc time)
                                                                    
julia> @time csvread("warmup_uniform_data_shortfloat64.csv");       
  1.293188 seconds (39.16 k allocations: 217.360 MiB)               

This PR:

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  1.031143 seconds (54.06 k allocations: 218.152 MiB, 10.52% gc time)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  0.889580 seconds (37.86 k allocations: 217.317 MiB)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  0.882089 seconds (37.86 k allocations: 217.317 MiB)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  1.095873 seconds (37.87 k allocations: 217.317 MiB, 13.29% gc time)

julia> @time csvread("warmup_uniform_data_shortfloat64.csv");
  0.889244 seconds (37.86 k allocations: 217.317 MiB)

@shashi shashi merged commit 48fa957 into queryverse:master Jan 3, 2019
@shashi
Copy link
Collaborator

shashi commented Jan 3, 2019

Good stuff @davidanthoff !

@davidanthoff davidanthoff deleted the fieldparsing-opt branch January 3, 2019 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants