Sample column types when finding multithreaded row chunks #688

quinnj · 2020-07-11T20:08:06Z

This is another optimization I've wanted to do for a while; we're
already checking up to 5 rows from various locations in the file to find
row boundaries for multithreaded parsing, so this just also samples the
types from those locations. The end result is that we already go into
full parsing having checked tasks * 5 # of rows for column types. The
2 main advantages of adding this are that we can avoid running detect
in all tasks in some cases, i.e. if we've already sampled that a column
is Int64, then each task will immediately start parsing Int64s
instead of having to detect, then allocate, etc. The other advantage
is if we happen to catch a rogue string value in a part of the file, it
can be promoted immediately instead of having to parse entire chunks as,
say, Int64, and then re-parsed later. It's not super robust in that
case because we're only sampling 5 rows from each chunk, but it's better
than nothing and doesn't add extra cost to what we're already doing.

This PR also adds an optimization when a PooledString column is
promoted to String, we don't need to reparse, but can just allocate
the String array and put the ref values in directly according to the
codes.

This is another optimization I've wanted to do for a while; we're already checking up to 5 rows from various locations in the file to find row boundaries for multithreaded parsing, so this just also samples the types from those locations. The end result is that we already go into full parsing having checked `tasks * 5` # of rows for column types. The 2 main advantages of adding this are that we can avoid running `detect` in all tasks in some cases, i.e. if we've already sampled that a column is `Int64`, then each task will immediately start parsing `Int64`s instead of having to `detect`, then `allocate`, etc. The other advantage is if we happen to catch a rogue string value in a part of the file, it can be promoted immediately instead of having to parse entire chunks as, say, `Int64`, and then re-parsed later. It's not super robust in that case because we're only sampling 5 rows from each chunk, but it's better than nothing and doesn't add extra cost to what we're already doing. This PR also adds an optimization when a `PooledString` column is promoted to `String`, we don't need to reparse, but can just allocate the `String` array and put the ref values in directly according to the codes.

codecov · 2020-07-11T21:09:37Z

Codecov Report

Merging #688 into master will increase coverage by 0.55%.
The diff coverage is 98.57%.

@@            Coverage Diff             @@
##           master     #688      +/-   ##
==========================================
+ Coverage   84.28%   84.84%   +0.55%     
==========================================
  Files          10       10              
  Lines        1801     1847      +46     
==========================================
+ Hits         1518     1567      +49     
+ Misses        283      280       -3

Impacted Files	Coverage Δ
src/detection.jl	`94.64% <96.15%> (+0.07%)`	⬆️
src/chunks.jl	`100.00% <100.00%> (ø)`
src/file.jl	`95.19% <100.00%> (+0.17%)`	⬆️
src/CSV.jl	`25.00% <0.00%> (-8.34%)`	⬇️
src/utils.jl	`85.50% <0.00%> (+2.89%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4136307...dfff952. Read the comment docs.

More work on cleaning things up

dfff952

quinnj mentioned this pull request Jul 28, 2020

Performance regression? #695

Closed

quinnj merged commit 57b48a9 into master Jul 29, 2020

quinnj deleted the jq/opts branch July 29, 2020 04:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample column types when finding multithreaded row chunks #688

Sample column types when finding multithreaded row chunks #688

quinnj commented Jul 11, 2020

codecov bot commented Jul 11, 2020 •

edited

Sample column types when finding multithreaded row chunks #688

Sample column types when finding multithreaded row chunks #688

Conversation

quinnj commented Jul 11, 2020

codecov bot commented Jul 11, 2020 • edited

Codecov Report

codecov bot commented Jul 11, 2020 •

edited