Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample column types when finding multithreaded row chunks #688

Merged
merged 2 commits into from
Jul 29, 2020
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Jul 11, 2020

This is another optimization I've wanted to do for a while; we're
already checking up to 5 rows from various locations in the file to find
row boundaries for multithreaded parsing, so this just also samples the
types from those locations. The end result is that we already go into
full parsing having checked tasks * 5 # of rows for column types. The
2 main advantages of adding this are that we can avoid running detect
in all tasks in some cases, i.e. if we've already sampled that a column
is Int64, then each task will immediately start parsing Int64s
instead of having to detect, then allocate, etc. The other advantage
is if we happen to catch a rogue string value in a part of the file, it
can be promoted immediately instead of having to parse entire chunks as,
say, Int64, and then re-parsed later. It's not super robust in that
case because we're only sampling 5 rows from each chunk, but it's better
than nothing and doesn't add extra cost to what we're already doing.

This PR also adds an optimization when a PooledString column is
promoted to String, we don't need to reparse, but can just allocate
the String array and put the ref values in directly according to the
codes.

This is another optimization I've wanted to do for a while; we're
already checking up to 5 rows from various locations in the file to find
row boundaries for multithreaded parsing, so this just also samples the
types from those locations. The end result is that we already go into
full parsing having checked `tasks * 5` # of rows for column types. The
2 main advantages of adding this are that we can avoid running `detect`
in all tasks in some cases, i.e. if we've already sampled that a column
is `Int64`, then each task will immediately start parsing `Int64`s
instead of having to `detect`, then `allocate`, etc. The other advantage
is if we happen to catch a rogue string value in a part of the file, it
can be promoted immediately instead of having to parse entire chunks as,
say, `Int64`, and then re-parsed later. It's not super robust in that
case because we're only sampling 5 rows from each chunk, but it's better
than nothing and doesn't add extra cost to what we're already doing.

This PR also adds an optimization when a `PooledString` column is
promoted to `String`, we don't need to reparse, but can just allocate
the `String` array and put the ref values in directly according to the
codes.
@codecov
Copy link

codecov bot commented Jul 11, 2020

Codecov Report

Merging #688 into master will increase coverage by 0.55%.
The diff coverage is 98.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #688      +/-   ##
==========================================
+ Coverage   84.28%   84.84%   +0.55%     
==========================================
  Files          10       10              
  Lines        1801     1847      +46     
==========================================
+ Hits         1518     1567      +49     
+ Misses        283      280       -3     
Impacted Files Coverage Δ
src/detection.jl 94.64% <96.15%> (+0.07%) ⬆️
src/chunks.jl 100.00% <100.00%> (ø)
src/file.jl 95.19% <100.00%> (+0.17%) ⬆️
src/CSV.jl 25.00% <0.00%> (-8.34%) ⬇️
src/utils.jl 85.50% <0.00%> (+2.89%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4136307...dfff952. Read the comment docs.

@quinnj quinnj mentioned this pull request Jul 28, 2020
@quinnj quinnj merged commit 57b48a9 into master Jul 29, 2020
@quinnj quinnj deleted the jq/opts branch July 29, 2020 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant