-
Notifications
You must be signed in to change notification settings - Fork 146
Refactor CSV internals to produce fully mutable columns by default #639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This involves quite a bit, so I'll try to spell out some of the chunks
of work going on here, starting with the key insight that kicked off all
this work.
Previously, to get a "type-stable" inner parsing loop over columns, I
had come to the conclusion you actually needed the same concrete typed
column type, no matter the column type of the data. This led to the
development of the `typecode` + `tapes` system that has served us pretty
well over the last little while. The idea there was any typed value
could be coerced to a `UInt64` and thus every column's underlying
storage was `Vector{UInt64}`, wrapped in our `CSV.Column` type to
reinterpret the bits as `Int64`, `Float64`, etc. when indexed. This was
a huge step forward for the package because we were no longer trying to
compile custom parsing kernels for every single file with a unique set
of column types (which could be fast once compiled, but with considerable
overhead at every first run).
The disadvantage of the tape system was it left our columns read-only;
yes, we could have made the regular bits types mutable without too much
trouble, but it was hard to generalize to all types and we'd be stuck
playing the "make `CSV.Column` be more like `Array` in _this_ way" game
forever. All those poor users to have tried mutating operations on
`CSV.Column` not realizing they needed to make a copy.
While refactoring ODBC.jl recently, at one point I realized that with
the fixed, small set of unique types you can receive from a database,
it'd be nice to roll my own "union splitting" optimization and unroll
the 10-15 types myself in order to specialize. This is because the core
Julia provided union-splitting algorithm will bail once you hit 4 unique
types. But in the case of ODBC.jl, I knew the static set of types and
unrolling the 12 or so types for basically `setindex!` operations could
be a boon to performance. As I sat pondering how I could reach into the
compiler with generated functions or macros or some other nonsense, the
ever heroic Tim Holy swooped in on my over-complicated thinking and
showed me that just writing the `if` branches and checking the objects
against concrete types _is exactly the same_ as what union-splitting
does. What luck!
What this means for CSV.jl is we can now allocate "normal" vectors and
in the innermost parsing loop (see `parserow` in `file.jl`), pull the
column out of our `columns::Vector{AbstractVector}` and check it against
the standard set of types we support and manually dispatch to the right
parsing function. All without spurious allocations or any kind of
dynamic dispatch. Brilliant!
In implementing this "trick", I realized several opportunities to
simplify/clean things up (net -400 LOC!), which include:
* Getting rid of `iteration.jl` and `tables.jl`; we only need a couple
lines from each now that we are getting rid of `CSV.Column` and the
threaded-row iteration shenanigans we were up to
* Getting rid of the `typecodes` we were using as our pseudo-type
system. While this technically could have stayed, I opted to switch to
using regular Julia types to streamline the process from parsing ->
output, and hopefully pave the way to supporting a broader range of
types while parsing (see #431).
* Move all the "sentinel array" tricks we were using into a more
formally established (and tested)
[SentinelArrays.jl](https://github.com/JuliaData/SentinelArrays.jl)
package; a `SentinelArray` uses a sentinel value of the underlying array
type for `missing`, so it can be thought of as a `Vector{Union{T,
Missing}}` for all intents and purposes. This package's array types will
be used as solutions to using regular `Vector{Union{T, Missing}}` are
worked out; specifically, how to efficiently allow a non-copying
operation like `convert(Vector{T}, A::Vector{Union{T, Missing}})`
Probably most surprisingly with this PR is that it's pretty much
non-breaking! I haven't finished the `CategoricalArrays` support just
yet, but we'll include it for now as we get ready to make other
deprecations in preparation for 1.0. The other bit I'm still mulling
over is how exactly to treat string columns; in this PR, we just fully
materialize them as `Vector{String}`, but that can be a tad expensive
depending on the data. I'm going to try out some solutions locally to
see if we can utilize `WeakRefStringArray` but will probably leave the
option to just materialize the full string columns, since we've had
people ask for that before.
If you've made it this far, congratulations! I hope it was useful in
explaining a bit about what's going on here. I'd love any
feedback/thoughts if you have them. I know it's a lot to ask anyone to
review such a monster PR in their free time, but if you're willing, I'm
more than happy to answer questions or chat on slack (#data channel) to
help clarify things.
nalimilan
reviewed
Jun 10, 2020
where we'll piece the threaded chunks together via threads.:
This was referenced Jun 12, 2020
Codecov Report
@@ Coverage Diff @@
## master #639 +/- ##
==========================================
- Coverage 84.53% 74.71% -9.83%
==========================================
Files 9 8 -1
Lines 1513 1641 +128
==========================================
- Hits 1279 1226 -53
- Misses 234 415 +181
Continue to review full report at Codecov.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This involves quite a bit, so I'll try to spell out some of the chunks
of work going on here, starting with the key insight that kicked off all
this work.
Previously, to get a "type-stable" inner parsing loop over columns, I
had come to the conclusion you actually needed the same concrete typed
column type, no matter the column type of the data. This led to the
development of the
typecode+tapessystem that has served us prettywell over the last little while. The idea there was any typed value
could be coerced to a
UInt64and thus every column's underlyingstorage was
Vector{UInt64}, wrapped in ourCSV.Columntype toreinterpret the bits as
Int64,Float64, etc. when indexed. This wasa huge step forward for the package because we were no longer trying to
compile custom parsing kernels for every single file with a unique set
of column types (which could be fast once compiled, but with considerable
overhead at every first run).
The disadvantage of the tape system was it left our columns read-only;
yes, we could have made the regular bits types mutable without too much
trouble, but it was hard to generalize to all types and we'd be stuck
playing the "make
CSV.Columnbe more likeArrayin this way" gameforever. All those poor users to have tried mutating operations on
CSV.Columnnot realizing they needed to make a copy.While refactoring ODBC.jl recently, at one point I realized that with
the fixed, small set of unique types you can receive from a database,
it'd be nice to roll my own "union splitting" optimization and unroll
the 10-15 types myself in order to specialize. This is because the core
Julia provided union-splitting algorithm will bail once you hit 4 unique
types. But in the case of ODBC.jl, I knew the static set of types and
unrolling the 12 or so types for basically
setindex!operations couldbe a boon to performance. As I sat pondering how I could reach into the
compiler with generated functions or macros or some other nonsense, the
ever heroic Tim Holy swooped in on my over-complicated thinking and
showed me that just writing the
ifbranches and checking the objectsagainst concrete types is exactly the same as what union-splitting
does. What luck!
What this means for CSV.jl is we can now allocate "normal" vectors and
in the innermost parsing loop (see
parserowinfile.jl), pull thecolumn out of our
columns::Vector{AbstractVector}and check it againstthe standard set of types we support and manually dispatch to the right
parsing function. All without spurious allocations or any kind of
dynamic dispatch. Brilliant!
In implementing this "trick", I realized several opportunities to
simplify/clean things up (net -400 LOC!), which include:
iteration.jlandtables.jl; we only need a couplelines from each now that we are getting rid of
CSV.Columnand thethreaded-row iteration shenanigans we were up to
typecodeswe were using as our pseudo-typesystem. While this technically could have stayed, I opted to switch to
using regular Julia types to streamline the process from parsing ->
output, and hopefully pave the way to supporting a broader range of
types while parsing (see Provided types only partially interpreted? #431).
formally established (and tested)
SentinelArrays.jl
package; a
SentinelArrayuses a sentinel value of the underlying arraytype for
missing, so it can be thought of as aVector{Union{T, Missing}}for all intents and purposes. This package's array types willbe used as solutions to using regular
Vector{Union{T, Missing}}areworked out; specifically, how to efficiently allow a non-copying
operation like
convert(Vector{T}, A::Vector{Union{T, Missing}})Probably most surprisingly with this PR is that it's pretty much
non-breaking! I haven't finished the
CategoricalArrayssupport justyet, but we'll include it for now as we get ready to make other
deprecations in preparation for 1.0. The other bit I'm still mulling
over is how exactly to treat string columns; in this PR, we just fully
materialize them as
Vector{String}, but that can be a tad expensivedepending on the data. I'm going to try out some solutions locally to
see if we can utilize
WeakRefStringArraybut will probably leave theoption to just materialize the full string columns, since we've had
people ask for that before.
If you've made it this far, congratulations! I hope it was useful in
explaining a bit about what's going on here. I'd love any
feedback/thoughts if you have them. I know it's a lot to ask anyone to
review such a monster PR in their free time, but if you're willing, I'm
more than happy to answer questions or chat on slack (#data channel) to
help clarify things.