Skip to content

Conversation

@quinnj
Copy link
Member

@quinnj quinnj commented Jun 10, 2020

This involves quite a bit, so I'll try to spell out some of the chunks
of work going on here, starting with the key insight that kicked off all
this work.

Previously, to get a "type-stable" inner parsing loop over columns, I
had come to the conclusion you actually needed the same concrete typed
column type, no matter the column type of the data. This led to the
development of the typecode + tapes system that has served us pretty
well over the last little while. The idea there was any typed value
could be coerced to a UInt64 and thus every column's underlying
storage was Vector{UInt64}, wrapped in our CSV.Column type to
reinterpret the bits as Int64, Float64, etc. when indexed. This was
a huge step forward for the package because we were no longer trying to
compile custom parsing kernels for every single file with a unique set
of column types (which could be fast once compiled, but with considerable
overhead at every first run).

The disadvantage of the tape system was it left our columns read-only;
yes, we could have made the regular bits types mutable without too much
trouble, but it was hard to generalize to all types and we'd be stuck
playing the "make CSV.Column be more like Array in this way" game
forever. All those poor users to have tried mutating operations on
CSV.Column not realizing they needed to make a copy.

While refactoring ODBC.jl recently, at one point I realized that with
the fixed, small set of unique types you can receive from a database,
it'd be nice to roll my own "union splitting" optimization and unroll
the 10-15 types myself in order to specialize. This is because the core
Julia provided union-splitting algorithm will bail once you hit 4 unique
types. But in the case of ODBC.jl, I knew the static set of types and
unrolling the 12 or so types for basically setindex! operations could
be a boon to performance. As I sat pondering how I could reach into the
compiler with generated functions or macros or some other nonsense, the
ever heroic Tim Holy swooped in on my over-complicated thinking and
showed me that just writing the if branches and checking the objects
against concrete types is exactly the same as what union-splitting
does. What luck!

What this means for CSV.jl is we can now allocate "normal" vectors and
in the innermost parsing loop (see parserow in file.jl), pull the
column out of our columns::Vector{AbstractVector} and check it against
the standard set of types we support and manually dispatch to the right
parsing function. All without spurious allocations or any kind of
dynamic dispatch. Brilliant!

In implementing this "trick", I realized several opportunities to
simplify/clean things up (net -400 LOC!), which include:

  • Getting rid of iteration.jl and tables.jl; we only need a couple
    lines from each now that we are getting rid of CSV.Column and the
    threaded-row iteration shenanigans we were up to
  • Getting rid of the typecodes we were using as our pseudo-type
    system. While this technically could have stayed, I opted to switch to
    using regular Julia types to streamline the process from parsing ->
    output, and hopefully pave the way to supporting a broader range of
    types while parsing (see Provided types only partially interpreted? #431).
  • Move all the "sentinel array" tricks we were using into a more
    formally established (and tested)
    SentinelArrays.jl
    package; a SentinelArray uses a sentinel value of the underlying array
    type for missing, so it can be thought of as a Vector{Union{T, Missing}} for all intents and purposes. This package's array types will
    be used as solutions to using regular Vector{Union{T, Missing}} are
    worked out; specifically, how to efficiently allow a non-copying
    operation like convert(Vector{T}, A::Vector{Union{T, Missing}})

Probably most surprisingly with this PR is that it's pretty much
non-breaking! I haven't finished the CategoricalArrays support just
yet, but we'll include it for now as we get ready to make other
deprecations in preparation for 1.0. The other bit I'm still mulling
over is how exactly to treat string columns; in this PR, we just fully
materialize them as Vector{String}, but that can be a tad expensive
depending on the data. I'm going to try out some solutions locally to
see if we can utilize WeakRefStringArray but will probably leave the
option to just materialize the full string columns, since we've had
people ask for that before.

If you've made it this far, congratulations! I hope it was useful in
explaining a bit about what's going on here. I'd love any
feedback/thoughts if you have them. I know it's a lot to ask anyone to
review such a monster PR in their free time, but if you're willing, I'm
more than happy to answer questions or chat on slack (#data channel) to
help clarify things.

This involves quite a bit, so I'll try to spell out some of the chunks
of work going on here, starting with the key insight that kicked off all
this work.

Previously, to get a "type-stable" inner parsing loop over columns, I
had come to the conclusion you actually needed the same concrete typed
column type, no matter the column type of the data. This led to the
development of the `typecode` + `tapes` system that has served us pretty
well over the last little while. The idea there was any typed value
could be coerced to a `UInt64` and thus every column's underlying
storage was `Vector{UInt64}`, wrapped in our `CSV.Column` type to
reinterpret the bits as `Int64`, `Float64`, etc. when indexed. This was
a huge step forward for the package because we were no longer trying to
compile custom parsing kernels for every single file with a unique set
of column types (which could be fast once compiled, but with considerable
overhead at every first run).

The disadvantage of the tape system was it left our columns read-only;
yes, we could have made the regular bits types mutable without too much
trouble, but it was hard to generalize to all types and we'd be stuck
playing the "make `CSV.Column` be more like `Array` in _this_ way" game
forever. All those poor users to have tried mutating operations on
`CSV.Column` not realizing they needed to make a copy.

While refactoring ODBC.jl recently, at one point I realized that with
the fixed, small set of unique types you can receive from a database,
it'd be nice to roll my own "union splitting" optimization and unroll
the 10-15 types myself in order to specialize. This is because the core
Julia provided union-splitting algorithm will bail once you hit 4 unique
types. But in the case of ODBC.jl, I knew the static set of types and
unrolling the 12 or so types for basically `setindex!` operations could
be a boon to performance. As I sat pondering how I could reach into the
compiler with generated functions or macros or some other nonsense, the
ever heroic Tim Holy swooped in on my over-complicated thinking and
showed me that just writing the `if` branches and checking the objects
against concrete types _is exactly the same_ as what union-splitting
does. What luck!

What this means for CSV.jl is we can now allocate "normal" vectors and
in the innermost parsing loop (see `parserow` in `file.jl`), pull the
column out of our `columns::Vector{AbstractVector}` and check it against
the standard set of types we support and manually dispatch to the right
parsing function. All without spurious allocations or any kind of
dynamic dispatch. Brilliant!

In implementing this "trick", I realized several opportunities to
simplify/clean things up (net -400 LOC!), which include:
* Getting rid of `iteration.jl` and `tables.jl`; we only need a couple
lines from each now that we are getting rid of `CSV.Column` and the
threaded-row iteration shenanigans we were up to
* Getting rid of the `typecodes` we were using as our pseudo-type
system. While this technically could have stayed, I opted to switch to
using regular Julia types to streamline the process from parsing ->
output, and hopefully pave the way to supporting a broader range of
types while parsing (see #431).
* Move all the "sentinel array" tricks we were using into a more
formally established (and tested)
[SentinelArrays.jl](https://github.com/JuliaData/SentinelArrays.jl)
package; a `SentinelArray` uses a sentinel value of the underlying array
type for `missing`, so it can be thought of as a `Vector{Union{T,
Missing}}` for all intents and purposes. This package's array types will
be used as solutions to using regular `Vector{Union{T, Missing}}` are
worked out; specifically, how to efficiently allow a non-copying
operation like `convert(Vector{T}, A::Vector{Union{T, Missing}})`

Probably most surprisingly with this PR is that it's pretty much
non-breaking! I haven't finished the `CategoricalArrays` support just
yet, but we'll include it for now as we get ready to make other
deprecations in preparation for 1.0. The other bit I'm still mulling
over is how exactly to treat string columns; in this PR, we just fully
materialize them as `Vector{String}`, but that can be a tad expensive
depending on the data. I'm going to try out some solutions locally to
see if we can utilize `WeakRefStringArray` but will probably leave the
option to just materialize the full string columns, since we've had
people ask for that before.

If you've made it this far, congratulations! I hope it was useful in
explaining a bit about what's going on here. I'd love any
feedback/thoughts if you have them. I know it's a lot to ask anyone to
review such a monster PR in their free time, but if you're willing, I'm
more than happy to answer questions or chat on slack (#data channel) to
help clarify things.
where we'll piece the threaded chunks together via threads.:
@codecov
Copy link

codecov bot commented Jun 18, 2020

Codecov Report

Merging #639 into master will decrease coverage by 9.82%.
The diff coverage is 68.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #639      +/-   ##
==========================================
- Coverage   84.53%   74.71%   -9.83%     
==========================================
  Files           9        8       -1     
  Lines        1513     1641     +128     
==========================================
- Hits         1279     1226      -53     
- Misses        234      415     +181     
Impacted Files Coverage Δ
src/CSV.jl 80.00% <ø> (ø)
src/file.jl 70.14% <62.18%> (-15.31%) ⬇️
src/utils.jl 83.33% <80.00%> (+2.29%) ⬆️
src/rows.jl 92.40% <84.61%> (-2.17%) ⬇️
src/header.jl 93.47% <91.66%> (-4.69%) ⬇️
src/tables.jl 0.00% <0.00%> (-75.26%) ⬇️
src/write.jl 82.90% <0.00%> (-4.29%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18f8e51...8a91f8f. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants