Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regressions CSV.Rows since 0.5? #752

Closed
altre opened this issue Oct 12, 2020 · 3 comments · Fixed by #753
Closed

Performance regressions CSV.Rows since 0.5? #752

altre opened this issue Oct 12, 2020 · 3 comments · Fixed by #753

Comments

@altre
Copy link

altre commented Oct 12, 2020

When I run the following benchmark

using CSV
using BenchmarkTools
using Random
Random.seed!(0)
open("test.csv", "w") do f
    for _ in 1:100_000
        write(f, join([randstring('a':'z') for _ in 1:8], ","))
        write(f, "\n")
    end
end
function read()
    rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

@benchmark read()

on v0.5.26, I get:

(Bla) pkg> st
Project Bla v0.1.0
Status `/private/tmp/Bla/Project.toml`
  [336ed68f] CSV v0.5.26

julia> @benchmark read()
BenchmarkTools.Trial: 
  memory estimate:  47.31 MiB
  allocs estimate:  1100059
  --------------
  minimum time:     65.747 ms (4.78% GC)
  median time:      72.943 ms (4.71% GC)
  mean time:        73.037 ms (5.13% GC)
  maximum time:     82.599 ms (4.64% GC)
  --------------
  samples:          69
  evals/sample:     1

Running the same on v0.7.7 is 4 x slower:

julia> @benchmark read()
BenchmarkTools.Trial: 
  memory estimate:  177.01 MiB
  allocs estimate:  3500118
  --------------
  minimum time:     287.515 ms (3.16% GC)
  median time:      294.814 ms (3.11% GC)
  mean time:        294.901 ms (3.14% GC)
  maximum time:     302.321 ms (3.21% GC)
  --------------
  samples:          17
  evals/sample:     1

(@v1.5) pkg> st CSV
Status `~/.julia/environments/v1.5/Project.toml`
  [336ed68f] CSV v0.7.7

Is this a performance regression, or have I missed an API change?

@quinnj
Copy link
Member

quinnj commented Oct 15, 2020

Thanks for reporting; I did some digging last night and I think I know what's going; I'll try to get a fix up today

quinnj added a commit that referenced this issue Oct 16, 2020
Fixes #752. This is a case of us not providing just not quite enough
information to the compiler, along with the compiler itself being too
clever. The default for `CSV.Rows` is to treat each column as
`Union{String, Missing}`, which results in the `V` type parameter of
`CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we
should get pretty good inferrability for `getproperty(::Row2,
::Symbol)`, because we should be able to know the return value will at
least be `Union{String, Missing}`. This knowledge, however, was trapped
in the "csv domain" and not expressed clearly enough to the compiler. It
inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called
`Tables.getcolumn(::Row2, i::Int)`, which in turn called
`Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected,
except that when we started supporting non-String types for `CSV.Rows`
(i.e. you can pass in whatever type you want and we'll parse it directly
from the file for each row), we added an additional typed
`Tables.getcolumn` method that handled all the non-String columns. Oops.
Now the compiler is confused because from `Tables.getcolumn(::Row2,
nm::Symbol)` it knows that it can return `missing`, a `String`, or if we
call this third method, it'll return an instance of our `V` type
parameter, which, if you'll remember, in the default case is
`CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return
type of `Union{Missing, UInt64, String}`, which makes downstream
operations even trickier to figure out.

Luckily, the solution here is to just help connect the dots for the
compiler: i.e. define specialize methods that dispatch on `V`,
specifically when `V === UInt64`. Then the compiler will see/know that
we will only ever call the `Union{String, Missing}` method and can
ignore the custom types codepath. This PR also rearranges a few
`@inbounds` uses since we can avoid the bounds checks further down the
stack once we've checked them higher up.
@quinnj
Copy link
Member

quinnj commented Oct 16, 2020

Alright; it was a little tricky to track down, but I've got a fix up for this: #753.

One thing to note is that the benchmarking is much cleaner if you pass in the CSV.Rows object to the read function, like:

function read(rows)
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
@benchmark read(rows)

this allows read to properly specialize on the CSV.Rows type parameters, which is important since the rows = CSV.Rows(...) operation is inherently type unstable (i.e. it's trying to figure out the type parameters from the initial parsing of the file).

With that rearrangement, I get these timings with the fix in my PR:
v0.5.26

julia> @benchmark read(rows)
BenchmarkTools.Trial:
  memory estimate:  30.52 MiB
  allocs estimate:  900001
  --------------
  minimum time:     32.326 ms (0.00% GC)
  median time:      37.181 ms (9.83% GC)
  mean time:        37.344 ms (7.23% GC)
  maximum time:     46.213 ms (9.55% GC)
  --------------
  samples:          134
  evals/sample:     1

PR:

julia> @benchmark read(rows)
BenchmarkTools.Trial:
  memory estimate:  24.41 MiB
  allocs estimate:  800001
  --------------
  minimum time:     41.163 ms (0.00% GC)
  median time:      44.335 ms (4.09% GC)
  mean time:        44.553 ms (2.50% GC)
  maximum time:     54.575 ms (0.00% GC)
  --------------
  samples:          113
  evals/sample:     1

Which seems inline with what I would expect; note that a big update that happened between 0.5.26 and 0.7.7 is the ability to support parsing custom types for any column, and incurred a similar 5-10% performance hit, along with some of the other improvements that have been made.

quinnj added a commit that referenced this issue Oct 16, 2020
Fixes #752. This is a case of us not providing just not quite enough
information to the compiler, along with the compiler itself being too
clever. The default for `CSV.Rows` is to treat each column as
`Union{String, Missing}`, which results in the `V` type parameter of
`CSV.Rows` being `CSV.PosLen`, instead of `Any`. If that's the case, we
should get pretty good inferrability for `getproperty(::Row2,
::Symbol)`, because we should be able to know the return value will at
least be `Union{String, Missing}`. This knowledge, however, was trapped
in the "csv domain" and not expressed clearly enough to the compiler. It
inspected `Tables.getcolumn(::Row2, nm::Symbol)` and saw that it called
`Tables.getcolumn(::Row2, i::Int)`, which in turn called
`Tables.getcolumn(::Row2, T, i, nm)`. This is all fine an expected,
except that when we started supporting non-String types for `CSV.Rows`
(i.e. you can pass in whatever type you want and we'll parse it directly
from the file for each row), we added an additional typed
`Tables.getcolumn` method that handled all the non-String columns. Oops.
Now the compiler is confused because from `Tables.getcolumn(::Row2,
nm::Symbol)` it knows that it can return `missing`, a `String`, or if we
call this third method, it'll return an instance of our `V` type
parameter, which, if you'll remember, in the default case is
`CSV.PosLen`, or more simply, `UInt64`. So we ended up with a return
type of `Union{Missing, UInt64, String}`, which makes downstream
operations even trickier to figure out.

Luckily, the solution here is to just help connect the dots for the
compiler: i.e. define specialize methods that dispatch on `V`,
specifically when `V === UInt64`. Then the compiler will see/know that
we will only ever call the `Union{String, Missing}` method and can
ignore the custom types codepath. This PR also rearranges a few
`@inbounds` uses since we can avoid the bounds checks further down the
stack once we've checked them higher up.
@schlurp
Copy link

schlurp commented Feb 28, 2023

opened new issue #1075 since cause is likely different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants