Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Tables.rows implementation to use eachrow #2051

Merged
merged 4 commits into from
Dec 22, 2019
Merged

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Dec 14, 2019

Instead of converting to a NamedTuple of Vectors (via columntable). The conversion to
NamedTuple is problematic in the case of extremely wide DataFrames, and
as such is the root cause of
JuliaData/CSV.jl#538.

The reason NamedTuple was used in the first place, however, was due to a
desire in certain querying contexts to be able to get a type-stable
row-iterator (cc: @davidanthoff ).

While this change generally follows the DataFrames.jl approach to
avoiding extreme compiler pressure, it does incur just over 2x
performance penalty on a simple manipulation like:

df |> @map({x=_.id + 1, y=_.salary * 2.0}) |> DataFrame

Perhaps it's easy enough for users who need that extra speed to just do
columntable(df) first before manipulating and when they know their
DataFrame has a reasonable amount of columns, but it might be possible
to automatically switch between typed/untyped representations
automatically, depending on the # of columns and pre-selected
thresholds.

Instead of converting to a NamedTuple of Vectors (via `columntable`). The conversion to
NamedTuple is problematic in the case of extremely wide DataFrames, and
as such is the root cause of
JuliaData/CSV.jl#538.

The reason NamedTuple was used in the first place, however, was due to a
desire in certain querying contexts to be able to get a type-stable
row-iterator (cc: @davidanthoff ).

While this change generally follows the DataFrames.jl approach to
avoiding extreme compiler pressure, it does incur just over 2x
performance penalty on a simple manipulation like:

df |> @Map({x=_.id + 1, y=_.salary * 2.0}) |> DataFrame

Perhaps it's easy enough for users who need that extra speed to just do
`columntable(df)` first before manipulating and when they know their
DataFrame has a reasonable amount of columns, but it might be possible
to automatically switch between typed/untyped representations
automatically, depending on the # of columns and pre-selected
thresholds.
@bkamins
Copy link
Member

bkamins commented Dec 14, 2019

  1. The change would have an additional consequence that with Tables.rows you learn the type of the column directly, while with eachrow you would have to get it from parent.
  2. My tests some time ago have shown that, if I recall it correctly, that for typical cases the breakpoint is around 1000 columns (but it also depends how "heterogeneous" the types of columns are if I recall this correctly)

@bkamins
Copy link
Member

bkamins commented Dec 14, 2019

@nalimilan - do we need to make a similar decision for by in DataFrames.jl?

@nalimilan
Copy link
Member

Given that DataFrame doesn't encode information about column types, it makes sense to return a type-unstable iterator, as the compiler cannot optimize this code anyway (except if you use a function barrier). But maybe Tables.jl should provide a way to create a type-stable iterator? Maybe Query could convert the result of eachro(df) to a named tuple of vectors for type stability?

@nalimilan - do we need to make a similar decision for by in DataFrames.jl?

by/combine only specializes on the types of columns passed to each function separately, so in practice it should never hit this problem.

@davidanthoff
Copy link
Contributor

Maybe one way to handle this is to replace

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df)

with

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) =
  Tables.datavaluerows(columntable(df))

Not sure about the performance implications of that, I don't really understand the whole fallback story in Tables and what calls what, and whether this would introduce some additional indirection or not. But in general it should be possible to handle the queryverse/tabletraits story independently of this change, given that we have the separate interface in tabletraits for the Query.jl story.

On a more general level, is a non-type stable row iterator useful for anything but very small toy examples? Doesn't this change essentially mean that any client who currently uses the Tables.rows interface in anything even vaguely performance critical would have to change their implementation to use something else instead if they want to preserve performance?

@bkamins
Copy link
Member

bkamins commented Dec 14, 2019

On a more general level, is a non-type stable row iterator useful for anything but very small toy examples?

This is the performance of type-unstable API from a fresh Julia session:

julia> using DataFrames

julia> df = DataFrame(rand(10^5, 10^4));

julia> function f(df)
           s = 0.0
           for row in eachrow(df)
               s += row.x10 + row.x9990
           end
           return s
       end
f (generic function with 1 method)

julia> @time f(df)
  0.305932 seconds (1.82 M allocations: 58.548 MiB)
100042.41979340125

julia> @time f(df)
  0.026889 seconds (898.98 k allocations: 15.243 MiB)
100042.41979340125

julia> @time f(df)
  0.030614 seconds (898.98 k allocations: 15.243 MiB)
100042.41979340125

julia> Base.summarysize(df)/10^9
8.000838824

Maybe 8GB is not huge, but I would say this is roughly what is a typical maximum the user can ask. Of course it is roughly 200x slower than what would be possible if we run the same operation in a type stable way, but my point is that 0.03 second is not bad for a normal scenario (i.e. if you are not doing this operation inside a hot loop).

Can you please try running the same example using your pipeline and report three first runtimes as above, so that we can see the difference?

My preference would be to have both options (type stable and type unstable). In particular for CSV.write I think using the type-unstable option should be good enough as probably IO is much more expensive than type inference.

@nalimilan
Copy link
Member

I'd say your proposal is fine @davidanthoff, at least if what you need is a type-stable iterator for Query.

On a more general level, is a non-type stable row iterator useful for anything but very small toy examples? Doesn't this change essentially mean that any client who currently uses the Tables.rows interface in anything even vaguely performance critical would have to change their implementation to use something else instead if they want to preserve performance?

My point is that with a type-unstable struct like DataFrame, it's not useful to have a type-stable iterator anyway unless you use a function barrier. If you do that (manually), you may as well explicitly ask for a type-stable iterator.

@davidanthoff
Copy link
Contributor

I guess I was just worried that currently people might have code that uses Table.rows that is structured in such a way that they have a function barrier and things work well with type stable stuff. For those folks, this PR seems pretty breaking. But from my/queryverse's end that is not a problem, as long as we make sure IteratorInterfaceExtensions.getiterator continues to return a type stable named tuple iterator.

@quinnj
Copy link
Member Author

quinnj commented Dec 15, 2019

For those folks, this PR seems pretty breaking.

Let's make sure we're using clear language here; there is absolutely nothing breaking here. A 2x performance slowdown is not breaking, and as @bkamins pointed out, probably wouldn't even be noticed in a wide variety of use-cases.

And let's not forget the original issue here: CSV.write couldn't even write several thousand column DataFrame due to compilation costs (I also tried a simple @map operation and had a similar result). This is a real production issue for applications trying to accept various wide/tall-shaped datasets. On a personal note, over the last year, the number one production issue we've had at Domo deploying a Julia application has been unanticipated compiler hangs for these kinds of "overly-typed" use-cases. We far and away will take a 2x performance hit if it means avoiding the compiler-hanging edge-case.

Now, I do think we should probably go with the proposal of defining:

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) =
  Tables.datavaluerows(columntable(df))

**

But it still makes me a little uncomfortable knowing there's a code path that could bite people who are using DataFrames as the stable, mature table format. I know that 90% or even 95% of use-cases will never run into compilation issues with very wide datasets, but it does raise the question of Query's suitableness for these very wide scenarios; I mean, it effectively can't be used if you're dealing with a really wide dataset, right? Are there any plans to try and support that?

My thoughts and feelings have definitely evolved over the last year or two on how critical type stability really is; after running into lots of production issues, issues with compilation costs in general, etc. I just think there are smarter ways to leverage Julia's power and flexibility by picking and choosing your "type battles"; there are a lot of cases, after going back and reviewing, where I've realized that the "type stability overhead" was definitely not worth cost of a more simple approach, with some extra code or API that allows introducing type stability more strategically or automatically when it would be really useful.

Anyway, sorry for waxing a tad too theoretical, but I wanted to make sure my thoughts on the "dangers" of type stability were thoroughly on the record.

** as a separate thought, it might be nice if we could just define it as Tables.datavaluerows(df), i.e. it would be nice if Query.jl didn't need a type-stable iterator itself, but could unroll the first NamedTuple and use the type stability from there. Not that it would really help in the worst-case scenario we're talking about here, but could perhaps save an extra conversion for non-type-stable table types.

@davidanthoff
Copy link
Contributor

Yes, "breaking" is probably not the right word. I was really just referring to @bkamins statement that things would be 200x slower and a problem if this happened in a hot loop. For folks that use things in that way, this PR seems to have the potential to be a major performance regression. Not sure how important that is.

Query.jl really doesn't work with wide data at all, at this point I think one can realistically only use it with tidy data or something like that. My plan to fix this is to create an alternative backend for sources that store things in columnar format, and add a full column oriented processing and whole-query optimization story (think MonetDB/X100). The hooks to do something like that are all there (this was the plan from the beginning of the whole project), but implementing that is a major undertaking, so that is more of a multi-year project. I'll probably try to get some research funding down the road for this so that I can hire some folks to work on this, I think there are enough novel issues in this that it can count as a research.

as a separate thought, it might be nice if we could just define it as Tables.datavaluerows(df), i.e. it would be nice if Query.jl didn't need a type-stable iterator itself, but could unroll the first NamedTuple and use the type stability from there. Not that it would really help in the worst-case scenario we're talking about here, but could perhaps save an extra conversion for non-type-stable table types.

I'm not entirely sure I understand this. Do you mean that it would be nice if IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df) worked? IteratorInterfaceExtensions.getiterator doesn't have to be type stable, but the object returned from that function needs to have enough type information so that the iterate methods on it are type stable. So I think whether that would work really just depends on how Tables.datavaluerows is implemented, right?

@quinnj
Copy link
Member Author

quinnj commented Dec 15, 2019

I was really just referring to @bkamins statement that things would be 200x slower

As I stated in the original post, for the specific workflows of doing various sequences of Query.jl operations, in my benchmarks, the performance hit was 2x. I'm not exactly sure what @bkamins was referring to with the 200x statement as I didn't seem to see a comparison of type-stable vs. not. In my benchmarks, I was directly comparing current release code vs. this PR.

So I think whether that would work really just depends on how Tables.datavaluerows is implemented, right?

No; as you probably could guess, there isn't a way for Tables.datavaluerows to produce stably-typed elements with a type unstable iterator. What I meant was that @map or @filter or any Query.jl operations could perhaps use a mechanism like collect_to_with_first! from Base where the first element is iterated, then passed along as the only type information needed for the various operations to be performant. We have similar machinery in Tables.jl buildcolumns code so that input tables don't have to have a defined Schema to still be performant. I've actually been really pleased and a little surprised that the performance difference is very close to the "known schema" case, without needing to rely on inference or anything.

My plan to fix this is to create an alternative backend for sources that store things in columnar format, and add a full column oriented processing and whole-query optimization story

I remember discussing some future Query stuff like this, but I'm not sure I quite make the link w/ that enabling the processing of wide datasets any better. Currently, the interface requires iterating strongly typed NamedTuples, right? Which I think already puts really wide datasets into a bit of a trap. Would the alternative backend use some other kind of mechanism to avoid materializing the NamedTuples or even their types? I know there's some scaffolding in TableTraits for column processing, but I think I remember it still required returning a NamedTuple of vectors as columns, which is still problematic.

Anyway, probably a little off-topic at this point, so I'll make the changes for Query.jl here and I think it should be ready to go.

@bkamins
Copy link
Member

bkamins commented Dec 15, 2019

Just to be clear, here is a simple comparison. x200 is the worst ratio I could get. Here is a "normal" situation:

julia> using DataFrames, Tables

julia> df = DataFrame!([rand(10^5) for _ in 1:10^4]);

julia> function f1(df)
                  s = 0.0
                  for row in eachrow(df)
                      s += row.x10 + row.x9990
                  end
                  return s
              end
f1 (generic function with 1 method)

julia> function f2(df)
                  s = 0.0
                  for row in Tables.rows(df)
                      s += row.x10 + row.x9990
                  end
                  return s
              end
f2 (generic function with 1 method)

julia> @time f2(df)
294.192706 seconds (23.67 M allocations: 1.205 GiB, 0.18% gc time)
100109.37692058666

julia> @time f2(df)
  0.417728 seconds (709.01 k allocations: 15.016 MiB, 5.58% gc time)
100109.37692058666

julia> @time f2(df)
  0.397983 seconds (709.01 k allocations: 15.016 MiB)
100109.37692058666

julia> @time f1(df)
  0.097221 seconds (1.31 M allocations: 33.495 MiB, 9.64% gc time)
100109.37692058666

julia> @time f1(df)
  0.015580 seconds (898.98 k allocations: 15.243 MiB)
100109.37692058666

julia> @time f1(df)
  0.018833 seconds (898.98 k allocations: 15.243 MiB)
100109.37692058666

And f1 is slower than f2 not only to compile but also when iterating over it.

My x200 statement was about an optimal way to do a certain operation, which in this case roughly is:

# the sum is different, as I had to restart my terminal
julia> function f3(df)
           sum(df.x1) + sum(df.x9900)
       end
f3 (generic function with 1 method)

julia> @time f3(df)
  0.025982 seconds (78.78 k allocations: 3.969 MiB)
100139.2393094188

julia> @time f3(df)
  0.000087 seconds (7 allocations: 208 bytes)
100139.2393094188

julia> @time f3(df)
  0.000088 seconds (7 allocations: 208 bytes)
100139.2393094188


Tables.schema(df::AbstractDataFrame) = Tables.Schema(names(df), eltype.(eachcol(df)))
Tables.schema(df::DataFrameRows) = Tables.schema(getfield(df, :df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this needs a new test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - this is probably clear what it will produce, but a test will provide us with an error if in the future someone makes a breaking change in this part of source code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and merged @tkf's PRs that added tests for Tables.schema(df::DataFrameRows) already, so we should be good to do here now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean #2054? I don't see how it tests schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, #2055

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. TBH I'd have preferred if you had waited a bit more until we had a chance to comment on the final design...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I didn't realize there was more to discuss or pending questions/concerns. Is there something specific you're still wondering about?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have: #2055 (comment). Sorry, it turned out I didn't understand what I wanted...

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK to merge it. If someone wants type-stability it is easy-enough to call Tables.columntable on a data frame and later work with it.

@bkamins
Copy link
Member

bkamins commented Dec 16, 2019

CI fails to the reason that DataFrameRows returns propertynames as a vector not as a tuple.
This follows a contract of propertynames so it is technically OK, but maybe in Tables.jl relies on the fact that propertynames is a Tuple and we should change this?

@@ -70,7 +70,7 @@ Base.propertynames(d::DuplicateNamesColumnTable) = (:a, :a, :b)
@test @inferred(Tables.materializer(table)(Tables.columns(table))) isa typeof(table)

row = first(Tables.rows(table))
@test propertynames(row) == (:a, :b)
@test collect(propertynames(row)) == [:a, :b]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. this is the point to discuss as I have raised. Now depending on if you pass e.g. df or eachrow(df) to Tables.rows you get a different type of row and in consequence a different type of propertynames return value. Is this OK to have.?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I have made #2056 to make sure we do not have to change this line as propertynames now always returns a Tuple.

@quinnj quinnj merged commit 6e70765 into master Dec 22, 2019
@quinnj quinnj deleted the jq/tablesrows branch December 22, 2019 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants