Change Tables.rows implementation to use eachrow #2051

quinnj · 2019-12-14T06:29:29Z

Instead of converting to a NamedTuple of Vectors (via columntable). The conversion to
NamedTuple is problematic in the case of extremely wide DataFrames, and
as such is the root cause of
JuliaData/CSV.jl#538.

The reason NamedTuple was used in the first place, however, was due to a
desire in certain querying contexts to be able to get a type-stable
row-iterator (cc: @davidanthoff ).

While this change generally follows the DataFrames.jl approach to
avoiding extreme compiler pressure, it does incur just over 2x
performance penalty on a simple manipulation like:

df |> @map({x=_.id + 1, y=_.salary * 2.0}) |> DataFrame

Perhaps it's easy enough for users who need that extra speed to just do
columntable(df) first before manipulating and when they know their
DataFrame has a reasonable amount of columns, but it might be possible
to automatically switch between typed/untyped representations
automatically, depending on the # of columns and pre-selected
thresholds.

@davidanthoff

Instead of converting to a NamedTuple of Vectors (via `columntable`). The conversion to NamedTuple is problematic in the case of extremely wide DataFrames, and as such is the root cause of JuliaData/CSV.jl#538. The reason NamedTuple was used in the first place, however, was due to a desire in certain querying contexts to be able to get a type-stable row-iterator (cc: @davidanthoff ). While this change generally follows the DataFrames.jl approach to avoiding extreme compiler pressure, it does incur just over 2x performance penalty on a simple manipulation like: df |> @Map({x=_.id + 1, y=_.salary * 2.0}) |> DataFrame Perhaps it's easy enough for users who need that extra speed to just do `columntable(df)` first before manipulating and when they know their DataFrame has a reasonable amount of columns, but it might be possible to automatically switch between typed/untyped representations automatically, depending on the # of columns and pre-selected thresholds.

bkamins · 2019-12-14T07:03:32Z

The change would have an additional consequence that with Tables.rows you learn the type of the column directly, while with eachrow you would have to get it from parent.
My tests some time ago have shown that, if I recall it correctly, that for typical cases the breakpoint is around 1000 columns (but it also depends how "heterogeneous" the types of columns are if I recall this correctly)

bkamins · 2019-12-14T08:51:28Z

@nalimilan - do we need to make a similar decision for by in DataFrames.jl?

nalimilan · 2019-12-14T11:33:47Z

Given that DataFrame doesn't encode information about column types, it makes sense to return a type-unstable iterator, as the compiler cannot optimize this code anyway (except if you use a function barrier). But maybe Tables.jl should provide a way to create a type-stable iterator? Maybe Query could convert the result of eachro(df) to a named tuple of vectors for type stability?

@nalimilan - do we need to make a similar decision for by in DataFrames.jl?

by/combine only specializes on the types of columns passed to each function separately, so in practice it should never hit this problem.

davidanthoff · 2019-12-14T18:51:08Z

Maybe one way to handle this is to replace

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df)

with

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) =
  Tables.datavaluerows(columntable(df))

Not sure about the performance implications of that, I don't really understand the whole fallback story in Tables and what calls what, and whether this would introduce some additional indirection or not. But in general it should be possible to handle the queryverse/tabletraits story independently of this change, given that we have the separate interface in tabletraits for the Query.jl story.

On a more general level, is a non-type stable row iterator useful for anything but very small toy examples? Doesn't this change essentially mean that any client who currently uses the Tables.rows interface in anything even vaguely performance critical would have to change their implementation to use something else instead if they want to preserve performance?

bkamins · 2019-12-14T20:43:37Z

On a more general level, is a non-type stable row iterator useful for anything but very small toy examples?

This is the performance of type-unstable API from a fresh Julia session:

julia> using DataFrames

julia> df = DataFrame(rand(10^5, 10^4));

julia> function f(df)
           s = 0.0
           for row in eachrow(df)
               s += row.x10 + row.x9990
           end
           return s
       end
f (generic function with 1 method)

julia> @time f(df)
  0.305932 seconds (1.82 M allocations: 58.548 MiB)
100042.41979340125

julia> @time f(df)
  0.026889 seconds (898.98 k allocations: 15.243 MiB)
100042.41979340125

julia> @time f(df)
  0.030614 seconds (898.98 k allocations: 15.243 MiB)
100042.41979340125

julia> Base.summarysize(df)/10^9
8.000838824

Maybe 8GB is not huge, but I would say this is roughly what is a typical maximum the user can ask. Of course it is roughly 200x slower than what would be possible if we run the same operation in a type stable way, but my point is that 0.03 second is not bad for a normal scenario (i.e. if you are not doing this operation inside a hot loop).

Can you please try running the same example using your pipeline and report three first runtimes as above, so that we can see the difference?

My preference would be to have both options (type stable and type unstable). In particular for CSV.write I think using the type-unstable option should be good enough as probably IO is much more expensive than type inference.

nalimilan · 2019-12-14T22:16:06Z

I'd say your proposal is fine @davidanthoff, at least if what you need is a type-stable iterator for Query.

On a more general level, is a non-type stable row iterator useful for anything but very small toy examples? Doesn't this change essentially mean that any client who currently uses the Tables.rows interface in anything even vaguely performance critical would have to change their implementation to use something else instead if they want to preserve performance?

My point is that with a type-unstable struct like DataFrame, it's not useful to have a type-stable iterator anyway unless you use a function barrier. If you do that (manually), you may as well explicitly ask for a type-stable iterator.

davidanthoff · 2019-12-14T22:38:06Z

I guess I was just worried that currently people might have code that uses Table.rows that is structured in such a way that they have a function barrier and things work well with type stable stuff. For those folks, this PR seems pretty breaking. But from my/queryverse's end that is not a problem, as long as we make sure IteratorInterfaceExtensions.getiterator continues to return a type stable named tuple iterator.

quinnj · 2019-12-15T00:05:55Z

For those folks, this PR seems pretty breaking.

Let's make sure we're using clear language here; there is absolutely nothing breaking here. A 2x performance slowdown is not breaking, and as @bkamins pointed out, probably wouldn't even be noticed in a wide variety of use-cases.

And let's not forget the original issue here: CSV.write couldn't even write several thousand column DataFrame due to compilation costs (I also tried a simple @map operation and had a similar result). This is a real production issue for applications trying to accept various wide/tall-shaped datasets. On a personal note, over the last year, the number one production issue we've had at Domo deploying a Julia application has been unanticipated compiler hangs for these kinds of "overly-typed" use-cases. We far and away will take a 2x performance hit if it means avoiding the compiler-hanging edge-case.

Now, I do think we should probably go with the proposal of defining:

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) =
  Tables.datavaluerows(columntable(df))

**

But it still makes me a little uncomfortable knowing there's a code path that could bite people who are using DataFrames as the stable, mature table format. I know that 90% or even 95% of use-cases will never run into compilation issues with very wide datasets, but it does raise the question of Query's suitableness for these very wide scenarios; I mean, it effectively can't be used if you're dealing with a really wide dataset, right? Are there any plans to try and support that?

My thoughts and feelings have definitely evolved over the last year or two on how critical type stability really is; after running into lots of production issues, issues with compilation costs in general, etc. I just think there are smarter ways to leverage Julia's power and flexibility by picking and choosing your "type battles"; there are a lot of cases, after going back and reviewing, where I've realized that the "type stability overhead" was definitely not worth cost of a more simple approach, with some extra code or API that allows introducing type stability more strategically or automatically when it would be really useful.

Anyway, sorry for waxing a tad too theoretical, but I wanted to make sure my thoughts on the "dangers" of type stability were thoroughly on the record.

** as a separate thought, it might be nice if we could just define it as Tables.datavaluerows(df), i.e. it would be nice if Query.jl didn't need a type-stable iterator itself, but could unroll the first NamedTuple and use the type stability from there. Not that it would really help in the worst-case scenario we're talking about here, but could perhaps save an extra conversion for non-type-stable table types.

davidanthoff · 2019-12-15T01:40:44Z

Yes, "breaking" is probably not the right word. I was really just referring to @bkamins statement that things would be 200x slower and a problem if this happened in a hot loop. For folks that use things in that way, this PR seems to have the potential to be a major performance regression. Not sure how important that is.

Query.jl really doesn't work with wide data at all, at this point I think one can realistically only use it with tidy data or something like that. My plan to fix this is to create an alternative backend for sources that store things in columnar format, and add a full column oriented processing and whole-query optimization story (think MonetDB/X100). The hooks to do something like that are all there (this was the plan from the beginning of the whole project), but implementing that is a major undertaking, so that is more of a multi-year project. I'll probably try to get some research funding down the road for this so that I can hire some folks to work on this, I think there are enough novel issues in this that it can count as a research.

as a separate thought, it might be nice if we could just define it as Tables.datavaluerows(df), i.e. it would be nice if Query.jl didn't need a type-stable iterator itself, but could unroll the first NamedTuple and use the type stability from there. Not that it would really help in the worst-case scenario we're talking about here, but could perhaps save an extra conversion for non-type-stable table types.

I'm not entirely sure I understand this. Do you mean that it would be nice if IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df) worked? IteratorInterfaceExtensions.getiterator doesn't have to be type stable, but the object returned from that function needs to have enough type information so that the iterate methods on it are type stable. So I think whether that would work really just depends on how Tables.datavaluerows is implemented, right?

quinnj · 2019-12-15T05:56:32Z

I was really just referring to @bkamins statement that things would be 200x slower

As I stated in the original post, for the specific workflows of doing various sequences of Query.jl operations, in my benchmarks, the performance hit was 2x. I'm not exactly sure what @bkamins was referring to with the 200x statement as I didn't seem to see a comparison of type-stable vs. not. In my benchmarks, I was directly comparing current release code vs. this PR.

So I think whether that would work really just depends on how Tables.datavaluerows is implemented, right?

No; as you probably could guess, there isn't a way for Tables.datavaluerows to produce stably-typed elements with a type unstable iterator. What I meant was that @map or @filter or any Query.jl operations could perhaps use a mechanism like collect_to_with_first! from Base where the first element is iterated, then passed along as the only type information needed for the various operations to be performant. We have similar machinery in Tables.jl buildcolumns code so that input tables don't have to have a defined Schema to still be performant. I've actually been really pleased and a little surprised that the performance difference is very close to the "known schema" case, without needing to rely on inference or anything.

My plan to fix this is to create an alternative backend for sources that store things in columnar format, and add a full column oriented processing and whole-query optimization story

I remember discussing some future Query stuff like this, but I'm not sure I quite make the link w/ that enabling the processing of wide datasets any better. Currently, the interface requires iterating strongly typed NamedTuples, right? Which I think already puts really wide datasets into a bit of a trap. Would the alternative backend use some other kind of mechanism to avoid materializing the NamedTuples or even their types? I know there's some scaffolding in TableTraits for column processing, but I think I remember it still required returning a NamedTuple of vectors as columns, which is still problematic.

Anyway, probably a little off-topic at this point, so I'll make the changes for Query.jl here and I think it should be ready to go.

bkamins · 2019-12-15T09:19:53Z

Just to be clear, here is a simple comparison. x200 is the worst ratio I could get. Here is a "normal" situation:

julia> using DataFrames, Tables

julia> df = DataFrame!([rand(10^5) for _ in 1:10^4]);

julia> function f1(df)
                  s = 0.0
                  for row in eachrow(df)
                      s += row.x10 + row.x9990
                  end
                  return s
              end
f1 (generic function with 1 method)

julia> function f2(df)
                  s = 0.0
                  for row in Tables.rows(df)
                      s += row.x10 + row.x9990
                  end
                  return s
              end
f2 (generic function with 1 method)

julia> @time f2(df)
294.192706 seconds (23.67 M allocations: 1.205 GiB, 0.18% gc time)
100109.37692058666

julia> @time f2(df)
  0.417728 seconds (709.01 k allocations: 15.016 MiB, 5.58% gc time)
100109.37692058666

julia> @time f2(df)
  0.397983 seconds (709.01 k allocations: 15.016 MiB)
100109.37692058666

julia> @time f1(df)
  0.097221 seconds (1.31 M allocations: 33.495 MiB, 9.64% gc time)
100109.37692058666

julia> @time f1(df)
  0.015580 seconds (898.98 k allocations: 15.243 MiB)
100109.37692058666

julia> @time f1(df)
  0.018833 seconds (898.98 k allocations: 15.243 MiB)
100109.37692058666

And f1 is slower than f2 not only to compile but also when iterating over it.

My x200 statement was about an optimal way to do a certain operation, which in this case roughly is:

# the sum is different, as I had to restart my terminal
julia> function f3(df)
           sum(df.x1) + sum(df.x9900)
       end
f3 (generic function with 1 method)

julia> @time f3(df)
  0.025982 seconds (78.78 k allocations: 3.969 MiB)
100139.2393094188

julia> @time f3(df)
  0.000087 seconds (7 allocations: 208 bytes)
100139.2393094188

julia> @time f3(df)
  0.000088 seconds (7 allocations: 208 bytes)
100139.2393094188

nalimilan · 2019-12-15T21:06:43Z

src/other/tables.jl


 Tables.schema(df::AbstractDataFrame) = Tables.Schema(names(df), eltype.(eachcol(df)))
+Tables.schema(df::DataFrameRows) = Tables.schema(getfield(df, :df))


Maybe this needs a new test?

Agreed - this is probably clear what it will produce, but a test will provide us with an error if in the future someone makes a breaking change in this part of source code.

I went ahead and merged @tkf's PRs that added tests for Tables.schema(df::DataFrameRows) already, so we should be good to do here now.

You mean #2054? I don't see how it tests schema.

Ah, OK. TBH I'd have preferred if you had waited a bit more until we had a chance to comment on the final design...

Oops. I didn't realize there was more to discuss or pending questions/concerns. Is there something specific you're still wondering about?

Actually, I have: #2055 (comment). Sorry, it turned out I didn't understand what I wanted...

bkamins

I am OK to merge it. If someone wants type-stability it is easy-enough to call Tables.columntable on a data frame and later work with it.

bkamins · 2019-12-16T22:49:34Z

CI fails to the reason that DataFrameRows returns propertynames as a vector not as a tuple.
This follows a contract of propertynames so it is technically OK, but maybe in Tables.jl relies on the fact that propertynames is a Tuple and we should change this?

bkamins · 2019-12-17T07:02:22Z

test/tables.jl

@@ -70,7 +70,7 @@ Base.propertynames(d::DuplicateNamesColumnTable) = (:a, :a, :b)
        @test @inferred(Tables.materializer(table)(Tables.columns(table))) isa typeof(table)

        row = first(Tables.rows(table))
-        @test propertynames(row) == (:a, :b)
+        @test collect(propertynames(row)) == [:a, :b]


E.g. this is the point to discuss as I have raised. Now depending on if you pass e.g. df or eachrow(df) to Tables.rows you get a different type of row and in consequence a different type of propertynames return value. Is this OK to have.?

Actually I have made #2056 to make sure we do not have to change this line as propertynames now always returns a Tuple.

quinnj mentioned this pull request Dec 14, 2019

Writing files with large # of columns polls forever? JuliaData/CSV.jl#538

Closed

Few fixes

3f54a79

nalimilan reviewed Dec 15, 2019

View reviewed changes

Merge branch 'master' into jq/tablesrows

5dced59

bkamins approved these changes Dec 16, 2019

View reviewed changes

nalimilan approved these changes Dec 16, 2019

View reviewed changes

Fix tests

2b62ea2

bkamins reviewed Dec 17, 2019

View reviewed changes

tkf mentioned this pull request Dec 17, 2019

Add Tables.jl interface for DataFrame(Rows|Columns) #2055

Merged

quinnj merged commit 6e70765 into master Dec 22, 2019

quinnj deleted the jq/tablesrows branch December 22, 2019 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change Tables.rows implementation to use eachrow #2051

Change Tables.rows implementation to use eachrow #2051

quinnj commented Dec 14, 2019

bkamins commented Dec 14, 2019

bkamins commented Dec 14, 2019

nalimilan commented Dec 14, 2019

davidanthoff commented Dec 14, 2019

bkamins commented Dec 14, 2019

nalimilan commented Dec 14, 2019

davidanthoff commented Dec 14, 2019

quinnj commented Dec 15, 2019

davidanthoff commented Dec 15, 2019

quinnj commented Dec 15, 2019

bkamins commented Dec 15, 2019

nalimilan Dec 15, 2019

bkamins Dec 15, 2019

quinnj Dec 16, 2019

nalimilan Dec 16, 2019

quinnj Dec 16, 2019

nalimilan Dec 16, 2019

quinnj Dec 17, 2019

tkf Dec 17, 2019

bkamins left a comment

bkamins commented Dec 16, 2019

bkamins Dec 17, 2019

bkamins Dec 22, 2019


		Tables.schema(df::AbstractDataFrame) = Tables.Schema(names(df), eltype.(eachcol(df)))
		Tables.schema(df::DataFrameRows) = Tables.schema(getfield(df, :df))

Change Tables.rows implementation to use eachrow #2051

Change Tables.rows implementation to use eachrow #2051

Conversation

quinnj commented Dec 14, 2019

bkamins commented Dec 14, 2019

bkamins commented Dec 14, 2019

nalimilan commented Dec 14, 2019

davidanthoff commented Dec 14, 2019

bkamins commented Dec 14, 2019

nalimilan commented Dec 14, 2019

davidanthoff commented Dec 14, 2019

quinnj commented Dec 15, 2019

davidanthoff commented Dec 15, 2019

quinnj commented Dec 15, 2019

bkamins commented Dec 15, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins left a comment

Choose a reason for hiding this comment

bkamins commented Dec 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment