Ideas for better static type inference #451

simonster · 2013-12-16T00:08:03Z

I had a discussion with @malmaud today about type inference for DataFrames. The core problem is that when you index into a DataFrame, under the hood, we index into a Vector{Any} and so you get no type information about the result. Thus, if you want to iterate over one or more columns of a DataFrame, the only reasonably performant way to do so is to first pull out the columns with typeasserts (or pass them to a function) and then iterate over these columns, which is not really idiomatic.

Part of one potential solution would be to make the columns of a DataFrame a tuple instead of a Vector{Any}, and to parametrize the DataFrame by the type of this tuple. Then we'd have type information for the columns if referenced by index, and we'd at least have a type union for indexing by a string. The downside to this approach is that we wouldn't be able to add columns to an existing DataFrame, only create new DataFrames with more columns or re-ordered columns. Syntax like df["newcol"] = x wouldn't work.

This could make code that indexes a DataFrame as df[i, 1] nearly as fast as indexing a corresponding Array. However, ideally code that indexes a DataFrame as df[i, "mycol"] would also be as fast as indexing an Array. This doesn't seem to be possible without changes in Base. We'd have to push the process of looking up the column name in the Index to type inference time, so that 1) we don't have to perform the look up on each loop iteration and 2) type inference has information about the type of the column. We'd either need a way to hook into the type inference process or some kind of "named tuple" primitive that has type information for each member like a tuple but can be indexed with strings.

The text was updated successfully, but these errors were encountered:

tshort · 2013-12-16T00:26:53Z

Those are intriguing ideas, Simon. I've been thinking about trying to make
an AbstractDataFrame that's just a type created on the fly with columns as
members. It has some of the same advantages and disadvantages of your
approach. I like that your approach doesn't "abuse" the type system.

Maybe we could have a "loose" DataFrame type and another "locked-in"
DataFrame that parametrizes using the tuple. The locked-in version could be
created when needed.

On Sun, Dec 15, 2013 at 7:08 PM, Simon Kornblith
notifications@github.comwrote:

I had a discussion with @malmaud https://github.com/malmaud today about
type inference for DataFrames. The core problem is that when you index into
a DataFrame, under the hood, we index into a Vector{Any} and so you get
no type information about the result. Thus, if you want to iterate over one
or more columns of a DataFrame, the only reasonably performant way to do so
is to first pull out the columns with typeasserts and then iterate over
these columns, which is not really idiomatic.

Part of one potential solution would be to make the columns of a DataFrame
a tuple instead of a Vector{Any}, and to parametrize the DataFrame by
this tuple. Then we'd have type information for the columns if referenced
by index, and we'd at least have a type union for indexing by a string. The
downside to this approach is that we wouldn't be able to add columns to an
existing DataFrame, only create new DataFrames with more columns or
re-ordered columns. Syntax like df["newcol"] = x wouldn't work.

This could make code that indexes a DataFrame as df[i, 1] nearly as fast
as indexing a corresponding Array. However, ideally code that indexes a
DataFrame as df[i, "mycol"] would also be as fast as indexing an Array.
This doesn't seem to be possible without changes in Base. We'd have to push
the process of looking up the column name in the Index to type inference
time, so that 1) we don't have to perform the look up on each loop
iteration and 2) type inference has information about the type of the
column. We'd either need a way to hook into the type inference process or
some kind of "named tuple" primitive that behaves like a tuple but can be
indexed with strings.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/451
.

kmsquire · 2013-12-16T00:59:41Z

Syntax like df["newcol"] = x wouldn't work.

Actually, it should be totally possible to make this work. It would just mean creating a new tuple, which might not be efficient DataFrames with a lot of columns, but it should work.

simonster · 2013-12-16T01:04:05Z

Actually, it should be totally possible to make this work. It would just mean creating a new tuple, which might not be efficient DataFrames with a lot of columns, but it should work.

But we need to parametrize the DataFrame by the type of the tuple to get type inference on it, and if we do that, we can't then go change that type without creating a new DataFrame.

kmsquire · 2013-12-16T02:01:05Z

Okay, that's true with your proposal.

How about skipping that parameterization, but under the hood, still
dispatch based on the type of df.columns? Something like

getindex(df::DataFrame, row_ind::Real, col_ind::ColumnIndex) =
df_getindex(df.columns, row_ind, col_ind)

function df_getindex{T}(columns::T, row_ind::Real, col_ind::ColumnIndex)
...
end

Would this work in a similar manner to your initial proposal?

Kevin

On Sun, Dec 15, 2013 at 5:04 PM, Simon Kornblith
notifications@github.comwrote:

Actually, it should be totally possible to make this work. It would just
mean creating a new tuple, which might not be efficient DataFrames with a
lot of columns, but it should work.

But we need to parametrize the DataFrame by the type of the tuple to get
type inference on it, and if we do that, we can't then go change that type
without creating a new DataFrame.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/451#issuecomment-30626927
.

johnmyleswhite · 2013-12-16T02:06:55Z

This is definitely a really cool idea. One thing we might do before making any decisions to scrap the current internals is to finally build up some reliable benchmarks. It seems, from a shallow reading, like the proposal would improve somethings and worsen others. Assuming that's true, it would be nice to know that more benchmarks are improved by this change than are worsened. It seems like that should be the case, but it would be testing.

simonster · 2013-12-16T02:55:13Z

As I think about it more, this approach doesn't actually help much, because type inference for tuple indexing with a constant index is implemented by a t-function in inference.jl. If we have:

immutable A{T}
    a::T
end
Base.getindex(x::A, i::Int) = x.a[i]
f(x) = (y = x[1]; y)

then:

julia> code_typed(f, ((Int, Float64),))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:x}, {{:y},{{:x,(Int64,Float64),0},{:y,Int64,18}},{}}, quote  # none, line 1:
        y = tupleref(x::(Int64,Float64),1)::Union(Float64,Int64)
        return y::Int64
    end)))

julia> code_typed(f, (A{(Int, Float64)},))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:x}, {{:y},{{:x,A{(Int64,Float64)},0},{:y,Union(Float64,Int64),18}},{}}, quote  # none, line 1:
        y = tupleref(top(getfield)(x::A{(Int64,Float64)},:a)::(Int64,Float64),1)::Union(Float64,Int64)
        return y::Union(Float64,Int64)
    end)))

So we don't get concrete type information we'd like if we index into a type that wraps the tuple with a constant index. To get concrete type information, we'd need "rerun type inference after inlining in some cases" from JuliaLang/julia#3440 or our own t-function. OTOH, even knowing the union type might be better than nothing, depending on what later code does.

I also worry that we'd be compiling code for every combination of DataFrame column types, which might be bad if there is code that uses a large number of DataFrames, but is probably fine in general.

@kmsquire I don't think that would help much, since calling df_getindex still requires looking up the method at runtime, which is expensive.

@johnmyleswhite I agree that any decisions we make should be informed by benchmarks.

nalimilan · 2018-09-20T18:08:42Z

Closing in favor of #1335.

simonster mentioned this issue Dec 20, 2013

A less frustrating way to select rows? #456

Closed

tshort mentioned this issue Jan 11, 2014

[Experimental] An AbstractDataFrame that's a composite type with columns as type members #471

Closed

nalimilan closed this as completed Sep 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for better static type inference #451

Ideas for better static type inference #451

simonster commented Dec 16, 2013

tshort commented Dec 16, 2013

kmsquire commented Dec 16, 2013

simonster commented Dec 16, 2013

kmsquire commented Dec 16, 2013

johnmyleswhite commented Dec 16, 2013

simonster commented Dec 16, 2013

nalimilan commented Sep 20, 2018

Ideas for better static type inference #451

Ideas for better static type inference #451

Comments

simonster commented Dec 16, 2013

tshort commented Dec 16, 2013

kmsquire commented Dec 16, 2013

simonster commented Dec 16, 2013

kmsquire commented Dec 16, 2013

johnmyleswhite commented Dec 16, 2013

simonster commented Dec 16, 2013

nalimilan commented Sep 20, 2018