Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for better static type inference #451

Closed
simonster opened this issue Dec 16, 2013 · 7 comments
Closed

Ideas for better static type inference #451

simonster opened this issue Dec 16, 2013 · 7 comments

Comments

@simonster
Copy link
Contributor

I had a discussion with @malmaud today about type inference for DataFrames. The core problem is that when you index into a DataFrame, under the hood, we index into a Vector{Any} and so you get no type information about the result. Thus, if you want to iterate over one or more columns of a DataFrame, the only reasonably performant way to do so is to first pull out the columns with typeasserts (or pass them to a function) and then iterate over these columns, which is not really idiomatic.

Part of one potential solution would be to make the columns of a DataFrame a tuple instead of a Vector{Any}, and to parametrize the DataFrame by the type of this tuple. Then we'd have type information for the columns if referenced by index, and we'd at least have a type union for indexing by a string. The downside to this approach is that we wouldn't be able to add columns to an existing DataFrame, only create new DataFrames with more columns or re-ordered columns. Syntax like df["newcol"] = x wouldn't work.

This could make code that indexes a DataFrame as df[i, 1] nearly as fast as indexing a corresponding Array. However, ideally code that indexes a DataFrame as df[i, "mycol"] would also be as fast as indexing an Array. This doesn't seem to be possible without changes in Base. We'd have to push the process of looking up the column name in the Index to type inference time, so that 1) we don't have to perform the look up on each loop iteration and 2) type inference has information about the type of the column. We'd either need a way to hook into the type inference process or some kind of "named tuple" primitive that has type information for each member like a tuple but can be indexed with strings.

@tshort
Copy link
Contributor

tshort commented Dec 16, 2013

Those are intriguing ideas, Simon. I've been thinking about trying to make
an AbstractDataFrame that's just a type created on the fly with columns as
members. It has some of the same advantages and disadvantages of your
approach. I like that your approach doesn't "abuse" the type system.

Maybe we could have a "loose" DataFrame type and another "locked-in"
DataFrame that parametrizes using the tuple. The locked-in version could be
created when needed.

On Sun, Dec 15, 2013 at 7:08 PM, Simon Kornblith
notifications@github.comwrote:

I had a discussion with @malmaud https://github.com/malmaud today about
type inference for DataFrames. The core problem is that when you index into
a DataFrame, under the hood, we index into a Vector{Any} and so you get
no type information about the result. Thus, if you want to iterate over one
or more columns of a DataFrame, the only reasonably performant way to do so
is to first pull out the columns with typeasserts and then iterate over
these columns, which is not really idiomatic.

Part of one potential solution would be to make the columns of a DataFrame
a tuple instead of a Vector{Any}, and to parametrize the DataFrame by
this tuple. Then we'd have type information for the columns if referenced
by index, and we'd at least have a type union for indexing by a string. The
downside to this approach is that we wouldn't be able to add columns to an
existing DataFrame, only create new DataFrames with more columns or
re-ordered columns. Syntax like df["newcol"] = x wouldn't work.

This could make code that indexes a DataFrame as df[i, 1] nearly as fast
as indexing a corresponding Array. However, ideally code that indexes a
DataFrame as df[i, "mycol"] would also be as fast as indexing an Array.
This doesn't seem to be possible without changes in Base. We'd have to push
the process of looking up the column name in the Index to type inference
time, so that 1) we don't have to perform the look up on each loop
iteration and 2) type inference has information about the type of the
column. We'd either need a way to hook into the type inference process or
some kind of "named tuple" primitive that behaves like a tuple but can be
indexed with strings.


Reply to this email directly or view it on GitHubhttps://github.com//issues/451
.

@kmsquire
Copy link
Contributor

Syntax like df["newcol"] = x wouldn't work.

Actually, it should be totally possible to make this work. It would just mean creating a new tuple, which might not be efficient DataFrames with a lot of columns, but it should work.

@simonster
Copy link
Contributor Author

Actually, it should be totally possible to make this work. It would just mean creating a new tuple, which might not be efficient DataFrames with a lot of columns, but it should work.

But we need to parametrize the DataFrame by the type of the tuple to get type inference on it, and if we do that, we can't then go change that type without creating a new DataFrame.

@kmsquire
Copy link
Contributor

Okay, that's true with your proposal.

How about skipping that parameterization, but under the hood, still
dispatch based on the type of df.columns? Something like

getindex(df::DataFrame, row_ind::Real, col_ind::ColumnIndex) =
df_getindex(df.columns, row_ind, col_ind)

function df_getindex{T}(columns::T, row_ind::Real, col_ind::ColumnIndex)
...
end

Would this work in a similar manner to your initial proposal?

Kevin

On Sun, Dec 15, 2013 at 5:04 PM, Simon Kornblith
notifications@github.comwrote:

Actually, it should be totally possible to make this work. It would just
mean creating a new tuple, which might not be efficient DataFrames with a
lot of columns, but it should work.

But we need to parametrize the DataFrame by the type of the tuple to get
type inference on it, and if we do that, we can't then go change that type
without creating a new DataFrame.


Reply to this email directly or view it on GitHubhttps://github.com//issues/451#issuecomment-30626927
.

@johnmyleswhite
Copy link
Contributor

This is definitely a really cool idea. One thing we might do before making any decisions to scrap the current internals is to finally build up some reliable benchmarks. It seems, from a shallow reading, like the proposal would improve somethings and worsen others. Assuming that's true, it would be nice to know that more benchmarks are improved by this change than are worsened. It seems like that should be the case, but it would be testing.

@simonster
Copy link
Contributor Author

As I think about it more, this approach doesn't actually help much, because type inference for tuple indexing with a constant index is implemented by a t-function in inference.jl. If we have:

immutable A{T}
    a::T
end
Base.getindex(x::A, i::Int) = x.a[i]
f(x) = (y = x[1]; y)

then:

julia> code_typed(f, ((Int, Float64),))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:x}, {{:y},{{:x,(Int64,Float64),0},{:y,Int64,18}},{}}, quote  # none, line 1:
        y = tupleref(x::(Int64,Float64),1)::Union(Float64,Int64)
        return y::Int64
    end)))

julia> code_typed(f, (A{(Int, Float64)},))
1-element Array{Any,1}:
 :($(Expr(:lambda, {:x}, {{:y},{{:x,A{(Int64,Float64)},0},{:y,Union(Float64,Int64),18}},{}}, quote  # none, line 1:
        y = tupleref(top(getfield)(x::A{(Int64,Float64)},:a)::(Int64,Float64),1)::Union(Float64,Int64)
        return y::Union(Float64,Int64)
    end)))

So we don't get concrete type information we'd like if we index into a type that wraps the tuple with a constant index. To get concrete type information, we'd need "rerun type inference after inlining in some cases" from JuliaLang/julia#3440 or our own t-function. OTOH, even knowing the union type might be better than nothing, depending on what later code does.

I also worry that we'd be compiling code for every combination of DataFrame column types, which might be bad if there is code that uses a large number of DataFrames, but is probably fine in general.

@kmsquire I don't think that would help much, since calling df_getindex still requires looking up the method at runtime, which is expensive.

@johnmyleswhite I agree that any decisions we make should be informed by benchmarks.

@nalimilan
Copy link
Member

Closing in favor of #1335.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants