Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Vector conversion for a DataFrame #1461

Closed
bkamins opened this issue Jul 24, 2018 · 19 comments
Closed

Adding Vector conversion for a DataFrame #1461

bkamins opened this issue Jul 24, 2018 · 19 comments

Comments

@bkamins
Copy link
Member

bkamins commented Jul 24, 2018

Every once in a while I need to access columns of a DataFrame as a vector of vectors. This is exactly what df.columns is, but of course we should not expose it. On the other hand eachcol is not very user friendly now.

What we could do:

  • add something like Vector(df::DataFrame) = copy(df.columns) conversion;
  • redefine eachcol to be more usable.

Actually I prefer the first option as we already have Matrix conversion. Any thoughts?

@pdeffebach
Copy link
Contributor

pdeffebach commented Jul 27, 2018

I think we should add an eachcol operation that doesn't return a (name, col) tuple, rather just iterating through columns, so it is easier to use like a vector of vectors.

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

So the todo would be:

  1. update current eachcol to support new iteration protocol
  2. add kwarg usenames to eachcol that is true by default; if it is false then do not pass names in the iteration
  3. make DFColumnIterator have a second parameter that is true or false and keeps information which style we want (so that we can dispatch on it)
  4. define map in such a way that if usenames is false a standard map is used (not the custom one that is present currently)

If we are OK with this plan I can implement it.

@pdeffebach
Copy link
Contributor

Would changing the output of the iterator based on a keyword argument lead to type instability? If it is true, it returns a tuple, if not, it returns a vector?

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

The idea is the following:

struct DFColumnIterator{U, T <: AbstractDataFrame}
    df::T
end
eachcol(df::T; usenames::Bool=true) where T<: AbstractDataFrame= DFColumnIterator{usenames, T}(df)

and then in any function you specify:

somefunction(itr::DFColumnIterator{true}) = ...

or

somefunction(itr::DFColumnIterator{false}) = ...

And it will be type stable AFAIK.

@pdeffebach
Copy link
Contributor

I think that is a good idea. maybe names instead of usenames. It kind of makes sense to define another iterator but I don't know what the name would be.

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

Ah - I see what you mean - we could use Val if we find that the compiler complains. But this should be a small union and those should be handled efficiently. (EDIT: this refers to your earlier question)

@pdeffebach
Copy link
Contributor

It might be hard to make this as fast as df[i] for i in 1:ncol(df)

using BenchmarkTools
df = DataFrame(rand(100,100));

function newarray(df::DataFrame, f::Function)
       [f(df[i]) for i in 1:ncol(df)]
end

function newarrayiter(df::DataFrame, f::Function)
       [f(col) for (name, col) in eachcol(df)]
end

julia> @btime newarray($df, mean);
  7.931 μs (104 allocations: 2.52 KiB)

julia> @btime newarrayiter($df, mean);
  30.791 μs (203 allocations: 5.59 KiB)

Perhaps this performance difference will go away if we stop returning name.

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

This in tests is actually faster:

struct Test{T <: AbstractDataFrame}
    df::T
end
ec(df::AbstractDataFrame) = Test(df)
Base.length(itr::Test) = ncol(itr.df)

function Base.iterate(itr::Test, state::Int=0)
    state += 1
    ncol(itr.df) < state && return nothing
    (itr.df[state], state)
end

than [f(df[i]) for i in 1:ncol(df)]

@pdeffebach
Copy link
Contributor

Cool. let's add the functionality then, and if we decide we need it to be a separate iterator rather than a parametric type we can always do that.

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

OK - @nalimilan - do you have any comments before the PR?

@pdeffebach
Copy link
Contributor

This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

Yes - this is the issue, but fortunately only noticeable if work done on a column is small; if the work is large enough then it is usually delegated to a function that works as barrier-function.

@pdeffebach
Copy link
Contributor

True. I've been benchmarking some large aggregate operations recently and I think that inference is a barrier to high performance for this. aggregate iterates, so maybe a decent amount of bottlenecks could be gotten rid of with just a typed iterator without the need for a NamedTuple implementation. But I don't know if typed iterators are a thing.

@nickeubank
Copy link
Contributor

This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?

Sorry, I recognize this is a touch off topic, but not sure where its best to bring it up: seems like a lack of type-stability of columns underlies many of the performance issues for DataFrames (if I understand what's going on). Is moving to type-stable columns a subject of discussion? If so, where?

@pdeffebach
Copy link
Contributor

I would say #1335 and #1256 have this discussion. Then there is nl/typed which has some code for an implementation.

@bkamins
Copy link
Member Author

bkamins commented Jul 27, 2018

And #744 - and old issue that is still open

@nickeubank
Copy link
Contributor

thx

@nalimilan
Copy link
Member

@bkamins Do you think something still needs to be done here?

@bkamins
Copy link
Member Author

bkamins commented Jan 21, 2019

It can be closed given our current implementation (there is still a deprecation period finished by #1613, but we will not lose track of it). The type stability issue is important, but I would discuss it in the other threads.

@bkamins bkamins closed this as completed Jan 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants