Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many functions want vector, DataFrame produces DataArray #368

Closed
houshuang opened this issue Oct 4, 2013 · 2 comments
Closed

Many functions want vector, DataFrame produces DataArray #368

houshuang opened this issue Oct 4, 2013 · 2 comments

Comments

@houshuang
Copy link

I ran into this problem first with Gadfly:

a = DataFrame(diff = [1,2,3,3,3,4,3,2])
plot(a, x="diff", Geom.histogram)

no method choose_bin_count_1d(PooledDataArray{Int64,Uint32,1},)
 in apply_statistic at /Users/Stian/.julia/Gadfly/src/statistics.jl:92
 in apply_statistics at /Users/Stian/.julia/Gadfly/src/statistics.jl:33
 in render at /Users/Stian/.julia/Gadfly/src/Gadfly.jl:575
 in writemime at /Users/Stian/.julia/Gadfly/src/Gadfly.jl:667
 in sprint at io.jl:426
 in display_dict at /Users/Stian/.julia/IJulia/src/execute_request.jl:35

by manually changing the type specification of bin_count_1d, it worked, and plotted nicely. I then tried Geom.density, and now the kde function from Distributions complained, making me realize that this went beyond the Gadfly package.

julia> a = DataFrame(diff=[1,2,3,4,5])
5x1 DataFrame:
        diff
[1,]       1
[2,]       2
[3,]       3
[4,]       4
[5,]       5

julia> kde(a)
ERROR: no method kde(DataFrame,)

julia> kde(a["diff"])
ERROR: no method kde(DataArray{Int64,1},)

julia> kde([1,2,3,4,5])
UnivariateKDE([-1.9207538685519072,-1.9159484448521495,-1.9111430211523919,-1.906

This seems very impractical and illogical... I'm not sure if all functions have to be rewritten to use AbstractVector, or how this could be solved, but it's the first time I've had Julia's type system hit me in the head - in a duck-typing system, you'd say that if it was able to give you a bunch of numbers, go calculate...

Just by removing the type signatures from the function call in kde (and bandwidth):

julia> a=DataFrame(f=[1,2,3,4,5])
5x1 DataFrame:
        f
[1,]    1
[2,]    2
[3,]    3
[4,]    4
[5,]    5


julia> kde(a["f"])
UnivariateKDE([-1.9207538685519072,-1.9159484448521495,-1.9111430211523919,-1.906337597452634,-1.9015321737528763,-1.8967267500531186,-1.891921326353361,-1.887115902653603,-1.8823104789538454,-1.8775050552540877    7.87269963155433,7.877505055254087,7.882310478953846,7.887115902653603,7.89192132635336,7.896726750053119,7.9015321737528765,7.906337597452634,7.911143021152391,7.91594844485215],[0.0018792598991436565,0.0018796348087474213,0.0018803845265443903,0.0018815091277097906,0.0018830087249675914,0.0018848834685850496,0.0018871335463659517,0.0018897591836418703,0.001892760643261479,0.0018961382255785753    0.0018961355641484124,0.0018927582655279998,0.0018897570884212234,0.0018871317326146142,0.0018848819353993965,0.0018830074715836878,0.001881508153502998,0.0018803838310290889,0.0018796343915768762,0.0018792597601098747])
@johnmyleswhite
Copy link
Contributor

You should try using the vector function to convert DataVector's into Vector's.

Beyond that, this isn't a simple thing to resolve. We should change most functions to work on AbstractVector, but they will almost certainly give the wrong answer for almost all DataVector's, so that definition will do sadly little to improve the situation. You can't use the same function definitions when NA's are present, because NA's will poison the result. What makes things worse is that you also can't use a simple strategy like always dropping NA's: that works only in a small subset of cases. What needs to happen is that we implement every function for both Array and DataArray.

This is certainly going to be one of the hardest points of transitions between Julia and R, but it's basically exactly what happens in Python and NumPy with two distinct types of arrays. The reason this seems like a non-issue in R is that R doesn't have an Array type at all -- it only has DataArray.

I think it would be better to open specific issues for functions we need to extend to work on DataArray.

@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

DataFrames are now mostly agnostic to the underlying column type, so if you put Vectors in, you get Vectors out.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants