Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a case for operations like \ for DataArrays? #692

Closed
tomasaschan opened this issue Sep 27, 2014 · 10 comments
Closed

Is there a case for operations like \ for DataArrays? #692

tomasaschan opened this issue Sep 27, 2014 · 10 comments

Comments

@tomasaschan
Copy link

I just stumbled over this (trying to construct a linear regression from my data):

data = readtable("mydata.csv")

A = hcat(data[:S], ones(size(data,1)))
c = A\data[:V]
ERROR: `A_ldiv_B!` has no method matching A_ldiv_B!(::Array{Float64,2}, ::DataArray{Float64,1})
 in \ at linalg/generic.jl:232

Since c = A\data[:V].data works, I figure this could easily be fixed by adding a method that does just that, but I also noted in #368, #346 and #165 that there might be reasons not to.

If implementing this in DataFrames.jl is a bad idea, what would be the "standard" way to solve an equation system like the one above?

@johnmyleswhite
Copy link
Contributor

This is a tough question to answer. It's not clear to me if the output solution would be all missing values if there were a single missing value in the input or not.

If we did implement this, it should go in DataArrays.

@tomasaschan
Copy link
Author

I have no idea if it's feasible implementation-wise, but a cool way of dealing with missing values for this specific operation would be to discard the rows (i.e. equations) that have missing values, and try to solve the remaining system. So if A is MxN, and b::DataArray has N rows but n missing values, then the resulting system would be Mx(N-n), with no NAs. If the system is still solvable (i.e. the determinant of A with all missing-value rows removed is still nonzero, and the system is still sufficiently well determined) the answer is the solution of that system. If the system is not solvable, either an error should be thrown (preferable to me) or a DataArray with only NAs could be returned (which might be confusing...).

Rationale: Most of the time when using A\b, we're solving an overdetermined system of equations, i.e. to fit some line or curve to a dataset. If I need to get rid of NAs before I solve the system, that's probably what I'd end up doing anyway, so it'd be nice having DataArrays do it for me.

@johnmyleswhite
Copy link
Contributor

I'd rather that something that involves meaningful assumptions be explicit, rather than be an implicit default.

@tomasaschan
Copy link
Author

In what context would A\b not mean "solve the equation system Ax=b"?

@johnmyleswhite
Copy link
Contributor

That's not what I said. The question is: "are there multiple reasonable ways to propose solving a linear equation where some entries of A or b are unknown?". And the answer is: "yes".

@grayclhn
Copy link

Yikes. Implicitly casting a DataFrame to a matrix for matrix operations seems like a frightening idea. Especially since it's shorthand that users can trivially implement locally if it fits with their workflow.

One case where A\data[:V] is likely to do badly is when data[:V] is a PooledDataArray. Let alone when there are missing values.

@floswald
Copy link
Contributor

Is there any strong reason for you not to use GLM for curve fitting? There
may be other solutions out there as well.

On Sunday, 28 September 2014, Gray Calhoun notifications@github.com wrote:

Yikes. Implicitly casting a DataFrame to a matrix for matrix operations
seems like a frightening idea. Especially since it's shorthand that users
can trivially implement locally if it fits with their workflow.

One case where A\data[:V] is likely to do badly is when data[:V] is a
PooledDataArray. Let alone when there are missing values.


Reply to this email directly or view it on GitHub
#692 (comment)
.

@tomasaschan
Copy link
Author

GLM is extremely overkill for the scenario I had at my hands at the moment. Also, the GLM.jl seems to be aimed primarily at users coming from R - I've never used R. On the other hand, I've used MATLAB more than enough, and there (to no-ones surprise) linear fitting is easiest done with the backslash operator...

@johnmyleswhite
Copy link
Contributor

Just want to remind folks that this issue should be shifted to DataArrays since it doesn't make any sense to try to define matrix ops on DataFrames.

@tomasaschan
Copy link
Author

Closing this for housekeeping, since it doesn't seem that it's a feature we want (and definitely not here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants