Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future plan for Stats.jl and statistical computing in general #4168

Closed
ViralBShah opened this issue Aug 28, 2013 · 7 comments
Closed

Future plan for Stats.jl and statistical computing in general #4168

ViralBShah opened this issue Aug 28, 2013 · 7 comments

Comments

@ViralBShah
Copy link
Member

I would love to do the following to make statistical computing much more accessible in julia. This would make things a lot simpler for folks coming from the R world.

  1. Merge the existing Stats.jl into Base. It is really basic stuff, and appears to be in much better shape for Base now, than when it was split out.
  2. Make Stats.jl a meta-package that depends on DataFrames, GLM, and Distributions. Over time, this could become a standard way to get all statistical computing functionality for julia that is reasonably well tested.
  3. Have one manual for statistical computing that combines the documentation for those 3 packages, and link this manual into the main julia manual.
  4. Function reference should get merged into the julia help() - this needs some work, as the current help does not have the ability to look into packages.
@StefanKarpinski
Copy link
Member

I like this plan. cc: @dmbates, @johnmyleswhite, @HarlanH.

@BobPortmann
Copy link
Contributor

It would be unfortunate to have to use DataFrames to do statistics (i.e., the R approach), as implied in (2). It would preferable if all statistical functions worked directly on Arrays and Numbers and then had wrappers to use DataFrames as well. Otherwise, those of us that use large multidimensional Arrays have extra overhead to push Arrays (or Array slices) into DataFrames (probably in a loop) just to call stats functions that ultimately just act on Arrays. This approach turned me off to R. A DataFrame is just an abstraction for tabular data and in my experience (climate modeling) most data sets and model output do not fit nicely into this framework.

@johnmyleswhite
Copy link
Member

I'm generally on board:

(1) For the moment, moving Stats back into Base seems unwise to me since removing Stats from Base has given us much more freedom to work. In addition, Stats depends upon both NumericExtensions and Distances, so those also would have to be brought into Base. In the long run, I'd like all of that functionality brought into Base -- but that's a few months away I would think.

(2) I'm happy to make Stats into a meta-package, but we have to then create a temporary package that stores all of the material currently in Stats.

(3) Unified documentation would be great.

(4) Function references would be great as well.

@BobPortmann, all of the functions in Stats already work on vectors by default. DataFrames then extends them to work on DataArray's, which are the relevant data structure. Operations on DataFrames are defined in terms of actions on DataArray's. If you dislike DataFrames, you can avoid them. You can also avoid DataArray's.

@dmbates
Copy link
Member

dmbates commented Aug 28, 2013

I'm fine with moving to a "sumo" Stats.jl that requires the other packages.

@johnmyleswhite Commits on the current Stats.jl have been infrequent of late. Do you anticipate the need to update those methods frequently in the future?

@BobPortmann
Copy link
Contributor

@johnmyleswhite Yes, I realize that is true now but I thought that item (2) above was proposing to move away from that model. I glad to hear that it is not. I'm not sure what you mean by DataArrays. Are these normal Arrays or an extension of DataFrames to higher dimensions?

@dmbates
Copy link
Member

dmbates commented Aug 28, 2013

@BobPortmann DataArrays are arrays that include a missing data specification. A DataFrame can contain one or more DataArray or other types of vectors or arrays. See DataFrames/src/dataarray.jl

@ViralBShah
Copy link
Member Author

The general consensus seems to be that we still need Stats.jl to be independent. I do wish there was an easy way to get command line help for package functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants