Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add essential packages for statistics #4

Merged
merged 2 commits into from
Aug 28, 2018
Merged

Add essential packages for statistics #4

merged 2 commits into from
Aug 28, 2018

Conversation

nalimilan
Copy link
Member

This makes the package useful again.

@nalimilan nalimilan changed the title Add 12 essential packages for statistics Add essential packages for statistics May 16, 2018
@ChrisRackauckas
Copy link

ManifoldLearning.jl should be forked into JuliaStats, master tagged, and added here IMO.

This makes the package useful again.
@andreasnoack
Copy link
Member

andreasnoack commented May 16, 2018

@nalimilan Great that you are pushing this. It will make things much more user friendly. One of the open questions is how documention should be handled. Maybe examples of anaylses that uses functionality across several packages would be useful, leaving the actual API documention to the individual packages.

ManifoldLearning.jl should be forked into JuliaStats, master tagged, and added here IMO.

I'm not sure if it is an obvious candidate for inclusion here. I think the idea here is to cover the standard stuff that you'd see in stats courses.

@ChrisRackauckas
Copy link

ChrisRackauckas commented May 16, 2018

Are manifold-based methods and TSne not in standard stats courses by now? I wouldn't be able to find a stats-based computational bio course without them.

@nalimilan
Copy link
Member Author

One of the open questions is how documention should be handled. Maybe examples of anaylsis that uses functionality across several packages would be useful, leaving the actual API documention to the individual packages.

Yes, that's a difficult question. Maybe the ideal would be to have a tutorial exposing the most common features of each domain, and redirecting to packages for more details. But that's a lot of work. So maybe we can just start with links to the package's manuals on each line?

Regarding ManifoldLearning.jl, I have no idea what it is so I can't really say. One good criterion would be whether other statistical environment provide it by default.

@andreasnoack
Copy link
Member

Are manifold-based methods and TSne not in standard stats courses by now? I wouldn't be able to find a stats-based computational bio course without them.

computational bio is not statistics. At least not the flavors of it that I've seen.

@ChrisRackauckas
Copy link

ChrisRackauckas commented May 16, 2018

Alright, I'll leave it alone. The best solution down the line is probably to add that stuff to MultivariateStats.jl which has the other half of the commonly used dimensional reduction methods.

The others that come to mind for me are LOESS.jl and Bootstrap.jl. At least to me, anything further is probably "specialized" and those are sitting right on the cutoff line.

computational bio is not statistics. At least not the flavors of it that I've seen.

There's tons of flavors to the point where computational/systems biology needs a word in front of it to really be descriptive.

@nalimilan
Copy link
Member Author

Good point, I've added Bootstrap and Loess. I missed the latter because it's not listed on the website, we should update it (and remove unmaintained packages). Also, shouldn't Loess be renamed to LOESS?

@rofinn
Copy link
Member

rofinn commented May 16, 2018

Should we include RDatasets? I know people who use that for their demos.

@gragusa
Copy link

gragusa commented May 16, 2018

There is CovarianceMateices.jl. I am working on making it generic, but as it is is a nice complement to GLM.jl (in certain fields m, these variances are the standard ones).

@ararslan
Copy link
Member

Want Jackknife?

@mkborregaard
Copy link

Great list. What about MixedModels? Would be nice to have that really integrated into the ecosystem here. In ecology at least nobody seems to do a GLM without random effects these days.

@mkborregaard
Copy link

variance, bias and estimator must be defined in other Stats packages than Jackknife, right? Shouldn't it be extending those functions with new methods? (sorry if this is out of place)

@ararslan
Copy link
Member

Jackknife doesn't export anything, so you have to call them as Jackknife.variance, etc.

@ararslan
Copy link
Member

Btw we may want to do some serious cleanup and dedicated maintenance if we're going to fully endorse all of these packages. While I think most are fine, I don't know that anybody really tends to MultivariateStats these days.

@mkborregaard
Copy link

Such an important package though.

@ChrisRackauckas
Copy link

Yeah, it's chicken and egg. I think you put it in so that way it has to be maintained. FWIW it's already widely used and right now it works. Maybe it just hasn't been touched because it's working just fine. But yes,

Such an important package though.

It has a lot of stuff in there, but at least PCA is pretty standard in most toolkits.

@nalimilan
Copy link
Member Author

A few comments:

  • Jackknife: sure, let's add it. BTW, we should try unifying interfaces with Bootstrap (and with StatsBase, e.g. for se vs. stderror).
  • RDatasets: I've thought about it, but it feels weird to include a package with R in the name in our standard set of packages. Maybe we should just rename it, which would also allow adding datasets from other sources if needed.
  • CovarianceMatrices: AFAICT this package is more powerful than what stats environments usually support by default, but I guess that's a strength as long as it also provides the basics and the API remains simple
  • MixedModels: that's definitely a strong point of the Julia stats ecosystem, so maybe it makes sense to include it by default. One issue is that it exports a bootstrap function which conflicts with Bootstrap. Better sort out this problem first.
  • MultivariateStats: that's a difficult case. On the one hand things like PCA are very standard in most statistical environments. On the other hand, I don't think we want to commit too much to a mostly unmaintained package which may have to be modified a lot. For example the APIs currently do not support formulas. The risk is that we end up as R, where external packages are typically used instead of the standard prcomp function. In doubt, it would be safer to leave it out for now.

@dmbates
Copy link

dmbates commented May 18, 2018

I'm happy to reconcile the exported MixedModels.bootstrap function with the Bootstrap package. I'd actually forgotten that there was a Bootstrap package.

@ararslan
Copy link
Member

Want MultivariateTests? It'd just have to be registered first.

@nalimilan
Copy link
Member Author

Yeah, but why not add these tests to HypothesisTests instead?

@ararslan
Copy link
Member

Yeah, I suppose they would work just fine there, good point. They were originally separate because it started as a project for my master's program. 😛

@matthieugomez
Copy link

matthieugomez commented May 28, 2018

I think the list should be shorter rather than longer. IMO, only packages that proved themselves useful/popular with end-users should be in this list. Otherwise, this list may give the impression that a lot of things are "done" in Julia, which is not true and which potentially stiffens innovation.

@mkborregaard
Copy link

What packages above does not fullfill these criteria in your opinion?

@matthieugomez
Copy link

matthieugomez commented May 28, 2018

I do not really know a lot of these packages. But it just seems safer to me to start with a small list of packages, and then expand it, rather than removing existing functionalities.
The very short list I have in mind would look something like CategoricalArrays, CSV, DataFrames, Distances, Distributions, StatsBase, StatsModels, GLM, and maybe Clustering, TimeSeries, MultivariateStats, HypothesisTests, MixedModels

@nalimilan
Copy link
Member Author

So basically you object about Bootstrap, KernelDensity, Loess, Jackknife and CovarianceMatrices? Care to develop why?

@nalimilan
Copy link
Member Author

Are KDE.jl and LOESS.jl reasonably complete to be worth including?

@nignatiadis
Copy link

I think that MultipleTesting.jl should be included in the list of essential packages as well! Both the Benjamini-Hochberg procedure (and the related Storey procedure) are ubiquitous in high-throughput studies. R provides some of that functionality through the p.adjust function, which is probably one of the most commonly used ones. Also the MultipleTesting.jl is lightweight and would not introduce additional dependencies (and the implementations are thorough and well-tested).

cc @juliangehring

@nalimilan For what it is worth, if by KDE.jl you mean KernelDensity.jl then whenever I needed it, it has been useful and it seems to have the (basic) required functionality (and I think a nonparametric density estimator falls within the "essential" category).

@mkborregaard
Copy link

Hi, I'm just curious where I can learn about the plans for this package?

@nalimilan
Copy link
Member Author

I'm not aware of any plans besides this PR. I think we should make a decision and merge it.

@andreasnoack
Copy link
Member

Let's merge what's here now. We can adjust later if needed. I'll do it tomorrow if nobody objects.

@nalimilan
Copy link
Member Author

If anybody thinks a package should be added or removed from the list, please file a new issue.

@mkborregaard mkborregaard mentioned this pull request Sep 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants