Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgradeable Statistics stdlib #46501

Merged
merged 21 commits into from Sep 16, 2023
Merged

Upgradeable Statistics stdlib #46501

merged 21 commits into from Sep 16, 2023

Conversation

nalimilan
Copy link
Member

This finishes the removal of the Statistics stdlib started at #45594, to make it a separate package. We will then be able to merge Statistics and StatsBase to avoid splitting basic stats functions across multiple modules.

Since there are concerns that users will complain about the need to install a package just to compute the mean and standard deviation, the PR also copies the code from mean, std and var and exports these functions from Base. This is yet one more step in a saga started by #27152. I'm not too happy about this since the choice of functions that live in Base is kind of arbitrary. In particular, its not great that weighted methods defined in StatsBase (to be merged with Statistics) are documented in a different place from the unweighted ones, and we had plans to make weights a keyword argument, which makes dispatching on weights types defined in StatsBase tricky if functions are defined in Base. But well... at least the situation after the PR is less confusing than having both Statistics and StatsBase.

Code is just copied from Statistics, the only changes are:

  • add compatibility note to docstrings
  • remove references to Statistics
  • change tests to use the mean keyword argument instead of stdm/varm
  • do not test sparse matrices as they are not available in Base

This copies the code for these three methods from the Statistics stdlib module.
Only changes are:
- add compatibility note to docstrings
- remove references to `Statistics`
- change tests to use the `mean` keyword argument instead of `stdm`/`varm`
- do not test sparse matrices as they are not available in Base
doc/src/base/statistics.md Outdated Show resolved Hide resolved
2.2285192400943226
```
"""
mean(f, A::AbstractArray; dims=:) = _mean(f, A, dims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be able to dispatch on a keyword, how strange would it be to build in a hook in for licensed piracy... something like:

Suggested change
mean(f, A::AbstractArray; dims=:) = _mean(f, A, dims)
mean(f, A::AbstractArray; dims=:, weights=nothing) = _mean(f, A, dims, weights)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's the kind of pattern I had in mind. But let's leave that question for later as moving Statistics out would already be a big achievement.

@DilumAluthge DilumAluthge added the domain:statistics The Statistics stdlib module label Aug 27, 2022
@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 27, 2022

In my opinion, adding 438 lines of statistics code to Base kind of defeats the point of excising the Statistics stdlib. Because we'll still have 438 lines of code that we can't make breaking changes to, etc.

For the most common use cases, e.g. mean(f, arr), is it really unreasonable to expect users to just write sum(f.(arr))/length(arr) themselves? And for anything more complicated, shouldn't they just install the Statistics package anyway?

There's even less friction for adding packages now, since doing import Foo or using Foo in the REPL would automatically prompt the user, and they can just press Enter to install the package.

@DilumAluthge
Copy link
Member

Also, what makes mean, std, and var special enough that they need to be included in Base, but other functions don't need to be?

@nalimilan
Copy link
Member Author

I mostly agree. We could also just remove Statistics for now and see how strongly people complain. As you say, with Pkg improvements, installing an external package now only requires typing y after using Statistics so it's not a big effort to ask even from newcomers.

That said the previous attempt ended up being reverted (see #27374) so we kind of know that people do complain. The situation may be a bit different now given that even sparse matrix support has been moved out of the stdlib.

@bkamins
Copy link
Member

bkamins commented Aug 27, 2022

@nalimilan - thank you for your effort.

Could you please summarize:

  1. the design of the ecosystem of Base/Statistics.jl/StatsBase.jl in Julia 1.9 release.
  2. the target design of the Base/Statistics.jl/StatsBase.jl ecosystem (if it is different than the one for 1.9 release)
  3. How do we expect users to write code that would work under: Julia 1.6 LTS, Julia 1.8, Julia 1.9 and target (if target is different). In particular how would existing code using Statistics.jl and StatsBase.jl be affected?

Thank you!

@mschauer
Copy link
Contributor

People starting to use sum(x)/length(x) could be an indicator of friction caused by moving mean to the periphery

@bkamins
Copy link
Member

bkamins commented Aug 27, 2022

People starting to use sum(x)/length(x)

This pattern as a replacement of mean is plain incorrect as it will not work for x that does not have O(1) cost of computing length as in such cases it will be undefined.

@ViralBShah
Copy link
Member

ViralBShah commented Aug 27, 2022

mean, var and std are considered common functions, and not special for statistics. People have a very strong objection to needing to install a package for mean, especially. I personally tend to agree with that view myself.

Sparse matrix is still significantly far less common than mean. Also the sparse package is not out of stdlib yet, and has been brought back into the system image until we have conditional dependencies.

@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 27, 2022

I'll concede that mean(a::AbstractArray) and var(a::AbstractArray) (where the latter would be returning the unbiased sample variance) would be considered common methods to most people. If this PR only added those two methods, I'll withdraw my objection.

But this PR adds a lot of other methods as well. It's harder for me to buy the argument that all of those methods would be considered common/non-statistical to most people.

I'm a little concerned about the slippery slope here. If we add those methods, then why not add weights? It's not clear to me why some functionality gets to be included, and other functionality is excluded.

@DilumAluthge
Copy link
Member

On a separate note, I'm not sure why we need to add std(a) to Base - users can't do sqrt(var(a))?

@bkamins
Copy link
Member

bkamins commented Aug 27, 2022

While waiting for a comment by @nalimilan about the long-term plan, I think that if we agree mean([1, 2, 3]) should be included then also e.g. something like mean(i for i in 1:3 if i > 1.5) should be included. The cases when length is not defined are pretty common.

@DilumAluthge
Copy link
Member

DilumAluthge commented Aug 27, 2022

I think I can probably be talked into adding just mean(iterable) and var(iterable), with support for arbitrary iterables, including unknown length.

But with a long-term plan that makes it explicit that we won't be adding any additional methods (including positional args or keyword args) to mean and var.

@DilumAluthge
Copy link
Member

Also, I'd be happier with variance instead of var, but that can be infinitely bikeshed, and that's not the hill I'm going to die on.

@ViralBShah
Copy link
Member

ViralBShah commented Aug 27, 2022

We can bring the weights in as well. I would not be opposed to that, but I suspect that will need to bring in a whole lot of other machinery around weights that is better managed outside.

Numpy actually goes much further in what it considers common functions and has a whole statistics module: https://numpy.org/doc/stable/reference/routines.statistics.html. Now numpy is itself a python package, so it is a little bit apples and oranges.

@ViralBShah
Copy link
Member

BTW, the one thing that I am not sure of is, that with the package manager improvements in the last 4 years, will people have the same reaction to mean not being in Base as they did in 2018. I suspect there will be lesser opposition, but there will still be quite a lot of discontent.

What if Base had, say, average, and we then have everything else (mean, std, var) in Statistics?

@DilumAluthge
Copy link
Member

What is average here? Just the regular unweighted mean of an array or iterable? E.g. average(arr::AbstractArray) or average(iterable)?

@nalimilan
Copy link
Member Author

Could you please summarize:

1. the design of the ecosystem of Base/Statistics.jl/StatsBase.jl in Julia 1.9 release.

2. the target design of the Base/Statistics.jl/StatsBase.jl ecosystem (if it is different than the one for 1.9 release)

3. How do we expect users to write code that would work under: Julia 1.6 LTS, Julia 1.8, Julia 1.9 and target (if target is different). In particular how would existing code using Statistics.jl and StatsBase.jl be affected?

@bkamins As I see it, the goal is to have Statistics as a standalone package to which we will be able to move most or all of the StatsBase features (details to be discussed, we probably don't want to import deprecated API), and to deprecate StatsBase in the long term. For Julia 1.9 probably the only change will be that Statistics is no longer an stdlib module but a separate package, as importing things from StatsBase will take some work.

Nothing should break for users of StatsBase and Statistics, except for having to install Statistics and add Statistics = "1" compat version bound to Project.toml (which will be done automatically in the registry like JuliaRegistries/General@c56dc97). But since Statistics cannot be upgraded on older Julia versions, packages will likely continue to use StatsBase for a long time (at least until the next LTS is out). Actually it could make sense to have StatsBase merely reexports functions from Statistics on recent Julia versions, a bit like Compat does for Julia Base. The exact implementation still needs reflection, but I don't think we need to sort it out to make a decision regarding this PR.

@ViralBShah
Copy link
Member

ViralBShah commented Aug 27, 2022

I believe JuliaCon 2023 could be the right timeframe for the next LTS. Hopefully we have conditional dependencies in place by then, and can even move out SparseArrays and perhaps a few other things as well for the new LTS.

@nalimilan
Copy link
Member Author

@DilumAluthge @ViralBShah The reason why I think weighted stats should not be included is that it would pull in a whole machinery of weight vectors types which really cannot live in Base as they are too specialized. That's not a problem for mean as we could just have a keyword argument taking any AbstractVector (including weight vectors), but for std and var the correction factor depends on the kind of weight so the types would have to live in Base too.

A better comparison than Numpy is probably Python's statistics standard library:

The module is not intended to be a competitor to third-party libraries such as NumPy, SciPy, or proprietary full-featured statistics packages aimed at professional statisticians such as Minitab, SAS and Matlab. It is aimed at the level of graphing and scientific calculators.

It is very similar to our current Statistics stdlib, but it also provides mode, harmonic and geometric mean, and basic linear regression. So if we took Python as a reference we would favor the status quo. :-) But the situation isn't ideal in Python as NumPy provides separate definitions, so you have multiple functions to do the same thing.

Conversely, Rust, Go and Swift don't provide any statistical functions in their standard libraries, not even mean.

@bkamins
Copy link
Member

bkamins commented Aug 27, 2022

@ViralBShah - is changing a module in which some name is defined (mean, mean!, var, and std in the case of this PR) considered non-breaking?

@mbaz
Copy link
Contributor

mbaz commented Aug 27, 2022

I use basic statistical functions in my work in signal processing all the time, but I am not a professional statistician. I am not concerned about having to import a package to use mean, var, median and other similar functions. What I am concerned about, though, is the future direction of the Statistics package and the impllications for people like me who just need basic functionality and whose data is mostly represented as regular, plain vectors and matrices.

Recent discussions over in Discourse (see for example here) make it clear that the desire of the professional statisticians in the community is to adapt Statistics.jl to meet their needs. Among other things, there is a clear reluctance to continue to support functions that take anything other than tables. Such changes would make it difficult and inconvenient for me (and I suspect, many others) to use that package in my own projects.

My concrete proposal is to excise statistics from stdlib if you must, but give a different name to the new package (maybe BasicStats or ClassicStats [or NonSeriousStats 😉 ]). Then the serious statisticians would be free to drive Statistics.jl forward, while the rest of us would continue to use the more basic functions.

@ViralBShah
Copy link
Member

@ViralBShah - is changing a module in which some name is defined (mean, mean!, var, and std in the case of this PR) considered non-breaking?

Perhaps this is more of a @KristofferC or @StefanKarpinski question. I believe that since these are exported names, the module name is less important. And of course, the Statistics.jl package can still provide compatibility.

@mschauer
Copy link
Contributor

Speaking for the serious statisticians: we have no interest in developing the package in a direction where it stops being practical

@nalimilan
Copy link
Member Author

Right, we can probably revert this now. So the PR becomes really minimal! :-) I've pushed a commit to do that.

@@ -57,6 +57,9 @@ Standard library changes

#### Dates

#### Statistics

* Statistics is now an upgradeable standard library.([#46501])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Statistics is now an upgradeable standard library.([#46501])
* Statistics is now an upgradeable standard library ([#46501]).

@vchuravy
Copy link
Sponsor Member

So I think the only real thing left to do is step 4 and 5 in #50697

@vchuravy
Copy link
Sponsor Member

Ah and I see in #45540 we did remove it from the spdx file.

@nalimilan
Copy link
Member Author

So I think the only real thing left to do is step 4 and 5 in #50697

Step 4: JuliaRegistries/General#89713
Step 5: JuliaLang/Pkg.jl#3587 (failing until step 4 is completed AFAICT).

Ah and I see in #45540 we did remove it from the spdx file.

Right, though as I noted the situation was different at the time since the module was completely removed. Maybe we just forgot to revert that.

@bkamins
Copy link
Member

bkamins commented Aug 15, 2023

We will then be able to merge Statistics and StatsBase to avoid splitting basic stats functions across multiple modules.

Just to make sure. With all these updates we are still going to merge Statistics.jl and StatsBase.jl as the last step some time in the future?

@nalimilan
Copy link
Member Author

I hope so. One complication will be that we probably won't be able to add dependencies to Statistics, which could block porting some features.

@bkamins
Copy link
Member

bkamins commented Aug 16, 2023

which could block porting some features.

That is why I am asking. I think we should have some plan prepared for this. So that we do not end up with a situation where we cannot achieve what we wanted to do in the first place. In particular - where things that cannot go to Statistics.jl should be moved (as I assume leaving leftovers in StatsBase.jl is not a good idea).

If no one has thought about it yet - please let me know and I can try to check.

@nalimilan
Copy link
Member Author

At any rate making Statistics upgradable is an improvement. But yeah it would be good to list at JuliaStats/Statistics.jl#87 features that rely on dependencies and see what we can do about them. Thanks for offering your help!

@cmcaine
Copy link
Contributor

cmcaine commented Aug 24, 2023

If Statistics becomes an upgradeable bundled library (sounds great to me!) then can we leave mean out of Base after all? That's less change for end users and seems like less overall confusion to me.

Any complaints about discoverability can be solved nicely by adding an exception hint for the names of common stats functions or for all names that are exported from the environment/some core list of packages/whatever.

@vchuravy vchuravy changed the title Remove Statistics stdlib and copy mean from it Upgradeable Statistics stdlib Aug 24, 2023
@nalimilan
Copy link
Member Author

If Statistics becomes an upgradeable bundled library (sounds great to me!) then can we leave mean out of Base after all? That's less change for end users and seems like less overall confusion to me.

Yes, that's what the PR does now.

@rayegun
Copy link
Member

rayegun commented Aug 24, 2023

I will briefly be the voice of Alan here and say that he detests the removal of mean. For what it's worth 😅

I think I hear about it every other meeting

@nalimilan
Copy link
Member Author

Step 5 (JuliaLang/Pkg.jl#3587) has just been completed, so it looks like we're good?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:statistics The Statistics stdlib module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet