Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AbstractWeights restriction to Real is too restrictive #745

Open
JeffFessler opened this issue Dec 4, 2021 · 19 comments · May be fixed by #755
Open

AbstractWeights restriction to Real is too restrictive #745

JeffFessler opened this issue Dec 4, 2021 · 19 comments · May be fixed by #755

Comments

@JeffFessler
Copy link

JeffFessler commented Dec 4, 2021

The type of AbstractWeights is restricted to Real which precludes complex numbers, numbers with units from Unitful, etc. Using Number for the most general type would be more inclusive.

abstract type AbstractWeights{S<:Real, T<:Real, V<:AbstractVector{T}} <: AbstractVector{T} end

Of course many subtypes like ProbabilityWeights must be Real, but general weights do not, especially for how weights are used in the fit method for a Histogram. The follow MWE ideally would work, but fails due to the weights() method restrictions:

using StatsBase
#T = Float32 # this version works fine
T = ComplexF64 # this version errors at weights(w)
w = rand(T, 100) # weights 
data = randn(100)
fit(Histogram, data, weights(w))

An alternative to generalizing the AbstractWeights type would be to generalize the fit method so that the weights can simply be any Number type.

I would be eventually willing to help with a PR if there is openness to such a generalization, but it would helpful to know if there is a preference between generalizing Weights vs generalizing fit.

@ParadaCarleton
Copy link
Contributor

What would be the use case for complex weights or weights with units? I've never seen weights that aren't dimensionless, positive reals.

@nalimilan
Copy link
Member

I don't understand what would be the meaning of the code you posted above either. Do you have any examples of other software that allows non-real weights?

@JeffFessler
Copy link
Author

Sorry I thought I had included a link in an earlier reply. Here is a detailed Literate example that uses fit(Histogram,...) and that specifically refers to this issue:
https://juliaimagerecon.github.io/Examples/generated/mri/1-nufft/
There I had to use a "kludge" to separate the real and imaginary parts and strip the values of units.

Here is a book chapter that discusses the MRI use case in detail. The MWE above is a minimal simplification of the essence of what is needed.

I realize that it may be hard to convince you that complex weights are useful, but perhaps you would at least consider weights with units? In a classical histogram, the weights are simply integers (counts). This package has already taken a useful step by generalizing to allow floating point weights. (I'm not quite sure why that generalization is there but I'm glad it is.) In scientific computing, in many places where we have floating point values it is because they are associated with physical measurements that have units. The Unitful.jl package is designed to support such units, and a growing number of packages in the Julia ecosystem have endeavored to support Unitful data, often by relaxing type restrictions to be Number instead of Real, because Unitful values are subtypes of Number rather than subtypes of Real. A side benefit of such generalizations is that other Number types like complex values also are supported, which makes the packages more widely useful.

@nalimilan
Copy link
Member

Thanks for the links, but I must say I don't really understand them. Do you mean that you would like to pass complex weights so that the real part of the weights applies to the real part of data and the imaginary part of weights to the imaginary part of data? Are you sure that if complex numbers were allowed for histograms, it would give what you expect, or would more changes be needed in StatsBase?

Could you give a self-contained example which can easily be understood without the context of other packages? Do you have any examples of other software that allows this? Same for the case of unitful weights.

Something worth noting is that weights are not just used for histograms in StatsBase, so if we relax the assumptions we have to make sure that would give correct results everywhere. But it would be possible to allow any AbstractArray of weights for histograms instead of requiring AbstractWeights (which is a good idea in general if we don't need that for dispatch).

@JeffFessler
Copy link
Author

I gave a MWE in the OP. If you have any examples of use cases where one needs Real instead of Int weights then I am sure that in those use cases often the Real values have (or should have) units. But it sounds like generalizing AbstractWeights would be a lot of work, so I think that relaxing the type for fit to AbstractArray{<:Number} could be the easiest way to go.

@JeffFessler
Copy link
Author

Sorry, I just realized that I reported the error in the OP, but I did not describe well enough the desired behavior. Here is a MWE of approximately how I'd like it to work:

using StatsBase
w = rand(ComplexF64, 100) # weights
data = randn(100)
tmp1 = fit(Histogram, data, weights(real(w)))
tmp2 = fit(Histogram, data, weights(imag(w)))
result = complex.(tmp1.weights, tmp2.weights)

Having to split and the recombine is tolerable but inelegant, and the result is a Vector, not a Histogram type (so the edges are "lost").

@nalimilan
Copy link
Member

Thanks for the self-contained example. AFAICT fit(Histogram, ...) would not automatically give this result if we made the signature less strict. In particular, something that seems problematic to me is how to compute the edges (unless you specify them manually). Unless we add a special method for complex numbers, there's no way to decide what to do with the real and imaginary parts in order to choose edges which are then used for both.

Also you haven't confirmed that this behavior is supported in other existing implementations.

I gave a MWE in the OP. If you have any examples of use cases where one needs Real instead of Int weights then I am sure that in those use cases often the Real values have (or should have) units.

No, absolutely not. For example, AnalyticWeights and ProbabilityWeights reflect the inverse of (respectively) the variance and of the draw probability (there's a lot of literature on and implementations of this). These quantities do not have units.

@JeffFessler
Copy link
Author

Unless we add a special method for complex numbers, there's no way to decide what to do with the real and imaginary parts in order to choose edges which are then used for both.

The data here are real; it is the weights that are complex valued. The edges come from the data, not from the weights. Certainly the data should still be real.

Also you haven't confirmed that this behavior is supported in other existing implementations.

Here is Matlab weighted histogram code that works perfectly fine with complex weights:

https://www.mathworks.com/matlabcentral/fileexchange/42493-generate-weighted-histogram

I confirmed that it works as expected by
histwc(randn(20,1), randn(20,1) + 2i, 30)

Matlab does not support units (AFAIK) but that code would work fine with weights that have units if Matlab did support units.
That code is 1D only and I need 2D and 3D cases which is why I am hoping to get this generality in StatsBase.

@nalimilan
Copy link
Member

The data here are real; it is the weights that are complex valued. The edges come from the data, not from the weights. Certainly the data should still be real.

Ah sorry I thought that the choice of bins depends on the weights, but looking at the code again that's not the case.

I wish you had mentioned the MATLAB precedent earlier when I asked! :-) It's generally much easier to think about new features when previous implementations exist.

Could you check whether replacing wv::AbstractWeights by wv::AbstractArray{<:Number} in src/hist.jl is enough for you to get the expected result? As I said, going this way will likely be much simpler than changing AbstractWeights itself, as it's used in many other places.

@Moelf @oschulz Do you have any opinion on this?

@Moelf
Copy link
Contributor

Moelf commented Jan 14, 2022

which precludes complex numbers, numbers with units from Unitful,

Matlab does not support units (AFAIK) but that code would work fine with weights that have units if Matlab did support units.

the unit should go with data right? "weight", is dimensionless, it's "counting", by default it's "one item per one data point"

@JeffFessler JeffFessler linked a pull request Jan 17, 2022 that will close this issue
@JeffFessler
Copy link
Author

the unit should go with data right?

Good point, but supporting units for the data is another issue for later.
This issue is about relaxing the type of the weights.

"weight", is dimensionless, it's "counting", by default it's "one item per one data point"

If it were always "one item" then the type should be Int not Real.
It's already Real and I'm hoping to generalize it to Number because my weights are complex.

Could you check whether replacing wv::AbstractWeights by wv::AbstractArray{<:Number} in src/hist.jl

I have checked this and it does work perfectly for my use case, but it causes a separate issue.
The current dispatch for fit allows these 3 argument combinations (among others):

  • fit(Histogram, data, edges)
  • fit(Histogram, data, weights)
  • fit(Histogram, data, weights, edges)
    There is no ambiguity here when weights is of type AbstractWeights, but when I relax it to be AbstractVector (which is the same type as edges), then sadly there is an ambiguity between the first two.
    Personally I avoid supporting so many positional arguments (who can remember if weights or edges is 3rd and 4th?). Requiring both edges and weights to be named keywords would avoid this ambiguity.
    Probably that deprecation would be unpopular at this point...

I've made a PR #755 to show a step towards what this would look like.
If edges must remain as a positional argument, then I don't see any way to do this other than generalizing AbstractWeights. Other ideas?

@Moelf
Copy link
Contributor

Moelf commented Jan 17, 2022

If it were always "one item" then the type should be Int not Real.
It's already Real and I'm hoping to generalize it to Number because my weights are complex.

no, because to get better distribution, many sampling processes generate <1 weight, while keeping the integration "correct", but in any case it's still counting, thus real. I've never seen non-real weights, and I personally can't imagine a situation where the "unit" is in weights not in data

@JeffFessler
Copy link
Author

I've never seen non-real weights, and I personally can't imagine

I hope that the packages I am developing in Julia will be used in ways I have never seen or imagined. And I've been around long enough to have the modesty to understand that I will never know or imagine most things.

@JeffFessler
Copy link
Author

Let me try to elaborate here. Once we go from Int to Real, we are going beyond counting to binned accumulation. Binned accumulation is useful beyond statistics, including in the physics-based MRI application that motivates this request. One can do binned accumulation of many types of quantities, complex numbers, rationals, even quaternions - anything that can be added. Even polynomials could be accumulated, but I don't have a use case for those so I limit my proposal here to the type Number because that includes complex values and values with units.

I believe that "anything Matlab can do, Julia can do better" so I would hope that the existence of Matlab code (linked above) that can do a weighted histogram with complex weights, along with my pointers above to a book chapter and a Julia repo that needs this functionality would be sufficiently compelling. Of course one could create a separate "BinnedAccumulation.jl" package but Histograms here is the natural home I think.

I just realized that my use case for fit uses the 4-argument version so I will update my PR to just handle that case and that will avoid breaking the existing edges positional argument. I'll post again when I have a working PR with proper tests.

@Moelf
Copy link
Contributor

Moelf commented Jan 17, 2022

Once we go from Int to Real, we are going beyond counting to binned accumulation.

we're still doing the same thing, as I said, the most important reason to support non-int real number is this to get the correct distribution, you can even have negative weight: https://arxiv.org/abs/2110.15211 . but it's still counting, imagine you're generating an approximation to a gaussian, you can't use int weight because then you can only have 1 bin count representing a gaussian distribution -- doesn't work. One can think of it as first generate histogram then normalize integral to 1, but some complex sampling process doesn't yield equal-weight samples, so you really need non-int weight as oppose to normalize histogram post-hoc.

I think Julia community doesn't believe in "doing something, anything is always better than error", that sounds like javascript. So unless there's a conceptually sound / existing domain-specific application, I don't think weight can be non-real number.

tbh, not the hill I'd die on, but I just want to be clear why int -> real doesn't imply real -> complex.

@JeffFessler
Copy link
Author

So unless there's a conceptually sound / existing domain-specific application, I don't think weight can be non-real number.

Above I have 1) provided pointers to a Julia code repo that needs complex weights, 2) linked a related book chapter that describes why, and 3) linked to a Matlab function that works for complex numbers as needed.

@nalimilan
Copy link
Member

I'd be OK with widening the signature, but the dispatch ambiguity is problematic. For a long time I've wanted to move from a positional to a keyword arguments for weights, but that's currently not possible to do everywhere due to functions like mean being defined in Statistics (https://github.com/JuliaLang/Statistics.jl/issues/87), and some people didn't like that approach. I'm reluctant to use a keyword argument for histograms but not for other functions as we really want a consistent API. And unfortunately, allowing non-real values for AbstractWeights sounds risky, it's already difficult to define the behavior of all weights types.

So unfortunately I don't see a short-term way forward, apart from making progress in the merge of Statistics and StatsBase.

@ParadaCarleton
Copy link
Contributor

I think Julia community doesn't believe in "doing something, anything is always better than error", that sounds like javascript. So unless there's a conceptually sound / existing domain-specific application, I don't think weight can be non-real number.

We could have all the weights implemented in StatsBase require a real input, but let users add weights of any Number

@nalimilan
Copy link
Member

If we do that we should probably allow Weights to have any Number type, as this is a fallback weights type without any particular definition. Otherwise, we wouldn't even test an AbstractWeights type with non-Real values so we couldn't guarantee it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants