-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: provide L2 norm of each distribution's pdf #806
Comments
@ablaom sorry about the delay. |
@fkiraly Do you want to suggest a shortlist of distributions for which we want to implement the L2 norm aka
|
PR welcome. |
Sure - I'll try to make the shortlist short... Composites: Continuous atoms: Discrete atoms: |
Though, having thought about the matter a little bit: In addition, I'm not sure whether LinearAlgebra.norm(d) is the best way to do this. That is because it's the l2 norm of the pdf, rather than of the distribution - that would be an ill-defined concept. After all, you could take the l2 norm of any distribution defining function (if exists), pdf, cdf, mgf, etc, and the result would be different. Why should one single out pdf, given that it is merely one of multiple ways to uniquely define a distribution, and given that it doesn't even exist for mixed distributions? Also, would it not be more natural if something like l2norm(d), or similar, gives the distribution of the random variable |X|_2, where X is distributed according to d? That is, returns a distribution object rather than a number? I'm not saying that such a behaviour is something I'd particularly be interested in, just that it seems to me like the more natural behaviour of applying functions to distribution objects. |
Ok yes I see your point, in that case a new function could be created, something like |
I might be misunderstanding, but wouldn't the Brier loss simply be: function brier_loss(d::Distribution, x0)
m = mean(d)
return var(d) + m*(m-2*x0) + x0
end I'm not opposed to having it here, but it does seem that scoring rules like this could also go in a separate package. |
@simonbyrne - no, that's not the Brier loss. Brier loss would be
and pdf for absolutely continuous ones. |
@simonbyrne or, is your statement that the two are always, analytically, identical? |
@simonbyrne it's actually not, here's a disproof: your score is linear in x0 |
Ignore my comment above, I misunderstood what was being proposed. From what I can tell, Brier score is really only defined for binary outcomes (i.e. Bernoulli predictions). It seems like there would be many ways one could generalize it to other distributions, e.g. brier_loss(d, x0) = (x0 - mean(d))^2 would be an obvious candidate. Additionally, from what I can tell, I think your definition off by a linear scaling and offset: e.g. for the Bernoulli case, if |
Another generalization which would have your desired property of depending only on brier_score(d, x0) = (1 - pdf(d, x0))^2 though this obviously wouldn't work for continuous distributions. |
@simonbyrne I think we should be systematic - there are three high-level questions here: I think (i) yes, (ii) yes, and (iii) no - and in addition you are probably not too familiar with probabilistic losses, and maybe a little confused. I hope you do not take this too negatively. I'll try to explain theory below. Regarding (i), I believe this is a standard thing, at least for the classification setting. Brier is implemented by relevant packages such as sklearn and mlr, and it one of multiple standard ways to train, or evaluate probabilistic classification models. For regression models, so are the generalizations, but these are more rare since common packages do not have good probabilistic interfaces (mlj does!) The answer to (ii) is of course up to Distributions.jl devs to decide, but I was of the impression that their feeling was "yes". Regarding (iii), I think we need to clarify a few points. The alternatives you propose are not proper for the multi-class classification case, or the regression case. For binary classification, it turns out that -2 p(y) + |p|_2^2, and the other (commonly used) expression (1-p(y))² happen to be the same up to scaling and offset, so they measure the same. But this is not true for the general case - one simple way to see this is that (1-p(y))² doesn't make too much sense for continuous pdf which are not upper bounded by 1. For the same reason, I believe even in the binary classification case the expression -2 p(y) + |p|_2^2 is more helpful, since that is more or less the only one which generalizes to all relevant cases (except mixed distributions, but that's another story). |
I don't really mind, but my point is that there is nothing really that special about having such rules in Distributions.jl: they could easily go in another package (say ScoringRules.jl, or be incorporated into LossFunctions.jl). Indeed, it is often desirable to work in smaller packages, since it is much easier and quicker to iteratively develop. In terms of implementation, the main question is how you would handle cases where no closed-form solution exists? That is a good point about propriety, and your L2 loss definition does match up with Selten characterization. |
@simonbyrne, great that we are on one page regarding theory! Let's chat about implementation then. If you want to support computation of the key proper scoring rules (or proper losses, with the sign convention more common in ML) for the continuous case, you have two options that I can see: Number (i) is infeasible i.m.o., since you need to write a symbolic computation engine à la Mathematica from scratch, and it seems overkill. Another alternative, numerical integration, isn't really one (I would argue), since it can be arbitrarily wrong: that's not only problematic if you have to do it for each data point, it's also a no-go if you want to use the losses for unbiased evaluation (where unbiased is not in the statistical, but in the empirical sense). |
I think it is fair to say that symbolic operations are far out of scope of this package. We could certainly have a |
yes, exactly! |
Wishing for a method
l2norm(d::Distribution)
that returns the L2 norm of the probability density function ford
. Our use case is computing the Brier loss and integrated square loss for machine learning models learning probability distributions: MLJ issue #34The text was updated successfully, but these errors were encountered: