Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: provide L2 norm of each distribution's pdf #806

Open
ablaom opened this issue Jan 8, 2019 · 17 comments
Open

Feature request: provide L2 norm of each distribution's pdf #806

ablaom opened this issue Jan 8, 2019 · 17 comments

Comments

@ablaom
Copy link
Contributor

ablaom commented Jan 8, 2019

Wishing for a method l2norm(d::Distribution) that returns the L2 norm of the probability density function for d. Our use case is computing the Brier loss and integrated square loss for machine learning models learning probability distributions: MLJ issue #34

@matbesancon
Copy link
Member

@ablaom sorry about the delay.
The best option would be to implement LinearAlgebra.norm for Distributions. PR welcome, but you can do that in your own repo (this is technically type piracy, but if it's not implemented in Distributions, it's fine)

@ablaom
Copy link
Contributor Author

ablaom commented May 2, 2019

@fkiraly Do you want to suggest a shortlist of distributions for which we want to implement the L2 norm aka LinearAlgebra.norm( •, 2)? Here are all the final types of Distributions.Distribution:

MixtureModel
UnivariateGMM
MvNormal
MvNormalCanon
Arcsine
Bernoulli
Beta
BetaBinomial
BetaPrime
Binomial
Biweight
Categorical
Cauchy
Chi
Chisq
Cosine
Dirichlet
DirichletMultinomial
DiscreteUniform
MvLogNormal
Distributions.GenericMvTDist
EdgeworthMean
EdgeworthSum
EdgeworthZ
EmpiricalUnivariateDistribution
Epanechnikov
Erlang
Exponential
FDist
Frechet
Gamma
GeneralizedExtremeValue
GeneralizedPareto
Geometric
Gumbel
Hypergeometric
InverseGamma
InverseGaussian
InverseWishart
KSDist
KSOneSided
Kolmogorov
Laplace
Levy
LocationScale
LogNormal
Logistic
Multinomial
NegativeBinomial
NoncentralBeta
NoncentralChisq
NoncentralF
FisherNoncentralHypergeometric
WalleniusNoncentralHypergeometric
NoncentralT
Normal
NormalCanon
NormalInverseGaussian
Pareto
Poisson
PoissonBinomial
Rayleigh
Semicircle
Skellam
SymTriangularDist
TDist
TriangularDist
Triweight
Truncated
Uniform
VonMises
VonMisesFisher
Weibull
Wishart

@matbesancon
Copy link
Member

PR welcome.
Also since this is a new feature, a page in the documentation would be great

@fkiraly
Copy link

fkiraly commented May 2, 2019

Sure - I'll try to make the shortlist short...

Composites:
MixtureModel
Truncated

Continuous atoms:
Beta
Chi
Chisq
Exponential
Gamma
GeneralizedExtremeValue
GeneralizedPareto
Gumbel
Laplace
Logistic
Normal
Pareto
Uniform
VonMises
Weibull

Discrete atoms:
Bernoulli
Categorical
Dirichlet
DiscreteUniform
Geometric
Hypergeometric
Multinomial
NegativeBinomial
Poisson
Skellam

@fkiraly
Copy link

fkiraly commented May 2, 2019

Though, having thought about the matter a little bit:
l2-norm is useful - but what would be even more useful is squared integral of pdf between 0 and x, for any x. Same for cdf.

In addition, I'm not sure whether LinearAlgebra.norm(d) is the best way to do this. That is because it's the l2 norm of the pdf, rather than of the distribution - that would be an ill-defined concept.

After all, you could take the l2 norm of any distribution defining function (if exists), pdf, cdf, mgf, etc, and the result would be different. Why should one single out pdf, given that it is merely one of multiple ways to uniquely define a distribution, and given that it doesn't even exist for mixed distributions?

Also, would it not be more natural if something like l2norm(d), or similar, gives the distribution of the random variable |X|_2, where X is distributed according to d? That is, returns a distribution object rather than a number?

I'm not saying that such a behaviour is something I'd particularly be interested in, just that it seems to me like the more natural behaviour of applying functions to distribution objects.

@matbesancon
Copy link
Member

Ok yes I see your point, in that case a new function could be created, something like dist_norm.
It will have to appear in the documentation

@simonbyrne
Copy link
Member

I might be misunderstanding, but wouldn't the Brier loss simply be:

function brier_loss(d::Distribution, x0)
    m = mean(d)
    return var(d) + m*(m-2*x0) + x0
end

I'm not opposed to having it here, but it does seem that scoring rules like this could also go in a separate package.

@fkiraly
Copy link

fkiraly commented May 9, 2019

@simonbyrne - no, that's not the Brier loss.

Brier loss would be

function brier_loss(d::Distribution, x0)
       return -2* pmf(d,x0) + l2norm(d)^2
end

and pdf for absolutely continuous ones.

@fkiraly
Copy link

fkiraly commented May 9, 2019

@simonbyrne or, is your statement that the two are always, analytically, identical?
(proof?)

@fkiraly
Copy link

fkiraly commented May 9, 2019

@simonbyrne it's actually not, here's a disproof: your score is linear in x0
The Brier score (as defined by me) is linear in pmf(x0), and doesn't otherwise depend on x0.
Hence the Brier score is in general is not linear in x0, so it can't be identical.

@simonbyrne
Copy link
Member

simonbyrne commented May 9, 2019

Ignore my comment above, I misunderstood what was being proposed.

From what I can tell, Brier score is really only defined for binary outcomes (i.e. Bernoulli predictions). It seems like there would be many ways one could generalize it to other distributions, e.g.

brier_loss(d, x0) = (x0 - mean(d))^2

would be an obvious candidate.

Additionally, from what I can tell, I think your definition off by a linear scaling and offset: e.g. for the Bernoulli case, if x0 == 0, then your expression is equal to -2*(1-p) + p^2 + (1-p)^2 == (-2 + 2p) +p^2 + (p^2 -2p + 1) == 2p^2 -1 (which should be p^2).

@simonbyrne
Copy link
Member

simonbyrne commented May 9, 2019

Another generalization which would have your desired property of depending only on x0 through the pdf, would be

brier_score(d, x0) = (1 - pdf(d, x0))^2

though this obviously wouldn't work for continuous distributions.

@fkiraly
Copy link

fkiraly commented May 10, 2019

@simonbyrne I think we should be systematic - there are three high-level questions here:
(i) whether the original Brier/squared loss is important/relevant
(ii) whether distributions functionality supporting it should be implemented
(iii) whether your alternatives are reasonable/viable alternatives - this is a theory question rather than an implementation question, primarily.

I think (i) yes, (ii) yes, and (iii) no - and in addition you are probably not too familiar with probabilistic losses, and maybe a little confused. I hope you do not take this too negatively. I'll try to explain theory below.

Regarding (i), I believe this is a standard thing, at least for the classification setting. Brier is implemented by relevant packages such as sklearn and mlr, and it one of multiple standard ways to train, or evaluate probabilistic classification models. For regression models, so are the generalizations, but these are more rare since common packages do not have good probabilistic interfaces (mlj does!)

The answer to (ii) is of course up to Distributions.jl devs to decide, but I was of the impression that their feeling was "yes".

Regarding (iii), I think we need to clarify a few points.
One can't just arbitrarily define losses for probabilistic predictions - they need to be sensibly measuring how close the prediction is to the truth, and properness is one such "reasonable" property.

The alternatives you propose are not proper for the multi-class classification case, or the regression case. For binary classification, it turns out that -2 p(y) + |p|_2^2, and the other (commonly used) expression (1-p(y))² happen to be the same up to scaling and offset, so they measure the same.

But this is not true for the general case - one simple way to see this is that (1-p(y))² doesn't make too much sense for continuous pdf which are not upper bounded by 1.

For the same reason, I believe even in the binary classification case the expression -2 p(y) + |p|_2^2 is more helpful, since that is more or less the only one which generalizes to all relevant cases (except mixed distributions, but that's another story).

@simonbyrne
Copy link
Member

The answer to (ii) is of course up to Distributions.jl devs to decide, but I was of the impression that their feeling was "yes".

I don't really mind, but my point is that there is nothing really that special about having such rules in Distributions.jl: they could easily go in another package (say ScoringRules.jl, or be incorporated into LossFunctions.jl). Indeed, it is often desirable to work in smaller packages, since it is much easier and quicker to iteratively develop.

In terms of implementation, the main question is how you would handle cases where no closed-form solution exists?

That is a good point about propriety, and your L2 loss definition does match up with Selten characterization.

@fkiraly
Copy link

fkiraly commented May 10, 2019

@simonbyrne, great that we are on one page regarding theory!

Let's chat about implementation then. If you want to support computation of the key proper scoring rules (or proper losses, with the sign convention more common in ML) for the continuous case, you have two options that I can see:
(i) on-demand symbolic integration based on symbolic expressions that the distributions provide
(ii) the distributions provide explicit expressions for integrals etc that appear in the scoring rules. For squared/Brier, this is the 2-norm of the pdf.

Number (i) is infeasible i.m.o., since you need to write a symbolic computation engine à la Mathematica from scratch, and it seems overkill.

Another alternative, numerical integration, isn't really one (I would argue), since it can be arbitrarily wrong: that's not only problematic if you have to do it for each data point, it's also a no-go if you want to use the losses for unbiased evaluation (where unbiased is not in the statistical, but in the empirical sense).

@simonbyrne
Copy link
Member

I think it is fair to say that symbolic operations are far out of scope of this package.

We could certainly have a pdfnorm(d::Distribution) (or something similar) function for functions where it can be calculated easily (it would throw a MethodError in other cases).

@fkiraly
Copy link

fkiraly commented May 10, 2019

yes, exactly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants