# Probability Estimation via Scoring


There are various methods in machine learning for inducing probabilistic predictors.
These are hypotheses $h$ that do not merely output point predictions $h(\vec{x}) \in \mathcal{Y}$, 
i.e., elements of the output space $\mathcal{Y}$, 
but probability estimates $p_h(\cdot \vert \vec{x}) =  p(\cdot \vert \vec{x}, h)$, 
i.e., complete probability distributions on $\mathcal{Y}$. 
In the case of classification, 
this means predicting a single (conditional) probability $p_h(y \vert \vec{x}) = p(y \vert \vec{x} , h)$ for each class $y \in \mathcal{Y}$, 
whereas in regression, $p( \cdot \vert \vec{x}, h)$ is a density function on $\mathbb{R}$. 
Such predictors can be learned in a discriminative way, 
i.e., in the form of a mapping $\vec{x} \mapsto p( \cdot \vert \vec{x})$, 
or in a generative way, which essentially means learning a joint distribution on $\mathcal{X} \times \mathcal{Y}$. 
Moreover, the approaches can be parametric (assuming specific parametric families of probability distributions) or non-parametric. 
Well-known examples include classical statistical methods such as logistic and linear regression, 
Bayesian approaches such as Bayesian networks and Gaussian processes, <!-- (cf.\ Section \ref{sec:gp}),  -->
as well as various techniques in the realm of (deep) neural networks. 
<!-- (cf.\ Section \ref{sec:m1}).  -->

Training probabilistic predictors is typically accomplished by minimizing suitable loss functions, 
i.e., loss functions that enforce "correct" (conditional) probabilities as predictions. 
In this regard, 
proper scoring rules ({cite:t}`gnei_sp05`) <!-- \citep{gnei_sp05}  -->
play an important role, 
including the log-loss as a well-known special case. 
Sometimes, however, estimates are also obtained in a very simple way, 
following basic frequentist techniques for probability estimation, 
like in Naïve Bayes or nearest neighbor classification. 

The predictions delivered by corresponding methods are at best "pseudo-probabilities" that are often not very accurate. 
Besides, there are many methods that deliver natural scores, 
intuitively expressing a degree of confidence 
(like the distance from the separating hyperplane in support vector machines), 
but which do not immediately qualify as probabilities either. 
The idea of *scaling* or *calibration methods* is to turn such scores into proper, 
well-calibrated probabilities, that is, 
to learn a mapping from scores to the unit interval that can be applied to the output of a predictor as a kind of post-processing step ({cite:t}`flac_cc17`). <!-- \citep{flac_cc17} -->
Examples of such methods include binning ({cite:p}`zadr_oc01`), <!-- \citep{zadr_oc01} -->
isotonic regression ({cite:t}`zadr_tc02`), <!-- \citep{zadr_tc02} -->
logistic scaling \citep{Pla00} and improvements thereof ({cite:p}`kull_bc17`), <!-- \citep{kull_bc17} -->
as well as the use of Venn predictors ({cite:p}`joha_vp18`). <!-- \citep{joha_vp18} -->
Calibration is still a topic of ongoing research. 

Another import class of methods is *ensemble learning*, such as bagging or boosting, which are especially popular in machine learning due to their ability to improve accuracy of (point) predictions. 
Since such methods produce a (large) set of predictors $h_1, \ldots, h_M$ instead of a single hypothesis, it is tempting to produce probability estimates following basic frequentist principles. In the simplest case (of classification), each prediction $h_i(\vec{x})$ can be interpreted as a "vote" in favor of a class $y \in \mathcal{Y}$, and probabilities can be estimated by relative frequencies\,---\,needless to say, probabilities constructed in this way tend to be biased and are not necessarily well calibrated. Especially important in this field are tree-based methods such as random forests ({cite:t}`brei_rf01,krup_pe14`).  
<!-- \citep{brei_rf01,krup_pe14} -->

Obviously, while standard probability estimation is a viable approach to representing uncertainty in a prediction, 
there is no explicit distinction between different types of uncertainty. 
Methods falling into this category are mostly concerned with the aleatoric part of the overall uncertainty.
\footnote{Yet, as will be seen later on, one way to go beyond mere aleatoric uncertainty is to combine the above methods, for example learning ensembles of probabilistic predictors (cf.\ Section \ref{sec:m1}).}  


## 1. Logarithmic Scoring Rule

In [3]:
import numpy as np

y_true = np.array([1, 1, 0])

p_pred = np.array([0.8, 0.3, 0.6])

log_score = -np.mean(np.log(p_pred[y_true == 1]))
print(f"Log Score: {log_score:.4f}")

Log Score: 0.7136


## 2. Brier Score

In [5]:
from sklearn.metrics import brier_score_loss

y_true = np.array([1, 1, 0])

p_pred = np.array([0.8, 0.3, 0.6])

brier_score = brier_score_loss(y_true, p_pred)
print(f"Brier Score: {brier_score:.4f}")


Brier Score: 0.2967


## 3. Continuous Ranked Probability Score

In [8]:
import numpy as np
from properscoring import crps_ensemble

y_true = np.array([3.5])

predicted_ensemble = np.array([[3.0, 3.2, 3.4, 3.6, 3.8]])

crps_score = crps_ensemble(y_true, predicted_ensemble)
print(f"Continuous Ranked Probability Score: {crps_score.mean():.4f}")


Continuous Ranked Probability Score: 0.1000
