# Evaluating Accuracies of Probabilistic Quantile Regression and Classification Models

### Install Dependencies

In [1]:
!pip install scikit-learn==1.5.2
# Install TabPFN
!pip install tabpfn
# Also install random forest quantile regression
!pip install quantile-forest
# install sigmaeval
!pip install sigmaeval

Collecting scikit-learn==1.5.2
  Downloading scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.3/13.3 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.6.1
    Uninstalling scikit-learn-1.6.1:
      Successfully uninstalled scikit-learn-1.6.1
Successfully installed scikit-learn-1.5.2
Collecting tabpfn
  Downloading tabpfn-2.0.5-py3-none-any.whl.metadata (7.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch<3,>=2.1->tabpfn)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch<3,>=2.1->tabpfn)
  Downloading nvidia_cuda_runtime_cu12-12.4.127

In [15]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import r2_score, roc_auc_score
from quantile_forest import RandomForestQuantileRegressor
from sklearn.ensemble import RandomForestClassifier

from tabpfn import TabPFNRegressor, TabPFNClassifier

from sigmaeval.sigmaeval import crps_quantile, brier_score_multiclass

## Test Regression Quantile Accuracies

As evaluation of the accuary of the predictive posterior (heregiven by quantiles) we use the **Continuous Ranked Probability Score (CRPS)**.

CRPS  is a proper scoring rule used to assess the accuracy of probabilistic predictions, particularly for continuous outcomes. It measures how well a predicted cumulative distribution function (CDF) aligns with the observed value. CRPS generalizes the **Brier Score** to continuous variables and accounts for both **calibration** (how well predicted probabilities match actual frequencies) and **sharpness** (how concentrated the predictive distribution is).


### **Advantages of CRPS for Quantile-Based Predictions**

✅ **Handles Uncertainty Properly:** Unlike MSE, which only considers point predictions, CRPS evaluates the full distribution.  
✅ **Comparable Across Different Models:** Since CRPS measures the entire distribution’s performance, it is useful for comparing probabilistic models.  
✅ **Works with Prediction Intervals:** Even if the model only predicts quantiles, CRPS can approximate how well those quantiles match the actual distribution.  

### Load Dataset and Fit Models

Here we compare TabPFN to RandomForests quantile regression.

In [3]:
X, y = datasets.fetch_california_housing(return_X_y=True)
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 500, test_size=500, random_state=42)

In [4]:
model_pfn = TabPFNRegressor()
model_pfn.fit(X_train, y_train)

  model, bardist, config_ = load_model_criterion_config(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tabpfn-v2-regressor.ckpt:   0%|          | 0.00/44.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/37.0 [00:00<?, ?B/s]

In [5]:
quantiles = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
pred_quantiles_pfn = model_pfn.predict(X_test, output_type="quantiles", quantiles=quantiles)
pred_quantiles_pfn = np.asarray(pred_quantiles_pfn)

In [6]:
model_rf = RandomForestQuantileRegressor()
model_rf.fit(X_train, y_train)
# get predictions and quantiles
pred_rf = model_rf.predict(X_test, quantiles=quantiles)

In [7]:
# Evaluate R2 score for the median prediction accuracy of the two models
r2_pfn = r2_score(y_test, pred_quantiles_pfn[4, :])
r2_rf = r2_score(y_test, pred_rf[:, 4])
print(f"R2 score for TabPFN: {round(r2_pfn,4)}")
print(f"R2 score for RandomForest: {round(r2_rf,4)}")

R2 score for TabPFN: 0.7928
R2 score for RandomForest: 0.7145


In [8]:
# Evaluate CRPS for the two models
crps_pfn = crps_quantile(pred_quantiles_pfn, quantiles, y_test)
print(f"CRPS TabPFB: {round(crps_pfn,4)}")
crps_rf = crps_quantile(pred_rf.T, quantiles, y_test)
print(f"CRPS RandomForest: {round(crps_rf,4)}")

CRPS TabPFB: 0.1618
CRPS RandomForest: 0.2108


## Test Classification Probabilities

The **Brier Score** is a proper scoring rule that measures the accuracy of probabilistic predictions. For **multi-class classification**, the generalized **multi-class Brier Score** is defined as:

$$
\text{Brier Score} = \frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} (p_{ik} - y_{ik})^2
$$

where:
* $N$ is the number of samples
* $K$ is the number of classes
* $p_{ik}$ is the predicted probability for class $k$ in sample $i$
* $y_{ik}$ is a one-hot encoded ground-truth indicator (1 if true class, 0 otherwise)

## Key Properties

* **Proper scoring rule**: Encourages well-calibrated probabilistic predictions
* **Lower values are better**: A perfect prediction gives a Brier score of 0
* **Works for multi-class settings**: Unlike log-loss, it doesn't require log-probabilities


### Test on Multi-Class Data

In [9]:
data = datasets.fetch_openml(name="parkinsons")
print(data.DESCR)

**Author**:   
**Source**: UCI
**Please cite**: 'Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) 

* Abstract: 

Oxford Parkinson's Disease Detection Dataset

* Source:

The dataset was created by Max Little of the University of Oxford, in collaboration with the National Centre for Voice and Speech, Denver, Colorado, who recorded the speech signals. The original study published the feature extraction methods for general voice disorders.

* Data Set Information:
This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to "status" colu

In [10]:
X = data.data
y = data.target
categories = data.categories
feature_names = data.feature_names
print('X shape:', X.shape)

X shape: (195, 22)


In [11]:
# split in train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [12]:
# TabPFN Classifier
model_pfn = TabPFNClassifier()
model_pfn.fit(X_train, y_train)

  model, _, config_ = load_model_criterion_config(


tabpfn-v2-classifier.ckpt:   0%|          | 0.00/29.0M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/37.0 [00:00<?, ?B/s]

In [13]:
pred_pfn = model_pfn.predict_proba(X_test)
brier_pnf = brier_score_multiclass(y_test.to_numpy().astype(int)-1,pred_pfn)
print(f"Brier Score TabPFN: {round(float(brier_pnf),4)}")

Brier Score TabPFN: 0.0588


In [16]:
# Compare with RandomForest
model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)
pred_rf = model_rf.predict_proba(X_test)
brier_rf = brier_score_multiclass(y_test.to_numpy().astype(int)-1,pred_rf)
print(f"Brier Score RandomForest: {round(brier_rf,4)}")

Brier Score RandomForest: 0.0658
