A comprehensive, reproducible implementation of post-hoc calibration methods for binary classification, based on the paper "Calibration Meets Reality: Making Machine Learning Predictions Trustworthy".
Principal Author: Kristina P. Sinaga Email: sinagakristinap@gmail.com Affiliation: Independent Researcher
Implementation Contributor: Arjun S. Nair Email: 5minutepodcastforyou@gmail.com
This repository provides production-ready implementations of post-hoc calibration methods with extensive experimental validation on real-world datasets. The implementation demonstrates that proper calibration can significantly improve the reliability of machine learning predictions, with ECE improvements up to 81.8% in our experiments.
- Implementation of Platt Scaling and Isotonic Regression calibration methods
- Experiments on 5 real-world datasets from UCI repository
- Comprehensive statistical analysis with significance testing
- Extensive visualizations of calibration performance
- Complete reproducibility of all experimental results
Method | Avg ECE | Avg Brier | Avg AUC | Win Rate | Best Dataset |
---|---|---|---|---|---|
Uncalibrated | 0.0708 | 0.0922 | 0.9039 | 24.0% | Adult |
Platt | 0.0678 | 0.0918 | 0.9039 | 20.0% | Breast Cancer |
Isotonic | 0.0544 | 0.0913 | 0.8989 | 56.0% | Adult |
Rank | Dataset | Classifier | Method | ECE Before | ECE After | Improvement |
---|---|---|---|---|---|---|
1 | Breast Cancer | GradientBoosting | Isotonic | 0.0472 | 0.0086 | 81.8% |
2 | German Credit | NeuralNetwork | Platt | 0.2020 | 0.0382 | 81.1% |
3 | Adult | SVM | Isotonic | 0.0344 | 0.0071 | 79.5% |
4 | Adult | RandomForest | Isotonic | 0.0361 | 0.0089 | 75.3% |
5 | Sonar | RandomForest | Isotonic | 0.1213 | 0.0325 | 73.2% |
Dataset | Best Method | Lowest ECE | ECE Improvement | Best Classifier |
---|---|---|---|---|
Adult | Isotonic | 0.0130 | 64.1% | SVM |
Breast Cancer | Isotonic | 0.0214 | 47.2% | GradientBoosting |
German Credit | Platt | 0.0620 | 36.1% | NeuralNetwork |
Ionosphere | Uncalibrated | 0.0607 | 0.0% | GradientBoosting |
Sonar | Isotonic | 0.0887 | 25.7% | RandomForest |
Classifier | Best Method | Avg ECE | Avg Brier | Avg AUC | ECE Improvement |
---|---|---|---|---|---|
RandomForest | Isotonic | 0.0405 | 0.0924 | 0.9022 | 47.2% |
LogisticRegression | Isotonic | 0.0583 | 0.0981 | 0.8807 | 12.3% |
SVM | Isotonic | 0.0621 | 0.0875 | 0.9058 | 11.5% |
GradientBoosting | Uncalibrated | 0.0537 | 0.0878 | 0.9133 | 0.0% |
NeuralNetwork | Isotonic | 0.0512 | 0.0858 | 0.8959 | 41.0% |
Metric | Uncalibrated | Platt Scaling | Isotonic Regression |
---|---|---|---|
ECE (mean ± std) | 0.0708 ± 0.0456 | 0.0678 ± 0.0403 | 0.0544 ± 0.0367 |
Brier (mean ± std) | 0.0922 ± 0.0527 | 0.0918 ± 0.0487 | 0.0913 ± 0.0490 |
AUC (mean ± std) | 0.9039 ± 0.0803 | 0.9039 ± 0.0803 | 0.8989 ± 0.0792 |
ECE (min) | 0.0146 | 0.0151 | 0.0071 |
ECE (max) | 0.2020 | 0.1646 | 0.1370 |
# Clone the repository
git clone https://github.com/yourusername/calibration-experiments.git
cd calibration-experiments
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run all experiments
python run_experiments.py
# Generate visualizations
python create_clean_visualizations.py
# Generate result tables
python generate_result_tables.py
calibration-experiments/
├── data/ # Dataset storage
│ ├── raw/ # Original datasets from UCI
│ └── processed/ # Preprocessed data
├── src/ # Source code
│ ├── calibration/ # Calibration methods
│ ├── data_loaders/ # Dataset loaders
│ ├── evaluation/ # Metrics and tests
│ └── visualization/ # Generated plots
├── experiments/ # Experimental scripts
├── results/ # Outputs and reports
│ ├── figures/ # Generated figures
│ └── tables/ # Result tables
└── docs/ # Documentation
-
Platt Scaling: Parametric sigmoid transformation
- Fits parameters A and B to map scores to probabilities
- Suitable for small calibration sets
- Optimal for well-separated classes
-
Isotonic Regression: Non-parametric monotonic mapping
- Fits a piecewise constant function
- More flexible but requires more data
- Superior performance in most cases (56% win rate)
- Expected Calibration Error (ECE): Average calibration error across bins
- Maximum Calibration Error (MCE): Worst-case calibration error
- Brier Score: Proper scoring rule for probability estimates
- Area Under Curve (AUC): Classification performance metric
- 5-fold stratified cross-validation
- Paired t-tests with Bonferroni correction
- Cohen's d for effect size measurement
Dataset | Samples | Features | Task | Classes |
---|---|---|---|---|
UCI Adult | 48,842 | 14 | Income prediction | 2 |
Breast Cancer | 569 | 30 | Cancer diagnosis | 2 |
German Credit | 1,000 | 20 | Credit risk assessment | 2 |
Ionosphere | 351 | 34 | Radar signal classification | 2 |
Sonar | 208 | 60 | Mine vs rock classification | 2 |
- Isotonic Regression Superiority: Wins 56% of comparisons across all experiments
- Neural Network Benefits: Shows largest improvements with calibration (up to 81.1% ECE reduction)
- Dataset Dependency: Calibration effectiveness varies significantly by dataset characteristics
- Gradient Boosting Exception: Often well-calibrated without post-hoc methods
If you use this code in your research, please cite:
@article{sinaga2025calibration,
title={Calibration Meets Reality: Making Machine Learning Predictions Trustworthy},
author={Sinaga, Kristina P.},
journal={arXiv preprint},
year={2025}
}
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
Principal Author: Kristina P. Sinaga (sinagakristinap@gmail.com) Implementation Contributor: Arjun S. Nair (5minutepodcastforyou@gmail.com)
For questions or collaborations, please contact the authors.
- UCI Machine Learning Repository for datasets
- scikit-learn for base calibration implementations
- All contributors and reviewers