Skip to content

Ajwebdevs/calibration-analysis-experiments

Repository files navigation

Calibration Analysis: Making Machine Learning Predictions Trustworthy

Python 3.8+ License: MIT

A comprehensive, reproducible implementation of post-hoc calibration methods for binary classification, based on the paper "Calibration Meets Reality: Making Machine Learning Predictions Trustworthy".

Principal Author: Kristina P. Sinaga Email: sinagakristinap@gmail.com Affiliation: Independent Researcher

Implementation Contributor: Arjun S. Nair Email: 5minutepodcastforyou@gmail.com


Overview

This repository provides production-ready implementations of post-hoc calibration methods with extensive experimental validation on real-world datasets. The implementation demonstrates that proper calibration can significantly improve the reliability of machine learning predictions, with ECE improvements up to 81.8% in our experiments.

Key Features

  • Implementation of Platt Scaling and Isotonic Regression calibration methods
  • Experiments on 5 real-world datasets from UCI repository
  • Comprehensive statistical analysis with significance testing
  • Extensive visualizations of calibration performance
  • Complete reproducibility of all experimental results

Key Results Summary

Method Performance Comparison

Method Avg ECE Avg Brier Avg AUC Win Rate Best Dataset
Uncalibrated 0.0708 0.0922 0.9039 24.0% Adult
Platt 0.0678 0.0918 0.9039 20.0% Breast Cancer
Isotonic 0.0544 0.0913 0.8989 56.0% Adult

Top ECE Improvements

Rank Dataset Classifier Method ECE Before ECE After Improvement
1 Breast Cancer GradientBoosting Isotonic 0.0472 0.0086 81.8%
2 German Credit NeuralNetwork Platt 0.2020 0.0382 81.1%
3 Adult SVM Isotonic 0.0344 0.0071 79.5%
4 Adult RandomForest Isotonic 0.0361 0.0089 75.3%
5 Sonar RandomForest Isotonic 0.1213 0.0325 73.2%

Visualizations

Summary Visualizations

ECE Heatmap - All Experiments

ECE Heatmap

Method Comparison

Method Comparison

Top ECE Improvements

Top Improvements

Dataset-Specific Results

Adult Dataset

Adult Performance Adult Improvement

Breast Cancer Dataset

Breast Cancer Performance Breast Cancer Improvement

German Credit Dataset

German Credit Performance German Credit Improvement

Ionosphere Dataset

Ionosphere Performance Ionosphere Improvement

Sonar Dataset

Sonar Performance Sonar Improvement

Classifier-Specific Results

Random Forest

Random Forest Performance

Logistic Regression

Logistic Regression Performance

Support Vector Machine

SVM Performance

Gradient Boosting

Gradient Boosting Performance

Neural Network

Neural Network Performance


Experimental Results

Summary by Dataset

Dataset Best Method Lowest ECE ECE Improvement Best Classifier
Adult Isotonic 0.0130 64.1% SVM
Breast Cancer Isotonic 0.0214 47.2% GradientBoosting
German Credit Platt 0.0620 36.1% NeuralNetwork
Ionosphere Uncalibrated 0.0607 0.0% GradientBoosting
Sonar Isotonic 0.0887 25.7% RandomForest

Summary by Classifier

Classifier Best Method Avg ECE Avg Brier Avg AUC ECE Improvement
RandomForest Isotonic 0.0405 0.0924 0.9022 47.2%
LogisticRegression Isotonic 0.0583 0.0981 0.8807 12.3%
SVM Isotonic 0.0621 0.0875 0.9058 11.5%
GradientBoosting Uncalibrated 0.0537 0.0878 0.9133 0.0%
NeuralNetwork Isotonic 0.0512 0.0858 0.8959 41.0%

Statistical Summary

Metric Uncalibrated Platt Scaling Isotonic Regression
ECE (mean ± std) 0.0708 ± 0.0456 0.0678 ± 0.0403 0.0544 ± 0.0367
Brier (mean ± std) 0.0922 ± 0.0527 0.0918 ± 0.0487 0.0913 ± 0.0490
AUC (mean ± std) 0.9039 ± 0.0803 0.9039 ± 0.0803 0.8989 ± 0.0792
ECE (min) 0.0146 0.0151 0.0071
ECE (max) 0.2020 0.1646 0.1370

Quick Start

Installation

# Clone the repository
git clone https://github.com/yourusername/calibration-experiments.git
cd calibration-experiments

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Running Experiments

# Run all experiments
python run_experiments.py

# Generate visualizations
python create_clean_visualizations.py

# Generate result tables
python generate_result_tables.py

Project Structure

calibration-experiments/
├── data/                      # Dataset storage
│   ├── raw/                  # Original datasets from UCI
│   └── processed/            # Preprocessed data
├── src/                      # Source code
│   ├── calibration/          # Calibration methods
│   ├── data_loaders/         # Dataset loaders
│   ├── evaluation/           # Metrics and tests
│   └── visualization/        # Generated plots
├── experiments/              # Experimental scripts
├── results/                  # Outputs and reports
│   ├── figures/             # Generated figures
│   └── tables/              # Result tables
└── docs/                     # Documentation

Methodology

Calibration Methods

  1. Platt Scaling: Parametric sigmoid transformation

    • Fits parameters A and B to map scores to probabilities
    • Suitable for small calibration sets
    • Optimal for well-separated classes
  2. Isotonic Regression: Non-parametric monotonic mapping

    • Fits a piecewise constant function
    • More flexible but requires more data
    • Superior performance in most cases (56% win rate)

Evaluation Metrics

  • Expected Calibration Error (ECE): Average calibration error across bins
  • Maximum Calibration Error (MCE): Worst-case calibration error
  • Brier Score: Proper scoring rule for probability estimates
  • Area Under Curve (AUC): Classification performance metric

Statistical Testing

  • 5-fold stratified cross-validation
  • Paired t-tests with Bonferroni correction
  • Cohen's d for effect size measurement

Datasets

Dataset Samples Features Task Classes
UCI Adult 48,842 14 Income prediction 2
Breast Cancer 569 30 Cancer diagnosis 2
German Credit 1,000 20 Credit risk assessment 2
Ionosphere 351 34 Radar signal classification 2
Sonar 208 60 Mine vs rock classification 2

Key Findings

  1. Isotonic Regression Superiority: Wins 56% of comparisons across all experiments
  2. Neural Network Benefits: Shows largest improvements with calibration (up to 81.1% ECE reduction)
  3. Dataset Dependency: Calibration effectiveness varies significantly by dataset characteristics
  4. Gradient Boosting Exception: Often well-calibrated without post-hoc methods

Citation

If you use this code in your research, please cite:

@article{sinaga2025calibration,
  title={Calibration Meets Reality: Making Machine Learning Predictions Trustworthy},
  author={Sinaga, Kristina P.},
  journal={arXiv preprint},
  year={2025}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.


Contributing

Contributions are welcome! Please feel free to submit a Pull Request.


Contact

Principal Author: Kristina P. Sinaga (sinagakristinap@gmail.com) Implementation Contributor: Arjun S. Nair (5minutepodcastforyou@gmail.com)

For questions or collaborations, please contact the authors.


Acknowledgments

  • UCI Machine Learning Repository for datasets
  • scikit-learn for base calibration implementations
  • All contributors and reviewers

About

Comprehensive implementation of post-hoc calibration methods for binary classification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •