# Summary and guide for calibration metrics

We provide a summary of the calibration metrics provides by calzone, including the pros and cons of each metrics. For a more detailed explanation of each metrics and how to calculate them using calzone, please refer to the specific notebook.

In [1]:
import pandas as pd
from IPython.display import display, HTML

data = {
    'Metrics': ['ECE', 'MCE', 'Hosmer-Lemeshow test', "Spiegelhalter's z test", "Cox's analysis", 'Integrated calibration index (ICI)'],
    'Description': [
        'Using binned reliability diagram, sum of absolute difference, weighted by bin count.',
        'Using binned reliability diagram, Maximum absolute difference.',
        'Using binned reliability diagram, Chi-squared based test using expected and observed.',
        'Decomposition of brier score. Normal distributed',
        'Logistic regression of the logits',
        'Similar to ECE, using smooth fit (usually losse) instead of binning to get the calibration curve'
    ],
    'Pros': [
        '• Intuitive<br>• Easy to calculate',
        '• Intuitive<br>• Easy to calculate',
        '• Intuitive<br>• Statistical meaning',
        '• Doesn\'t rely on binning<br>• Statistical meaning',
        '• Doesn\'t rely on binning<br>• Its value shows the how the calibration is off',
        '• Doesn\'t rely on binning<br>• Capture all kind of miscalibration'
    ],
    'Cons': [
        '• Depend on binning<br>• Depend on class-by-class or top-class',
        '• Depend on binning<br>• Depend on class-by-class or top-class',
        '• Depend on binning<br>• Low power<br>• Wrong coverage',
        '• Doesn\'t detect prevalence shift',
        '• Failed to capture some cases of miscalibration',
        '• Depend on the choice of curve fitting<br>• Depend on fitting parameters'
    ],
    'Meaning': [
        'Average deviation from true probability',
        'Maximum deviation from true probability',
        'Test of calibration',
        'Test of calibration',
        'A logit fit to the calibration curve',
        'Average deviation from true probability'
    ]
}

df = pd.DataFrame(data)

# Apply custom styling
styled_df = df.style.set_properties(**{'text-align': 'left', 'white-space': 'pre-wrap'})
styled_df = styled_df.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])

# Display the styled dataframe
#display(HTML(styled_df.to_html(escape=False)))


In [2]:
display(HTML(styled_df.to_html(escape=False)))

Unnamed: 0,Metrics,Description,Pros,Cons,Meaning
0,ECE,"Using binned reliability diagram, sum of absolute difference, weighted by bin count.",• Intuitive • Easy to calculate,• Depend on binning • Depend on class-by-class or top-class,Average deviation from true probability
1,MCE,"Using binned reliability diagram, Maximum absolute difference.",• Intuitive • Easy to calculate,• Depend on binning • Depend on class-by-class or top-class,Maximum deviation from true probability
2,Hosmer-Lemeshow test,"Using binned reliability diagram, Chi-squared based test using expected and observed.",• Intuitive • Statistical meaning,• Depend on binning • Low power • Wrong coverage,Test of calibration
3,Spiegelhalter's z test,Decomposition of brier score. Normal distributed,• Doesn't rely on binning • Statistical meaning,• Doesn't detect prevalence shift,Test of calibration
4,Cox's analysis,Logistic regression of the logits,• Doesn't rely on binning • Its value shows the how the calibration is off,• Failed to capture some cases of miscalibration,A logit fit to the calibration curve
5,Integrated calibration index (ICI),"Similar to ECE, using smooth fit (usually losse) instead of binning to get the calibration curve",• Doesn't rely on binning • Capture all kind of miscalibration,• Depend on the choice of curve fitting • Depend on fitting parameters,Average deviation from true probability


## Guide to Calibration Metrics

We recommend to visualize the calibration using reliability diagrams. If you see general over- or under-estimation of the probability for a given class, consider applying a prevalence adjustment to see whether it is only due to prevalence shift. After prevalence adjustment, plot the reliability diagrams again and examine the result of calibration metrcis. We recommend using Cox and Loess integrated calibration index(ICI) to get a general sense of the average probability deviation. We recommend using Spiegelhalter'z test to test for calibration. Other metrics such as ECE, Cox slope/intercept and HL test has it limitations and should be used as with caution. See the details description of each metric in the following sections.