# Summary and guide for calzone

We provide a summary of the calibration metrics provides by calzone, including the pros and cons of each metrics. For a more detailed explanation of each metrics and how to calculate them using calzone, please refer to the specific notebook.

In [3]:
import pandas as pd
from IPython.display import display, HTML
data = {
    'Metrics': ['Expected calibration error<br>(ECE)', 'Maximum calibration error<br>(MCE)', 'Hosmer-Lemeshow test', "Spiegelhalter's z test", "Cox's analysis", 'Integrated calibration index<br> (ICI)'],
    'Description': [
        '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>sum of absolute difference, weighted by bin count.</div>',
        '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Maximum absolute difference.</div>',
        '<div>Using binned reliability diagram<br>(equal-width or equal-count binning),<br>Chi-squared based test using expected and observed.</div>',
        '<div>Decomposition of brier score.<br>Normal distributed<br> </div>',
        '<div>Logistic regression of the logits<br> <br> </div>',
        '<div>Similar to ECE, using smooth fit (usually losse)<br>instead of binning to get<br>the calibration curve</div>'
    ],
    'Pros': [
        '<div>• Intuitive<br>• Easy to calculate</div>',
        '<div>• Intuitive<br>• Easy to calculate</div>',
        '<div>• Intuitive<br>• Statistical meaning</div>',
        '<div>• Doesn\'t rely on binning<br>• Statistical meaning</div>',
        '<div>• Doesn\'t rely on binning<br>• Hints at miscalibration type</div>',
        '<div>• Doesn\'t rely on binning<br>• Capture all kind of miscalibration</div>'
    ],
    'Cons': [
        '<div>• Depend on binning <br>• Depend on class-by-class/top-class</div>',
        '<div>• Depend on binning <br>• Depend on class-by-class/top-class</div>',
        '<div>• Depend on binning <br>• Low power<br>• Wrong coverage</div>',
        '<div>• Doesn\'t detect prevalence shift</div>',
        '<div>• Failed to capture some miscalibration</div>',
        '<div>• Depend on the choice of curve fitting<br>• Depend on fitting parameters</div>'
    ],
    'Meaning': [
        '<div>Average deviation from<br>true probability</div>',
        '<div>Maximum deviation from<br>true probability</div>',
        '<div>Test of<br>calibration</div>',
        '<div>Test of<br>calibration</div>',
        '<div>A logit fit to the<br>calibration curve</div>',
        '<div>Average deviation from<br>true probability</div>'
    ]
}
df = pd.DataFrame(data)

# Apply custom styling
styled_df = df.style.set_properties(**{'text-align': 'left', 'white-space': 'pre-wrap'})
styled_df = styled_df.set_table_styles([dict(selector='th', props=[('text-align', 'center')])])

styled_df = styled_df.hide(axis="index")

# Display the styled dataframe
display(HTML(styled_df.to_html(escape=False)))
# ### Export PNG format of the table
# import dataframe_image as dfi

# dfi.export(styled_df,"mytable.png",table_conversion = 'matplotlib',dpi=300)

Metrics,Description,Pros,Cons,Meaning
Expected calibration error (ECE),"Using binned reliability diagram (equal-width or equal-count binning), sum of absolute difference, weighted by bin count.",• Intuitive • Easy to calculate,• Depend on binning • Depend on class-by-class/top-class,Average deviation from true probability
Maximum calibration error (MCE),"Using binned reliability diagram (equal-width or equal-count binning), Maximum absolute difference.",• Intuitive • Easy to calculate,• Depend on binning • Depend on class-by-class/top-class,Maximum deviation from true probability
Hosmer-Lemeshow test,"Using binned reliability diagram (equal-width or equal-count binning), Chi-squared based test using expected and observed.",• Intuitive • Statistical meaning,• Depend on binning • Low power • Wrong coverage,Test of calibration
Spiegelhalter's z test,Decomposition of brier score. Normal distributed,• Doesn't rely on binning • Statistical meaning,• Doesn't detect prevalence shift,Test of calibration
Cox's analysis,Logistic regression of the logits,• Doesn't rely on binning • Hints at miscalibration type,• Failed to capture some miscalibration,A logit fit to the calibration curve
Integrated calibration index  (ICI),"Similar to ECE, using smooth fit (usually losse) instead of binning to get the calibration curve",• Doesn't rely on binning • Capture all kind of miscalibration,• Depend on the choice of curve fitting • Depend on fitting parameters,Average deviation from true probability


### PNG version of the table for formatting problem
![alt text](mytable.png "Title")

## Guide to calzone and calibration metrics

calzone aims to access whether a model achieves moderate calibration, meaning whether $\mathbb{P}(D=1|\hat{P}=p)=p$ for all $p\in[0,1]$.

To accurately assess the calibration of machine learning models, it is essential to have a comprehensive and reprensative dataset with sufficient coverage of the prediction space. The calibration metrics is not meaningful if the dataset is not representative of true intended population.

calzone takes in a csv dataset which contains the probability of each class and the true label. Most metrics in calzone only work with binary classification and which transforms the problem into 1-vs-rest when calcualte the metrics. Therefore, you need to specify the class-of-interest when using the metrics. The only exception is the Top-class Expected calibration error ($ECE_{top}$) and Top-class Maximum calibration error ($MCE_{top}$) metrics which only measure the calibration of the class with highest predicted probability hence works for multi-class problems. See the corresponding documentation for more details.


We recommend visualizing calibration using reliability diagrams. If you observe general over- or under-estimation of probabilities for a given class, consider applying a prevalence adjustment to determine if it's solely due to prevalence shift. After prevalence adjustment, plot the reliability diagrams again and examine the results of calibration metrics.

For a general sense of average probability deviation, we recommend using the Cox and Loess integrated calibration index (ICI) as they don't depend on binning. Alternativly, ECE can be used to measure the same but the result will depend on the binning scheme you used. If the probabilities distribution is highly skewed toward 0 and 1, use equal-count binning for ECE. 

Please refer to the notebooks for detailed descriptions of each metric.