# Upskilling a no-skill classifier with Conformal Prediction

Whenever a scientist needs to build a model, they need to evaluate the results against certain metrics given some context.

This led to a culture that focus too much on optimizing metrics instead of measuring how the model would impact the business.

In this notebook I'll show a way to evaluate models through different lens by using conformal prediction that can helps you given a clear picture of what your model is predicting.

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from crepes import WrapClassifier
from sklearn.base import ClassifierMixin
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

alt.data_transformers.enable("vegafusion")


def calculate_coverage(
    X: pd.DataFrame,
    y: pd.Series,
    label: int,
    calibrated_conformal_classifier: WrapClassifier,
    alphas: list[float],
) -> pd.DataFrame:
    X = X.reset_index(drop=True)
    y = y.reset_index(drop=True)

    sets = {
        alpha: calibrated_conformal_classifier.predict_set(X, confidence=1 - alpha)
        for alpha in alphas
    }

    count_true_label = sum(True for value in y if value == label)
    size = len(y)

    random_guessing = count_true_label / size

    results = []

    for alpha in alphas:
        count_coverage = 0
        count_sets = 0

        sets_aux = sets[alpha]
        for i, value in enumerate(y):
            if sets_aux[i, label]:
                count_sets += 1
                if value == label:
                    count_coverage += 1

        denominator_count_sets = count_sets if count_sets > 0 else 1

        res = {
            "alpha": alpha,
            "coverage": count_coverage,
            "% coverage (recall)": round(count_coverage * 100 / count_true_label, 2),
            "# sets containing target": count_sets,
            "% sets containing_target": round(count_sets * 100 / size, 2),
            "% sets correctly covering target (precision)": round(
                count_coverage * 100 / denominator_count_sets, 2
            ),
            "pp gain over random guessing": round(
                ((count_coverage * 100) / denominator_count_sets)
                - (random_guessing * 100),
                2,
            ),
        }

        results.append(res)

    return pd.DataFrame(results).set_index("alpha")


def train_and_calibrate(
    classifier: ClassifierMixin,
    X_train: pd.DataFrame,
    X_calib: pd.DataFrame,
    X_test: pd.DataFrame,
    y_train: pd.Series,
    y_calib: pd.Series,
    y_test: pd.Series,
) -> WrapClassifier:
    classifier.fit(X_train, y_train)
    y_preds = classifier.predict(X_test)

    print(classification_report(y_test, y_preds, zero_division=0))

    conformal_classifier = WrapClassifier(classifier)
    conformal_classifier.calibrate(
        X_calib.reset_index(drop=True), y_calib.reset_index(drop=True), class_cond=True
    )

    return conformal_classifier


def make_df(random_state: int) -> pd.DataFrame:
    size_sample = 20000
    size_1 = 1000
    size_2 = 1000
    size_0 = size_sample - size_1 - size_2

    X_, y_ = make_blobs(
        n_samples=[size_0, size_1, size_2],
        centers=[(0, 0), (3, 0), (1, 2)],
        cluster_std=[2, 0.5, 0.8],
        random_state=random_state,
    )

    features = ["a", "b"]
    target = "y"

    return pd.concat(
        (
            pd.DataFrame(X_, columns=features),
            pd.Series(y_, name=target),
        ),
        axis=1,
    )

In [2]:
start_alpha = 0.05
end_alpha = 0.95
num_alpha = int((end_alpha + start_alpha) / 0.05) - 1

alphas = np.linspace(start_alpha, end_alpha, num_alpha)
alphas = [round(i, 2) for i in alphas]

df = make_df(0)

print(df.info())
display(df.head())
alt.Chart(df).mark_point(size=1).encode(
    alt.X("a:Q"),
    alt.Y("b:Q"),
    alt.Color("y:N"),
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       20000 non-null  float64
 1   b       20000 non-null  float64
 2   y       20000 non-null  int64  
dtypes: float64(2), int64(1)
memory usage: 468.9 KB
None


Unnamed: 0,a,b,y
0,-1.484079,-0.089315,0
1,2.804356,1.00883,1
2,0.622711,-2.05976,0
3,-0.236287,0.590889,0
4,-3.034808,2.488674,0


From the image above one can already imagine that it will be very hard to correctly classify the label 1 while being almost impossible to do the same with label 2.

Some would say "upsampling" while others would say "downsampling". Upsampling is objectively bad, I'm not wasting my time on why it is bad to create artificial data. Downsampling on the other hand is ok, but it adds an extra layer of complexity that can't be avoided if you want correct results.

For this case I prefer Conformal Prediction. It's a method that helps you say something like "in this region of the feature space we expect this probability for each label" with mathematical grounding.

What does this means in practice?
 * for a targeted campaign this could mean "anyone from this region is sure to be at least interested in this ad" (remember that are cases in which we don't even have the capability of attending the whole demand for something, so if we can reduce our spending in marketing campaigns while still guaranteeing we sell the whole stock it is a big win);
 * on the other hand, if we are talking about fraud detection, we can use a cheaper model to detect anyone slightly suspicious and then send the results to a more sofisticated and expensive model that maybe is a paid API we contracted for our business, which means that we don't waste too many resources trying to detect frauds on most transactions.

In [3]:
features = ["a", "b"]
target = "y"

X, y = df.filter(features), df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
X_calib, X_test, y_calib, y_test = train_test_split(
    X_test, y_test, test_size=0.5, random_state=0
)

train_data = (X_train, X_calib, X_test, y_train, y_calib, y_test)

new_df = make_df(1)
new_X, new_y = new_df.filter(features), new_df[target]

## Logistic Regression Classifier

Training a Logistic Regression Classifier yields a model that can't classify no one beyonds label 0 correctly. However, it isn't a useless model like the metrics make it seems so.

Looking at the table calculated on the results of each set whe can see how the model is able to perform at each alpha. This give us the ability to analyze where our model is performing correctly with virtually 100% certainty that the label is correct while showing us at which threshold the models starts to "fail".

That means even a classically bad classifier can be useful if we don't have anything better.

In [4]:
conformal_logistic_regression_classifier = train_and_calibrate(
    LogisticRegression(random_state=0), *train_data
)
for label in (0, 1, 2):
    print(f"***** Results for label {label} *****")
    display(
        calculate_coverage(
            X_test,
            y_test,
            label,
            conformal_logistic_regression_classifier,
            alphas,
        )
    )

              precision    recall  f1-score   support

           0       0.91      0.98      0.94      3630
           1       0.00      0.00      0.00       164
           2       0.00      0.00      0.00       206

    accuracy                           0.89      4000
   macro avg       0.30      0.33      0.31      4000
weighted avg       0.82      0.89      0.86      4000

***** Results for label 0 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,3440,94.77,3787,94.67,90.84,0.09
0.1,3271,90.11,3542,88.55,92.35,1.6
0.15,3095,85.26,3273,81.83,94.56,3.81
0.2,2929,80.69,3065,76.62,95.56,4.81
0.25,2757,75.95,2838,70.95,97.15,6.4
0.3,2592,71.4,2643,66.08,98.07,7.32
0.35,2404,66.23,2425,60.62,99.13,8.38
0.4,2201,60.63,2212,55.3,99.5,8.75
0.45,2002,55.15,2006,50.15,99.8,9.05
0.5,1840,50.69,1842,46.05,99.89,9.14


***** Results for label 1 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,150,91.46,638,15.95,23.51,19.41
0.1,143,87.2,569,14.22,25.13,21.03
0.15,134,81.71,499,12.47,26.85,22.75
0.2,128,78.05,472,11.8,27.12,23.02
0.25,124,75.61,450,11.25,27.56,23.46
0.3,116,70.73,419,10.47,27.68,23.58
0.35,111,67.68,401,10.03,27.68,23.58
0.4,98,59.76,366,9.15,26.78,22.68
0.45,91,55.49,339,8.47,26.84,22.74
0.5,86,52.44,330,8.25,26.06,21.96


***** Results for label 2 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,194,94.17,1322,33.05,14.67,9.52
0.1,186,90.29,1176,29.4,15.82,10.67
0.15,171,83.01,1018,25.45,16.8,11.65
0.2,167,81.07,924,23.1,18.07,12.92
0.25,156,75.73,821,20.52,19.0,13.85
0.3,144,69.9,770,19.25,18.7,13.55
0.35,140,67.96,708,17.7,19.77,14.62
0.4,132,64.08,656,16.4,20.12,14.97
0.45,120,58.25,617,15.43,19.45,14.3
0.5,115,55.83,588,14.7,19.56,14.41


### Predicting on new data

It also keeps roughly the same results on data that follows the same distribution.

In [5]:
for label in (0, 1, 2):
    print(f"***** Results for label {label} *****")
    display(
        calculate_coverage(
            new_X,
            new_y,
            label,
            conformal_logistic_regression_classifier,
            alphas,
        )
    )

***** Results for label 0 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,17041,94.67,18914,94.57,90.1,0.1
0.1,16148,89.71,17543,87.72,92.05,2.05
0.15,15231,84.62,16162,80.81,94.24,4.24
0.2,14427,80.15,15060,75.3,95.8,5.8
0.25,13460,74.78,13835,69.17,97.29,7.29
0.3,12641,70.23,12870,64.35,98.22,8.22
0.35,11771,65.39,11899,59.49,98.92,8.92
0.4,10839,60.22,10905,54.52,99.39,9.39
0.45,9768,54.27,9800,49.0,99.67,9.67
0.5,8918,49.54,8931,44.66,99.85,9.85


***** Results for label 1 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,963,96.3,3647,18.23,26.41,21.41
0.1,926,92.6,3235,16.18,28.62,23.62
0.15,869,86.9,2838,14.19,30.62,25.62
0.2,831,83.1,2687,13.44,30.93,25.93
0.25,801,80.1,2580,12.9,31.05,26.05
0.3,758,75.8,2424,12.12,31.27,26.27
0.35,731,73.1,2305,11.53,31.71,26.71
0.4,664,66.4,2109,10.54,31.48,26.48
0.45,594,59.4,1930,9.65,30.78,25.78
0.5,576,57.6,1868,9.34,30.84,25.84


***** Results for label 2 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,923,92.3,6451,32.26,14.31,9.31
0.1,892,89.2,5741,28.7,15.54,10.54
0.15,840,84.0,5013,25.07,16.76,11.76
0.2,791,79.1,4498,22.49,17.59,12.59
0.25,735,73.5,4070,20.35,18.06,13.06
0.3,694,69.4,3787,18.93,18.33,13.33
0.35,655,65.5,3545,17.73,18.48,13.48
0.4,599,59.9,3227,16.14,18.56,13.56
0.45,557,55.7,3008,15.04,18.52,13.52
0.5,519,51.9,2842,14.21,18.26,13.26


## Random Forest Classifier

Of course the Random Forest Classifier would perform way better than a Logistic Regression Classifier. However it still performs very bad, but the Conformal Prediction on top of it still gives way better control over the results.

But hey, you do lose something in the label 0: there is no region with virtual 100% correct labels! There are some cases in which the no skill Logistic Regression Classifier can be more usefull than the smarter Random Forest Classifier, considering the results of the Conformal Prediction on top of it.

In [6]:
conformal_random_forest_classifier = train_and_calibrate(
    RandomForestClassifier(random_state=0), *train_data
)
for label in (0, 1, 2):
    print(f"***** Results for label {label} *****")
    display(
        calculate_coverage(
            X_test,
            y_test,
            label,
            conformal_random_forest_classifier,
            alphas,
        )
    )

              precision    recall  f1-score   support

           0       0.93      0.96      0.95      3630
           1       0.50      0.49      0.50       164
           2       0.26      0.10      0.15       206

    accuracy                           0.90      4000
   macro avg       0.56      0.52      0.53      4000
weighted avg       0.88      0.90      0.89      4000

***** Results for label 0 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,3459,95.29,3693,92.33,93.66,2.91
0.1,3281,90.39,3450,86.25,95.1,4.35
0.15,3130,86.23,3257,81.42,96.1,5.35
0.2,2953,81.35,3039,75.97,97.17,6.42
0.25,2753,75.84,2811,70.28,97.94,7.19
0.3,2646,72.89,2686,67.15,98.51,7.76
0.35,2525,69.56,2554,63.85,98.86,8.11
0.4,2309,63.61,2324,58.1,99.35,8.6
0.45,2309,63.61,2324,58.1,99.35,8.6
0.5,2309,63.61,2324,58.1,99.35,8.6


***** Results for label 1 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,159,96.95,568,14.2,27.99,23.89
0.1,147,89.63,341,8.53,43.11,39.01
0.15,139,84.76,305,7.62,45.57,41.47
0.2,126,76.83,263,6.58,47.91,43.81
0.25,116,70.73,235,5.88,49.36,45.26
0.3,112,68.29,224,5.6,50.0,45.9
0.35,100,60.98,194,4.85,51.55,47.45
0.4,84,51.22,167,4.17,50.3,46.2
0.45,76,46.34,148,3.7,51.35,47.25
0.5,71,43.29,130,3.25,54.62,50.52


***** Results for label 2 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,206,100.0,4000,100.0,5.15,0.0
0.1,179,86.89,1035,25.88,17.29,12.14
0.15,171,83.01,929,23.23,18.41,13.26
0.2,155,75.24,816,20.4,19.0,13.85
0.25,150,72.82,762,19.05,19.69,14.54
0.3,139,67.48,655,16.38,21.22,16.07
0.35,133,64.56,602,15.05,22.09,16.94
0.4,115,55.83,503,12.57,22.86,17.71
0.45,101,49.03,419,10.47,24.11,18.96
0.5,94,45.63,387,9.68,24.29,19.14


# Predicting on new data

Same as the logistic regression classifier, just so we can see it the conformal prediction helpings in many scenarios while confirming that we lost something for label 0 even in new data.

In [7]:
for label in (0, 1, 2):
    print(f"***** Results for label {label} *****")
    display(
        calculate_coverage(
            new_X,
            new_y,
            label,
            conformal_random_forest_classifier,
            alphas,
        )
    )

***** Results for label 0 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,17132,95.18,18313,91.56,93.55,3.55
0.1,16327,90.71,17150,85.75,95.2,5.2
0.15,15577,86.54,16151,80.75,96.45,6.45
0.2,14658,81.43,15013,75.06,97.64,7.64
0.25,13747,76.37,13957,69.78,98.5,8.5
0.3,13098,72.77,13242,66.21,98.91,8.91
0.35,12575,69.86,12689,63.45,99.1,9.1
0.4,11492,63.84,11570,57.85,99.33,9.33
0.45,11492,63.84,11570,57.85,99.33,9.33
0.5,11492,63.84,11570,57.85,99.33,9.33


***** Results for label 1 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,975,97.5,3131,15.65,31.14,26.14
0.1,929,92.9,1998,9.99,46.5,41.5
0.15,890,89.0,1782,8.91,49.94,44.94
0.2,813,81.3,1509,7.54,53.88,48.88
0.25,753,75.3,1330,6.65,56.62,51.62
0.3,731,73.1,1278,6.39,57.2,52.2
0.35,658,65.8,1111,5.55,59.23,54.23
0.4,611,61.1,1021,5.11,59.84,54.84
0.45,563,56.3,930,4.65,60.54,55.54
0.5,508,50.8,829,4.14,61.28,56.28


***** Results for label 2 *****


Unnamed: 0_level_0,coverage,% coverage (recall),# sets containing target,% sets containing_target,% sets correctly covering target (precision),pp gain over random guessing
alpha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0.05,1000,100.0,20000,100.0,5.0,0.0
0.1,892,89.2,4971,24.86,17.94,12.94
0.15,869,86.9,4551,22.75,19.09,14.09
0.2,808,80.8,3909,19.55,20.67,15.67
0.25,779,77.9,3677,18.39,21.19,16.19
0.3,705,70.5,3155,15.78,22.35,17.35
0.35,676,67.6,2901,14.51,23.3,18.3
0.4,596,59.6,2424,12.12,24.59,19.59
0.45,521,52.1,2030,10.15,25.67,20.67
0.5,489,48.9,1861,9.3,26.28,21.28
