Copyright (c) 2020. Cognitive Scale Inc. All rights reserved.
Licensed under CognitiveScale Example Code [License](https://github.com/CognitiveScale/cortex-certifai-examples/blob/master/LICENSE.md)


# Alternate formulations for regression 

In this notebook we'll show how to modify the formulation of what constitutes a 'significant change' for the purposes of deciding how much change is required for an outcome to be considered different (counterfactual) to the original data point's prediction.

We'll base it on the primary [Regression example notebook](./Regression.ipynb) and use the same datasets, and models as that example does.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from sklearn.model_selection import train_test_split
import numpy as np
import random
import pprint

from sklearn.svm import SVR, LinearSVR
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

from certifai.common.utils.encoding import CatEncoder
from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiPredictorWrapper, CertifaiModel, CertifaiModelMetric,
                                      CertifaiDataset, CertifaiGroupingFeature, CertifaiDatasetSource,
                                      CertifaiPredictionTask, CertifaiTaskOutcomes, CertifaiOutcomeValue)
from certifai.scanner.report_utils import scores, construct_scores_dataframe
from certifai.scanner.explanation_utils import explanations, construct_explanations_dataframe, counterfactual_changes

In [2]:
# Prepare datasets for test/train split
base_path = '..'
all_data_file = f"{base_path}/datasets/auto_insurance_claims_dataset.csv"
explanation_data_file = f"{base_path}/datasets/auto_insurance_explan.csv"
RANDOM_SEED = 42

df = pd.read_csv(all_data_file)

cat_columns = [
    'State Code',
    'Coverage',
    'Education',
    'EmploymentStatus',
    'Gender',
    'Location Code',
    'Marital Status',
    'Policy',
    'Claim Reason',
    'Sales Channel',
    'Vehicle Class',
    'Vehicle Size',
]
label_column = "Total Claim Amount"


Y = df[label_column]
X = df.drop(label_column, axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20, random_state=RANDOM_SEED)

encoder = CatEncoder(cat_columns, X)

# Train models

def build_model(data, name, model_family, test=None):
    if test is None:
        test = data
        
    if model_family == 'SVM':
        parameters = {'C':[0.1, .5, 1, 2, 4, 10], 'epsilon':[0, 0.01, 0.1, 0.5, 1, 2, 4]}
        m = LinearSVR()
    elif model_family == 'Lasso':
        parameters = {'alpha': [0.001,0.01,.1]}
        m = Lasso()
    model = GridSearchCV(m, parameters, cv=3)
    model.fit(data[0], data[1])

    r2 = r2_score(test[1], model.predict(test[0]))
    print(f"{name} R-Squared: {r2}")
    return model

linl1_model = build_model((encoder(X_train.values), Y_train),
                          "LinL1",
                          "Lasso",
                          (encoder(X_test.values), Y_test))


svm_model = build_model((encoder(X_train.values), Y_train),
                          "SVM",
                          "SVM",
                          (encoder(X_test.values), Y_test))

Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.620e+06, tolerance: 4.158e+04
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 6.310e+06, tolerance: 4.165e+04
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.136e+07, tolerance: 4.179e+04
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.387e+05, tolerance: 4.158e+04
Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 5.996e+05, tolerance: 4.165e+04
Objective did n

LinL1 R-Squared: 0.763031334080441
SVM R-Squared: 0.7458998551454252


In [3]:
# Wrap the models for use by Certifai as a local model
linl1_model_proxy = CertifaiPredictorWrapper(linl1_model, encoder=encoder)
svm_model_proxy = CertifaiPredictorWrapper(svm_model, encoder=encoder)

In [4]:
# Make a scan object given the task definition (which encompasses the specifics of the forumlation desired)
def make_scan(task):
    # Create the scan object from scratch using the ScanBuilder class
    scan = CertifaiScanBuilder.create('regression_test_use_case',
                                      prediction_task=task)

    # Add our local models
    first_model = CertifaiModel('LinL1', local_predictor=linl1_model_proxy)
    scan.add_model(first_model)

    second_model = CertifaiModel('SVM', local_predictor=svm_model_proxy)
    scan.add_model(second_model)


    # Add datasets to the scan
    eval_dataset = CertifaiDataset('evaluation', CertifaiDatasetSource.csv(all_data_file))
    scan.add_dataset(eval_dataset)
    scan.evaluation_dataset_id = eval_dataset.id

    # For the sake of illustration we'll just extract explanations for a few examples
    NUM_EXPLANATIONS = 5
    explan_df = pd.read_csv(explanation_data_file)[:NUM_EXPLANATIONS]
    explan_dataset = CertifaiDataset('explanation', CertifaiDatasetSource.dataframe(explan_df))
    scan.add_dataset(explan_dataset)
    scan.explanation_dataset_id = explan_dataset.id

    # Because the datasets contain a ground truth outcome column which the model does not
    # expect to receive as input we need to state that in the dataset schema (since it cannot
    # be inferred from the CSV)
    scan.dataset_schema.outcome_feature_name = 'Total Claim Amount'

    # Setup an evaluation that just produces explanations in the interests of keeping this example
    # simpler
    scan.add_evaluation_type('explanation')
    
    return scan

# Set up absolute-threshold formulation

Here we'll define settlement values above \\$500 as being favorable.  That is - all predictions above \\$500
will now be considered favorable (an absolute fixed threshold) rather than those that are 0.5 standard deviations above the original prediction (which is what the baseline exaple did)

In [5]:
# Set the favorable direction to be increasing, and we'll consider
# predictions above the mean-prediction for
task = CertifaiPredictionTask(CertifaiTaskOutcomes.regression(True, absolute_threshold=500),
                              prediction_description='Amount of Settled Claim')

In [6]:
# Create the scan object for this formulation
scan = make_scan(task)

In [7]:
# Run the scan.
# By default this will write the results into individual report files (one per model and evaluation
# type) in the 'reports' directory relative to this notebook. This may be disabled by specifying
# `write_reports=False` as below
# The result is a dictionary of dictionaries of reports.  The top level dict key is the evaluation type
# and the second level key is model id.
results = scan.run(write_reports=False)

Starting scan with model_use_case_id: 'regression_test_use_case' and scan_id: '6c04926b600a'
[--------------------] 2023-01-05 12:14:36.685876 - 0 of 2 reports (0.0% complete) - Running explanation evaluation for model: LinL1
[##########----------] 2023-01-05 12:14:50.424096 - 1 of 2 reports (50.0% complete) - Running explanation evaluation for model: SVM
[####################] 2023-01-05 12:15:04.573167 - 2 of 2 reports (100.0% complete) - Completed all evaluations


In [8]:
# Using Certifai's explanation utilities we can programmatically explore counterfactuals produced
# during the explanation evaluation. Below we examine only a single explanation for each model
# by displaying the original input data followed by what features were changed by each
# counterfactual. Note, that for this regression use case a counterfactual will produced in the
# favorable direction (increasing) and in the unfavorable direction (decreasing)
linl1_explanations = construct_explanations_dataframe(explanations(results, model_id='LinL1'))
svm_explanations = construct_explanations_dataframe(explanations(results, model_id='SVM'))

def display_explanation(df):
    df_original = df[df['instance']=='original']
    df_original.style.hide_index()
    display(df_original)
    changes = counterfactual_changes(df)
    display(changes)
    if len(changes) < 2:
        print("No counterfactual found")

In [9]:
print("Explanations for LinL1 Model (settlements over $500 favorable):\n")
for idx in linl1_explanations['row'].unique():
    display_explanation(linl1_explanations[linl1_explanations['row'] == idx]) 

Explanations for LinL1 Model (settlements over $500 favorable):



Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
0,LinL1,1,original,0,original prediction,539.669717,0.0,,NE,387.364705,...,105,18,50,0,1,Personal L3,Hail,Agent,Sports Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,539.669717,0.0,105
0,counterfactual,1,prediction decreased,499.09581,2.25,97


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
2,LinL1,2,original,0,original prediction,330.827029,0.0,,IA,1212.836511,...,61,7,36,0,2,Personal L3,Hail,Agent,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,330.827029,0.0,61
0,counterfactual,1,prediction increased,503.266133,0.529412,95


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
4,LinL1,3,original,0,original prediction,119.084194,0.0,,IA,800.054506,...,100,23,19,1,3,Personal L3,Other,Agent,Sports Car,Small


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,119.084194,0.0,Rural,100
0,counterfactual,1,prediction increased,500.42205,0.947368,Suburban,101


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
6,LinL1,4,original,0,original prediction,-22.158995,0.0,,MO,795.615006,...,67,25,41,1,2,Personal L3,Collision,Web,Two-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,-22.158995,0.0,Rural,67
0,counterfactual,1,prediction increased,501.187536,0.382979,Suburban,96


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
8,LinL1,5,original,0,original prediction,537.893339,0.0,,NE,234.75918,...,69,14,10,1,1,Personal L3,Collision,Branch,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,537.893339,0.0,69
0,counterfactual,1,prediction decreased,497.343602,2.25,61


In [10]:
print("Explanations for SVM Model (settlements over $500 favorable):\n")
for idx in svm_explanations['row'].unique():
    display_explanation(svm_explanations[svm_explanations['row'] == idx]) 

Explanations for SVM Model (settlements over $500 favorable):



Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
0,SVM,1,original,0,original prediction,502.018625,0.0,,NE,387.364705,...,105,18,50,0,1,Personal L3,Hail,Agent,Sports Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,502.018625,0.0,105
0,counterfactual,1,prediction decreased,497.196141,18.0,104


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
2,SVM,2,original,0,original prediction,299.891718,0.0,,IA,1212.836511,...,61,7,36,0,2,Personal L3,Hail,Agent,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,299.891718,0.0,61
0,counterfactual,1,prediction increased,502.436064,0.428571,103


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
4,SVM,3,original,0,original prediction,161.965253,0.0,,IA,800.054506,...,100,23,19,1,3,Personal L3,Other,Agent,Sports Car,Small


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,161.965253,0.0,Rural,100
0,counterfactual,1,prediction increased,504.647543,0.75,Suburban,106


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
6,SVM,4,original,0,original prediction,18.315705,0.0,,MO,795.615006,...,67,25,41,1,2,Personal L3,Collision,Web,Two-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,18.315705,0.0,Rural,67
0,counterfactual,1,prediction increased,500.850043,0.339623,Suburban,102


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
8,SVM,5,original,0,original prediction,488.576211,0.0,,NE,234.75918,...,69,14,10,1,1,Personal L3,Collision,Branch,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,488.576211,0.0,69
0,counterfactual,1,prediction increased,503.043664,6.0,72


# Set up percentile threshold formulation

Similarly we can define a fixed threshold in terms of percentiles of the evaluation dataset's predicted values.
In this case let's define 'favorable' as the top 20% of the distribution.

In [11]:
# Set the favorable direction to be increasing, and we'll consider
# predictions above the mean-prediction for
task = CertifaiPredictionTask(CertifaiTaskOutcomes.regression(True, absolute_percentile=80),
                              prediction_description='Amount of Settled Claim')

In [12]:
# Create the scan object for this formulation
scan = make_scan(task)

In [13]:
results = scan.run(write_reports=False)

linl1_explanations = construct_explanations_dataframe(explanations(results, model_id='LinL1'))
svm_explanations = construct_explanations_dataframe(explanations(results, model_id='SVM'))

[--------------------] 2023-01-05 12:15:04.965269 - 0 of 2 reports (0.0% complete) - Starting scan with model_use_case_id: 'regression_test_use_case' and scan_id: 'd82ffd9ddb00'
[--------------------] 2023-01-05 12:15:04.965424 - 0 of 2 reports (0.0% complete) - Running explanation evaluation for model: LinL1
[##########----------] 2023-01-05 12:15:18.824522 - 1 of 2 reports (50.0% complete) - Running explanation evaluation for model: SVM
[####################] 2023-01-05 12:15:32.539990 - 2 of 2 reports (100.0% complete) - Completed all evaluations


In [14]:
print("Explanations for LinL1 Model (80th percentile favorable):\n")
for idx in linl1_explanations['row'].unique():
    display_explanation(linl1_explanations[linl1_explanations['row'] == idx]) 

Explanations for LinL1 Model (80th percentile favorable):



Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
0,LinL1,1,original,0,original prediction,539.669717,0.0,,NE,387.364705,...,105,18,50,0,1,Personal L3,Hail,Agent,Sports Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,539.669717,0.0,105
0,counterfactual,1,prediction increased,610.674054,1.285714,119


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
2,LinL1,2,original,0,original prediction,330.827029,0.0,,IA,1212.836511,...,61,7,36,0,2,Personal L3,Hail,Agent,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,330.827029,0.0,61
0,counterfactual,1,prediction increased,609.772639,0.327273,116


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
4,LinL1,3,original,0,original prediction,119.084194,0.0,,IA,800.054506,...,100,23,19,1,3,Personal L3,Other,Agent,Sports Car,Small


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,119.084194,0.0,Rural,100
0,counterfactual,1,prediction increased,612.000294,0.439024,Suburban,123


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
6,LinL1,4,original,0,original prediction,-22.158995,0.0,,MO,795.615006,...,67,25,41,1,2,Personal L3,Collision,Web,Two-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,-22.158995,0.0,Rural,67
0,counterfactual,1,prediction increased,607.694041,0.264706,Suburban,117


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
8,LinL1,5,original,0,original prediction,537.893339,0.0,,NE,234.75918,...,69,14,10,1,1,Personal L3,Collision,Branch,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,537.893339,0.0,69
0,counterfactual,1,prediction increased,608.897676,1.285714,83


In [15]:
print("Explanations for SVM Model (80th percentile favorable):\n")
for idx in svm_explanations['row'].unique():
    display_explanation(svm_explanations[svm_explanations['row'] == idx]) 

Explanations for SVM Model (80th percentile favorable):



Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
0,SVM,1,original,0,original prediction,502.018625,0.0,,NE,387.364705,...,105,18,50,0,1,Personal L3,Hail,Agent,Sports Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,502.018625,0.0,105
0,counterfactual,1,prediction increased,564.710922,1.384615,118


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
2,SVM,2,original,0,original prediction,299.891718,0.0,,IA,1212.836511,...,61,7,36,0,2,Personal L3,Hail,Agent,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,299.891718,0.0,61
0,counterfactual,1,prediction increased,560.305877,0.333333,115


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
4,SVM,3,original,0,original prediction,161.965253,0.0,,IA,800.054506,...,100,23,19,1,3,Personal L3,Other,Agent,Sports Car,Small


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,161.965253,0.0,Rural,100
0,counterfactual,1,prediction increased,562.517356,0.5,Suburban,118


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
6,SVM,4,original,0,original prediction,18.315705,0.0,,MO,795.615006,...,67,25,41,1,2,Personal L3,Collision,Web,Two-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Location Code,Monthly Premium Auto
0,original,0,original prediction,18.315705,0.0,Rural,67
0,counterfactual,1,prediction increased,563.542341,0.272727,Suburban,115


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,contribution,State Code,Claim Amount,...,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy,Claim Reason,Sales Channel,Vehicle Class,Vehicle Size
8,SVM,5,original,0,original prediction,488.576211,0.0,,NE,234.75918,...,69,14,10,1,1,Personal L3,Collision,Branch,Four-Door Car,Medsize


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,Monthly Premium Auto
0,original,0,original prediction,488.576211,0.0,69
0,counterfactual,1,prediction increased,560.913477,1.2,84


# Notes

It is apparent from the last formulation that using a percentile of the outcome distribution is *not* the
same as specifying an a-priori fixed value as the favorability threshold.  This is apparent by comparing the results of the two models which show the SVM model had an 80% percentile at about \\$560, whereas the logistic model had one at a little over \\$600.  It is also noteable that the logistic model predicts some negative values, which are clearly not realistic.  Both of these are indications that the logistic model (which will have a linear decision surface) is insufficiently complex to capture the data behaviour at the extremes.  This is despite its R-squared actually being better than that of the SVM.

For comparing different models one would typically therefore prefer to specify a fixed threshold informed by domain knowledge rather than an outcome percentile, though an outcome percentile formulation can be useful especically during initial exploration of a use-case.