In this notebook, we create a simple checklist using the UCI Heart dataset that minimizes FPR subject to a constraint on the FNR.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

from IPChecklists.dataset import BinaryDataset

# if using CPLEX
from IPChecklists.model_cplex import ChecklistMIP
from IPChecklists.constraints_cplex import MaxNumFeatureConstraint, FNRConstraint, FPRConstraint

# if using Python-MIP
# from IPChecklists.model_pythonmip import ChecklistMIP
# from IPChecklists.constraints_pythonmip import MaxNumFeatureConstraint, FNRConstraint, FPRConstraint

Using CPLEX version 20.1.0.0


### 1. Load and process the dataset

In [2]:
df = pd.read_csv('./data/heart.csv')

# process feature columns
cont_cols = ['trestbps', 'chol', 'thalach', 'age', 'oldpeak']
for i in cont_cols:
    df[i] = df[i].astype(float)

cat_cols = ['cp', 'thal', 'ca', 'slope', 'restecg']
for i in cat_cols: # cast categorical columns as string for later type inference
    df[i] = df[i].astype(str)

df_train, df_test = train_test_split(df, test_size = 0.25, random_state = 42, stratify = df['target'])

In [3]:
df_train.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
66,51.0,1,2,100.0,222.0,0,1,143.0,1,1.2,1,0,2,1
260,66.0,0,0,178.0,228.0,1,1,165.0,1,1.0,1,2,3,0
289,55.0,0,0,128.0,205.0,0,2,130.0,1,2.0,1,1,3,0
237,60.0,1,0,140.0,293.0,0,0,170.0,0,1.2,1,2,3,0
144,76.0,0,2,140.0,197.0,0,2,116.0,0,1.1,1,0,2,1


### 2. Binarize the dataset

In [4]:
train_ds = BinaryDataset(df_train, 
                         target_name = 'target',  # column name of target variable
                         pos_label = 1, # what value of the target is a "positive" prediction
                         col_subset = cont_cols + cat_cols # use these columns for modelling
                      )

INFO:root:Removed 2 non-informative columns: {'oldpeak<~0.0', 'oldpeak>=0.0'}
INFO:root:Binary dataframe: 66 binary features and 227 samples


In [5]:
# binarized features
train_ds.binarized_df.columns

Index(['trestbps>=120.0', 'trestbps<~120.0', 'trestbps>=130.0',
       'trestbps<~130.0', 'trestbps>=140.0', 'trestbps<~140.0', 'chol>=211.0',
       'chol<~211.0', 'chol>=240.0', 'chol<~240.0', 'chol>=270.5',
       'chol<~270.5', 'thalach>=136.5', 'thalach<~136.5', 'thalach>=152.0',
       'thalach<~152.0', 'thalach>=166.0', 'thalach<~166.0', 'age>=47.0',
       'age<~47.0', 'age>=55.0', 'age<~55.0', 'age>=61.0', 'age<~61.0',
       'oldpeak>=0.8', 'oldpeak<~0.8', 'oldpeak>=1.8', 'oldpeak<~1.8', 'cp==2',
       'cp!=2', 'cp==0', 'cp!=0', 'cp==1', 'cp!=1', 'cp==3', 'cp!=3',
       'thal==2', 'thal!=2', 'thal==3', 'thal!=3', 'thal==1', 'thal!=1',
       'thal==0', 'thal!=0', 'ca==0', 'ca!=0', 'ca==2', 'ca!=2', 'ca==1',
       'ca!=1', 'ca==3', 'ca!=3', 'ca==4', 'ca!=4', 'slope==1', 'slope!=1',
       'slope==2', 'slope!=2', 'slope==0', 'slope!=0', 'restecg==1',
       'restecg!=1', 'restecg==2', 'restecg!=2', 'restecg==0', 'restecg!=0'],
      dtype='object')

In [6]:
test_ds = train_ds.apply_transform(df_test) # binarize the test set using the same thresholds

### 3. Create a MIP

Here, we minimize the FPR subject to an FNR constraint. The FNR constraint is required, because the model could otherwise obtain 0% FPR by only making negative predictions.

Alternatively, we could have set cost_func = '01' (i.e. maximizing accuracy) and not have to use any performance constraints.

In [7]:
model = ChecklistMIP(train_ds, cost_func = 'FPR') 

INFO:root:Before compression: 227 rows
INFO:root:After compression: 223 rows


### 4. Build the MIP and add constraints

In [8]:
model.add_constraint(FNRConstraint(0.1)) # FNR <= 10%
model.build_problem(N_constraint = MaxNumFeatureConstraint('<=', 5)) # use at most 5 features

### 5. Solve the MIP

In [9]:
stats = model.solve(max_seconds=60, display_progress=False) # can solve for longer for better performance

Advanced basis not built.


Found solution with objective 1770.502211784141 and optimality gap 62.61%.


### 6. Create a "checklist" from the MIP

In [10]:
check = model.to_checklist()

In [11]:
check

oldpeak<~0.8
cp!=0
thal==2
ca==0
slope!=1

M = 3.0, N = 5.0

### 7. Examine various metrics

In [12]:
# training set performance. Note that FNR <= 10%
check.get_metrics(train_ds)

{'accuracy': 0.8766519823788547,
 'n_samples': 227,
 'TN': 86,
 'FN': 11,
 'TP': 113,
 'FP': 17,
 'error': 28,
 'TPR': 0.9112903225806451,
 'FNR': 0.08870967741935484,
 'FPR': 0.1650485436893204,
 'TNR': 0.8349514563106796,
 'precision': 0.8692307692307693,
 'pred_prevalence': 0.5726872246696035,
 'prevalence': 0.5462555066079295}

In [13]:
# test set performance
check.get_metrics(test_ds)

{'accuracy': 0.7894736842105263,
 'n_samples': 76,
 'TN': 26,
 'FN': 7,
 'TP': 34,
 'FP': 9,
 'error': 16,
 'TPR': 0.8292682926829268,
 'FNR': 0.17073170731707318,
 'FPR': 0.2571428571428571,
 'TNR': 0.7428571428571429,
 'precision': 0.7906976744186046,
 'pred_prevalence': 0.5657894736842105,
 'prevalence': 0.5394736842105263}