# Introduction

This is the first notebook in this example of how to explain models using Certifai.

In this notebook, we will:
1. Perform some preprocessing of the dataset, and save the processed dataset so it can be used in the second notebook when explaining the models. 
2. Save some additional metadata about the dataset so that it can be used by Certifai.
2. Train two scikit-learn models and save them for use in the [second notebook](patient-readmission-explain-scan.ipynb). 


This example uses a kaggle dataset [Diabetes 130 US hospitals for years 1999-2008](https://www.kaggle.com/brandao/diabetes) where the task is to predict whether a patient will be readmitted to hospital after being discharged. Please refer to kaggle for more details.


In [1]:
import numpy as np
import pandas as pd
import time, pickle
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score, f1_score


# Data Preprocessing
This section performs some minimal data preprocessing. 

In [2]:
df = pd.read_csv('diabetic_data.csv')
df.replace('?',np.nan,inplace=True)
df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,6,25,1,1,...,No,No,No,No,No,No,No,No,No,NO
1,149190,55629189,Caucasian,Female,[10-20),,1,1,7,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
2,64410,86047875,AfricanAmerican,Female,[20-30),,1,1,7,2,...,No,No,No,No,No,No,No,No,Yes,NO
3,500364,82442376,Caucasian,Male,[30-40),,1,1,7,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
4,16680,42519267,Caucasian,Male,[40-50),,1,1,7,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO


Drop columns and rows that are known to offer little information to the model. This includes IDs and those with many N/As. Further, drop all columns that only have a single value in them.

In [3]:
# IDs and known low-info columns
df.drop(['weight','medical_specialty','payer_code', 'encounter_id','patient_nbr','admission_type_id',
         'discharge_disposition_id','admission_source_id'], axis=1, inplace=True)

# dropping columns with no variation
invariate_cols = [c for c in df.columns if len(df[c].unique()) < 2]
for c in invariate_cols:
    print(f"{c}")
df.drop(columns=invariate_cols, inplace=True)
df.columns

examide
citoglipton


Index(['race', 'gender', 'age', 'time_in_hospital', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'diag_1', 'diag_2', 'diag_3',
       'number_diagnoses', 'max_glu_serum', 'A1Cresult', 'metformin',
       'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
       'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'insulin', 'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmitted'],
      dtype='object')

Based on Table 2 in [Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records](https://www.hindawi.com/journals/bmri/2014/781670/), reduce the large number of diagnostic codes to a small number of condition group names.

In [4]:
diag_cols = ['diag_1','diag_2','diag_3']
for col in diag_cols:
    df[col].fillna('0', inplace=True)
    # Anything with E or V will get mapped into 'Other' group, so set as '0'
    df.loc[df[col].str.contains('E'), col] = '0'
    df.loc[df[col].str.contains('V'), col] = '0'
    # Any '250.xx' will be mapped to Diabetes
    df.loc[df[col].str.contains('250'), col] = '250'

df[diag_cols] = df[diag_cols].astype(float)

# diagnosis grouping
for col in diag_cols:
    df['temp']='Other'
    
    condition = (df[col]>=390) & (df[col]<=458) | (df[col]==785)
    df.loc[condition,'temp']='Circulatory'
    
    condition = (df[col]>=460) & (df[col]<=519) | (df[col]==786)
    df.loc[condition,'temp']='Respiratory'
    
    condition = (df[col]>=520) & (df[col]<=579) | (df[col]==787)
    df.loc[condition,'temp']='Digestive'
    
    condition = df[col]==250
    df.loc[condition,'temp']='Diabetes'
    
    condition = (df[col]>=800) & (df[col]<=999)
    df.loc[condition,'temp']='Injury'
    
    condition = (df[col]>=710) & (df[col]<=739)
    df.loc[condition,'temp']='Muscoloskeletal'
   
    condition = (df[col]>=580) & (df[col]<=629) | (df[col]==788)
    df.loc[condition,'temp']='Genitourinary'
    
    condition = (df[col]>=140) & (df[col]<=239)
    df.loc[condition,'temp']='Neoplasms'
    
    df.loc[df[col].isnull(),'temp']='Unknown'
    df[col]=df['temp']
    df.drop('temp',axis=1,inplace=True)

df[diag_cols]

Unnamed: 0,diag_1,diag_2,diag_3
0,Diabetes,Other,Other
1,Other,Diabetes,Other
2,Other,Diabetes,Other
3,Other,Diabetes,Circulatory
4,Neoplasms,Neoplasms,Diabetes
...,...,...,...
101761,Diabetes,Other,Circulatory
101762,Digestive,Other,Digestive
101763,Other,Genitourinary,Other
101764,Injury,Other,Injury


Encode the outcome column so that 0 is 'not readmitted' and 1 is 'readmitted'

In [5]:
condition = df['readmitted']!='NO'
df['readmitted'] = np.where(condition,1,0)
df.head()

Unnamed: 0,race,gender,age,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,...,tolazamide,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,Caucasian,Female,[0-10),1,41,0,1,0,0,0,...,No,No,No,No,No,No,No,No,No,0
1,Caucasian,Female,[10-20),3,59,0,18,0,0,0,...,No,Up,No,No,No,No,No,Ch,Yes,1
2,AfricanAmerican,Female,[20-30),2,11,5,13,2,0,1,...,No,No,No,No,No,No,No,No,Yes,0
3,Caucasian,Male,[30-40),2,44,1,16,0,0,0,...,No,Up,No,No,No,No,No,Ch,Yes,0
4,Caucasian,Male,[40-50),1,51,0,8,0,0,0,...,No,Steady,No,No,No,No,No,Ch,Yes,0


Encode age range as the midpoint of the range, rather than a categorical so that it has a defined ordering.

In [6]:
def parse_age(r):
    f, to = r[1:-1].split('-')
    return int((int(to) + int(f))/2)

df['age'] = df['age'].map(parse_age)

Encode all of the categorical columns. Save the information about the encoding in a pickle so we can use it later when we construct the scan definition. Note: Make sure to include a column for null data, as Certifai requires there to be no zero-hot or multi-hot rows in the one-hot encoding.

In [7]:
cat_cols = df.select_dtypes('object').columns
df = pd.get_dummies(df, columns=cat_cols, dtype=int, dummy_na=True)
invariate_cols = [c for c in df.columns if len(df[c].unique()) < 2]
df.drop(columns=invariate_cols, inplace=True)

df.head()

# get_dummies uses a pattern of 'feature_value' for one-hot columns. Save this as a pickled dictionary for use in the scan
# definition
cat_value_mappings = {}
for feature in cat_cols:
    one_hot_col_name_prefix = f"{feature}_"
    mapping = {}
    for ec in df.columns:
        if ec.startswith(one_hot_col_name_prefix):
            value = ec[len(one_hot_col_name_prefix):]
            mapping[ec] = value
    cat_value_mappings[feature] = mapping
    print(f"Feature value -> column mappings for categorical feature '{feature}':")
    for col, val in cat_value_mappings[feature].items():
        print(f"\t{col} -> {val}")
        
with open('cat_value_mappings.pkl', 'wb') as file:
    pickle.dump(cat_value_mappings, file)

Feature value -> column mappings for categorical feature 'race':
	race_AfricanAmerican -> AfricanAmerican
	race_Asian -> Asian
	race_Caucasian -> Caucasian
	race_Hispanic -> Hispanic
	race_Other -> Other
	race_nan -> nan
Feature value -> column mappings for categorical feature 'gender':
	gender_Female -> Female
	gender_Male -> Male
	gender_Unknown/Invalid -> Unknown/Invalid
Feature value -> column mappings for categorical feature 'diag_1':
	diag_1_Circulatory -> Circulatory
	diag_1_Diabetes -> Diabetes
	diag_1_Digestive -> Digestive
	diag_1_Genitourinary -> Genitourinary
	diag_1_Injury -> Injury
	diag_1_Muscoloskeletal -> Muscoloskeletal
	diag_1_Neoplasms -> Neoplasms
	diag_1_Other -> Other
	diag_1_Respiratory -> Respiratory
Feature value -> column mappings for categorical feature 'diag_2':
	diag_2_Circulatory -> Circulatory
	diag_2_Diabetes -> Diabetes
	diag_2_Digestive -> Digestive
	diag_2_Genitourinary -> Genitourinary
	diag_2_Injury -> Injury
	diag_2_Muscoloskeletal -> Muscoloskele

Save the encoded dataset.

In [8]:
df.to_csv('diabetic_data_processed.csv', index=False)

# Model Training

Create the test and training datasets

In [9]:
X = df.drop('readmitted',axis=1)
y = df['readmitted']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
X_train.shape,X_test.shape

((81412, 126), (20354, 126))

Train two models.

In [10]:
def scores(name, model, x, y):
    preds = model.predict(x)
    return [
        name,
        accuracy_score(y, preds), 
        f1_score(y, preds),
        roc_auc_score(y, model.predict_proba(x)[:,1])
    ]

logit_model = LogisticRegression(random_state=0, solver="lbfgs", max_iter=1000)
logit_model.fit(X_train,y_train)
logit_preds = logit_model.predict(X_test)
mlp_model = MLPClassifier(random_state=0, hidden_layer_sizes=(20,20), max_iter=1000)
mlp_model.fit(X_train,y_train)
results = [scores('logit', logit_model, X_test, y_test), scores('mlp', mlp_model, X_test, y_test)]
display(pd.DataFrame(results, columns=['Name', 'Accuracy', 'F1 Score', 'AUC_ROC']))

Unnamed: 0,Name,Accuracy,F1 Score,AUC_ROC
0,logit,0.621942,0.507898,0.65946
1,mlp,0.628476,0.571558,0.672792


Save the models as pickle files. In this case, all of the encoding is done in the data pipeline so we do not need to save an encoder. 

In [11]:

def save(name, model, encoder=None):
    model_obj = {'model': model, 'encoder': None, 'name': name, 'created': int(time.time())}
    with open(f'readmission_{name}.pkl', 'wb') as file:
        pickle.dump(model_obj, file)
    print(f"Saved: {name}")

# Save models as pickle files
save('logit', logit_model)
save('mlp', mlp_model)

Saved: logit
Saved: mlp
