# Predicting chronic kidney disease using an XGBoost pipeline

This notebook disaplys a use case of utilizing XGBoost in a pipeline to process and predict chronic kidney disease in patients. The <a href="https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease"> dataset</a> is provided by the UCI Machine Learning repository.

The first step is to load all necessary modules and read in the dataset. The names of the columns is not included in the dataset so we will name them when loading using the description given on the webpage.

In [11]:

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

In [12]:
df=pd.read_csv('chronic_kidney_disease.csv',
               header=None,names=['age','bp','sg','al','su','rbc','pc','pcc','ba','bgr','bu','sc',
                                  'sod','pot','hemo','pcv','wc','rc','htn','dm','cad','appet','pe',
                                  'ane','class'])

In [13]:
print(df.head())
print(df.dtypes)

  age  bp     sg al su     rbc        pc         pcc          ba  bgr  ...  \
0  48  80  1.020  1  0       ?    normal  notpresent  notpresent  121  ...   
1   7  50  1.020  4  0       ?    normal  notpresent  notpresent    ?  ...   
2  62  80  1.010  2  3  normal    normal  notpresent  notpresent  423  ...   
3  48  70  1.005  4  0  normal  abnormal     present  notpresent  117  ...   
4  51  80  1.010  2  0  normal    normal  notpresent  notpresent  106  ...   

  pcv    wc   rc  htn   dm cad appet   pe  ane class  
0  44  7800  5.2  yes  yes  no  good   no   no   ckd  
1  38  6000    ?   no   no  no  good   no   no   ckd  
2  31  7500    ?   no  yes  no  poor   no  yes   ckd  
3  32  6700  3.9  yes   no  no  poor  yes  yes   ckd  
4  35  7300  4.6   no   no  no  good   no   no   ckd  

[5 rows x 25 columns]
age      object
bp       object
sg       object
al       object
su       object
rbc      object
pc       object
pcc      object
ba       object
bgr      object
bu       object
sc


Upon initial inspection, we see null values are displayes as '?' and all the columns are of object type. This is not ideal, I will first replace '?' with NaNs which will make it easier to process later. I will also convert columns with numeric values to float type, the description of each column is given in the , I use it to inform which columns should be numerical and categorical. The column names of each column type will be listed in 'numeric_cols' and 'cat_cols'.


In [14]:
df.replace('?',np.nan,inplace=True)

In [15]:
numeric_cols=['age','bp','sg','al','su','bgr','bu','sc','sod','pot','hemo','pcv','wc','rc']
cat_cols=['rbc', 'pc', 'pcc', 'ba', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane']

In [16]:
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric)

In [17]:
print(df.head())
print(df.dtypes)

    age    bp     sg   al   su     rbc        pc         pcc          ba  \
0  48.0  80.0  1.020  1.0  0.0     NaN    normal  notpresent  notpresent   
1   7.0  50.0  1.020  4.0  0.0     NaN    normal  notpresent  notpresent   
2  62.0  80.0  1.010  2.0  3.0  normal    normal  notpresent  notpresent   
3  48.0  70.0  1.005  4.0  0.0  normal  abnormal     present  notpresent   
4  51.0  80.0  1.010  2.0  0.0  normal    normal  notpresent  notpresent   

     bgr  ...   pcv      wc   rc  htn   dm  cad  appet   pe  ane class  
0  121.0  ...  44.0  7800.0  5.2  yes  yes   no   good   no   no   ckd  
1    NaN  ...  38.0  6000.0  NaN   no   no   no   good   no   no   ckd  
2  423.0  ...  31.0  7500.0  NaN   no  yes   no   poor   no  yes   ckd  
3  117.0  ...  32.0  6700.0  3.9  yes   no   no   poor  yes  yes   ckd  
4  106.0  ...  35.0  7300.0  4.6   no   no   no   good   no   no   ckd  

[5 rows x 25 columns]
age      float64
bp       float64
sg       float64
al       float64
su       float

Now the data is in an easily processible form, I will separate the target labels ('class' column) and feature table without target labels. The target labels will be converted to 0 for no kidney disease and 1 for a positive case of chronic kidney disease.

In [18]:
target=df['class'].to_numpy()
y=np.where(target=='ckd',1,0)
feature=df.drop(['class'],axis=1)

Now I begin constructing the pipeline steps. Since there are missing values, we will use an imputer to substitute null values with median values of the column in the case of numeric variables, and most frequent value of the column in the case of categorical variables. Then we need to encode the categorical variables in order to feed it into the ML model, for this I will use a onehotencoder.

In [19]:
numeric_transformer = SimpleImputer(strategy='median')


categoric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Now the two imputers need to be combined and the numeric_transfomer needs to be applied on the numeric_cols and categoric_transfomer to cat_cols. I use a ColumnTransformer to combine these two imputers which will enable me to include it as a pipeline step in the end.

In [20]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categoric_transformer, cat_cols)
    ])

Below I construct the main pipeline, first containing the 'preprocessor' step and an 'xgbclf' step which contains the XGBoost classifier. The initial classifier will have trees with a max depth of 3, use the 'logloss' metric which is an appropriate loss function for a binary classification problem. The model will run parellely using 3 cores.

In [21]:
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('xgbclf', xgb.XGBClassifier(max_depth=3,eval_metric='logloss',n_jobs=3,use_label_encoder=False))])

To make sure the pipeline has  no errors and the data is in correct format, I apply the pipeline with cross validation using 5 folds. The model will be evaluated using the ROC AUC score since we want the model to ideally classify positive and negative cases correctly.

In [22]:
cross_val_scores = cross_val_score(pipeline, feature, y, scoring="roc_auc", cv=5)

In [23]:
print(np.mean(cross_val_scores))

0.9982666666666666


The mean ROC AUC score is 0.999 which is extremely good for an untuned model! This model can be used as is without further tuning depending on requirements of the problem. However, for demonstration purposes, I will see if the model can be improved by tuning the hyperparameters. The three hyperparameters tuned here will be the learning rate, max depth of each tree and number of trees or 'boosters' used. The complete list of hyperparameters can be found in the <a href="https://archive.ics.uci.edu/ml/datasets/chronic_kidney_disease"> XGB documentation</a>.

In [24]:
xgb_grid = {
    'xgbclf__learning_rate': np.arange(0.05, 1, 0.05),
    'xgbclf__max_depth': np.arange(3,15, 1),
    'xgbclf__n_estimators': np.arange(50, 250, 50)
}

I will use a randomized grid search to sample 100 different parameters settings from the grid above. For each parameter setting, the model will be cross validated with 4 folds, this will produce 100 $\times$ 4 = 400 total fits.

In [28]:
randomized_roc_auc = RandomizedSearchCV(pipeline,param_distributions=xgb_grid,
                                        n_iter=100,scoring='roc_auc',verbose=1,cv=4)

In [29]:
randomized_roc_auc.fit(feature,y)

Fitting 4 folds for each of 100 candidates, totalling 400 fits


RandomizedSearchCV(cv=4,
                   estimator=Pipeline(steps=[('preprocessor',
                                              ColumnTransformer(transformers=[('num',
                                                                               SimpleImputer(strategy='median'),
                                                                               ['age',
                                                                                'bp',
                                                                                'sg',
                                                                                'al',
                                                                                'su',
                                                                                'bgr',
                                                                                'bu',
                                                                                'sc',
                        

In [30]:
print(randomized_roc_auc.best_score_)
print(randomized_roc_auc.best_params_)

0.9997854997854998
{'xgbclf__n_estimators': 50, 'xgbclf__max_depth': 14, 'xgbclf__learning_rate': 0.6500000000000001}


Great! The tuned model produces a slightly higher ROC AUC score and the hyperparameter values for the best model is displayed above. This model can now be tuned further if desired of validated on a test set before sending to production.