# Chronic Kidney Disease Analysis

 An application that can be used across biomedical data science projects. The dataset used for the proof of concept can help physicians better understand chronic kidney disease (CKD) using numerous measurements and biomarkers that have been collected. 

## Description

This project is a display of what can be done with machine learning in order to make some analysis over data such as the Chronic Kidney Disease dataset.

### Relevant Information:

			age		-	age	
			bp		-	blood pressure
			sg		-	specific gravity
			al		-   	albumin
			su		-	sugar
			rbc		-	red blood cells
			pc		-	pus cell
			pcc		-	pus cell clumps
			ba		-	bacteria
			bgr		-	blood glucose random
			bu		-	blood urea
			sc		-	serum creatinine
			sod		-	sodium
			pot		-	potassium
			hemo		-	hemoglobin
			pcv		-	packed cell volume
			wc		-	white blood cell count
			rc		-	red blood cell count
			htn		-	hypertension
			dm		-	diabetes mellitus
			cad		-	coronary artery disease
			appet		-	appetite
			pe		-	pedal edema
			ane		-	anemia
			class		-	class	


Number of Attributes: 24 + class = 25 ( 11  numeric ,14  nominal) 

### Attribute Information :

 	1.Age(numerical)
  	  	age in years
 	2.Blood Pressure(numerical)
	       	bp in mm/Hg
 	3.Specific Gravity(nominal)
	  	sg - (1.005,1.010,1.015,1.020,1.025)
 	4.Albumin(nominal)
		al - (0,1,2,3,4,5)
 	5.Sugar(nominal)
		su - (0,1,2,3,4,5)
 	6.Red Blood Cells(nominal)
		rbc - (normal,abnormal)
 	7.Pus Cell (nominal)
		pc - (normal,abnormal)
 	8.Pus Cell clumps(nominal)
		pcc - (present,notpresent)
 	9.Bacteria(nominal)
		ba  - (present,notpresent)
 	10.Blood Glucose Random(numerical)		
		bgr in mgs/dl
 	11.Blood Urea(numerical)	
		bu in mgs/dl
 	12.Serum Creatinine(numerical)	
		sc in mgs/dl
 	13.Sodium(numerical)
		sod in mEq/L
 	14.Potassium(numerical)	
		pot in mEq/L
 	15.Hemoglobin(numerical)
		hemo in gms
 	16.Packed  Cell Volume(numerical)
 	17.White Blood Cell Count(numerical)
		wc in cells/cumm
 	18.Red Blood Cell Count(numerical)	
		rc in millions/cmm
 	19.Hypertension(nominal)	
		htn - (yes,no)
 	20.Diabetes Mellitus(nominal)	
		dm - (yes,no)
 	21.Coronary Artery Disease(nominal)
		cad - (yes,no)
 	22.Appetite(nominal)	
		appet - (good,poor)
 	23.Pedal Edema(nominal)
		pe - (yes,no)	
 	24.Anemia(nominal)
		ane - (yes,no)
 	25.Class (nominal)		
		class - (ckd,notckd)

In [1]:
from src import ETL_tool as ETL
from sklearn.model_selection import train_test_split
from autogluon.tabular import TabularDataset, TabularPredictor
from pandas import DataFrame, Series



## Data loading

In [2]:

df = ETL.load_data("data/chronic_kidney_disease_full.arff")
print(df.head(5))
print(df.describe())


    age    bp     sg al su     rbc        pc         pcc          ba    bgr  \
0  48.0  80.0  1.020  1  0     NaN    normal  notpresent  notpresent  121.0   
1   7.0  50.0  1.020  4  0     NaN    normal  notpresent  notpresent    NaN   
2  62.0  80.0  1.010  2  3  normal    normal  notpresent  notpresent  423.0   
3  48.0  70.0  1.005  4  0  normal  abnormal     present  notpresent  117.0   
4  51.0  80.0  1.010  2  0  normal    normal  notpresent  notpresent  106.0   

   ...   pcv    wbcc  rbcc  htn   dm  cad  appet   pe  ane class  
0  ...  44.0  7800.0   5.2  yes  yes   no   good   no   no   ckd  
1  ...  38.0  6000.0   NaN   no   no   no   good   no   no   ckd  
2  ...  31.0  7500.0   NaN   no  yes   no   poor   no  yes   ckd  
3  ...  32.0  6700.0   3.9  yes   no   no   poor  yes  yes   ckd  
4  ...  35.0  7300.0   4.6   no   no   no   good   no   no   ckd  

[5 rows x 25 columns]
              age          bp         bgr          bu          sc         sod  \
count  391.000000  

## Split dataset

In [3]:
df_train, df_test = train_test_split(df, test_size=0.33, random_state=1)
print(df_train.shape)
print(df_test.shape)

(268, 25)
(132, 25)


## Seperate input and output

In [4]:
test_data = df_test.drop(['class'], axis=1)
print(test_data.shape)
print(test_data.head(5))

(132, 24)
      age    bp     sg   al   su     rbc      pc         pcc          ba  \
398  17.0  60.0  1.025    0    0  normal  normal  notpresent  notpresent   
125  72.0  90.0    NaN  NaN  NaN     NaN     NaN  notpresent  notpresent   
328  28.0  70.0  1.020    0    0  normal  normal         NaN         NaN   
339  25.0  70.0  1.020    0    0  normal  normal  notpresent  notpresent   
172  62.0  80.0  1.010    1    2     NaN     NaN  notpresent  notpresent   

       bgr  ...  hemo   pcv     wbcc  rbcc  htn   dm  cad  appet  pe ane  
398  114.0  ...  14.2  51.0   7200.0   5.9   no   no   no   good  no  no  
125  308.0  ...   NaN   NaN      NaN   NaN  yes  yes   no   poor  no  no  
328  131.0  ...   NaN  45.0   8600.0   6.5   no   no   no   good  no  no  
339   88.0  ...  13.3  48.0   7000.0   4.9   no   no   no   good  no  no  
172  309.0  ...  10.6  34.0  12800.0   4.9   no   no   no   good  no  no  

[5 rows x 24 columns]


## Prediction
When the fit method is used the autogluon algorithm can determine the type of output the model needs to predict. In this case, it chooses the binary classification as there are only two kinds of output classes "ckd" and "notckd", which is wether the individual has a chronic kidney disease or not. The fit method chooses the most appropriate models for this task and trains them all. The training set is automatically split with a validation set in order to know how to optimise the models.

In [5]:
predictor = TabularPredictor(label='class').fit(train_data=df_train, verbosity=2, presets='best_quality')

No path specified. Models will be saved in: "AutogluonModels\ag-20230702_213736\"
Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=5, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20230702_213736\"
AutoGluon Version:  0.8.2
Python Version:     3.10.11
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
Disk Space Avail:   36.29 GB / 478.87 GB (7.6%)
Train Data Rows:    268
Train Data Columns: 24
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  ['notckd', 'ckd']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = notckd, class 0 = ckd

The fit_summary method shows the details of each model training such as the prediction score on the validation dataset (score_val), the time it took to train the model (fit_time), the time it took to run the model  on the validation dataset (pred_time_val), how many times it got retrained (fit_time_m)

In [6]:
predictor.fit_summary()

*** Summary of fit() ***
Estimated performance of each model:
                      model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0         LightGBMXT_BAG_L1   1.000000       0.233871  13.442483                0.233871          13.442483            1       True          3
1       WeightedEnsemble_L2   1.000000       0.236869  16.542738                0.002998           3.100255            2       True         14
2    NeuralNetFastAI_BAG_L1   1.000000       0.472729  42.422261                0.472729          42.422261            1       True         10
3     ExtraTreesEntr_BAG_L1   0.996269       0.250859   1.360399                0.250859           1.360399            1       True          9
4   RandomForestGini_BAG_L1   0.996269       0.277844   1.757026                0.277844           1.757026            1       True          5
5     ExtraTreesGini_BAG_L1   0.996269       0.330810   1.738003                



{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'RandomForestGini_BAG_L1': 'StackerEnsembleModel_RF',
  'RandomForestEntr_BAG_L1': 'StackerEnsembleModel_RF',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'ExtraTreesGini_BAG_L1': 'StackerEnsembleModel_XT',
  'ExtraTreesEntr_BAG_L1': 'StackerEnsembleModel_XT',
  'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
  'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
  'NeuralNetTorch_BAG_L1': 'StackerEnsembleModel_TabularNeuralNetTorch',
  'LightGBMLarge_BAG_L1': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif_BAG_L1': 0.6753731343283582,
  'KNeighborsDist_BAG_L1': 0.6940298507462687,
  'LightGBMXT_BAG_L1': 1.0,
  'LightGBM_BAG_L1': 0.9925373134328358,
  'RandomForestGini

The leaderbord method shows the statistics of each model training sorted by the effectiveness of the method on the training set (score_test). I would have sorted it by the score from the validation set which is the most important factor that determines which model is the best. We can see that the weighted model that is considered the best is as precise as the light GBMXT model but has the best marginal prediction time. The NeuralNet FastAI model is the most precise but the slowest which can be acceptable for this case as we don't need to predict in real time.

In [7]:
predictor.leaderboard(df_train, silent=True)

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,KNeighborsDist_BAG_L1,1.0,0.69403,0.069995,0.027212,0.022759,0.069995,0.027212,0.022759,1,True,2
1,ExtraTreesGini_BAG_L1,1.0,0.996269,0.197887,0.33081,1.738003,0.197887,0.33081,1.738003,1,True,8
2,RandomForestEntr_BAG_L1,1.0,0.996269,0.232898,0.447743,1.3796,0.232898,0.447743,1.3796,1,True,6
3,RandomForestGini_BAG_L1,1.0,0.996269,0.240413,0.277844,1.757026,0.240413,0.277844,1.757026,1,True,5
4,ExtraTreesEntr_BAG_L1,1.0,0.996269,0.259851,0.250859,1.360399,0.259851,0.250859,1.360399,1,True,9
5,LightGBMLarge_BAG_L1,1.0,0.992537,0.352796,0.156909,13.124398,0.352796,0.156909,13.124398,1,True,13
6,NeuralNetTorch_BAG_L1,1.0,0.992537,0.488729,0.3558,29.268425,0.488729,0.3558,29.268425,1,True,12
7,XGBoost_BAG_L1,1.0,0.988806,0.494725,0.189888,13.382499,0.494725,0.189888,13.382499,1,True,11
8,NeuralNetFastAI_BAG_L1,1.0,1.0,3.006343,0.472729,42.422261,3.006343,0.472729,42.422261,1,True,10
9,CatBoost_BAG_L1,0.996269,0.992537,0.218873,0.174899,97.99498,0.218873,0.174899,97.99498,1,True,7


## Feature importance

The feature_importance method shows which kidney Disease factors have the most importance. It seems that the red blood cells (rbc), hemoglobin (hemo) and the serum creatinine (sc) are the biggest factors in determining chronic kidney disease from this dataset.
The anemia (ane), appetite (appet), coronary artery disease (cad), specific gravity (sg), sugar (su), white blood cell count (wbcc), pus cell clumps (pcc) and bacteria (ba) all don't seem to foctor in the CKD however the dataset might not be big enough in order to distinguish relations with these parameters.

In [8]:
predictor.feature_importance(data=df_train)

Computing feature importance via permutation shuffling for 24 features using 268 rows with 5 shuffle sets...
	33.92s	= Expected runtime (6.78s per shuffle set)
	3.44s	= Actual runtime (Completed 5 of 5 shuffle sets)


Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
rbc,0.087313,0.008986,1.3e-05,5,0.105816,0.068811
hemo,0.021642,0.007647,0.001596,5,0.037387,0.005897
sc,0.011194,0.0059,0.006618,5,0.023342,-0.000954
pcv,0.009701,0.004254,0.003494,5,0.018461,0.000942
rbcc,0.008209,0.004865,0.009777,5,0.018226,-0.001808
dm,0.007463,0.002638,0.001599,5,0.012895,0.00203
sod,0.005224,0.003337,0.012448,5,0.012096,-0.001648
bgr,0.003731,0.003731,0.044505,5,0.011414,-0.003952
pot,0.002985,0.001669,0.008065,5,0.006421,-0.000451
htn,0.002985,0.003122,0.04965,5,0.009413,-0.003443


## Model prediction test

In [13]:
y_pred = predictor.predict(test_data)
df_pred = DataFrame(y_pred, columns=['class'])
df_pred

Unnamed: 0,class
398,notckd
125,ckd
328,notckd
339,notckd
172,ckd
...,...
12,ckd
309,ckd
399,notckd
333,notckd


In [14]:
predictor.evaluate(df_test)

Evaluation: accuracy on test data: 0.9848484848484849
Evaluations on test data:
{
    "accuracy": 0.9848484848484849,
    "balanced_accuracy": 0.9821428571428572,
    "mcc": 0.969309258988296,
    "roc_auc": 1.0,
    "f1": 0.9818181818181818,
    "precision": 1.0,
    "recall": 0.9642857142857143
}


{'accuracy': 0.9848484848484849,
 'balanced_accuracy': 0.9821428571428572,
 'mcc': 0.969309258988296,
 'roc_auc': 1.0,
 'f1': 0.9818181818181818,
 'precision': 1.0,
 'recall': 0.9642857142857143}

## Import saved model and use it

In [18]:
#with model : "WeightedEnsemble_L2"
predictor.delete_models(models_to_keep='best',dry_run=False)
predictor = TabularPredictor.load("AutogluonModels/ag-20230702_213736/")
y_pred = predictor.predict(test_data)
df_pred = DataFrame(y_pred, columns=['class'])
df_pred

Deleting model KNeighborsUnif_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\KNeighborsUnif_BAG_L1\ will be removed.
Deleting model KNeighborsDist_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\KNeighborsDist_BAG_L1\ will be removed.
Deleting model LightGBM_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\LightGBM_BAG_L1\ will be removed.
Deleting model RandomForestGini_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\RandomForestGini_BAG_L1\ will be removed.
Deleting model RandomForestEntr_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\RandomForestEntr_BAG_L1\ will be removed.
Deleting model CatBoost_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\CatBoost_BAG_L1\ will be removed.
Deleting model ExtraTreesGini_BAG_L1. All files under AutogluonModels/ag-20230702_213736/\models\ExtraTreesGini_BAG_L1\ will be removed.
Deleting model ExtraTreesEntr_BAG_L1. All files under Aut

Unnamed: 0,class
398,notckd
125,ckd
328,notckd
339,notckd
172,ckd
...,...
12,ckd
309,ckd
399,notckd
333,notckd
