# CKD RESEARCH

## Business Understanding



> Chronic Kidney Disease (CKD) is a major global public health concern, affecting approximately 9–13% of the world’s population./n

Despite its high prevalence, a substantial proportion of cases remain undiagnosed, particularly in the early stages when the disease is largely asymptomatic./n Late diagnosis often leads to severe complications, including cardiovascular disease, end-stage renal disease (ESRD), and increased mortality./n

CKD is strongly associated with common chronic conditions such as diabetes, hypertension, and obesity—conditions that are themselves highly prevalent. In many healthcare settings, especially in low- and middle-income countries, routine screening and early detection remain inadequate due to limited resources and delayed clinical evaluation.

Traditional diagnosis relies on laboratory markers such as estimated glomerular filtration rate (eGFR) and proteinuria. However, these indicators are often assessed only after clinical suspicion arises, reducing opportunities for preventive intervention./n

Given the availability of routinely collected demographic and laboratory data, there is an opportunity to leverage machine learning techniques to develop predictive models capable of identifying individuals at high risk of CKD earlier and more efficiently. An accurate and sensitive predictive model could support clinical decision-making, improve early detection rates, and ultimately reduce disease progression and healthcare burden.

## Metric of Success

> A model with a recall of at least 0.85 and ROC-AUC of at least 0.88.

## Experimental Design

- Research question : Can we predict whether a patient has CKD using routine lab and demographic data?
- Based on our main research question, can we predict the stage of CKD usiing lab values?
- The data was collected from 
- The variables: 'age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class'


- 1.Age(numerical)
  	  	age in years
 	2.Blood Pressure(numerical)
	       	bp in mm/Hg
 	3.Specific Gravity(nominal)
	  	sg - (1.005,1.010,1.015,1.020,1.025)
 	4.Albumin(nominal)
		al - (0,1,2,3,4,5)
 	5.Sugar(nominal)
		su - (0,1,2,3,4,5)
 	6.Red Blood Cells(nominal)
		rbc - (normal,abnormal)
 	7.Pus Cell (nominal)
		pc - (normal,abnormal)
 	8.Pus Cell clumps(nominal)
		pcc - (present,notpresent)
 	9.Bacteria(nominal)
		ba  - (present,notpresent)
 	10.Blood Glucose Random(numerical)		
		bgr in mgs/dl
 	11.Blood Urea(numerical)	
		bu in mgs/dl
 	12.Serum Creatinine(numerical)	
		sc in mgs/dl
 	13.Sodium(numerical)
		sod in mEq/L
 	14.Potassium(numerical)	
		pot in mEq/L
 	15.Hemoglobin(numerical)
		hemo in gms
 	16.Packed  Cell Volume(numerical)
 	17.White Blood Cell Count(numerical)
		wc in cells/cumm
 	18.Red Blood Cell Count(numerical)	
		rc in millions/cmm
 	19.Hypertension(nominal)	
		htn - (yes,no)
 	20.Diabetes Mellitus(nominal)	
		dm - (yes,no)
 	21.Coronary Artery Disease(nominal)
		cad - (yes,no)
 	22.Appetite(nominal)	
		appet - (good,poor)
 	23.Pedal Edema(nominal)
		pe - (yes,no)	
 	24.Anemia(nominal)
		ane - (yes,no)
 	25.Class (nominal)		
		class - (ckd,notckd)
- Model evaluation
- Conclusions and recommendations.

## Data Understanding

In [1]:
# import relevant libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
#import the data 
from convert_arff_to_csv import arff_to_dataframe
df = arff_to_dataframe("data/chronic_kidney_disease.arff")
df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.02,1,0,,normal,notpresent,notpresent,121.0,...,44.0,7800.0,5.2,yes,yes,no,good,no,no,ckd
1,7.0,50.0,1.02,4,0,,normal,notpresent,notpresent,,...,38.0,6000.0,,no,no,no,good,no,no,ckd
2,62.0,80.0,1.01,2,3,normal,normal,notpresent,notpresent,423.0,...,31.0,7500.0,,no,yes,no,poor,no,yes,ckd
3,48.0,70.0,1.005,4,0,normal,abnormal,present,notpresent,117.0,...,32.0,6700.0,3.9,yes,no,no,poor,yes,yes,ckd
4,51.0,80.0,1.01,2,0,normal,normal,notpresent,notpresent,106.0,...,35.0,7300.0,4.6,no,no,no,good,no,no,ckd


In [5]:
df.columns

Index(['age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr', 'bu',
       'sc', 'sod', 'pot', 'hemo', 'pcv', 'wbcc', 'rbcc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'class'],
      dtype='object')

In [6]:
#descriptive statistics for our data
df.describe(include='number')

Unnamed: 0,age,bp,bgr,bu,sc,sod,pot,hemo,pcv,wbcc,rbcc
count,391.0,388.0,356.0,381.0,383.0,313.0,312.0,348.0,329.0,294.0,269.0
mean,51.483376,76.469072,148.036517,57.425722,3.072454,137.528754,4.627244,12.526437,38.884498,8406.122449,4.707435
std,17.169714,13.683637,79.281714,50.503006,5.741126,10.408752,3.193904,2.912587,8.990105,2944.47419,1.025323
min,2.0,50.0,22.0,1.5,0.4,4.5,2.5,3.1,9.0,2200.0,2.1
25%,42.0,70.0,99.0,27.0,0.9,135.0,3.8,10.3,32.0,6500.0,3.9
50%,55.0,80.0,121.0,42.0,1.3,138.0,4.4,12.65,40.0,8000.0,4.8
75%,64.5,80.0,163.0,66.0,2.8,142.0,4.9,15.0,45.0,9800.0,5.4
max,90.0,180.0,490.0,391.0,76.0,163.0,47.0,17.8,54.0,26400.0,8.0


In [7]:
df.shape

(400, 25)

### Data Pre-processing

In [8]:
# check for missing values
df.isnull().sum()

age        9
bp        12
sg        47
al        46
su        49
rbc      152
pc        65
pcc        4
ba         4
bgr       44
bu        19
sc        17
sod       87
pot       88
hemo      52
pcv       71
wbcc     106
rbcc     131
htn        2
dm         3
cad        2
appet      1
pe         1
ane        1
class      0
dtype: int64

In [9]:
#check for duplicates
df.duplicated().sum()

0