## Chronic Kidney Disease (CKD) using RandomForestClassifier

[Dataset](https://www.kaggle.com/datasets/mansoordaku/ckdisease)

[Google Colab Notebook](https://colab.research.google.com/drive/1HJJvZdouj6NnH8TcEeUGNBzsvM8Rr6aK)

[Github](https://github.com/z5208980/machine-learning-health/tree/main/chronic_kidney_disease)

Chronic Kidney Disease (CKD) is the failure of the kidneys. This means that the body will not be able to filter waste and fluids from the blood which can cause buildup. CKD is a progressive disease and there are 5 stages of CKD with Stage 5 being the most severe. Mainly CKD can be determined by calculating the eGFR or GFR and depending on the range at those values, will determine the Stage. (e)GFR stands for (estimated) glomerular filtration rate and is a test that measures kidney functions with variables such as blood creatinine, age, weight and other factors taken into account.

The dataset can be used to assist classify whether an individual has Chronic Kidney Disease based on their measurement which could be used to assist doctors in making a decision on whether to take further action. This ML model could be used as a diagnostic tool or an early detection for CKD.

A note about this dataset is that it was taken in India. This may affect results due to geographical differences but that shouldn't matter. The target for this dataset is the column `classification` which has values *ckd* or *notckd* which means patients likely to have chronic kidney disease and not have chronic kidney disease respectively. The columns for these are a bit difficult to determine,

- bp: Blood pressure (mm/Hg)
- sg: Specific Gravity (Not too sure what this is)
- al: Albumin
- su: Sugar Level
- rbc: Red Blood Cells 
- pc: Pus Cells (Related to dead white blood cells)
- pcc: Pus Cell Clumps
- ba: Bacteria
- bgr: Random Blood Glucose (mgs/dL)
- bu: Blood Urea (mgs/dL)
- sc: Serum Creatinine (mgs/dL)
- sod: Sodium (mEq/L) 
- pot: Potassium (mEq/L)
- hemo: Hemoglobin (gms)  
- pcv: Packed Cell Volume   
- wc: White Blood Cells count
- rc: Red Blood Cells count
- htn: Hypertension
- dm: Diabetes Mellitus
- cad: Coronary Artery Disease
- appet: Appetite
- pe: Pedal Edema (Related to the swollen or puffiness of feet)
- ane: Anemia




In [50]:
import numpy as np
import pandas as pd
import pickle

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import binarize, LabelEncoder, MinMaxScaler, StandardScaler
from sklearn import metrics
from sklearn.metrics import accuracy_score, mean_squared_error, precision_recall_curve
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

In [43]:
# Loading and seeking the data

# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/z5208980/machine-learning-health/main/chronic_kidney_disease/data/data.csv')

print(f"There have {df.shape[0]} rows with {df.shape[1]} columns including targets")

# Seek the dataset
df.head(5)

There have 400 rows with 26 columns including targets


Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd


In [44]:
# Processing the data
 
# Remove uneccessary features
df.drop("id",axis=1,inplace=True)

# Drop rows that have a column that has NaN since we don't want to fill them out because they might affect the model
df.dropna(axis=0,inplace=True)

# Since this row has values such as '\tyes' or '\tno'
df.dm = df.dm.str.strip()

# Note that (158/400) * 100 ~= 35% of the data is gone, but that's fine
print(f"There have {df.shape[0]} rows with {df.shape[1]} columns including targets")

There have 158 rows with 25 columns including targets


In [46]:
# Encoding

# Encode catergorical features. Note: ckd = 1, notckd = 0
encode_features = ["htn", "dm", "cad", "pe", "ane", "rbc", "pc", "appet", "pcc", "ba", "classification"]
for feature in encode_features:
  encoder = LabelEncoder()
  encoder.fit(df[feature])
  df[feature] = encoder.transform(df[feature])

# Scaling for numerical features
scaler_features = ["age", "bp", "sg", "al", "bgr", "wc", "rc", "bu", "sc", "sod", "pot", "hemo", "pcv"]
for feature in scaler_features:
  scaler = StandardScaler()
  df[feature] = scaler.fit_transform(df[[feature]])

# # Save
# filename = '/content/sample_data/processed.csv'
# df.to_csv(filename)  

df.head()

Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
3,-0.101098,-0.363613,-2.713365,2.273474,0.0,1,0,1,0,-0.221549,...,-1.092705,-0.569768,-0.976025,1,0,0,1,1,1,0
9,0.222253,1.431726,0.023092,0.853676,0.0,0,0,1,0,-0.947597,...,-1.423236,1.162684,-1.17285,1,1,0,1,0,1,0
11,0.868954,-0.363613,-1.801213,1.563575,0.0,0,0,1,0,3.841231,...,-1.092705,-1.275582,-1.074438,1,1,0,1,1,0,0
14,1.192305,0.534056,-1.801213,1.563575,2.0,1,0,1,1,0.396364,...,-2.855537,0.809777,-2.255385,1,1,1,1,1,0,0
20,0.739614,0.534056,-0.88906,0.853676,0.0,0,0,0,0,0.643529,...,-1.974121,0.232293,-1.664911,1,1,1,1,1,1,0


In [47]:
X = df.drop("classification", axis=1)
y = df.classification

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=200)

The choosen model use is **RandomForestClassifier** which yields a 100% accurancy in training and testing. No parameters is required, however this is throught to be overfitting such all it has a perfect score.

## Using the model

To use the mode, there should be a total list of 25 features from the dot points in the introduction. In the demo below, we can pick one of the rows as dummy data to test our model.


In [48]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

y_pred_class = model.predict(X_test)

print('RESULT')
print('Accuracy:', metrics.accuracy_score(y_test, y_pred_class))

filename = '/content/sample_data/model.sav'
pickle.dump(model, open(filename, 'wb'))

RESULT
Accuracy: 1.0


In [51]:
model = pickle.load(open('/content/sample_data/model.sav', 'rb'))   # load model

val = []
row = 23
for x in X_train.iloc[row]:
  val.append(x)

input = [val]
output = model.predict(input)

print("X=%s, Predicted=%s, Actually=%s" % (input[0], output[0], y_train.iloc[row]))

X=[-1.3298317658438865, 0.5340564282548816, 0.023092471320819787, -0.5661220593180517, 0.0, 1.0, 1.0, 0.0, 0.0, -0.7467756043907438, -0.07568922380695657, -0.4200345919696298, 0.020346261311939572, -0.385737603314116, 1.3270334711364955, -0.21128879998624175, 0.45687028110698197, -0.18872732431074937, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Predicted=1, Actually=1
