# Chronic Kidney Disease

![alt-text](chronic-kidney-disease.jpg)

## Data Set Information:

We use the following representation to collect the dataset

    age - age
    bp - blood pressure
    sg - specific gravity
    al - albumin
    su - sugar
    rbc - red blood cells
    pc - pus cell
    pcc - pus cell clumps
    ba - bacteria
    bgr - blood glucose random
    bu - blood urea
    sc - serum creatinine
    sod - sodium
    pot - potassium
    hemo - hemoglobin
    pcv - packed cell volume
    wc - white blood cell count
    rc - red blood cell count
    htn - hypertension
    dm - diabetes mellitus
    cad - coronary artery disease
    appet - appetite
    pe - pedal edema
    ane - anemia
    class - class 
    
## Attribute Information:

We use 24 + class = 25 ( 11 numeric ,14 nominal)

    1.Age                                       (numerical)                      age in years
    2.Blood Pressure                            (numerical)                      bp in mm/Hg
    3.Specific Gravity                          (nominal)                        sg - (1.005,1.010,1.015,1.020,1.025)
    4.Albumin                                   (nominal)                        al - (0,1,2,3,4,5)
    5.Sugar                                     (nominal)                        su - (0,1,2,3,4,5)
    6.Red Blood Cells                           (nominal)                        rbc - (normal,abnormal)
    7.Pus Cell                                  (nominal)                        pc - (normal,abnormal)
    8.Pus Cell clumps                           (nominal)                        pcc - (present,notpresent)
    9.Bacteria                                  (nominal)                        ba - (present,notpresent)
    10.Blood Glucose Random                     (numerical)                      bgr in mgs/dl
    11.Blood Urea                               (numerical)                      bu in mgs/dl
    12.Serum Creatinine                         (numerical)                      sc in mgs/dl
    13.Sodium                                   (numerical)                      sod in mEq/L
    14.Potassium                                (numerical)                      pot in mEq/L
    15.Hemoglobin                               (numerical)                      hemo in gms
    16.Packed Cell Volume                       (numerical)                      NA
    17.White Blood Cell Count                   (numerical)                      wc in cells/cumm
    18.Red Blood Cell Count                     (numerical)                      rc in millions/cmm
    19.Hypertension                             (nominal)                        htn - (yes,no)
    20.Diabetes Mellitus                        (nominal)                        dm - (yes,no)
    21.Coronary Artery Disease                  (nominal)                        cad - (yes,no)
    22.Appetite                                 (nominal)                        appet - (good,poor)
    23.Pedal Edema                              (nominal)                        pe - (yes,no)
    24.Anemia                                   (nominal)                        ane - (yes,no)
    25.Class                                    (nominal)                        class - (ckd,notckd)

## Importing Modules

In [21]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd 
import pandas_profiling as pp
import seaborn as sns
import plotly_express as px 

from sklearn.impute import SimpleImputer
from impyute.imputation.cs import fast_knn
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import RFE

from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score

## Defining function for modelling

In [22]:
def print_line():
    print("----------------------------------------------------------")

def train(train_x, test_x, train_y, test_y, clf):
    res = clf.fit(train_x, train_y).score(test_x, test_y)*100
    return res
    
    
def model(X, y, test_size = 0.25, shuffle = True):
    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=test_size, shuffle=shuffle)
    xtc = ExtraTreeClassifier()
    rfc = RandomForestClassifier()
    xgb = XGBClassifier()
    dtc = DecisionTreeClassifier()
    mlp = MLPClassifier()
    print("Accuracy of Extra Tree Classifier is {}%".format(train(train_x, test_x, train_y, test_y, xtc)))
    print_line()
    print("Accuracy of Decision Tree Classifier is {}%".format(train(train_x, test_x, train_y, test_y, dtc)))
    print_line()
    print("Accuracy of Random Forest Classifier is {}%".format(train(train_x, test_x, train_y, test_y, rfc)))
    print_line()
    print("Accuracy of XGBClassifier is {}%".format(train(train_x, test_x, train_y, test_y, xgb)))
    print_line()
    print("Accuracy of MLP Classifier is {}%".format(train(train_x, test_x, train_y, test_y, mlp)))
    print_line()

## Dataset import and overview

In [23]:
df = pd.read_csv('chronic_kidney.csv')
new_df = df
print(df.head(15))
print_line()
print(df.info())

     age     bp  spec_gravity   al   su       rbc        pc         pcc  \
0   48.0   80.0         1.020  1.0  0.0       NaN    normal  notpresent   
1    7.0   50.0         1.020  4.0  0.0       NaN    normal  notpresent   
2   62.0   80.0         1.010  2.0  3.0    normal    normal  notpresent   
3   48.0   70.0         1.005  4.0  0.0    normal  abnormal     present   
4   51.0   80.0         1.010  2.0  0.0    normal    normal  notpresent   
5   60.0   90.0         1.015  3.0  0.0       NaN       NaN  notpresent   
6   68.0   70.0         1.010  0.0  0.0       NaN    normal  notpresent   
7   24.0    NaN         1.015  2.0  4.0    normal  abnormal  notpresent   
8   52.0  100.0         1.015  3.0  0.0    normal  abnormal     present   
9   53.0   90.0         1.020  2.0  0.0  abnormal  abnormal     present   
10  50.0   60.0         1.010  2.0  4.0       NaN  abnormal     present   
11  63.0   70.0         1.010  3.0  0.0  abnormal  abnormal     present   
12  68.0   70.0         1

## Rough analysis using Pandas Profiling

In [24]:
pp.ProfileReport(new_df)



## Imputing Object(Categorical) Columns

In [25]:
for i in new_df.columns:
    if np.dtype(new_df[str(i)]) == 'object':
        new_df[str(i)] = new_df[str(i)].fillna(new_df[str(i)].mode()[0])
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
age             391 non-null float64
bp              388 non-null float64
spec_gravity    353 non-null float64
al              354 non-null float64
su              351 non-null float64
rbc             400 non-null object
pc              400 non-null object
pcc             400 non-null object
ba              400 non-null object
bgr             356 non-null float64
bu              381 non-null float64
sc              383 non-null float64
sod             313 non-null float64
pot             312 non-null float64
hemo            348 non-null float64
pcv             329 non-null float64
wbcc            294 non-null float64
rbcc            269 non-null float64
htn             400 non-null object
dm              400 non-null object
cad             400 non-null object
appet           400 non-null object
pe              400 non-null object
ane             400 non-null object
class           4

## Label Encoding 

In [26]:
#Label Encoding
enc = LabelEncoder()
for i in new_df.columns:
    if np.dtype(new_df[str(i)]) == 'object':
        new_df[str(i)] = enc.fit_transform(new_df[str(i)].astype(str))
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
age             391 non-null float64
bp              388 non-null float64
spec_gravity    353 non-null float64
al              354 non-null float64
su              351 non-null float64
rbc             400 non-null int64
pc              400 non-null int64
pcc             400 non-null int64
ba              400 non-null int64
bgr             356 non-null float64
bu              381 non-null float64
sc              383 non-null float64
sod             313 non-null float64
pot             312 non-null float64
hemo            348 non-null float64
pcv             329 non-null float64
wbcc            294 non-null float64
rbcc            269 non-null float64
htn             400 non-null int64
dm              400 non-null int64
cad             400 non-null int64
appet           400 non-null int64
pe              400 non-null int64
ane             400 non-null int64
class           400 non-nul

## Imputation using FastKNN

In [27]:
fknn = fast_knn(new_df, k = 3)
fknn.columns = new_df.columns
fknn

Unnamed: 0,age,bp,spec_gravity,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wbcc,rbcc,htn,dm,cad,appet,pe,ane,class
0,48.0,80.0,1.020,1.0,0.0,1.0,1.0,0.0,0.0,121.000000,...,44.0,7800.0,5.200000,1.0,2.0,0.0,0.0,0.0,0.0,0.0
1,7.0,50.0,1.020,4.0,0.0,1.0,1.0,0.0,0.0,118.794048,...,38.0,6000.0,5.231679,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,62.0,80.0,1.010,2.0,3.0,1.0,1.0,0.0,0.0,423.000000,...,31.0,7500.0,3.502840,0.0,2.0,0.0,1.0,0.0,1.0,0.0
3,48.0,70.0,1.005,4.0,0.0,1.0,0.0,1.0,0.0,117.000000,...,32.0,6700.0,3.900000,1.0,1.0,0.0,1.0,1.0,1.0,0.0
4,51.0,80.0,1.010,2.0,0.0,1.0,1.0,0.0,0.0,106.000000,...,35.0,7300.0,4.600000,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
395,55.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,140.000000,...,47.0,6700.0,4.900000,0.0,1.0,0.0,0.0,0.0,0.0,1.0
396,42.0,70.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,75.000000,...,54.0,7800.0,6.200000,0.0,1.0,0.0,0.0,0.0,0.0,1.0
397,12.0,80.0,1.020,0.0,0.0,1.0,1.0,0.0,0.0,100.000000,...,49.0,6600.0,5.400000,0.0,1.0,0.0,0.0,0.0,0.0,1.0
398,17.0,60.0,1.025,0.0,0.0,1.0,1.0,0.0,0.0,114.000000,...,51.0,7200.0,5.900000,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [28]:
fknn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 25 columns):
age             400 non-null float64
bp              400 non-null float64
spec_gravity    400 non-null float64
al              400 non-null float64
su              400 non-null float64
rbc             400 non-null float64
pc              400 non-null float64
pcc             400 non-null float64
ba              400 non-null float64
bgr             400 non-null float64
bu              400 non-null float64
sc              400 non-null float64
sod             400 non-null float64
pot             400 non-null float64
hemo            400 non-null float64
pcv             400 non-null float64
wbcc            400 non-null float64
rbcc            400 non-null float64
htn             400 non-null float64
dm              400 non-null float64
cad             400 non-null float64
appet           400 non-null float64
pe              400 non-null float64
ane             400 non-null float64
class  

## Feature selection using RFE - Recursive Feature Elimination

In [29]:
xgb = XGBClassifier(learning_rate=0.001,n_estimators=1000, n_jobs=-1)
sel = RFE(xgb, n_features_to_select=15, step=5, verbose=2)
sel = sel.fit(fknn.drop('class', 1),
              fknn[['class']])
features = list(zip(fknn.columns, sel.support_))
features

Fitting estimator with 24 features.
Fitting estimator with 19 features.


[('age', False),
 ('bp', False),
 ('spec_gravity', True),
 ('al', True),
 ('su', True),
 ('rbc', True),
 ('pc', True),
 ('pcc', True),
 ('ba', True),
 ('bgr', True),
 ('bu', True),
 ('sc', True),
 ('sod', True),
 ('pot', False),
 ('hemo', True),
 ('pcv', True),
 ('wbcc', False),
 ('rbcc', False),
 ('htn', False),
 ('dm', False),
 ('cad', False),
 ('appet', False),
 ('pe', True),
 ('ane', True)]

In [30]:
selected_features = []
for i in features:
    if i[1] == True:
        selected_features.append(str(i[0]))
selected_features

['spec_gravity',
 'al',
 'su',
 'rbc',
 'pc',
 'pcc',
 'ba',
 'bgr',
 'bu',
 'sc',
 'sod',
 'hemo',
 'pcv',
 'pe',
 'ane']

In [31]:
X = fknn.drop('class', 1)
y = fknn[['class']]

## Modeling 

In [33]:
model(X, y, test_size = 15, shuffle = True)

Accuracy of Extra Tree Classifier is 100.0%
----------------------------------------------------------
Accuracy of Decision Tree Classifier is 93.33333333333333%
----------------------------------------------------------
Accuracy of Random Forest Classifier is 100.0%
----------------------------------------------------------
Accuracy of XGBClassifier is 100.0%
----------------------------------------------------------
Accuracy of MLP Classifier is 93.33333333333333%
----------------------------------------------------------


## Conclusion 
    
    ExtraTreeClassifier, XGBClassifier and RandomForestClassifier are best performing algorithms if shuffled is True
    DecisionTreeClassifier, XGBClassifier and RandomForestClassifier are best performing algorithms if shuffled is False