## Data Cleaning

Before starting with analyzing the data, it is important for us to clean our dataset.

Importing necessary packages 

In [1]:
import pandas as pd
import numpy as np
import datetime
import sklearn
from sklearn.cross_validation import train_test_split 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.grid_search import GridSearchCV
from sklearn import linear_model
from sklearn.metrics import *
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()





Importing dataset.

In [2]:
df=pd.read_csv('kidney_disease.csv')

In [3]:
df.columns

Index(['id', 'age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr',
       'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'classification'],
      dtype='object')

In [4]:
df.shape

(400, 26)

Let's find out whether our data contains any missing data or not.

In [5]:
df.isnull().sum()

id                  0
age                 9
bp                 12
sg                 47
al                 46
su                 49
rbc               152
pc                 65
pcc                 4
ba                  4
bgr                44
bu                 19
sc                 17
sod                87
pot                88
hemo               52
pcv                70
wc                105
rc                130
htn                 2
dm                  2
cad                 2
appet               1
pe                  1
ane                 1
classification      0
dtype: int64

From the above code, we can understand that there are lots of missing values in almost all columns. It is difficult for us to proceed with analyzing these data if we don't remove these missing data on first place.

Let's find out what are different types of data in our dataset.

In [6]:
df.dtypes

id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                object
pc                 object
pcc                object
ba                 object
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv                object
wc                 object
rc                 object
htn                object
dm                 object
cad                object
appet              object
pe                 object
ane                object
classification     object
dtype: object

We can see that many colummns contains data which are not in either float or integer format. This can hinder our analysis. So, it is advisable for us to change these to numeric data.

In [7]:
df

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd
5,5,60.0,90.0,1.015,3.0,0.0,,,notpresent,notpresent,...,39,7800,4.4,yes,yes,no,good,yes,no,ckd
6,6,68.0,70.0,1.010,0.0,0.0,,normal,notpresent,notpresent,...,36,,,no,no,no,good,no,no,ckd
7,7,24.0,,1.015,2.0,4.0,normal,abnormal,notpresent,notpresent,...,44,6900,5,no,yes,no,good,yes,no,ckd
8,8,52.0,100.0,1.015,3.0,0.0,normal,abnormal,present,notpresent,...,33,9600,4,yes,yes,no,good,no,yes,ckd
9,9,53.0,90.0,1.020,2.0,0.0,abnormal,abnormal,present,notpresent,...,29,12100,3.7,yes,yes,no,poor,no,yes,ckd


We found that there are special characters in our dataset. This can hinder our analysis. So, replacing it with `null`

In [8]:
df.replace(to_replace="\t?",value=np.nan,inplace=True)

In [10]:
df.dtypes

id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                object
pc                 object
pcc                object
ba                 object
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv                object
wc                 object
rc                 object
htn                object
dm                 object
cad                object
appet              object
pe                 object
ane                object
classification     object
dtype: object

In [11]:
df['rbc'].unique()

array([nan, 'normal', 'abnormal'], dtype=object)

In [12]:
df.columns

Index(['id', 'age', 'bp', 'sg', 'al', 'su', 'rbc', 'pc', 'pcc', 'ba', 'bgr',
       'bu', 'sc', 'sod', 'pot', 'hemo', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad',
       'appet', 'pe', 'ane', 'classification'],
      dtype='object')

In [13]:
df

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.020,1.0,0.0,,normal,notpresent,notpresent,...,44,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.020,4.0,0.0,,normal,notpresent,notpresent,...,38,6000,,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.010,2.0,3.0,normal,normal,notpresent,notpresent,...,31,7500,,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.010,2.0,0.0,normal,normal,notpresent,notpresent,...,35,7300,4.6,no,no,no,good,no,no,ckd
5,5,60.0,90.0,1.015,3.0,0.0,,,notpresent,notpresent,...,39,7800,4.4,yes,yes,no,good,yes,no,ckd
6,6,68.0,70.0,1.010,0.0,0.0,,normal,notpresent,notpresent,...,36,,,no,no,no,good,no,no,ckd
7,7,24.0,,1.015,2.0,4.0,normal,abnormal,notpresent,notpresent,...,44,6900,5,no,yes,no,good,yes,no,ckd
8,8,52.0,100.0,1.015,3.0,0.0,normal,abnormal,present,notpresent,...,33,9600,4,yes,yes,no,good,no,yes,ckd
9,9,53.0,90.0,1.020,2.0,0.0,abnormal,abnormal,present,notpresent,...,29,12100,3.7,yes,yes,no,poor,no,yes,ckd


We have to remove missing values. We analyzed our dataste throughly and we can up with few reults.

1) In columns like: age, bgr, bu, sc, sod, pot, hemo, pcv and rc , missing values will be replaced with average/mean of that column's data.

2) In columns like: rbc, pc, pcc, ba, pcv, wc, rc, htn, dm , cad, appet, pe, ane and classification , missing values will be replaced with most frequently occuring data as these contain string data.

In [14]:
bp = pd.DataFrame(df.groupby('bp').size()).idxmax()[0]
df['bp'] = df['bp'].fillna(bp)

sg = pd.DataFrame(df.groupby('sg').size()).idxmax()[0]
df['sg'] = df['sg'].fillna(sg)

al = pd.DataFrame(df.groupby('al').size()).idxmax()[0]
df['al'] = df['al'].fillna(al)

su = pd.DataFrame(df.groupby('su').size()).idxmax()[0]
df['su'] = df['su'].fillna(su)


rbc = pd.DataFrame(df.groupby('rbc').size()).idxmax()[0]
df['rbc'] = df['rbc'].fillna(rbc)


pc = pd.DataFrame(df.groupby('pc').size()).idxmax()[0]
df['pc'] = df['pc'].fillna(pc)

pcc = pd.DataFrame(df.groupby('pcc').size()).idxmax()[0]
df['pcc'] = df['pcc'].fillna(pcc)



ba = pd.DataFrame(df.groupby('ba').size()).idxmax()[0]
df['ba'] = df['ba'].fillna(ba)


bgr = pd.DataFrame(df.groupby('bgr').size()).idxmax()[0]
df['bgr'] = df['bgr'].fillna(bgr)

wc = pd.DataFrame(df.groupby('wc').size()).idxmax()[0]
df['wc'] = df['wc'].fillna(wc)

htn = pd.DataFrame(df.groupby('htn').size()).idxmax()[0]
df['htn'] = df['htn'].fillna(htn)

dm = pd.DataFrame(df.groupby('dm').size()).idxmax()[0]
df['dm'] = df['dm'].fillna(dm)

wc = pd.DataFrame(df.groupby('wc').size()).idxmax()[0]
df['wc'] = df['wc'].fillna(wc)

cad = pd.DataFrame(df.groupby('cad').size()).idxmax()[0]
df['cad'] = df['cad'].fillna(cad)


appet = pd.DataFrame(df.groupby('appet').size()).idxmax()[0]
df['appet'] = df['appet'].fillna(appet)

pe = pd.DataFrame(df.groupby('pe').size()).idxmax()[0]
df['pe'] = df['pe'].fillna(pe)

ane = pd.DataFrame(df.groupby('ane').size()).idxmax()[0]
df['ane'] = df['ane'].fillna(ane)

classification = pd.DataFrame(df.groupby('classification').size()).idxmax()[0]
df['classification'] = df['classification'].fillna(classification)





In [15]:
df['age'] = df['age'].fillna(df['age'].mean(axis=0))
df['bgr'] = df['bgr'].fillna(df['bgr'].mean(axis=0))
df['sc'] = df['sc'].fillna(df['sc'].mean(axis=0))


In [16]:

df['bu'] = df['bu'].fillna(df['bu'].mean(axis=0))


df['sod'] = df['sod'].fillna(df['sod'].mean(axis=0))



In [17]:

df['pot'] = df['pot'].fillna(df['pot'].mean(axis=0))

In [18]:
df['pcv'] = df['pcv'].astype('float64')
df['wc'] = df['wc'].astype('int64')

In [19]:

df['pcv'] = df['pcv'].fillna(df['pcv'].mean(axis=0))



In [20]:
df['rc'] = df['rc'].astype('float64') 

In [21]:

df['hemo'] = df['hemo'].fillna(df['hemo'].mean(axis=0))
df['rc'] = df['rc'].fillna(df['rc'].mean(axis=0))

In [22]:
df.dtypes

id                  int64
age               float64
bp                float64
sg                float64
al                float64
su                float64
rbc                object
pc                 object
pcc                object
ba                 object
bgr               float64
bu                float64
sc                float64
sod               float64
pot               float64
hemo              float64
pcv               float64
wc                  int64
rc                float64
htn                object
dm                 object
cad                object
appet              object
pe                 object
ane                object
classification     object
dtype: object

Rounding off to nearest decimal value.

In [23]:
df.age = df.age.round()

In [24]:
df.bu = df.bu.round()

df.sod = df.sod.round()
df.pcv = df.pcv.round()

In [25]:
df.head(10)

Unnamed: 0,id,age,bp,sg,al,su,rbc,pc,pcc,ba,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,0,48.0,80.0,1.02,1.0,0.0,normal,normal,notpresent,notpresent,...,44.0,7800,5.2,yes,yes,no,good,no,no,ckd
1,1,7.0,50.0,1.02,4.0,0.0,normal,normal,notpresent,notpresent,...,38.0,6000,4.707435,no,no,no,good,no,no,ckd
2,2,62.0,80.0,1.01,2.0,3.0,normal,normal,notpresent,notpresent,...,31.0,7500,4.707435,no,yes,no,poor,no,yes,ckd
3,3,48.0,70.0,1.005,4.0,0.0,normal,abnormal,present,notpresent,...,32.0,6700,3.9,yes,no,no,poor,yes,yes,ckd
4,4,51.0,80.0,1.01,2.0,0.0,normal,normal,notpresent,notpresent,...,35.0,7300,4.6,no,no,no,good,no,no,ckd
5,5,60.0,90.0,1.015,3.0,0.0,normal,normal,notpresent,notpresent,...,39.0,7800,4.4,yes,yes,no,good,yes,no,ckd
6,6,68.0,70.0,1.01,0.0,0.0,normal,normal,notpresent,notpresent,...,36.0,9800,4.707435,no,no,no,good,no,no,ckd
7,7,24.0,80.0,1.015,2.0,4.0,normal,abnormal,notpresent,notpresent,...,44.0,6900,5.0,no,yes,no,good,yes,no,ckd
8,8,52.0,100.0,1.015,3.0,0.0,normal,abnormal,present,notpresent,...,33.0,9600,4.0,yes,yes,no,good,no,yes,ckd
9,9,53.0,90.0,1.02,2.0,0.0,abnormal,abnormal,present,notpresent,...,29.0,12100,3.7,yes,yes,no,poor,no,yes,ckd


In [27]:
df.isnull().sum()

id                0
age               0
bp                0
sg                0
al                0
su                0
rbc               0
pc                0
pcc               0
ba                0
bgr               0
bu                0
sc                0
sod               0
pot               0
hemo              0
pcv               0
wc                0
rc                0
htn               0
dm                0
cad               0
appet             0
pe                0
ane               0
classification    0
dtype: int64

Now, we can see that we have removed all null values. But we still have `string` data in our which has to be removed. For this, we are creating dummy variables. 

df=pd.get_dummies(df,prefix=['RBC','PC','PCC' , 'BA' , 'HTN' , 'DM' , 'CAD' , 'Appet' , 'PE' , 'Ane' , 'Classification'],columns=['rbc','pc','pcc' , 'ba' , 'htn' , 'dm' , 'cad' , 'appet' , 'pe' , 'ane' , 'classification'])

In [31]:
df.isnull().sum()

id                       0
age                      0
bp                       0
sg                       0
al                       0
su                       0
bgr                      0
bu                       0
sc                       0
sod                      0
pot                      0
hemo                     0
pcv                      0
wc                       0
rc                       0
RBC_abnormal             0
RBC_normal               0
PC_abnormal              0
PC_normal                0
PCC_notpresent           0
PCC_present              0
BA_notpresent            0
BA_present               0
HTN_no                   0
HTN_yes                  0
DM_no                    0
DM_yes                   0
CAD_no                   0
CAD_yes                  0
Appet_good               0
Appet_poor               0
PE_no                    0
PE_yes                   0
Ane_no                   0
Ane_yes                  0
Classification_ckd       0
Classification_notckd    0
d

In [33]:
df.head()

Unnamed: 0,id,age,bp,sg,al,su,bgr,bu,sc,sod,...,CAD_no,CAD_yes,Appet_good,Appet_poor,PE_no,PE_yes,Ane_no,Ane_yes,Classification_ckd,Classification_notckd
0,0,48.0,80.0,1.02,1.0,0.0,121.0,36.0,1.2,138.0,...,1,0,1,0,1,0,1,0,1,0
1,1,7.0,50.0,1.02,4.0,0.0,99.0,18.0,0.8,138.0,...,1,0,1,0,1,0,1,0,1,0
2,2,62.0,80.0,1.01,2.0,3.0,423.0,53.0,1.8,138.0,...,1,0,0,1,1,0,0,1,1,0
3,3,48.0,70.0,1.005,4.0,0.0,117.0,56.0,3.8,111.0,...,1,0,0,1,0,1,0,1,1,0
4,4,51.0,80.0,1.01,2.0,0.0,106.0,26.0,1.4,138.0,...,1,0,1,0,1,0,1,0,1,0


Conclusion: We have removed all the missing data from our dataset and we have also removed all the striing values and converted it to numeric data which can make our data analysis easy.