Chronic kidney disease includes conditions that damage your kidneys and decrease their ability to keep you healthy by doing the jobs listed. If kidney disease gets worse, wastes can build to high levels in your blood and make you feel sick. You may develop complications like high blood pressure, anemia (low blood count), weak bones, poor nutritional health and nerve damage. Also, kidney disease increases your risk of having heart and blood vessel disease. These problems may happen slowly over a long period of time. Chronic kidney disease may be caused by diabetes, high blood pressure and other disorders. Early detection and treatment can often keep chronic kidney disease from getting worse.
# So in this notebook ,based on the following features I predicted ,the person is prone to ckd or not based on his blood pressure,haemoglobin rbc count ,wbc count 
* In the analysis section I tried to automate the relationshp between the features with the help of violin and scatter plot from the  plotly module,the correlation is established with the help of heatmap and tried to replace the null values with the help of random imputation,so that it didn't affect the distribution of features
* Later I used SelectKbest for feature selection and then finally did the modelling with XGBoost 



![](https://i1.wp.com/vegofwa.org/wp-content/uploads/2019/02/kidneys-cartoon.jpeg?fit=532%2C421&ssl=1)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=pd.read_csv('../input/ckdisease/kidney_disease.csv')

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
# now lets take a look at the description file to understand what this features mean exactly
df1=pd.read_csv('../input/kidney/data_description.txt',sep='-')
df1=df1.reset_index()

In [None]:
df1

In [None]:
df1.columns=['shortf','longf']
df1

In [None]:
df1['longf'].values

In [None]:
# lets rename the column names in our origninal dataframe
df.columns=df1['longf'].values
df.head()

In [None]:
df.dtypes

In [None]:
# we can see that some features having nueric values are assigned object datatype ,like packed cell volume,rbc count,wbc count so lets convert them to numeric
def convert_dtypes(df,feature):
  df[feature]=pd.to_numeric(df[feature],errors='coerce')
  # this error parameter will handle nan vlaues

In [None]:
features=['packed cell volume', 'white blood cell count',
       'red blood cell count']
for feature in features:
  convert_dtypes(df,feature)


In [None]:
df.dtypes

In [None]:
df.drop('id',axis=1,inplace=True)
# bcoz id makes no sense

lets clean our data
lets separate categorical and numerical columns

In [None]:
# lets clean our data
# lets separate categorical and numerical columns
def extract_cat_nume(df):
  cat_col=[col for col in df.columns if df[col].dtype=='object']
  num_col=[col for col in df.columns if df[col].dtype!='object']
  return cat_col,num_col
  


In [None]:
cat_col,num_col=extract_cat_nume(df)

In [None]:
num_col

In [None]:
cat_col

In [None]:
### total unique categories in our categorical features to check if any dirtiness is there in there in  data or not

In [None]:
for col in cat_col:
    print('{} has {} values '.format(col,df[col].unique()))
    print('\n')

In [None]:
## ckd-chronic kidney disease
## notckd-->> not chronic kidney disease

In [None]:
for col in cat_col:
    print('{} has {} values  '.format(col, df[col].unique()))
    print('\n')

In [None]:
#Replace incorrect values

df['diabetes mellitus'].replace(to_replace = {'\tno':'no','\tyes':'yes',' yes':'yes'},inplace=True)

df['coronary artery disease'] = df['coronary artery disease'].replace(to_replace = '\tno', value='no')

df['class'] = df['class'].replace(to_replace = 'ckd\t', value = 'ckd')

In [None]:
for col in cat_col:
    print('{} has {} values  '.format(col, df[col].unique()))
    print('\n')

In [None]:
# apart from nan values we are good to go

#CHECKING FEATURE DISTRIBUTION

In [None]:
plt.figure(figsize=(30,20))
for i,feature in enumerate(num_col):
  # enuerate is used for assigning index to the features
    plt.subplot(5,3,i+1) 
    #  //i here is used for index
    df[feature].hist()
    plt.title(feature)

In [None]:
# WE CAN SEE THERE ARE SOME OUTLIERS 
#  Observations:
#         1.age looks a bit left skewed
#         2.Blood gluscose random is right skewed
#         3.Blood Urea is also a bit right skewed
#         4.Rest of the features are lightly skewed

In [None]:
plt.figure(figsize=(30,20))
for i,feature in enumerate(cat_col):
  plt.subplot(4,3,i+1)
  sns.countplot(df[feature])

lets see the correlation between the features

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(),annot=True)

Positive Correlation:
 
Specific gravity -> Red blood cell count, Packed cell volume and Hemoglobin
Sugar -> Blood glucose random
Blood Urea -> Serum creatinine
Hemoglobin -> Red Blood cell count <- packed cell volume
 
 
Negative Correlation:
Albumin, Blood urea -> Red blood cell count, packed cell volume, Hemoglobin
Serum creatinine -> Sodium

In [None]:
df.groupby(['red blood cells','class'])['red blood cell count'].agg(['count','mean','median','min','max'])

we can observe that for the red blood cells to be normal and possessing non crhonic dissease ,the count should be atleast 134 and the mean value should also be higher

### Let's check for Positive correlation and its impact on classes¶

In [None]:
import plotly.express as px

In [None]:
px.violin(df,y='red blood cell count',x="class", color="class")

#we can see the descriptive analysis with the help of violin plot

In [None]:
px.scatter(df,'haemoglobin','packed cell volume')
#how haemoglobin and placked cell volume relate

In [None]:
### analysing distribution of 'red_blood_cell_count' in both Labels 

grid=sns.FacetGrid(df, hue="class",aspect=2)
#aspect =2 to increase the area
grid.map(sns.kdeplot, 'red blood cell count')
#mapping a distribution of red blood cell count(feature)
grid.add_legend()

we can see the exact inference from this visual i.e the person not having chronic disease are having high rbc count

In [None]:
#lets automate our analysis with the help of violin plot
def violin(col):
  fig=px.violin(df,y=col,x='class',color='class',box=True)
  fig.show()

In [None]:
def scatter(col1,col2):
   fig = px.scatter(df, x=col1, y=col2, color="class")
   return fig.show()

In [None]:
def kde_plot(feature):
    grid = sns.FacetGrid(df, hue="class",aspect=2)
    grid.map(sns.kdeplot, feature)
    grid.add_legend()

In [None]:
# now we have made common function ,and we can easily cann these function for any of the features available
# for example lets plot the kde plot for haemoglobin
kde_plot('haemoglobin')

we can easily observe from the plot that the people not having chronic kidney disease have there haemoglobin count more

In [None]:
df.columns

In [None]:
scatter('red blood cell count','packed cell volume')

we can observe that the person not having disease have some non linear relationship,while the one's having the disease possess some linearity in the data

In [None]:
scatter('red blood cell count','haemoglobin')

In [None]:
violin('packed cell volume')

we can observe that whenever the packed cell volume is between 35-55 the person is found not having chronic disease

In [None]:
df.isna().sum().sort_values(ascending=False)

In [None]:
# lets fill the missing values
 

In [None]:
data=df.copy()

In [None]:
data

In [None]:
#this function will work for all features containing null
def random_value_imputation(feature):
  random_sample=data[feature].dropna().sample(data[feature].isnull().sum())
  #this statement indicates to generate random values equal to the number of nan values
  # the index of the random values should be same as the index of the columns 
  random_sample.index=data[data[feature].isnull()].index
  data.loc[data[feature].isnull(),feature]=random_sample

In [None]:
# now lets take a look at the missing values in numerical and categorical columns
data[num_col].isnull().sum()

In [None]:
for col in num_col:
  random_value_imputation(col)
  

In [None]:
# now lets take a look at the missing values in numerical and categorical columns
data[num_col].isnull().sum()

we can observe all missing values got removed

In [None]:
data[cat_col].isnull().sum()

In [None]:
random_value_imputation(' pus cell')
random_value_imputation('red blood cells')

In [None]:
data[cat_col].isnull().sum()

In [None]:
# now we have less missing values so we can fill them with the help of mode as we have categorical features here


In [None]:
def impute_mode(feature):
    mode=data[feature].mode()[0]
    data[feature]=data[feature].fillna(mode)

In [None]:
for col in cat_col:
    impute_mode(col)

In [None]:
data[cat_col].isnull().sum()

 Now applying feature encoding on categorical data

In [None]:
for col in cat_col:
    print('{} has {} categories'.format(col, data[col].nunique()))
    

In [None]:
#### as we have just 2 categories in each feature then we can consider Label Encoder as it will not cause Curse of Dimensionality

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()

In [None]:
for col in cat_col:
    data[col]=le.fit_transform(data[col])

In [None]:
data.head()

we can see our categorical data is converted to numerical data

#FEATURE IMPORTANCE

In [None]:
from sklearn.feature_selection import SelectKBest
# it will check the probablility value is less than 0.5 or not 
from sklearn.feature_selection import chi2
# chi square

In [None]:
ind_feature=[col for col in data.columns if col!='class']
target=data['class']


In [None]:
X=data[ind_feature]
y=data['class']

In [None]:
y

In [None]:
ordered_rank_features=SelectKBest(score_func=chi2,k=20)
ordered_rank_features

In [None]:
ordered_rank_features=ordered_rank_features.fit(X,y)
ordered_rank_features

In [None]:
import pandas as pd

In [None]:
datascores=pd.DataFrame(ordered_rank_features.scores_,columns=['Score'])

In [None]:
datascores

In [None]:
dfcols=X.columns
dfcols

In [None]:
dfcols=pd.DataFrame(dfcols,columns=['Features'])
dfcols

In [None]:
features_rank=pd.concat([dfcols,datascores],axis=1)
features_rank

In [None]:
features_rank.nlargest(10,'Score')
# on the basis of score lets select the top 10 features


In [None]:
selected_columns=features_rank.nlargest(10,'Score')['Features'].values
# this will convert into an array

In [None]:
selected_columns

In [None]:
X_new=data[selected_columns]

In [None]:
X_new.shape

In [None]:
X=X_new

In [None]:
X

#Applying a crossvalidated model and then lets check the accuracy of the model  

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
from xgboost import XGBClassifier


In [None]:
XGBClassifier()

In [None]:
params={
  "learning_rate"    : [0.05, 0.20, 0.25 ] ,
 "max_depth"        : [ 5, 8, 10, 12],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.7 ]
}

In [None]:
import warnings
from warnings import filterwarnings
filterwarnings('ignore')

In [None]:
from sklearn.model_selection import RandomizedSearchCV


In [None]:
classifier=XGBClassifier()

In [None]:
 random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)

In [None]:
random_search.fit(X_train,y_train)

In [None]:
random_search.best_estimator_

In [None]:
random_search.best_params_

In [None]:
classifier=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7, gamma=0.1,
              learning_rate=0.2, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [None]:
classifier.fit(X_train,y_train)

In [None]:
y_pred=classifier.predict(X_test)

In [None]:
y_pred

In [None]:
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

In [None]:
confusion_matrix(y_test,y_pred)

In [None]:
accuracy_score(y_test,y_pred)

we get accuracy upto 97% on this use-case using XGBClassifier