# Chronic Kidney Disease Prediction Project

This notebook guides you through building and improving a machine learning model to predict chronic kidney disease. We will cover data loading, preprocessing, feature engineering, model training, evaluation, and optimization to achieve high accuracy.

## 1. Import Required Libraries.

In [None]:
# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning and preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_selection import SelectKBest, f_classif
from imblearn.combine import SMOTETomek
import joblib

import os

# Display settings
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

## 2. Load and Explore the Dataset

In [3]:
# Load the dataset (update the path as needed)
df = pd.read_csv('kidney_disease_dataset.csv')

# Display shape and first few rows
print('Dataset shape:', df.shape)
df.head()

Dataset shape: (20538, 43)


Unnamed: 0,Age of the patient,Blood pressure (mm/Hg),Specific gravity of urine,Albumin in urine,Sugar in urine,Red blood cells in urine,Pus cells in urine,Pus cell clumps in urine,Bacteria in urine,Random blood glucose level (mg/dl),Blood urea (mg/dl),Serum creatinine (mg/dl),Sodium level (mEq/L),Potassium level (mEq/L),Hemoglobin level (gms),Packed cell volume (%),White blood cell count (cells/cumm),Red blood cell count (millions/cumm),Hypertension (yes/no),Diabetes mellitus (yes/no),Coronary artery disease (yes/no),Appetite (good/poor),Pedal edema (yes/no),Anemia (yes/no),Estimated Glomerular Filtration Rate (eGFR),Urine protein-to-creatinine ratio,Urine output (ml/day),Serum albumin level,Cholesterol level,Parathyroid hormone (PTH) level,Serum calcium level,Serum phosphate level,Family history of chronic kidney disease,Smoking status,Body Mass Index (BMI),Physical activity level,Duration of diabetes mellitus (years),Duration of hypertension (years),Cystatin C level,Urinary sediment microscopy results,C-reactive protein (CRP) level,Interleukin-6 (IL-6) level,Target
0,54,167,1.023,1,4,normal,abnormal,not present,not present,96,169.101369,7.55,146.06841,6.272576,11.8,35,5791,5.6,yes,yes,no,good,no,no,71.62,2.51,1397,3.23,152,65.078329,8.71,4.31,no,yes,25.3,low,4,16,0.67,normal,4.88,10.23,No_Disease
1,42,127,1.023,3,2,normal,normal,not present,present,73,183.223479,13.37,123.501427,5.611303,8.2,25,5390,4.6,no,yes,no,good,yes,yes,13.93,4.27,1632,3.47,242,46.030692,10.41,5.78,yes,no,20.6,moderate,3,13,0.55,abnormal,4.49,13.11,Low_Risk
2,38,148,1.016,0,0,abnormal,normal,not present,not present,77,193.141665,9.49,149.456527,3.965957,10.1,46,12098,4.7,no,no,yes,good,yes,no,60.09,1.56,889,4.42,103,26.214653,9.14,3.66,no,no,38.4,high,11,23,2.37,abnormal,4.57,13.27,No_Disease
3,7,98,1.017,4,0,abnormal,normal,not present,present,225,125.939396,10.98,131.758843,4.980997,14.0,24,6747,4.8,no,no,yes,good,no,yes,31.62,3.19,2424,3.44,140,11.931283,9.81,3.71,no,no,24.7,high,24,3,2.54,abnormal,8.57,12.36,No_Disease
4,67,174,1.015,1,1,normal,abnormal,not present,not present,376,197.1886,3.01,120.912465,4.097602,16.1,46,5759,5.7,no,no,no,good,yes,yes,36.61,1.23,893,4.14,149,34.909936,10.17,4.62,no,yes,17.6,high,22,24,1.9,normal,6.75,1.46,No_Disease


In [4]:
# Dataset info and class distribution
print('Dataset info:')
df.info()

print('\nClass distribution:')
print(df['Target'].value_counts())

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20538 entries, 0 to 20537
Data columns (total 43 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Age of the patient                           20538 non-null  int64  
 1   Blood pressure (mm/Hg)                       20538 non-null  int64  
 2   Specific gravity of urine                    20538 non-null  float64
 3   Albumin in urine                             20538 non-null  int64  
 4   Sugar in urine                               20538 non-null  int64  
 5   Red blood cells in urine                     20538 non-null  object 
 6   Pus cells in urine                           20538 non-null  object 
 7   Pus cell clumps in urine                     20538 non-null  object 
 8   Bacteria in urine                            20538 non-null  object 
 9   Random blood glucose level (mg/dl)           20538 non-nul

## 3. Preprocess the Data

In [5]:
# Handle missing values
# Fill numeric columns with median, categorical with mode
df.fillna(df.select_dtypes(include='number').median(), inplace=True)
for col in df.select_dtypes(include='object').columns:
    df[col] = df[col].fillna(df[col].mode()[0])

# Encode categorical variables
le = LabelEncoder()
categorical_cols = [col for col in df.columns if df[col].dtype == 'object' and col != 'Target']
for col in categorical_cols:
    df[col] = le.fit_transform(df[col])

# Encode target variable if not numeric
if df['Target'].dtype == 'object':
    df['Target'] = le.fit_transform(df['Target'])

In [6]:
# Scale features
scaler = StandardScaler()
X = df.drop('Target', axis=1)
y = df['Target']
X_scaled = scaler.fit_transform(X)

## 4. Feature Selection and Engineering

In [7]:
# Feature selection using ANOVA F-test
k = min(20, X.shape[1])
selector = SelectKBest(score_func=f_classif, k=k)
X_selected = selector.fit_transform(X_scaled, y)
selected_features = X.columns[selector.get_support()].tolist()

print('Selected features:', selected_features)

Selected features: ['Age of the patient', 'Specific gravity of urine', 'Sugar in urine', 'Red blood cells in urine', 'Pus cells in urine', 'Serum creatinine (mg/dl)', 'Hemoglobin level (gms)', 'White blood cell count (cells/cumm)', 'Hypertension (yes/no)', 'Coronary artery disease (yes/no)', 'Appetite (good/poor)', 'Pedal edema (yes/no)', 'Anemia (yes/no)', 'Estimated Glomerular Filtration Rate (eGFR)', 'Serum albumin level', 'Parathyroid hormone (PTH) level', 'Serum phosphate level', 'Duration of diabetes mellitus (years)', 'Cystatin C level', 'Urinary sediment microscopy results']


## 5. Split Data into Training and Test Sets

In [8]:
# Apply SMOTETomek to handle class imbalance
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X_selected, y)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled
)

print('Train set size:', X_train.shape)
print('Test set size:', X_test.shape)

Train set size: (65728, 20)
Test set size: (16432, 20)


## 6. Build and Train Baseline Model

In [9]:
# Train a baseline Random Forest model
baseline_model = RandomForestClassifier(random_state=42)
baseline_model.fit(X_train, y_train)

## 7. Evaluate Baseline Model Performance

In [10]:
# Evaluate baseline model
baseline_pred = baseline_model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, baseline_pred))
print('\nClassification Report:')
print(classification_report(y_test, baseline_pred))
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, baseline_pred))

Accuracy: 0.9813169425511198

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      3287
           1       0.99      0.95      0.97      3286
           2       0.99      0.99      0.99      3286
           3       0.93      0.99      0.96      3287
           4       1.00      1.00      1.00      3286

    accuracy                           0.98     16432
   macro avg       0.98      0.98      0.98     16432
weighted avg       0.98      0.98      0.98     16432


Confusion Matrix:
[[3239    5    5   36    2]
 [   7 3116   11  151    1]
 [   3    3 3248   32    0]
 [   1   30   10 3245    1]
 [   0    1    0    8 3277]]


## 8. Improve Model with Hyperparameter Tuning

In [11]:
# Hyperparameter tuning for Random Forest
param_grid = {
    'n_estimators': [100, 150, 200],
    'max_depth': [8, 10, 12],
    'min_samples_leaf': [2, 3, 4],
    'class_weight': ['balanced', 'balanced_subsample']
}
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

print('Best parameters:', grid_search.best_params_)
print('Best cross-validation accuracy:', grid_search.best_score_)

Best parameters: {'class_weight': 'balanced', 'max_depth': 12, 'min_samples_leaf': 2, 'n_estimators': 200}
Best cross-validation accuracy: 0.7979555275237518


## 9. Evaluate Improved Model

In [12]:
# Evaluate improved model
best_model = grid_search.best_estimator_
improved_pred = best_model.predict(X_test)
print('Improved Accuracy:', accuracy_score(y_test, improved_pred))
print('\nClassification Report:')
print(classification_report(y_test, improved_pred))
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, improved_pred))

Improved Accuracy: 0.815968841285297

Classification Report:
              precision    recall  f1-score   support

           0       0.85      0.81      0.83      3287
           1       0.87      0.58      0.69      3286
           2       0.84      0.83      0.84      3286
           3       0.77      0.91      0.84      3287
           4       0.78      0.95      0.86      3286

    accuracy                           0.82     16432
   macro avg       0.82      0.82      0.81     16432
weighted avg       0.82      0.82      0.81     16432


Confusion Matrix:
[[2665   83  122  169  248]
 [ 252 1894  300  488  352]
 [ 101   59 2741  174  211]
 [  51  125   61 3001   49]
 [  54   18   39   68 3107]]


## 10. Save the Final Model

In [None]:
os.makedirs('models', exist_ok=True)

# Save the best model

joblib.dump(best_model, 'models/ckd_best_model.joblib')
print('Model saved as ckd_best_model.joblib')

Model saved as ckd_best_model.joblib


In [None]:
try:
    joblib.dump(le, 'models/encoder.joblib')
except Exception as e:
    print('Encoder not saved:', e)

# Save scaler (if used)
try:
    joblib.dump(scaler, 'models/scaler.joblib')
except Exception as e:
    print('Scaler not saved:', e)

# Save selector (if used)
try:
    joblib.dump(selector, 'models/selector.joblib')
except Exception as e:
    print('Selector not saved:', e)

print('Preprocessing objects saved to models folder.')

Preprocessing objects saved to models folder.
