# Chronic Kidney Disease Prediction - Data Science Project

## 1. Introduction

- Chronic Kidney Disease (CKD) is a global health issue that affects millions. Early prediction and diagnosis can significantly improve patient outcomes-. This project aims to develop predictive models for CKD using machine learning techniques.

# 2. Problem Statement

The goal is to classify whether a patient has CKD or not based on clinical data. This involves:

- Preprocessing the dataset.

- Building and evaluating machine learning models.

- Comparing model performance.

# 3. Dataset Overview

Dataset: kidney_disease.csv

The dataset contains clinical measurements and patient conditions, including:

Features: Blood pressure, albumin levels, glucose levels, etc.

Target: classification - whether the patient has CKD (ckd) or not (notckd).

# 4. Steps in the Project

## Import necessary libraries.

- Load and explore the dataset.

- Preprocess the data (cleaning, encoding, handling missing values).

- Train and evaluate machine learning models.

- Compare performance metrics.

# 5. Code and Explanations

## 5.1 Importing Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
import warnings
warnings.simplefilter('ignore')

- Importing libraries for data manipulation, preprocessing, machine learning models, and evaluation.

### Loading and Exploring the Dataset


In [2]:
df = pd.read_csv("kidney_disease.csv", header=0, na_values="?")
print(df.head(10))
print(df.info())
print(df.columns)
print(df.describe())
print(df.shape)

   id   age     bp     sg   al   su       rbc        pc         pcc  \
0   0  48.0   80.0  1.020  1.0  0.0       NaN    normal  notpresent   
1   1   7.0   50.0  1.020  4.0  0.0       NaN    normal  notpresent   
2   2  62.0   80.0  1.010  2.0  3.0    normal    normal  notpresent   
3   3  48.0   70.0  1.005  4.0  0.0    normal  abnormal     present   
4   4  51.0   80.0  1.010  2.0  0.0    normal    normal  notpresent   
5   5  60.0   90.0  1.015  3.0  0.0       NaN       NaN  notpresent   
6   6  68.0   70.0  1.010  0.0  0.0       NaN    normal  notpresent   
7   7  24.0    NaN  1.015  2.0  4.0    normal  abnormal  notpresent   
8   8  52.0  100.0  1.015  3.0  0.0    normal  abnormal     present   
9   9  53.0   90.0  1.020  2.0  0.0  abnormal  abnormal     present   

           ba  ...  pcv     wc   rc  htn   dm  cad appet   pe  ane  \
0  notpresent  ...   44   7800  5.2  yes  yes   no  good   no   no   
1  notpresent  ...   38   6000  NaN   no   no   no  good   no   no   
2  notpr


Load the dataset and inspect its structure, size, and basic statistics.

## Renaming Columns: 

In [3]:
cols_names = {
    "bp": "blood_pressure",
    "sg": "specific_gravity",
    "al": "albumin",
    "su": "sugar",
    "rbc": "red_blood_cells",
    "pc": "pus_cell",
    "pcc": "pus_cell_clumps",
    "ba": "bacteria",
    "bgr": "blood_glucose_random",
    "bu": "blood_urea",
    "sc": "serum_creatinine",
    "sod": "sodium",
    "pot": "potassium",
    "hemo": "haemoglobin",
    "pcv": "packed_cell_volume",
    "wc": "white_blood_cell_count",
    "rc": "red_blood_cell_count",
    "htn": "hypertension",
    "dm": "diabetes_mellitus",
    "cad": "coronary_artery_disease",
    "appet": "appetite",
    "pe": "pedal_edema",
    "ane": "anemia",
    "classification": "classification"
}
df.rename(columns=cols_names, inplace=True)


- Renaming columns for clarity and better interpretation.

## Data Type Conversion:

In [4]:
df['red_blood_cell_count'] = pd.to_numeric(df['red_blood_cell_count'], errors='coerce')
df['packed_cell_volume'] = pd.to_numeric(df['packed_cell_volume'], errors='coerce')
df['white_blood_cell_count'] = pd.to_numeric(df['white_blood_cell_count'], errors='coerce')


- Converts selected columns to numeric, replacing invalid values with NaN.

## Dropping Irrelevant Columns:

In [5]:
df.drop(["id"], axis=1, inplace=True)

Drops the id column as it doesn’t contribute to analysis.

## Handling Inconsistent Values


In [6]:

df['diabetes_mellitus'] = df['diabetes_mellitus'].replace({'\tno': 'no', '\tyes': 'yes', ' yes': 'yes'})
df['coronary_artery_disease'] = df['coronary_artery_disease'].replace('\tno', 'no')
df['classification'] = df['classification'].replace('ckd\t', 'ckd')


Fixes inconsistent entries such as tabs or extra spaces.

## Encoding Categorical Variables and Handling Missing Values

In [7]:
conv_value = {
    "red_blood_cells": {"normal": 1, "abnormal": 0},
    "pus_cell": {"normal": 1, "abnormal": 0},
    "pus_cell_clumps": {"present": 1, "notpresent": 0},
    "bacteria": {"present": 1, "notpresent": 0},
    "hypertension": {"yes": 1, "no": 0},
    "diabetes_mellitus": {"yes": 1, "no": 0},
    "coronary_artery_disease": {"yes": 1, "no": 0},
    "appetite": {"good": 1, "poor": 0},
    "pedal_edema": {"yes": 1, "no": 0},
    "anemia": {"yes": 1, "no": 0},
    "classification": {"ckd": 1, "notckd": 0}
}
df.replace(conv_value, inplace=True)
df.fillna(round(df.mean(), 2), inplace=True)

- Encodes categorical variables and fills missing values with column means.

## Splitting Features and Target

In [8]:
x = df.iloc[:, :-1]
y = df.iloc[:, -1]

- Separates the dataset into features (x) and target variable (y).

## Model Definition

In [9]:
models = {
    "Logistic Regression": LogisticRegression(),
    "SVM": SVC(),
    "Decision Trees": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "K Neighbors": KNeighborsClassifier(),
    "XGBoost": XGBClassifier(),
    "AdaBoost": AdaBoostClassifier(),
    "Naive Bayes": GaussianNB(),
    "Gradient Boosting": GradientBoostingClassifier(),
}

- Defines a dictionary of machine learning models.

## Model Training and Evaluation

In [10]:
train_acc = {}
test_acc = {}
cv_score_train = {}
precision_train = {}
recall_train = {}
f1_train = {}
best_random_state = {}

for name, model in models.items():
    train_acc_temp = []
    test_acc_temp = []
    cv_temp = []

    for i in range(0, 100):
        x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=i)
        scaler = MinMaxScaler()
        x_train_scaled = scaler.fit_transform(x_train)
        x_test_scaled = scaler.transform(x_test)

        model.fit(x_train_scaled, y_train)
        ypred_train = model.predict(x_train_scaled)
        ypred_test = model.predict(x_test_scaled)

        cv_temp.append(cross_val_score(model, x_train_scaled, y_train, cv=5).mean())
        train_acc_temp.append(accuracy_score(y_train, ypred_train))
        test_acc_temp.append(accuracy_score(y_test, ypred_test))

    em = pd.DataFrame({"train_acc": train_acc_temp, "cv": cv_temp, "test_acc": test_acc_temp})
    gm = em[(abs(em["test_acc"] - em["cv"]) <= 0.05)]
    rs = gm[gm["test_acc"] == gm["test_acc"].max()].index.tolist()[0]
    best_random_state[name] = rs

    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=rs)
    x_train_scaled = scaler.fit_transform(x_train)
    x_test_scaled = scaler.transform(x_test)

    model.fit(x_train_scaled, y_train)
    ypred_train = model.predict(x_train_scaled)
    ypred_test = model.predict(x_test_scaled)

    train_acc[name] = accuracy_score(y_train, ypred_train)
    test_acc[name] = accuracy_score(y_test, ypred_test)
    cv_score_train[name] = cross_val_score(model, x_train_scaled, y_train, cv=5).mean()
    precision_train[name] = precision_score(y_train, ypred_train, average='weighted')
    recall_train[name] = recall_score(y_train, ypred_train, average='weighted')
    f1_train[name] = f1_score(y_train, ypred_train, average='weighted')

- Trains and evaluates each model using cross-validation and selects the best random state.

## Results Summary

In [11]:
for name in models.keys():
    print(f"Model: {name}")
    print(f"Best Random State: {best_random_state[name]}")
    print(f"Train Accuracy: {train_acc[name]}")
    print(f"Test Accuracy: {test_acc[name]}")
    print(f"CV Score (Train): {cv_score_train[name]}")
    print(f"Precision (Train): {precision_train[name]}")
    print(f"Recall (Train): {recall_train[name]}")
    print(f"F1 Score (Train): {f1_train[name]}")
    print("_" * 50)

Model: Logistic Regression
Best Random State: 4
Train Accuracy: 0.990625
Test Accuracy: 1.0
CV Score (Train): 0.98125
Precision (Train): 0.99085
Recall (Train): 0.990625
F1 Score (Train): 0.9906461507556324
__________________________________________________
Model: SVM
Best Random State: 0
Train Accuracy: 0.996875
Test Accuracy: 1.0
CV Score (Train): 0.9875
Precision (Train): 0.996900406504065
Recall (Train): 0.996875
F1 Score (Train): 0.9968774218548179
__________________________________________________
Model: Decision Trees
Best Random State: 1
Train Accuracy: 1.0
Test Accuracy: 1.0
CV Score (Train): 0.9625
Precision (Train): 1.0
Recall (Train): 1.0
F1 Score (Train): 1.0
__________________________________________________
Model: Random Forest
Best Random State: 0
Train Accuracy: 1.0
Test Accuracy: 1.0
CV Score (Train): 0.99375
Precision (Train): 1.0
Recall (Train): 1.0
F1 Score (Train): 1.0
__________________________________________________
Model: K Neighbors
Best Random State: 0
Train

 Displays the performance metrics for all models.


## Conclusion
- This project successfully tackled the challenge of predicting Chronic Kidney Disease (CKD) using machine learning techniques. By systematically preprocessing the dataset, addressing inconsistencies, and encoding categorical variables, the project ensured data readiness for analysis. Several models, including Logistic Regression, Random Forest, SVM, and XGBoost, were trained and evaluated using metrics like accuracy, precision, recall, and F1-score.

- The rigorous approach of leveraging cross-validation and identifying optimal random states for model stability ensured reliable and interpretable results. The best-performing models exhibited high accuracy and consistent performance across training and testing phases, highlighting their potential for real-world application.

- Overall, this project underscores the significance of data science in healthcare, showcasing how predictive analytics can support early CKD diagnosis and improve patient outcomes. Future enhancements, such as hyperparameter tuning and deployment in clinical settings, can further elevate the utility of these models.