## About this Project
This project will predict Choronic Kidney Disease. Using UCI Machine Learning CKD dataset to train the model and predict the accuracy. Here I use LR, SVM, RF, DT, KNN and NB supervised classifier model to build our model.
# 📚 1. Importing Required Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# ML libraries

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

## 📥 2. Loading the Dataset
Load the CKD dataset and show the first few rows.

In [None]:
data = pd.read_csv('new_model.csv')
data.head()

# About Dataset
# Context
This dataset is originally from UCI Machine Learning Repository. The objective of the dataset is to diagnostically predict whether a patient is having chronic kidney disease or not, based on certain diagnostic measurements included in the dataset.

# Content
The datasets consists of several medical predictor variables and one target variable, Class. Predictor variables includes Blood Pressure(Bp), Albumin(Al), etc.
## 📊 3. Basic Data Overview
We now inspect the shape, missing values, and data types.

In [None]:
print("Shape of data:", data.shape)
print("\nMissing values per column:\n", data.isnull().sum())
print("\nData Types:\n")
data.info()

There is no null value so do not need to replace or remove null value operation. To remove null value we can replace that value using mean value if the value is categoruical then change with the mode.

Also there all columns contain numerical value so do not need to transfer any column datatype. If there is any categorical data then we can do 2 type encoding One is level encpding and another is one hot encoding. 

## 🔍 4. Check for Duplicate Records
Identify if there are any duplicated rows in the dataset.

In [None]:
data_duplicates = data[data.duplicated()]
print("Number of duplicate rows:", data_duplicates.shape[0])

## 🔥 5. Correlation Heatmap
Visualizing correlations among features using a heatmap.

In [None]:
correlation_matrix = data.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

## 📦 6. Box Plot to Visualize Outliers
Box plots help identify outliers in each feature.

In [None]:
data.plot.box(figsize=(15, 6), rot=90)
plt.title("Boxplot of Features")
plt.show()

## 🚨 7. Detecting Outliers Using IQR Method
Check for outlier indices in each numeric feature.


In [None]:
db = data.drop("Class", axis=1)

def detect_outlier(feature):
    Q1 = feature.quantile(0.25)
    Q3 = feature.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return feature.index[(feature < lower_bound) | (feature > upper_bound)].tolist()

for col in db:
    print(f"{col} --> {detect_outlier(data[col])}")


## 📦 8. Boxplot After Outlier Detection
Visualizing boxplots again for comparison.


In [None]:
db.plot.box(figsize=(15, 6), rot=90)
plt.title("Boxplot After Outlier Detection")
plt.show()


## 🚨 7. Detecting Outliers Using IQR Method
Check for outlier indices in each numeric feature.


In [None]:
db = data.drop("Class", axis=1)

def detect_outlier(feature):
    Q1 = feature.quantile(0.25)
    Q3 = feature.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return feature.index[(feature < lower_bound) | (feature > upper_bound)].tolist()

for col in db:
    print(f"{col} --> {detect_outlier(data[col])}")


## 📦 8. Boxplot After Outlier Detection
Visualizing boxplots again for comparison.


In [None]:
db.plot.box(figsize=(15, 6), rot=90)
plt.title("Boxplot After Outlier Detection")
plt.show()

## 📈 9. Histogram of Features
Distribution of each feature to understand data spread and skewness.


In [None]:
db.hist(figsize=(16, 12), bins=20, edgecolor='black')
plt.suptitle('Histograms of Features', fontsize=16)
plt.tight_layout(rect=(0, 0, 1, 0.97))
plt.show()


## 🌊 10. KDE (Kernel Density Estimation) Plots
Visualizing the density distribution of each feature.


In [None]:
db.plot(kind='density', subplots=True, layout=(4, 4), sharex=False, figsize=(16, 12))
plt.suptitle('KDE Plots of Features', fontsize=16)
plt.tight_layout(rect=(0, 0, 1, 0.97))
plt.show()


## 🔗 11. Pairplot of Selected Features
Checking relationships between first 5 features.


In [None]:
selected_cols = db.columns[:5]
sns.pairplot(db[selected_cols].copy())
plt.suptitle('Pairplot of Selected Features', y=1.02)
plt.show()


## 📊 12. Class Distribution
Visualizing how many instances belong to each CKD class.


In [None]:
if 'Class' in data.columns:
    plt.figure(figsize=(8, 5))
    sns.countplot(x='Class', data=data, order=data['Class'].value_counts().index)
    plt.title('Class Distribution')
    plt.xticks(rotation=45)
    plt.show()


## 🧬 13. Violin Plots of Features by CKD Class
Detailed view of distributions for each feature across CKD classes.


In [None]:
for col in data.columns[:-1]:
    plt.figure(figsize=(8, 5))
    sns.violinplot(x='Class', y=col, data=data, inner='quartile')
    plt.title(f'Violin Plot of {col} by Class')
    plt.xticks(rotation=45)
    plt.show()


## ⚙️ 14. Data Preprocessing
Split data, scale features, and prepare for training.


In [None]:
# Encode labels if not already numerical
if data['Class'].dtype == 'object':
    data['Class'] = data['Class'].astype('category').cat.codes

X = data.drop('Class', axis=1)
y = data['Class']

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


## 🤖 15. Train Multiple Classifiers
We will train Logistic Regression, SVM, Decision Tree, KNN, Naive Bayes, and Random Forest.


In [None]:
models = {
    "Logistic Regression": LogisticRegression(),
    "SVM": SVC(),
    "Decision Tree": DecisionTreeClassifier(),
    "KNN": KNeighborsClassifier(),
    "Naive Bayes": GaussianNB(),
    "Random Forest": RandomForestClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"\n🧠 {name} Accuracy: {acc:.2f}")
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


## 🏁 Summary
This notebook walked through loading, visualizing, preprocessing, and classifying Chronic Kidney Disease data using several ML algorithms. You can improve results further using hyperparameter tuning, feature selection, or ensemble models.
