# Classification Assignment
#### Problem Statement or Requirement:

###### A requirement from the Hospital, Management asked us to create a predictive model which will predict the Chronic Kidney Disease (CKD) based on the several parameters. The Client has provided the dataset of the same.



1. The hospital wants a reliable predictive model to classify patients into CKD (Chronic Kidney Disease) or Not CKD, using medical attributes.
You are tasked with building the best-performing classification model, evaluated using precision, recall, f1-score, and accuracy.

Domain: ML

Type: Supervised Learning

Objective: Binary Classification

In [32]:
#importing the Libraies
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB, CategoricalNB
from sklearn.metrics import classification_report, accuracy_score


In [33]:
# Reading the Dataset
dataset = pd.read_csv('CKD.csv')

2. Tell basic info about the dataset (Total number of rows, columns)

In [34]:
print(f"\nRows: {dataset.shape[0]}")
print(f"\nColumns: {dataset.shape[1]}\n\n")

display(dataset)

# Displaying the dataset information
display(dataset.info())


Rows: 399

Columns: 25




Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,2.000000,76.459948,c,3.0,0.0,normal,abnormal,notpresent,notpresent,148.112676,...,38.868902,8408.191126,4.705597,no,no,no,yes,yes,no,yes
1,3.000000,76.459948,c,2.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,34.000000,12300.000000,4.705597,no,no,no,yes,poor,no,yes
2,4.000000,76.459948,a,1.0,0.0,normal,normal,notpresent,notpresent,99.000000,...,34.000000,8408.191126,4.705597,no,no,no,yes,poor,no,yes
3,5.000000,76.459948,d,1.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,38.868902,8408.191126,4.705597,no,no,no,yes,poor,yes,yes
4,5.000000,50.000000,c,0.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,36.000000,12400.000000,4.705597,no,no,no,yes,poor,no,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394,51.492308,70.000000,a,0.0,0.0,normal,normal,notpresent,notpresent,219.000000,...,37.000000,9800.000000,4.400000,no,no,no,yes,poor,no,yes
395,51.492308,70.000000,c,0.0,2.0,normal,normal,notpresent,notpresent,220.000000,...,27.000000,8408.191126,4.705597,yes,yes,no,yes,poor,yes,yes
396,51.492308,70.000000,c,3.0,0.0,normal,normal,notpresent,notpresent,110.000000,...,26.000000,9200.000000,3.400000,yes,yes,no,poor,poor,no,yes
397,51.492308,90.000000,a,0.0,0.0,normal,normal,notpresent,notpresent,207.000000,...,38.868902,8408.191126,4.705597,yes,yes,no,yes,poor,yes,yes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             399 non-null    float64
 1   bp              399 non-null    float64
 2   sg              399 non-null    object 
 3   al              399 non-null    float64
 4   su              399 non-null    float64
 5   rbc             399 non-null    object 
 6   pc              399 non-null    object 
 7   pcc             399 non-null    object 
 8   ba              399 non-null    object 
 9   bgr             399 non-null    float64
 10  bu              399 non-null    float64
 11  sc              399 non-null    float64
 12  sod             399 non-null    float64
 13  pot             399 non-null    float64
 14  hrmo            399 non-null    float64
 15  pcv             399 non-null    float64
 16  wc              399 non-null    float64
 17  rc              399 non-null    flo

None

3.) Mention the pre-processing method if you’re doing any (like converting string to number – nominal data)

In [35]:
# Converting categorical variables to numerical
# as the classification column is categorical - ordinal data, will be converted to numerical values using OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Ensure dataset is defined
if 'dataset' not in locals():
    raise ValueError("dataset is not defined. Please run the cell where dataset is loaded.")

# Handle categorical and numerical separately
categorical_cols = dataset.select_dtypes(include='object').columns
numerical_cols = dataset.select_dtypes(exclude='object').columns

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')
# Fit and transform the categorical columns
encoded_categorical = encoder.fit_transform(dataset[categorical_cols])
# Create a DataFrame with the encoded categorical columns
encoded_categorical_df = pd.DataFrame(encoded_categorical, columns=encoder.get_feature_names_out(categorical_cols))
# Concatenate the encoded categorical columns with the numerical columns
dataset = pd.concat([dataset[numerical_cols].reset_index(drop=True), encoded_categorical_df.reset_index(drop=True)], axis=1)
# Display the processed dataset
display(dataset)

Unnamed: 0,age,bp,al,su,bgr,bu,sc,sod,pot,hrmo,...,pc_normal,pcc_present,ba_present,htn_yes,dm_yes,cad_yes,appet_yes,pe_yes,ane_yes,classification_yes
0,2.000000,76.459948,3.0,0.0,148.112676,57.482105,3.077356,137.528754,4.627244,12.518156,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0
1,3.000000,76.459948,2.0,0.0,148.112676,22.000000,0.700000,137.528754,4.627244,10.700000,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,4.000000,76.459948,1.0,0.0,99.000000,23.000000,0.600000,138.000000,4.400000,12.000000,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
3,5.000000,76.459948,1.0,0.0,148.112676,16.000000,0.700000,138.000000,3.200000,8.100000,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
4,5.000000,50.000000,0.0,0.0,148.112676,25.000000,0.600000,137.528754,4.627244,11.800000,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394,51.492308,70.000000,0.0,0.0,219.000000,36.000000,1.300000,139.000000,3.700000,12.500000,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
395,51.492308,70.000000,0.0,2.0,220.000000,68.000000,2.800000,137.528754,4.627244,8.700000,...,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0
396,51.492308,70.000000,3.0,0.0,110.000000,115.000000,6.000000,134.000000,2.700000,9.100000,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
397,51.492308,90.000000,0.0,0.0,207.000000,80.000000,6.800000,142.000000,5.500000,8.500000,...,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0


4. Develop a good model with good evaluation metric. You can use any machine learning algorithm; you can create many models. Finally, you have to come up with final model.

# NaiveBayes

In [36]:
# Splitting the dataset into independent and dependent variables
independent = dataset.drop('classification_yes', axis=1)
dependent = dataset['classification_yes']

In [37]:
#split into training set and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(independent, dependent, test_size = 1/3, random_state = 0)


# CategoricalNB

In [38]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import GridSearchCV
import numpy as np

# Define hyperparameter grid
param_grid = {'alpha': np.logspace(0, -9, num=10)}

# Grid Search
grid = GridSearchCV(CategoricalNB(), param_grid, refit=True, verbose=3, n_jobs=-1, scoring='f1_weighted')

# Fit model
grid.fit(independent, dependent)


Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 2/5] END ...........................alpha=1.0;, score=nan total time=   0.0s
[CV 1/5] END .........................alpha=1.0;, score=0.950 total time=   0.0s
[CV 4/5] END ...........................alpha=1.0;, score=nan total time=   0.0s
[CV 3/5] END ...........................alpha=1.0;, score=nan total time=   0.0s
[CV 5/5] END ...........................alpha=1.0;, score=nan total time=   0.0s
[CV 2/5] END ...........................alpha=0.1;, score=nan total time=   0.0s
[CV 1/5] END .........................alpha=0.1;, score=0.963 total time=   0.0s
[CV 3/5] END ...........................alpha=0.1;, score=nan total time=   0.0s
[CV 1/5] END ........................alpha=0.01;, score=0.975 total time=   0.0s
[CV 4/5] END ...........................alpha=0.1;, score=nan total time=   0.0s
[CV 3/5] END ..........................alpha=0.01;, score=nan total time=   0.0s
[CV 2/5] END ..........................alpha=0.0

Traceback (most recent call last):
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
    y_pred = method_caller(
             ^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
    result, _ = _get_response_values(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/s

[CV 2/5] END ........................alpha=0.0001;, score=nan total time=   0.0s
[CV 1/5] END .......................alpha=1e-06;, score=0.962 total time=   0.0s
[CV 3/5] END .........................alpha=1e-06;, score=nan total time=   0.0s
[CV 1/5] END .......................alpha=1e-05;, score=0.975 total time=   0.0s
[CV 2/5] END .........................alpha=1e-06;, score=nan total time=   0.0s
[CV 1/5] END .......................alpha=1e-08;, score=0.962 total time=   0.0s
[CV 4/5] END .........................alpha=1e-06;, score=nan total time=   0.0s
[CV 2/5] END .........................alpha=1e-05;, score=nan total time=   0.0s
[CV 2/5] END .........................alpha=1e-07;, score=nan total time=   0.0s
[CV 2/5] END .........................alpha=1e-08;, score=nan total time=   0.0s
[CV 3/5] END .........................alpha=1e-08;, score=nan total time=   0.0s
[CV 5/5] END .........................alpha=1e-06;, score=nan total time=   0.0s
[CV 3/5] END ...............

Traceback (most recent call last):
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 949, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 288, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 380, in _score
    y_pred = method_caller(
             ^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/metrics/_scorer.py", line 90, in _cached_call
    result, _ = _get_response_values(
                ^^^^^^^^^^^^^^^^^^^^^
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/s

In [39]:
# print best parameter after tuning 
#print(grid.best_params_) 
result = grid.cv_results_
#print(result)
grid_predictions = grid.predict(X_test) 
   

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, grid_predictions)

# print classification report 
from sklearn.metrics import classification_report
clf_report = classification_report(y_test, grid_predictions)

In [40]:

from sklearn.metrics import f1_score
f1_macro=f1_score(y_test,grid_predictions,average='weighted')
print("The f1_macro value for best parameter {}:".format(grid.best_params_),f1_macro)

print("The confusion Matrix:\n",confusion_matrix)
print("\nThe report using CategoricalNB :\n\n\n",clf_report)


The f1_macro value for best parameter {'alpha': np.float64(1.0)}: 0.9924946382275899
The confusion Matrix:
 [[51  0]
 [ 1 81]]

The report using CategoricalNB :


               precision    recall  f1-score   support

         0.0       0.98      1.00      0.99        51
         1.0       1.00      0.99      0.99        82

    accuracy                           0.99       133
   macro avg       0.99      0.99      0.99       133
weighted avg       0.99      0.99      0.99       133



In [41]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test,grid.predict_proba(X_test)[:,1])

table=pd.DataFrame.from_dict(result)
table


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.016335,0.005067,0.003481,0.001334,1.0,{'alpha': 1.0},0.950296,,,,,,,1
1,0.027465,0.00763,0.006951,0.0028,0.1,{'alpha': 0.1},0.962618,,,,,,,1
2,0.020942,0.006437,0.00765,0.004637,0.01,{'alpha': 0.01},0.975,,,,,,,1
3,0.016113,0.005933,0.00549,0.003554,0.001,{'alpha': 0.001},0.975,,,,,,,1
4,0.014842,0.006049,0.005388,0.003,0.0001,{'alpha': 0.0001},0.975,,,,,,,1
5,0.015363,0.005505,0.006114,0.00374,1e-05,{'alpha': 1e-05},0.975,,,,,,,1
6,0.01535,0.005964,0.004781,0.001304,1e-06,{'alpha': 1e-06},0.962368,,,,,,,1
7,0.022958,0.007874,0.008386,0.003976,1e-07,{'alpha': 1e-07},0.962368,,,,,,,1
8,0.008309,0.001239,0.003989,0.002426,1e-08,{'alpha': 1e-08},0.962368,,,,,,,1
9,0.016116,0.003874,0.005815,0.005208,1e-09,{'alpha': 1e-09},0.949628,,,,,,,1
