# Classification Assignment
#### Problem Statement or Requirement:

###### A requirement from the Hospital, Management asked us to create a predictive model which will predict the Chronic Kidney Disease (CKD) based on the several parameters. The Client has provided the dataset of the same.

###### 1.) Identify your problem statement

###### 2.) Tell basic info about the dataset (Total number of rows, columns)

###### 3.) Mention the pre-processing method if you’re doing any (like converting string to number – nominal data)

###### 4.) Develop a good model with good evaluation metric. You can use any machine learning algorithm; you can create many models. Finally, you have to come up with final model.

###### 5.) All the research values of each algorithm should be documented. (You can make tabulation or screenshot of the results.)

###### 6.) Mention your final model, justify why u have chosen the same.

1. The hospital wants a reliable predictive model to classify patients into CKD (Chronic Kidney Disease) or Not CKD, using medical attributes.
You are tasked with building the best-performing classification model, evaluated using precision, recall, f1-score, and accuracy.

Domain: ML

Type: Supervised Learning

Objective: Binary Classification

In [None]:
#importing the Libraies
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd


In [None]:
# Reading the Dataset
dataset = pd.read_csv('CKD.csv')


2. Tell basic info about the dataset (Total number of rows, columns)

In [None]:
print(f"\nRows: {dataset.shape[0]}")
print(f"\nColumns: {dataset.shape[1]}\n\n")

display(dataset)

# Displaying the dataset information
display(dataset.info())



Rows: 399

Columns: 25




Unnamed: 0,age,bp,sg,al,su,rbc,pc,pcc,ba,bgr,...,pcv,wc,rc,htn,dm,cad,appet,pe,ane,classification
0,2.000000,76.459948,c,3.0,0.0,normal,abnormal,notpresent,notpresent,148.112676,...,38.868902,8408.191126,4.705597,no,no,no,yes,yes,no,yes
1,3.000000,76.459948,c,2.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,34.000000,12300.000000,4.705597,no,no,no,yes,poor,no,yes
2,4.000000,76.459948,a,1.0,0.0,normal,normal,notpresent,notpresent,99.000000,...,34.000000,8408.191126,4.705597,no,no,no,yes,poor,no,yes
3,5.000000,76.459948,d,1.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,38.868902,8408.191126,4.705597,no,no,no,yes,poor,yes,yes
4,5.000000,50.000000,c,0.0,0.0,normal,normal,notpresent,notpresent,148.112676,...,36.000000,12400.000000,4.705597,no,no,no,yes,poor,no,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394,51.492308,70.000000,a,0.0,0.0,normal,normal,notpresent,notpresent,219.000000,...,37.000000,9800.000000,4.400000,no,no,no,yes,poor,no,yes
395,51.492308,70.000000,c,0.0,2.0,normal,normal,notpresent,notpresent,220.000000,...,27.000000,8408.191126,4.705597,yes,yes,no,yes,poor,yes,yes
396,51.492308,70.000000,c,3.0,0.0,normal,normal,notpresent,notpresent,110.000000,...,26.000000,9200.000000,3.400000,yes,yes,no,poor,poor,no,yes
397,51.492308,90.000000,a,0.0,0.0,normal,normal,notpresent,notpresent,207.000000,...,38.868902,8408.191126,4.705597,yes,yes,no,yes,poor,yes,yes


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 399 entries, 0 to 398
Data columns (total 25 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             399 non-null    float64
 1   bp              399 non-null    float64
 2   sg              399 non-null    object 
 3   al              399 non-null    float64
 4   su              399 non-null    float64
 5   rbc             399 non-null    object 
 6   pc              399 non-null    object 
 7   pcc             399 non-null    object 
 8   ba              399 non-null    object 
 9   bgr             399 non-null    float64
 10  bu              399 non-null    float64
 11  sc              399 non-null    float64
 12  sod             399 non-null    float64
 13  pot             399 non-null    float64
 14  hrmo            399 non-null    float64
 15  pcv             399 non-null    float64
 16  wc              399 non-null    float64
 17  rc              399 non-null    flo

None

3.) Mention the pre-processing method if you’re doing any (like converting string to number – nominal data)

In [None]:
# Converting categorical variables to numerical
# as the classification column is categorical - ordinal data, will be converted to numerical values using LabelEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer

# Handle categorical and numerical separately
categorical_cols = dataset.select_dtypes(include='object').columns
numerical_cols = dataset.select_dtypes(exclude='object').columns

# Fill missing values
dataset[categorical_cols] = dataset[categorical_cols].fillna(dataset[categorical_cols].mode().iloc[0])
dataset[numerical_cols] = dataset[numerical_cols].fillna(dataset[numerical_cols].mean())

# Encode categorical columns
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    dataset[col] = le.fit_transform(dataset[col])
    label_encoders[col] = le


4. Develop a good model with good evaluation metric. You can use any machine learning algorithm; you can create many models. Finally, you have to come up with final model.

# RandomForest

In [None]:
# Splitting the dataset into independent and dependent variables
independent = dataset.drop('classification', axis=1)
dependent = dataset['classification']


In [None]:
#split into training set and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(independent, dependent, test_size = 1/3, random_state = 0)


In [None]:
from sklearn.preprocessing import StandardScaler
StandardScaler = StandardScaler()
X_train = StandardScaler.fit_transform(X_train)
X_test = StandardScaler.transform(X_test)


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators':[100, 200, 300],
              'max_features':['auto', 'sqrt', 'log2'],
              'min_samples_split':[2, 5, 10],
              'bootstrap':[True, False]} 



grid = GridSearchCV(RandomForestClassifier(), param_grid, refit = True, verbose = 3,n_jobs=-1,scoring='f1_weighted') 

# fitting the model for grid search 
grid.fit(X_train, y_train) 


Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV 2/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=100;, score=nan total time=   0.0s
[CV 5/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=100;, score=nan total time=   0.0s
[CV 1/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=200;, score=nan total time=   0.0s
[CV 2/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=200;, score=nan total time=   0.0s
[CV 3/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=200;, score=nan total time=   0.0s
[CV 4/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=200;, score=nan total time=   0.0s
[CV 5/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=200;, score=nan total time=   0.0s
[CV 4/5] END bootstrap=True, max_features=auto, min_samples_split=2, n_estimators=100;, score=nan total time=   0.0s
[C

90 fits failed out of a total of 270.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/home/deehub/JoinDeeHub/.venv/lib/python3.12/site-packages/sklearn/utils/_param_validation.py", line

In [None]:
# print best parameter after tuning 
#print(grid.best_params_) 
result = grid.cv_results_
#print(result)
grid_predictions = grid.predict(X_test)

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, grid_predictions)

# print classification report 
from sklearn.metrics import classification_report
classification_report = classification_report(y_test, grid_predictions)


In [None]:

from sklearn.metrics import f1_score
f1_macro=f1_score(y_test,grid_predictions,average='weighted')
print("The f1_macro value for best parameter {}:".format(grid.best_params_),f1_macro)


The f1_macro value for best parameter {'bootstrap': False, 'max_features': 'sqrt', 'min_samples_split': 2, 'n_estimators': 100}: 1.0


In [None]:
print("The confusion Matrix:\n", confusion_matrix)


The confusion Matrix:
 [[51  0]
 [ 0 82]]


In [None]:
print("\nThe report using RandomForestClassifier :\n\n\n", classification_report)



The report using RandomForestClassifier :


               precision    recall  f1-score   support

           0       1.00      1.00      1.00        51
           1       1.00      1.00      1.00        82

    accuracy                           1.00       133
   macro avg       1.00      1.00      1.00       133
weighted avg       1.00      1.00      1.00       133



In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test,grid.predict_proba(X_test)[:,1])


np.float64(1.0)

In [None]:
table=pd.DataFrame.from_dict(result)
table


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_max_features,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001635,0.000816,0.0,0.0,True,auto,2,100,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
1,0.000923,4.3e-05,0.0,0.0,True,auto,2,200,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
2,0.001546,0.000744,0.0,0.0,True,auto,2,300,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
3,0.001706,0.000677,0.0,0.0,True,auto,5,100,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
4,0.001856,0.000652,0.0,0.0,True,auto,5,200,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
5,0.001701,0.000741,0.0,0.0,True,auto,5,300,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
6,0.001304,0.000656,0.0,0.0,True,auto,10,100,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
7,0.001561,0.000842,0.0,0.0,True,auto,10,200,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
8,0.000842,0.000171,0.0,0.0,True,auto,10,300,"{'bootstrap': True, 'max_features': 'auto', 'm...",,,,,,,,37
9,0.219602,0.018353,0.016579,0.005681,True,sqrt,2,100,"{'bootstrap': True, 'max_features': 'sqrt', 'm...",1.0,0.961755,1.0,0.962264,1.0,0.984804,0.018612,2


In [None]:
import pickle
# Save the model to a file
good_fit_model_filename = "random_forest_classification_model.sav"
classification = grid.fit(X_train, y_train)
pickle.dump(classification, open(good_fit_model_filename, 'wb'))
# Load the model from the file
load_good_fit_model = pickle.load(open("random_forest_classification_model.sav", "rb"))
# Ensure the input has all 6 features (including the missing one-hot encoded column for 'State_New York')
result = load_good_fit_model.predict(X_test)
print("Predicted ", result)


: 