# Hyperparameter Tuning with GridSearchCV
This data, which includes information about various stages of liver failure, was obtained from Kaggle. the stages are 
- Inflammation<br>
- Fibrosis<br>
- Cirrhosis<br>
- Liver Failure

Link to dataset: https://www.kaggle.com/datasets/fedesoriano/cirrhosis-prediction-dataset


## objectives
- To gain understanding of the concept of hyperparameter tuning.
- To enhance the accuracy of a model by utilizing GridSearchCV for hyperparameter optimization

### import necessary packages 

In [117]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt

## Understanding the data
Cirrhosis is an advanced phase of liver scarring (fibrosis) that arises due to various liver ailments and conditions, including hepatitis and long-term alcohol abuse. The subsequent information presents the findings gathered from the Mayo Clinic study on primary biliary cirrhosis (PBC) of the liver.

1) ID: unique identifier<br>
2) N_Days: number of days between registration and the earlier of death, transplantation, or study analysis time in July 1986<br>
3) Status: status of the patient C (censored), CL (censored due to liver tx), or D (death)<br>
4) Drug: type of drug D-penicillamine or placebo<br>
5) Age: age in [days]<br>
6) Sex: M (male) or F (female)<br>
7) Ascites: presence of ascites N (No) or Y (Yes)<br>
8) Hepatomegaly: presence of hepatomegaly N (No) or Y (Yes)<br>
9) Spiders: presence of spiders N (No) or Y (Yes)<br>
10) Edema: presence of edema N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy)<br>
11) Bilirubin: serum bilirubin in [mg/dl]<br>
12) Cholesterol: serum cholesterol in [mg/dl]<br>
13) Albumin: albumin in [gm/dl]<br>
14) Copper: urine copper in [ug/day]<br>
15) Alk_Phos: alkaline phosphatase in [U/liter]<br>
16) SGOT: SGOT in [U/ml]<br>
17) Triglycerides: triglicerides in [mg/dl]<br>
18) Platelets: platelets per cubic [ml/1000]<br>
19) Prothrombin: prothrombin time in seconds [s]<br>
20) Stage: histologic stage of disease (1, 2, 3, or 4)<br>

### Reading the data in

In [118]:
#read the data in 
df = pd.read_csv("cirrhosis.csv")
#take a look at the data
df.head()

Unnamed: 0,ID,N_Days,Status,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
0,1,400,D,D-penicillamine,21464,F,Y,Y,Y,Y,14.5,261.0,2.6,156.0,1718.0,137.95,172.0,190.0,12.2,4.0
1,2,4500,C,D-penicillamine,20617,F,N,Y,Y,N,1.1,302.0,4.14,54.0,7394.8,113.52,88.0,221.0,10.6,3.0
2,3,1012,D,D-penicillamine,25594,M,N,N,N,S,1.4,176.0,3.48,210.0,516.0,96.1,55.0,151.0,12.0,4.0
3,4,1925,D,D-penicillamine,19994,F,N,Y,Y,S,1.8,244.0,2.54,64.0,6121.8,60.63,92.0,183.0,10.3,4.0
4,5,1504,CL,Placebo,13918,F,N,Y,Y,N,3.4,279.0,3.53,143.0,671.0,113.15,72.0,136.0,10.9,3.0


### Data Exploration

Let's take a look at the dataet

In [119]:
df.value_counts("Stage")

Stage
3.0    155
4.0    144
2.0     92
1.0     21
dtype: int64

### 21 Inflammation, 92 Fibrosis , 155 Cirrhosis, 144 Liver Failure

In [120]:
#Statistical summary of dataset
df.describe()

Unnamed: 0,ID,N_Days,Age,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
count,418.0,418.0,418.0,418.0,284.0,418.0,310.0,312.0,312.0,282.0,407.0,416.0,412.0
mean,209.5,1917.782297,18533.351675,3.220813,369.510563,3.49744,97.648387,1982.655769,122.556346,124.702128,257.02457,10.731731,3.024272
std,120.810458,1104.672992,3815.845055,4.407506,231.944545,0.424972,85.61392,2140.388824,56.699525,65.148639,98.325585,1.022,0.882042
min,1.0,41.0,9598.0,0.3,120.0,1.96,4.0,289.0,26.35,33.0,62.0,9.0,1.0
25%,105.25,1092.75,15644.5,0.8,249.5,3.2425,41.25,871.5,80.6,84.25,188.5,10.0,2.0
50%,209.5,1730.0,18628.0,1.4,309.5,3.53,73.0,1259.0,114.7,108.0,251.0,10.6,3.0
75%,313.75,2613.5,21272.5,3.4,400.0,3.77,123.0,1980.0,151.9,151.0,318.0,11.1,4.0
max,418.0,4795.0,28650.0,28.0,1775.0,4.64,588.0,13862.4,457.25,598.0,721.0,18.0,4.0


In [121]:
df.groupby("Sex").count()

Unnamed: 0_level_0,ID,N_Days,Status,Drug,Age,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin,Stage
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
F,374,374,374,276,374,276,276,276,374,374,249,374,274,276,276,247,364,372,368
M,44,44,44,36,44,36,36,36,44,44,35,44,36,36,36,35,43,44,44


According to the dataset, it has been found that women have a higher likelihood of developing liver failure compared to men. This means that the risk of liver failure is greater in females than in males. The dataset has provided evidence that there is a gender-based disparity in the incidence of liver failure.

In [122]:
#checking for null observations
df.isnull().sum()

ID                 0
N_Days             0
Status             0
Drug             106
Age                0
Sex                0
Ascites          106
Hepatomegaly     106
Spiders          106
Edema              0
Bilirubin          0
Cholesterol      134
Albumin            0
Copper           108
Alk_Phos         106
SGOT             106
Tryglicerides    136
Platelets         11
Prothrombin        2
Stage              6
dtype: int64

In [123]:
#replace null with mean 
df.fillna(df.mean(numeric_only=True).round(1), inplace=True)

## Feature Set

Let's define feature sets, X:

In [124]:
X = df.drop(["ID","N_Days","Status","Stage"],axis=1)
X.columns

Index(['Drug', 'Age', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema',
       'Bilirubin', 'Cholesterol', 'Albumin', 'Copper', 'Alk_Phos', 'SGOT',
       'Tryglicerides', 'Platelets', 'Prothrombin'],
      dtype='object')

In [127]:
#encoding the features with LabelEncoder from sci-kit learn
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
X["Drug"] = encoder.fit_transform(X["Drug"]) 
X["Sex"] = encoder.fit_transform(X["Sex"]) 
X["Ascites"] = encoder.fit_transform(X["Ascites"]) 
X["Spiders"] = encoder.fit_transform(X["Spiders"]) 
X["Edema"] = encoder.fit_transform(X["Edema"]) 
X["Hepatomegaly"] = encoder.fit_transform(X["Hepatomegaly"])

X.head(20)


Unnamed: 0,Drug,Age,Sex,Ascites,Hepatomegaly,Spiders,Edema,Bilirubin,Cholesterol,Albumin,Copper,Alk_Phos,SGOT,Tryglicerides,Platelets,Prothrombin
0,0,21464,0,1,1,1,2,14.5,261.0,2.6,156.0,1718.0,137.95,172.0,190.0,12.2
1,0,20617,0,0,1,1,0,1.1,302.0,4.14,54.0,7394.8,113.52,88.0,221.0,10.6
2,0,25594,1,0,0,0,1,1.4,176.0,3.48,210.0,516.0,96.1,55.0,151.0,12.0
3,0,19994,0,0,1,1,1,1.8,244.0,2.54,64.0,6121.8,60.63,92.0,183.0,10.3
4,1,13918,0,0,1,1,0,3.4,279.0,3.53,143.0,671.0,113.15,72.0,136.0,10.9
5,1,24201,0,0,1,0,0,0.8,248.0,3.98,50.0,944.0,93.0,63.0,257.0,11.0
6,1,20284,0,0,1,0,0,1.0,322.0,4.09,52.0,824.0,60.45,213.0,204.0,9.7
7,1,19379,0,0,0,0,0,0.3,280.0,4.0,52.0,4651.2,28.38,189.0,373.0,11.0
8,0,15526,0,0,0,1,0,3.2,562.0,3.08,79.0,2276.0,144.15,88.0,251.0,11.0
9,1,25772,0,1,0,1,2,12.6,200.0,2.74,140.0,918.0,147.25,143.0,302.0,11.5


 What are our labels?

In [145]:
Y = df["Stage"].values
Y[:5]

array([4., 3., 4., 4., 3.])

## Normalize Data

Data Standardization gives the data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on the distance of data points:



In [146]:
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))
X[0:5]

array([[-1.1155219 ,  0.76894112, -0.34299717,  0.50176025,  0.1414695 ,
         0.32988642,  3.55381809,  2.56215232, -0.56855073, -2.11429564,
         0.79288537, -0.14335573,  0.31452689,  0.88547764, -0.69165327,
         1.44199133],
       [-1.1155219 ,  0.54670595, -0.34299717, -0.65063417,  0.1414695 ,
         0.32988642, -0.39696904, -0.48175876, -0.35372089,  1.51381832,
        -0.59280013,  2.93145839, -0.18499205, -0.68708851, -0.37174882,
        -0.12921069],
       [-1.1155219 ,  1.85256717,  2.91547595, -0.65063417, -1.14405771,
        -0.85884224,  1.57842452, -0.4136115 , -1.01392968, -0.04108766,
         1.52648358, -0.79441381, -0.54117789, -1.30488235, -1.09411371,
         1.24559107],
       [-1.1155219 ,  0.38324372, -0.34299717, -0.65063417,  0.1414695 ,
         0.32988642,  1.57842452, -0.32274848, -0.65762652, -2.25565073,
        -0.45694861,  2.24194345, -1.26643113, -0.61220441, -0.76388976,
        -0.42381107],
       [ 0.15848945, -1.21097222, -0

## Train/Test Split

In [147]:
from sklearn.model_selection import train_test_split
X_train, X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

## Classification

### K nearest neighbor (KNN)

#### Import library


Classifier implementing the k-nearest neighbors vote.


In [148]:
from sklearn.neighbors import KNeighborsClassifier

### Training

In [149]:
model = KNeighborsClassifier(n_neighbors=4)
model.fit(X_train,Y_train)

KNeighborsClassifier(n_neighbors=4)

### Predictions

In [150]:
pred = model.predict(X_test)
pred[:5]

array([3., 4., 2., 3., 4.])

### Evaluation

In [151]:
from sklearn.metrics import accuracy_score
from sklearn import metrics
print("Train set Accuracy: ", accuracy_score(Y_train, model.predict(X_train)))
print("Test set Accuracy: ", accuracy_score(Y_test, pred))

Train set Accuracy:  0.6616766467065869
Test set Accuracy:  0.4166666666666667


### Hyperparameter tuning with GridSearchCV

In [152]:
from sklearn.model_selection import GridSearchCV 

In [153]:
# Define the k-NN model and the hyperparameters to tune
model = KNeighborsClassifier()
k_range = list(range(1, 10))
param_grid = {'n_neighbors': k_range, 'weights': ['uniform', 'distance']}

In [154]:
#Instantiating the GridSearchCV
grid = GridSearchCV(model,param_grid,cv=10,scoring='accuracy')

In [155]:
#fitting the model with GridsearchCV
gridsearch = grid.fit(X_train,Y_train)

In [156]:
#Best Params and Best Score
print('Best Hyperparameters: ', gridsearch.best_params_)
print('Validation accuracy: ', gridsearch.best_score_)

Best Hyperparameters:  {'n_neighbors': 7, 'weights': 'distance'}
Validation accuracy:  0.5118538324420677


### Conclusion

Despite improving the accuracy of the model through hyperparameter tuning, it still fell short as the dataset was small and had a high number of missing values.