# Binary Classifier - Non-Alcohol Fatty Liver Disease

## A. Introduction

Each example is a fatty liver disease case with status either 1 (not alive) or 0 (alive). 

## B. Importing Libraries & Dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Desktop/archive/nafld1.csv")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,id,age,male,weight,height,bmi,case.id,futime,status
0,3631,1,57,0,60.0,163.0,22.690939,10630.0,6261,0
1,8458,2,67,0,70.4,168.0,24.884028,14817.0,624,0
2,6298,3,53,1,105.8,186.0,30.453537,3.0,1783,0
3,15398,4,56,1,109.3,170.0,37.8301,6628.0,3143,0
4,13261,5,68,1,,,,1871.0,1836,1


In [4]:
df.tail()

Unnamed: 0.1,Unnamed: 0,id,age,male,weight,height,bmi,case.id,futime,status
17544,11130,17562,46,0,53.0,161.0,20.501023,12713.0,1894,0
17545,1099,17563,52,1,111.8,154.0,47.335905,17563.0,3841,0
17546,1522,17564,59,0,57.3,,,16164.0,5081,0
17547,5764,17565,61,0,,,,17276.0,3627,1
17548,6658,17566,69,1,94.1,180.0,29.20465,2017.0,2744,0


In [5]:
df.drop('id', axis=1, inplace=True)

In [6]:
df.drop('case.id', axis=1, inplace=True)

In [7]:
df.drop(columns=df.columns[0], axis=1, inplace=True)
df.tail()

Unnamed: 0,age,male,weight,height,bmi,futime,status
17544,46,0,53.0,161.0,20.501023,1894,0
17545,52,1,111.8,154.0,47.335905,3841,0
17546,59,0,57.3,,,5081,0
17547,61,0,,,,3627,1
17548,69,1,94.1,180.0,29.20465,2744,0


In [8]:
df.isnull().sum() / len(df) * 100

age        0.000000
male       0.000000
weight    27.272209
height    18.052311
bmi       28.269417
futime     0.000000
status     0.000000
dtype: float64

In [9]:
for col in df.columns:
    if df[col].dtypes != 'object':
        df[col].fillna(df[col].median(),inplace=True)
    else:
        df[col].fillna(df[col].mode()[0],inplace=True)

## C. Feature Selection

We will separate the columns into *target* and *predictors*.

In [10]:
feature_columns = ['age','male','weight','height','bmi']

In [11]:
X = df[feature_columns]
y = df['status']

## D. Splitting Dataset

We will divide the dataset into a training set and a test set using the function train_test_split() and by passing three parameters: 
- features
- test_set size
- random_state

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state = 42)

To prevent any data leakage, scale X_train and X_test by fitting a scaler model to X_train and transforming them to z-scores separately:

In [13]:
from sklearn.preprocessing import StandardScaler
model = StandardScaler() 
X_train_scaled = model.fit_transform(X_train)
X_test_scaled = model.transform(X_test)

## E. Model Development 

We will tune the hyperparameters of our SVC model using a grid search. 

We only selected fewer parameters to reduce the time to train the model. Within 3 x 4 combinations, we are trying  12 different combinations on each run.

In [14]:
import numpy as np
grid = {'C': [1, 10], 
        'kernel': ['linear', 'rbf'] }

We will the GridSearchCV model with the gamma hyperparameter set to auto to avoid warnings and set probability to True. 

To speed up the time it takes:
- we will set cv = 3 to fit 3 models for each parameter group, 
- we will set n_jobs = -1 to create one job per core automatically. 

With verbose = 1 we get the computation time for each fold and parameter candidate. 

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
model = GridSearchCV(estimator=SVC(gamma='auto'), param_grid=grid, cv=3, n_jobs=-1, verbose=1)

We will fit the grid search model. We will calculate the elapsed time. 

In [16]:
import time
start_time = time.time()
model.fit(X_train, y_train)
end_time = time.time()
elapsed_time = end_time - start_time
print('Execution time:', elapsed_time, 'seconds')

Fitting 3 folds for each of 4 candidates, totalling 12 fits
Execution time: 297.43122720718384 seconds


We will print the best parameters:

In [17]:
best_parameters = model.best_params_
print(best_parameters)

{'C': 10, 'kernel': 'linear'}


We will calculate the accuracy of the training data:

In [19]:
model_pred_train = model.predict(X_train)

We will calculate the accuracy of the test data:

In [20]:
model_pred_test = model.predict(X_test)

In [21]:
from sklearn.metrics import accuracy_score
print('SVC:\n> Accuracy on training data = {:.4f}\n> Accuracy on validation data = {:.4f}'.format( accuracy_score(y_true=y_train, y_pred=model_pred_train), accuracy_score(y_true=y_test, y_pred=model_pred_test)))

SVC:
> Accuracy on training data = 0.9238
> Accuracy on validation data = 0.9216


- The classifier predicts 92.38% of the training data correctly. 
- The accuracy on test data is almost the same as the accuracy on the training data. The variance of the model is not high.So, we assume that the classifier is not overfitting the model.

## F. Model Evaluation

A confusion matrix is a table that is used to evaluate the performance of a classification model.

In [22]:
from sklearn.metrics import confusion_matrix
cm = pd.DataFrame(confusion_matrix(y_test, model_pred_test))
cm['Total'] = np.sum(cm, axis=1)
cm = cm.append(np.sum(cm, axis=0), ignore_index=True)
cm.columns = ['Predicted Alive', 'Predicted Not Alive', 'Total']
cm = cm.set_index([['Alive', 'Not Alive', 'Total']])
print(cm)

           Predicted Alive  Predicted Not Alive  Total
Alive                 5330                    3   5333
Not Alive              451                    8    459
Total                 5781                   11   5792


5330 and 8 are actual predictions, and 451 and 3 are incorrect predictions.

We will evaluate the model using classification_report for accuracy, precision, and recall.

In [23]:
from sklearn.metrics import classification_report
target_names = ['alive', 'not alive']
print(classification_report(y_test, model_pred_test, target_names=target_names))

              precision    recall  f1-score   support

       alive       0.92      1.00      0.96      5333
   not alive       0.73      0.02      0.03       459

    accuracy                           0.92      5792
   macro avg       0.82      0.51      0.50      5792
weighted avg       0.91      0.92      0.89      5792



We got a classification rate of 92%. This is considered as good accuracy.