# Blood Donor vs Hepatitis Prediction

The dataset contains more then 500 observations regarding whether an observation is a donor, suspected donor or not a donor. The observations that are not a donor are either Hepatitis, Cirrhosis or Fibrosis. 

The data set is publicly available from the URL below:

*https://archive.ics.uci.edu/ml/datasets/HCV+data* 

This ipynb notebook is divided into four section, each covering the different aspects of the the CRISP model:

1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation & Deployment


(**Dependencies**)

In [2]:
import pandas as pd

## Business Understanding

ToDo: describe why is the detection important for the business. What is the business case?

## Data Understanding

This section begins with importing the dataset, after which the dataset will be set into Pandas framework so that we can start querying the dataset for information for better understanding. 

In [3]:
df = pd.read_csv("KNN-hcvdat0.csv")
df

Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,0=Blood Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,0=Blood Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.80,74.0,15.6,76.5
2,3,0=Blood Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.20,86.0,33.2,79.3
3,4,0=Blood Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,0=Blood Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
610,611,3=Cirrhosis,62,f,32.0,416.6,5.9,110.3,50.0,5.57,6.30,55.7,650.9,68.5
611,612,3=Cirrhosis,64,f,24.0,102.8,2.9,44.4,20.0,1.54,3.02,63.0,35.9,71.3
612,613,3=Cirrhosis,64,f,29.0,87.3,3.5,99.0,48.0,1.66,3.63,66.7,64.2,82.0
613,614,3=Cirrhosis,46,f,33.0,,39.0,62.0,20.0,3.56,4.20,52.0,50.0,71.0


In [4]:
df.dtypes

Unnamed: 0      int64
Category       object
Age             int64
Sex            object
ALB           float64
ALP           float64
ALT           float64
AST           float64
BIL           float64
CHE           float64
CHOL          float64
CREA          float64
GGT           float64
PROT          float64
dtype: object

In [7]:

for i in df.columns:
    print(i)

Unnamed: 0
Category
Age
Sex
ALB
ALP
ALT
AST
BIL
CHE
CHOL
CREA
GGT
PROT


The dataset has 615 observations and 14 variables (615 rows and 14 columns):

1. Unnamed: 0 (this functions as the index/ID column so this column is not needed in the training data)
2. Category : (this is the target columns that has to be predicted. So this is the label)
3. Age: (Numerical columns)
4. Sex: (Categorical column, f&m)
5. ALB: (Numerical columns)
6. ALP: (Numerical columns)
7. ALT: (Numerical columns)
8. AST: (Numerical columns)
9. BIL: (Numerical columns)
10. CHE: (Numerical columns)
11. CHOL: (Numerical columns)
12. CREA: (Numerical columns)
13. GGT: (Numerical columns)
14. PROT: (Numerical columns)

In [8]:
df.isnull().sum()

Unnamed: 0     0
Category       0
Age            0
Sex            0
ALB            1
ALP           18
ALT            1
AST            0
BIL            0
CHE            0
CHOL          10
CREA           0
GGT            0
PROT           1
dtype: int64

So there are 29 rows with missing values. To fix this problem these rows can be dropped or be replaced with mean, median or average of the column. 



In [5]:
df['ALB'] = df['ALB'].fillna()

Unnamed: 0    0
Category      0
Age           0
Sex           0
ALB           0
ALP           0
ALT           0
AST           0
BIL           0
CHE           0
CHOL          0
CREA          0
GGT           0
PROT          0
dtype: int64

In [6]:
df['Category'].value_counts()

0=Blood Donor             526
3=Cirrhosis                24
1=Hepatitis                20
2=Fibrosis                 12
0s=suspect Blood Donor      7
Name: Category, dtype: int64

In [7]:
df.loc[(df.Category == '0=Blood Donor'), 'Category'] ='Donor'

In [8]:
df['Category'].value_counts()

Donor                     526
3=Cirrhosis                24
1=Hepatitis                20
2=Fibrosis                 12
0s=suspect Blood Donor      7
Name: Category, dtype: int64

In [9]:
df.loc[(df['Category'] != 'Donor'), 'Category'] = 'NotDonor'

In [10]:
df.Category.value_counts()

Donor       526
NotDonor     63
Name: Category, dtype: int64

In [11]:
df.dtypes

Unnamed: 0      int64
Category       object
Age             int64
Sex            object
ALB           float64
ALP           float64
ALT           float64
AST           float64
BIL           float64
CHE           float64
CHOL          float64
CREA          float64
GGT           float64
PROT          float64
dtype: object

In [12]:
# Category is now Donor vs NotDonor
# now lest look at the Sex

In [13]:
df.Sex.value_counts()

m    363
f    226
Name: Sex, dtype: int64

In [14]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Age,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
count,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0
mean,298.648557,47.417657,41.624278,68.12309,26.575382,33.772835,11.018166,8.203633,5.391341,81.6691,38.198472,71.890153
std,174.142507,9.931334,5.761794,25.921072,20.86312,32.866871,17.406572,2.191073,1.128954,50.696991,54.302407,5.348883
min,1.0,23.0,14.9,11.3,0.9,10.6,0.8,1.42,1.43,8.0,4.5,44.8
25%,149.0,39.0,38.8,52.5,16.4,21.5,5.2,6.93,4.62,68.0,15.6,69.3
50%,296.0,47.0,41.9,66.2,22.7,25.7,7.1,8.26,5.31,77.0,22.8,72.1
75%,448.0,54.0,45.1,79.9,31.9,31.7,11.0,9.57,6.08,89.0,37.6,75.2
max,613.0,77.0,82.2,416.6,325.3,324.0,209.0,16.41,9.67,1079.1,650.9,86.5


In [15]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,1,Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,2,Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5
2,3,Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3
3,4,Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,5,Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
5,6,Donor,32,m,41.6,43.3,18.5,19.7,12.3,9.92,6.05,111.0,91.0,74.0
6,7,Donor,32,m,46.3,41.3,17.5,17.8,8.5,7.01,4.79,70.0,16.9,74.5
7,8,Donor,32,m,42.2,41.9,35.8,31.1,16.1,5.82,4.6,109.0,21.5,67.1
8,9,Donor,32,m,50.9,65.5,23.2,21.2,6.9,8.69,4.1,83.0,13.7,71.3
9,10,Donor,32,m,42.4,86.3,20.3,20.0,35.2,5.46,4.45,81.0,15.9,69.9


In [16]:
df = df.drop('Unnamed: 0', axis=1)

In [17]:
df

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,Donor,32,m,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,Donor,32,m,38.5,70.3,18.0,24.7,3.9,11.17,4.80,74.0,15.6,76.5
2,Donor,32,m,46.9,74.7,36.2,52.6,6.1,8.84,5.20,86.0,33.2,79.3
3,Donor,32,m,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,Donor,32,m,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
608,NotDonor,58,f,34.0,46.4,15.0,150.0,8.0,6.26,3.98,56.0,49.7,80.6
609,NotDonor,59,f,39.0,51.3,19.6,285.8,40.0,5.77,4.51,136.1,101.1,70.5
610,NotDonor,62,f,32.0,416.6,5.9,110.3,50.0,5.57,6.30,55.7,650.9,68.5
611,NotDonor,64,f,24.0,102.8,2.9,44.4,20.0,1.54,3.02,63.0,35.9,71.3


In [18]:
df.loc[df['Sex'] == 'm', 'Sex'] = 0

In [19]:
df.loc[df['Sex'] == 'f', 'Sex'] = 1

In [20]:
df.Sex.value_counts()

0    363
1    226
Name: Sex, dtype: int64

In [21]:
df

Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
0,Donor,32,0,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0
1,Donor,32,0,38.5,70.3,18.0,24.7,3.9,11.17,4.80,74.0,15.6,76.5
2,Donor,32,0,46.9,74.7,36.2,52.6,6.1,8.84,5.20,86.0,33.2,79.3
3,Donor,32,0,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7
4,Donor,32,0,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
608,NotDonor,58,1,34.0,46.4,15.0,150.0,8.0,6.26,3.98,56.0,49.7,80.6
609,NotDonor,59,1,39.0,51.3,19.6,285.8,40.0,5.77,4.51,136.1,101.1,70.5
610,NotDonor,62,1,32.0,416.6,5.9,110.3,50.0,5.57,6.30,55.7,650.9,68.5
611,NotDonor,64,1,24.0,102.8,2.9,44.4,20.0,1.54,3.02,63.0,35.9,71.3


In [22]:
normal = df[['ALB', 'ALP', 'ALT', 'AST', 'BIL', 'CHE', 'CHOL', 'CREA', 'GGT', 'PROT']]

In [23]:
normal.describe()


Unnamed: 0,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
count,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0
mean,41.624278,68.12309,26.575382,33.772835,11.018166,8.203633,5.391341,81.6691,38.198472,71.890153
std,5.761794,25.921072,20.86312,32.866871,17.406572,2.191073,1.128954,50.696991,54.302407,5.348883
min,14.9,11.3,0.9,10.6,0.8,1.42,1.43,8.0,4.5,44.8
25%,38.8,52.5,16.4,21.5,5.2,6.93,4.62,68.0,15.6,69.3
50%,41.9,66.2,22.7,25.7,7.1,8.26,5.31,77.0,22.8,72.1
75%,45.1,79.9,31.9,31.7,11.0,9.57,6.08,89.0,37.6,75.2
max,82.2,416.6,325.3,324.0,209.0,16.41,9.67,1079.1,650.9,86.5


In [24]:

import numpy as np

np.random.seed(0)

def minmax_norm(df_input):
    return (normal - normal.min()) / ( normal.max() - normal.min())

normailized = minmax_norm(normal)

normailized.describe()

Unnamed: 0,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT
count,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0,589.0
mean,0.397092,0.1402,0.079147,0.07394,0.049079,0.452544,0.480745,0.068779,0.052133,0.649644
std,0.085614,0.063955,0.064313,0.104872,0.083605,0.146169,0.137009,0.047332,0.084007,0.128271
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.355126,0.101653,0.047781,0.03478,0.021134,0.367578,0.387136,0.056017,0.017172,0.58753
50%,0.401189,0.135455,0.067201,0.048181,0.030259,0.456304,0.470874,0.06442,0.028311,0.654676
75%,0.448737,0.169257,0.095561,0.067326,0.048991,0.543696,0.56432,0.075623,0.051207,0.729017
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [25]:
normailized.isna().sum()

ALB     0
ALP     0
ALT     0
AST     0
BIL     0
CHE     0
CHOL    0
CREA    0
GGT     0
PROT    0
dtype: int64

In [26]:
cleandf = normailized

In [27]:
cleandf['Age'] = df['Age']
cleandf['Sex'] = df['Sex']


In [28]:
cleandf

Unnamed: 0,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,Age,Sex
0,0.350669,0.101653,0.020962,0.036694,0.032181,0.367578,0.218447,0.091495,0.011757,0.580336,32,0
1,0.350669,0.145571,0.052713,0.044990,0.014890,0.650434,0.408981,0.061619,0.017172,0.760192,32,0
2,0.475483,0.156427,0.108816,0.134014,0.025456,0.494997,0.457524,0.072822,0.044400,0.827338,32,0
3,0.420505,0.100419,0.091554,0.038290,0.086936,0.394263,0.401699,0.067221,0.045328,0.741007,32,0
4,0.361070,0.154947,0.097719,0.045310,0.042267,0.515677,0.350728,0.063486,0.039295,0.573141,32,0
...,...,...,...,...,...,...,...,...,...,...,...,...
608,0.283804,0.086603,0.043465,0.444799,0.034582,0.322882,0.309466,0.044814,0.069926,0.858513,58,1
609,0.358098,0.098692,0.057645,0.878111,0.188280,0.290193,0.373786,0.119597,0.149443,0.616307,59,1
610,0.254086,1.000000,0.015413,0.318124,0.236311,0.276851,0.591019,0.044534,1.000000,0.568345,62,1
611,0.135215,0.225759,0.006165,0.107849,0.092219,0.008005,0.192961,0.051349,0.048577,0.635492,64,1


In [29]:
label = df['Category']

In [30]:
label.head()

0    Donor
1    Donor
2    Donor
3    Donor
4    Donor
Name: Category, dtype: object

In [31]:
from sklearn.model_selection import train_test_split

x = cleandf
y = label

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 2)

In [33]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=23)
classifier.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=23)

In [36]:
y_pred = classifier.predict(x_test)

In [37]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[105   0]
 [ 13   0]]
              precision    recall  f1-score   support

       Donor       0.89      1.00      0.94       105
    NotDonor       0.00      0.00      0.00        13

    accuracy                           0.89       118
   macro avg       0.44      0.50      0.47       118
weighted avg       0.79      0.89      0.84       118



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [39]:
from sklearn.metrics import accuracy_score

x_train_pred = classifier.predict(x_train)
training_accuracy = accuracy_score(x_train_pred, y_train)
print('The accuracy of the model on the training data is: ', training_accuracy)

The accuracy of the model on the training data is:  0.8938428874734607


In [45]:
# Accuracy on test data
x_test_pred = classifier.predict(x_test)
test_accuracy = accuracy_score(x_test_pred, y_test)
print('The accuracy of the model on the test data is: ', (test_accuracy*100))

The accuracy of the model on the test data is:  88.98305084745762
