# Disease Prediction
In this project I'm going to use machine learning as an alternative to running diagnostic tests on individuals.  This will in turn help me to be able to classify individuals into disease or non diseased categories based on their health attributes.<br>
I will use various features of an individual such as their cholesterol levels, quantity of white and red blood cells and their heart rate to predict the type of disease they might or are likely to have.

In [1]:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE

In [2]:
df = pd.read_csv('Train_data.csv')

# Data Cleaning
Before even thinking about using machine learning tools on this data, I first have to check if the dataset is clean or not.

In [3]:
df.head()

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,White Blood Cells,Red Blood Cells,Hematocrit,Mean Corpuscular Volume,Mean Corpuscular Hemoglobin,Mean Corpuscular Hemoglobin Concentration,...,HbA1c,LDL Cholesterol,HDL Cholesterol,ALT,AST,Heart Rate,Creatinine,Troponin,C-reactive Protein,Disease
0,0.739597,0.650198,0.713631,0.868491,0.687433,0.529895,0.290006,0.631045,0.001328,0.795829,...,0.502665,0.21556,0.512941,0.064187,0.610827,0.939485,0.095512,0.465957,0.76923,Healthy
1,0.121786,0.023058,0.944893,0.905372,0.507711,0.403033,0.164216,0.307553,0.207938,0.505562,...,0.85681,0.652465,0.106961,0.942549,0.344261,0.666368,0.65906,0.816982,0.401166,Diabetes
2,0.452539,0.116135,0.54456,0.40064,0.294538,0.382021,0.625267,0.295122,0.868369,0.026808,...,0.466795,0.387332,0.421763,0.007186,0.506918,0.431704,0.417295,0.799074,0.779208,Thalasse
3,0.136609,0.015605,0.419957,0.191487,0.081168,0.166214,0.073293,0.668719,0.125447,0.501051,...,0.016256,0.040137,0.826721,0.265415,0.594148,0.225756,0.490349,0.637061,0.354094,Anemia
4,0.176737,0.75222,0.971779,0.785286,0.44388,0.439851,0.894991,0.442159,0.257288,0.805987,...,0.429431,0.146294,0.221574,0.01528,0.567115,0.841412,0.15335,0.794008,0.09497,Thalasse


In [4]:
df.isnull().sum()

Glucose                                      0
Cholesterol                                  0
Hemoglobin                                   0
Platelets                                    0
White Blood Cells                            0
Red Blood Cells                              0
Hematocrit                                   0
Mean Corpuscular Volume                      0
Mean Corpuscular Hemoglobin                  0
Mean Corpuscular Hemoglobin Concentration    0
Insulin                                      0
BMI                                          0
Systolic Blood Pressure                      0
Diastolic Blood Pressure                     0
Triglycerides                                0
HbA1c                                        0
LDL Cholesterol                              0
HDL Cholesterol                              0
ALT                                          0
AST                                          0
Heart Rate                                   0
Creatinine   

We have no null values.

In [5]:
df[df.duplicated()].sort_values(by='Glucose')

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,White Blood Cells,Red Blood Cells,Hematocrit,Mean Corpuscular Volume,Mean Corpuscular Hemoglobin,Mean Corpuscular Hemoglobin Concentration,...,HbA1c,LDL Cholesterol,HDL Cholesterol,ALT,AST,Heart Rate,Creatinine,Troponin,C-reactive Protein,Disease
461,0.010994,0.316546,0.516925,0.894809,0.547689,0.981186,0.011772,0.703163,0.323070,0.319178,...,0.484525,0.377543,0.970482,0.637386,0.335753,0.985786,0.646643,0.008220,0.389637,Anemia
1089,0.010994,0.316546,0.516925,0.894809,0.547689,0.981186,0.011772,0.703163,0.323070,0.319178,...,0.484525,0.377543,0.970482,0.637386,0.335753,0.985786,0.646643,0.008220,0.389637,Anemia
2314,0.010994,0.316546,0.516925,0.894809,0.547689,0.981186,0.011772,0.703163,0.323070,0.319178,...,0.484525,0.377543,0.970482,0.637386,0.335753,0.985786,0.646643,0.008220,0.389637,Anemia
678,0.010994,0.316546,0.516925,0.894809,0.547689,0.981186,0.011772,0.703163,0.323070,0.319178,...,0.484525,0.377543,0.970482,0.637386,0.335753,0.985786,0.646643,0.008220,0.389637,Anemia
751,0.010994,0.316546,0.516925,0.894809,0.547689,0.981186,0.011772,0.703163,0.323070,0.319178,...,0.484525,0.377543,0.970482,0.637386,0.335753,0.985786,0.646643,0.008220,0.389637,Anemia
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
888,0.968460,0.406009,0.956986,0.615633,0.146895,0.172418,0.919517,0.959807,0.759804,0.943169,...,0.064762,0.354323,0.308892,0.248164,0.083672,0.114550,0.095323,0.914947,0.042490,Diabetes
1147,0.968460,0.406009,0.956986,0.615633,0.146895,0.172418,0.919517,0.959807,0.759804,0.943169,...,0.064762,0.354323,0.308892,0.248164,0.083672,0.114550,0.095323,0.914947,0.042490,Diabetes
1082,0.968460,0.406009,0.956986,0.615633,0.146895,0.172418,0.919517,0.959807,0.759804,0.943169,...,0.064762,0.354323,0.308892,0.248164,0.083672,0.114550,0.095323,0.914947,0.042490,Diabetes
867,0.968460,0.406009,0.956986,0.615633,0.146895,0.172418,0.919517,0.959807,0.759804,0.943169,...,0.064762,0.354323,0.308892,0.248164,0.083672,0.114550,0.095323,0.914947,0.042490,Diabetes


In [6]:
df[df.duplicated()].shape

(2286, 25)

In [7]:
df.shape

(2351, 25)

In [8]:
df.shape[0] - df[df.duplicated()].shape[0]

65

The amount of duplicate values we have in this dataframe is quite shocking.<br>
Sometimes duplicates may indicate repeated measurements or valid occurrences, and in this instance it is the former.<br>
These are complete duplicates however I'm going to keep them as we’ll be testing the model on different test data.<br>

In [9]:
df.shape

(2351, 25)

# Encoding categorical variables

In [10]:
df['Disease'].value_counts()

Disease
Anemia      623
Healthy     556
Diabetes    540
Thalasse    509
Thromboc    123
Name: count, dtype: int64

In [11]:
df.loc[:, 'Disease'] = df['Disease'].map({
    'Healthy': 0,
    'Diabetes': 1,
    'Anemia': 2,
    'Thalasse': 3,
    'Thromboc': 4
})

In [12]:
df['Disease'].value_counts(normalize=True)

Disease
2    0.264994
0    0.236495
1    0.229689
3    0.216504
4    0.052318
Name: proportion, dtype: float64

Our data is imbalanced, we will use smote technique to handle imbalanced data

In [13]:
df['Disease'] = df['Disease'].astype(int)

# Outlier Detection

In [14]:
def outlier_detection(dataframe, column_name):
    Q1 = dataframe[column_name]
    Q3 = dataframe[column_name]
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = dataframe[(dataframe[column_name] < lower_bound) | (dataframe[column_name] > upper_bound)]
    return outliers.shape[0]

In [15]:
df.columns

Index(['Glucose', 'Cholesterol', 'Hemoglobin', 'Platelets',
       'White Blood Cells', 'Red Blood Cells', 'Hematocrit',
       'Mean Corpuscular Volume', 'Mean Corpuscular Hemoglobin',
       'Mean Corpuscular Hemoglobin Concentration', 'Insulin', 'BMI',
       'Systolic Blood Pressure', 'Diastolic Blood Pressure', 'Triglycerides',
       'HbA1c', 'LDL Cholesterol', 'HDL Cholesterol', 'ALT', 'AST',
       'Heart Rate', 'Creatinine', 'Troponin', 'C-reactive Protein',
       'Disease'],
      dtype='object')

In [16]:
for column in df.columns:
    print(outlier_detection(df, column))

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


We have no outliers

# Model Building

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

In [18]:
df

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,White Blood Cells,Red Blood Cells,Hematocrit,Mean Corpuscular Volume,Mean Corpuscular Hemoglobin,Mean Corpuscular Hemoglobin Concentration,...,HbA1c,LDL Cholesterol,HDL Cholesterol,ALT,AST,Heart Rate,Creatinine,Troponin,C-reactive Protein,Disease
0,0.739597,0.650198,0.713631,0.868491,0.687433,0.529895,0.290006,0.631045,0.001328,0.795829,...,0.502665,0.215560,0.512941,0.064187,0.610827,0.939485,0.095512,0.465957,0.769230,0
1,0.121786,0.023058,0.944893,0.905372,0.507711,0.403033,0.164216,0.307553,0.207938,0.505562,...,0.856810,0.652465,0.106961,0.942549,0.344261,0.666368,0.659060,0.816982,0.401166,1
2,0.452539,0.116135,0.544560,0.400640,0.294538,0.382021,0.625267,0.295122,0.868369,0.026808,...,0.466795,0.387332,0.421763,0.007186,0.506918,0.431704,0.417295,0.799074,0.779208,3
3,0.136609,0.015605,0.419957,0.191487,0.081168,0.166214,0.073293,0.668719,0.125447,0.501051,...,0.016256,0.040137,0.826721,0.265415,0.594148,0.225756,0.490349,0.637061,0.354094,2
4,0.176737,0.752220,0.971779,0.785286,0.443880,0.439851,0.894991,0.442159,0.257288,0.805987,...,0.429431,0.146294,0.221574,0.015280,0.567115,0.841412,0.153350,0.794008,0.094970,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2346,0.012956,0.336925,0.451218,0.175006,0.734664,0.382770,0.656463,0.177502,0.808162,0.684499,...,0.670665,0.311568,0.595083,0.155866,0.885812,0.636125,0.132226,0.716519,0.006121,1
2347,0.407101,0.124738,0.983306,0.663867,0.361113,0.663716,0.232516,0.341056,0.847441,0.309766,...,0.491185,0.701914,0.218104,0.790341,0.570902,0.339125,0.310964,0.310900,0.622403,3
2348,0.344356,0.783918,0.582171,0.996841,0.065363,0.242885,0.658851,0.543017,0.290106,0.838722,...,0.141738,0.155871,0.473638,0.250535,0.387197,0.344728,0.606719,0.395145,0.134021,2
2349,0.351722,0.014278,0.898615,0.167550,0.727148,0.046091,0.900434,0.136227,0.134361,0.279219,...,0.570553,0.171245,0.858352,0.362012,0.290984,0.996873,0.882164,0.411158,0.146255,1


In [19]:
smote = SMOTE(random_state=42)

In [20]:
lr = LogisticRegression(max_iter=200)
svm = SVC()
gnb = GaussianNB()
scaler = StandardScaler()

In [21]:
X_train = df.drop('Disease', axis=1)
y_train = df['Disease']

In [22]:
y_train.value_counts()

Disease
2    623
0    556
1    540
3    509
4    123
Name: count, dtype: int64

In [23]:
X_train, y_train = smote.fit_resample(X_train, y_train)

In [24]:
X_train = scaler.fit_transform(X_train)

In [25]:
y_train.value_counts()

Disease
0    623
1    623
3    623
2    623
4    623
Name: count, dtype: int64

In [26]:
y_train.shape

(3115,)

In [27]:
lr.fit(X_train, y_train)

In [28]:
svm.fit(X_train, y_train)

In [29]:
gnb.fit(X_train, y_train)

# Cleaning Test Data

In [30]:
df_test = pd.read_csv('test_data.csv')

In [31]:
df_test[df_test.duplicated()]

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,White Blood Cells,Red Blood Cells,Hematocrit,Mean Corpuscular Volume,Mean Corpuscular Hemoglobin,Mean Corpuscular Hemoglobin Concentration,...,HbA1c,LDL Cholesterol,HDL Cholesterol,ALT,AST,Heart Rate,Creatinine,Troponin,C-reactive Protein,Disease


In [32]:
df_test.isnull().sum()

Glucose                                      0
Cholesterol                                  0
Hemoglobin                                   0
Platelets                                    0
White Blood Cells                            0
Red Blood Cells                              0
Hematocrit                                   0
Mean Corpuscular Volume                      0
Mean Corpuscular Hemoglobin                  0
Mean Corpuscular Hemoglobin Concentration    0
Insulin                                      0
BMI                                          0
Systolic Blood Pressure                      0
Diastolic Blood Pressure                     0
Triglycerides                                0
HbA1c                                        0
LDL Cholesterol                              0
HDL Cholesterol                              0
ALT                                          0
AST                                          0
Heart Rate                                   0
Creatinine   

In [33]:
df_test['Disease'].value_counts()

Disease
Diabetes    294
Anemia       84
Thalasse     48
Heart Di     39
Thromboc     16
Healthy       5
Name: count, dtype: int64

In [34]:
df_test.loc[:, 'Disease'] = df_test['Disease'].map({
    'Healthy': 0,
    'Diabetes': 1,
    'Anemia': 2,
    'Thalasse': 3,
    'Thromboc': 4,
    'Heart Di': 6
})

In [35]:
df_test

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,White Blood Cells,Red Blood Cells,Hematocrit,Mean Corpuscular Volume,Mean Corpuscular Hemoglobin,Mean Corpuscular Hemoglobin Concentration,...,HbA1c,LDL Cholesterol,HDL Cholesterol,ALT,AST,Heart Rate,Creatinine,Troponin,C-reactive Protein,Disease
0,0.001827,0.033693,0.114755,0.997927,0.562604,0.866499,0.578042,0.914615,0.026864,0.038641,...,0.653230,0.186104,0.430398,0.016678,0.885352,0.652733,0.788235,0.054788,0.031313,3
1,0.436679,0.972653,0.084998,0.180909,0.675736,0.563889,0.798382,0.670361,0.376092,0.184890,...,0.833540,0.153001,0.458533,0.401845,0.635969,0.574425,0.047025,0.607985,0.594123,1
2,0.545697,0.324815,0.584467,0.475748,0.558596,0.661007,0.934056,0.381782,0.500342,0.531829,...,0.678901,0.220479,0.817151,0.690981,0.101633,0.855740,0.551124,0.413294,0.070909,6
3,0.172994,0.050351,0.736000,0.782022,0.069435,0.085219,0.032907,0.460619,0.785448,0.491495,...,0.381500,0.459396,0.420154,0.798537,0.399236,0.324600,0.499504,0.436662,0.242766,1
4,0.758534,0.739968,0.597868,0.772683,0.875720,0.860265,0.486189,0.486686,0.621048,0.191756,...,0.993381,0.272338,0.663579,0.265227,0.918847,0.804910,0.571119,0.188368,0.750848,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
481,0.985163,0.412960,0.529993,0.263765,0.431288,0.198882,0.581289,0.701192,0.249410,0.246893,...,0.680556,0.048191,0.465272,0.066511,0.965544,0.015051,0.442730,0.196986,0.816038,1
482,0.581914,0.629325,0.491644,0.901473,0.347797,0.633286,0.698114,0.516947,0.674259,0.798153,...,0.261767,0.482322,0.799523,0.807460,0.325313,0.825194,0.777866,0.415987,0.842804,6
483,0.066669,0.404558,0.591041,0.228401,0.127461,0.026670,0.847444,0.279740,0.575425,0.156438,...,0.168146,0.763625,0.677782,0.890501,0.638825,0.559993,0.795478,0.669925,0.124874,2
484,0.901444,0.430680,0.243853,0.825551,0.493884,0.726299,0.660930,0.445560,0.349782,0.343069,...,0.893448,0.500059,0.112250,0.548469,0.211496,0.938355,0.463381,0.862921,0.658526,1


In [36]:
df_test['Disease'] = df_test['Disease'].astype(int)

In [37]:
df_test['Disease'].value_counts()

Disease
1    294
2     84
3     48
6     39
4     16
0      5
Name: count, dtype: int64

In [38]:
X_test = df_test.drop('Disease', axis=1)
y_test = df_test['Disease']

In [39]:
X_test = scaler.transform(X_test)

In [40]:
svm.score(X_test, y_test)

0.49176954732510286

In [41]:
y_svm_pred = svm.predict(X_test)
print(classification_report(y_true=y_test, y_pred=y_svm_pred))

              precision    recall  f1-score   support

           0       0.04      0.60      0.07         5
           1       0.69      0.68      0.68       294
           2       0.42      0.32      0.36        84
           3       0.17      0.19      0.18        48
           4       0.00      0.00      0.00        16
           6       0.00      0.00      0.00        39

    accuracy                           0.49       486
   macro avg       0.22      0.30      0.22       486
weighted avg       0.51      0.49      0.50       486



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Gaussian NB
SVM

In [42]:
lr.score(X_test, y_test)

0.31893004115226337

In [43]:
y_lr_pred = lr.predict(X_test)
print(classification_report(y_true=y_test, y_pred=y_lr_pred))

              precision    recall  f1-score   support

           0       0.03      0.60      0.06         5
           1       0.65      0.35      0.46       294
           2       0.29      0.42      0.34        84
           3       0.14      0.21      0.17        48
           4       0.13      0.25      0.17        16
           6       0.00      0.00      0.00        39

    accuracy                           0.32       486
   macro avg       0.21      0.30      0.20       486
weighted avg       0.46      0.32      0.36       486



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [44]:
gnb.score(X_test, y_test)

0.5390946502057613

In [45]:
y_gnb_pred = gnb.predict(X_test)
print(classification_report(y_true=y_test, y_pred=y_gnb_pred))

              precision    recall  f1-score   support

           0       0.03      0.40      0.05         5
           1       0.70      0.77      0.73       294
           2       0.45      0.26      0.33        84
           3       0.27      0.25      0.26        48
           4       0.00      0.00      0.00        16
           6       0.00      0.00      0.00        39

    accuracy                           0.54       486
   macro avg       0.24      0.28      0.23       486
weighted avg       0.53      0.54      0.53       486



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Using Tensorflow

In [219]:
import tensorflow as tf
from tensorflow import keras

In [220]:
model = keras.Sequential([
    keras.layers.Dense(24, input_shape=(24,), activation='relu'),
    keras.layers.Dense(48, activation='relu'),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
             loss='binary_crossentropy',
             metrics=['accuracy'])


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [221]:
X_train.shape

(3115, 24)

In [222]:
model.fit(X_train, y_train, epochs=500)

Epoch 1/500


[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.1859 - loss: -1.2765
Epoch 2/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.1946 - loss: -114.6222
Epoch 3/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 999us/step - accuracy: 0.2068 - loss: -1509.3124
Epoch 4/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.2055 - loss: -8491.9180
Epoch 5/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.1965 - loss: -27542.9648
Epoch 6/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.2118 - loss: -64489.5039
Epoch 7/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.1869 - loss: -149938.6094
Epoch 8/500
[1m98/98[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.1839 - loss: -269493.1875  
Epoch 9/500
[1m98/98

<keras.src.callbacks.history.History at 0x1782e2310>

In [224]:
model.evaluate(X_test, y_test)

[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 924us/step - accuracy: 0.5978 - loss: -145244880896.0000


[-141392134144.0, 0.604938268661499]