# Building The Models

In this notebook, we will be using the data we processed in the [Data Preprocessing Stage](./Data_Preprocessing.ipynb)  
to build 3 different Models as follows:

1. KNN model
2. Logistic-Regression model (Machine Learning Approach)
3. FNN (Deep Learning Approach)


In [1]:
# Here will be the general imports
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score

# import the sequential model and Dense layer to build the models
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input, Dropout
from tensorflow.keras import activations

2024-09-05 22:15:35.641489: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-05 22:15:35.648891: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-05 22:15:35.682685: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-05 22:15:35.726424: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-05 22:15:35.738031: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-05 22:15:35.762771: I tensorflow/core/platform/cpu_feature_gu

In [2]:
# read the data
x_train = pd.read_csv('./data/features_training_data.csv')
x_val = pd.read_csv('./data/features_validation_data.csv')
y_train = pd.read_csv('./data/target_training_data.csv')
y_val = pd.read_csv('./data/target_validation_data.csv')

# reshape the y's to be a 1D vector
y_train_reshaped = np.ravel(y_train)
y_val_reshaped = np.ravel(y_val)

print(x_train.shape, y_train.shape, x_val.shape, y_val.shape, y_train_reshaped.shape, y_val_reshaped.shape)

(575, 20) (575, 1) (144, 20) (144, 1) (575,) (144,)


## KNN model

For building this model and choosing the best k, I will be using optuna


In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
import optuna

In [4]:
# Define the search space
search_space_knn = {
  'n_neighbors': [5, 7, 9, 11, 13],
  'metric': ['euclidean', 'manhattan', 'minkowski']
}

In [5]:
# create the objective function
def objective_knn(trial):
  n_neighbors = trial.suggest_int('n_neighbors', 5, 13)
  metric = trial.suggest_categorical('metric', ['euclidean', 'manhattan', 'minkowski'])

  knn = KNeighborsClassifier(n_neighbors=n_neighbors, metric=metric)
  
  scores = cross_val_score(knn, x_train, y_train_reshaped, cv=10)
  accuracy = scores.mean()
  return accuracy

In [6]:
# Define the study_knn_knn
sampler = optuna.samplers.GridSampler(search_space_knn)
pruner = optuna.pruners.MedianPruner()
direction = 'maximize'
study_knn = optuna.create_study(sampler=sampler, pruner=pruner, direction=direction)

[I 2024-09-05 22:15:39,598] A new study created in memory with name: no-name-dc13ba68-dab6-435e-83a5-602122eb6e2a


In [7]:
# Lets study ☺
study_knn.optimize(func=objective_knn)

[I 2024-09-05 22:15:40,017] Trial 0 finished with value: 0.853901996370236 and parameters: {'n_neighbors': 7, 'metric': 'euclidean'}. Best is trial 0 with value: 0.853901996370236.
[I 2024-09-05 22:15:40,247] Trial 1 finished with value: 0.8659407138535995 and parameters: {'n_neighbors': 7, 'metric': 'manhattan'}. Best is trial 1 with value: 0.8659407138535995.
[I 2024-09-05 22:15:40,390] Trial 2 finished with value: 0.8677253478523896 and parameters: {'n_neighbors': 11, 'metric': 'manhattan'}. Best is trial 2 with value: 0.8677253478523896.
[I 2024-09-05 22:15:40,516] Trial 3 finished with value: 0.8677555958862673 and parameters: {'n_neighbors': 13, 'metric': 'manhattan'}. Best is trial 3 with value: 0.8677555958862673.
[I 2024-09-05 22:15:40,643] Trial 4 finished with value: 0.8590744101633394 and parameters: {'n_neighbors': 13, 'metric': 'minkowski'}. Best is trial 3 with value: 0.8677555958862673.
[I 2024-09-05 22:15:40,767] Trial 5 finished with value: 0.8590744101633394 and para

In [8]:
print(f'The best parameters for the KNN model are\n{study_knn.best_params}\n')
print(f'Which resulted in a best value of {study_knn.best_trial.value}')

The best parameters for the KNN model are
{'n_neighbors': 13, 'metric': 'manhattan'}

Which resulted in a best value of 0.8677555958862673


In [9]:
# Now, lets create the model according to these parameters
knn = KNeighborsClassifier(**study_knn.best_params)
# Fit the model using the whole train set
knn.fit(x_train, y_train_reshaped)

In [10]:
y_predict_KNN = knn.predict(x_val)
KNN_accuracy = accuracy_score(y_val_reshaped, y_predict_KNN)

print(f'The accuracy score for the knn model is {KNN_accuracy}')

The accuracy score for the knn model is 0.8611111111111112


## Logistic Regression (Machine Learning approach)

Here we will be building a logistic regression model using
Tensorflow Keras API  

We will be also using optuna to choose the best batch size and number of epochs

In [11]:
number_of_classes = 1 # Binary classification problem
input_shape = (x_train.shape[1],)

# Define the search space
search_space_lr = {
  'epochs': [10, 13, 16, 20],
  'batch_size': [25, 50, 100]
}
print(input_shape)

(20,)


In [12]:
# Build the objective function
def objective_lr(trial):
  epochs = trial.suggest_int('epochs', 10, 20)
  batch_size = trial.suggest_int('batch_size', 25, 100)

  # Build the Logistic regression model
  logistic_reg = Sequential(
    [
      Input(shape=input_shape),
      Dense(number_of_classes, activation='sigmoid')
    ]
  )

  # Compile and Fit
  logistic_reg.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  history = logistic_reg.fit(x_train, y_train, epochs=epochs, batch_size=batch_size)

  # Get the accuracy of the last epoch
  accuracy = history.history['accuracy'][-1] 

  return accuracy

In [13]:
# Define the study. Completely the same as the study of KNN
sampler = optuna.samplers.GridSampler(search_space_lr)
pruner = optuna.pruners.NopPruner()
direction = 'maximize'
study_lr = optuna.create_study(sampler=sampler, pruner=pruner, direction=direction)

[I 2024-09-05 22:15:42,465] A new study created in memory with name: no-name-4e6a3fb5-ef95-4a0a-a43e-cb463f5c5be1


In [None]:
study_lr.optimize(func=objective_lr)

In [15]:
# Best parameters with best accuracy are
print(f'The best parameters for the Logistic Regression model are\n{study_lr.best_params}\n')
print(f'Which resulted in a best value of {study_lr.best_trial.value}')

The best parameters for the Logistic Regression model are
{'epochs': 20, 'batch_size': 25}

Which resulted in a best value of 0.8243478536605835


In [16]:
# Ok, lets build the real model
logistic_reg = Sequential(
  [
    Input(shape=input_shape),
    Dense(number_of_classes, activation='sigmoid')
  ]
)

# Compile and Fit
logistic_reg.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = logistic_reg.fit(x_train, y_train, **study_lr.best_params)

Epoch 1/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7767 - loss: 0.5030   
Epoch 2/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.7336 - loss: 0.5451 
Epoch 3/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7632 - loss: 0.5098 
Epoch 4/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7710 - loss: 0.4914 
Epoch 5/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7891 - loss: 0.4731 
Epoch 6/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7868 - loss: 0.4762 
Epoch 7/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8293 - loss: 0.4158 
Epoch 8/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7797 - loss: 0.4448 
Epoch 9/20
[1m23/23[0m [32m━━━━━━━━━━━━━━━━

In [17]:
# Now, lets test the model on the validation data and get the accuracy
y_predicted_LR = logistic_reg.predict(x_val)

# casting the data from continuous to binary (0, 1)
y_predicted_LR = (y_predicted_LR > 0.5).astype(int)

# get the accuracy
LR_accuracy = accuracy_score(y_val, y_predicted_LR)
print(f'The accuracy score for the Logistic Regression model is {LR_accuracy}')

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step
The accuracy score for the Logistic Regression model is 0.7986111111111112


# FNN (Deep Learning Approach)

Here we will be building a Forward Neural Network

In [18]:
fnn = Sequential (
  [
    Input(shape=input_shape),
    Dense(128, activation=activations.relu),
    Dropout(0.5),
    Dense(64, activation=activations.relu),
    Dropout(0.5),
    Dense(32, activation=activations.relu),
    Dense(16, activation=activations.relu),
    Dense(1, activation=activations.sigmoid),
  ]
)

fnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
fnn.fit(x_train, y_train, epochs = 17, batch_size=32)

Epoch 1/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.4988 - loss: 0.7033
Epoch 2/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.6487 - loss: 0.6233 
Epoch 3/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7668 - loss: 0.5201 
Epoch 4/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.8320 - loss: 0.4096 
Epoch 5/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.8408 - loss: 0.4023
Epoch 6/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8477 - loss: 0.3846
Epoch 7/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.8520 - loss: 0.3609
Epoch 8/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.8567 - loss: 0.3662
Epoch 9/17
[1m18/18[0m [32m━━━━━━━━━━━━━━━━━━━━[0

<keras.src.callbacks.history.History at 0x7cdee5f7ce50>

In [19]:
# Now, lets test the model on the validation data and get the accuracy
y_predicted_FNN = fnn.predict(x_val)

# casting the data from continuous to binary (0, 1)
y_predicted_FNN = (y_predicted_FNN > 0.5).astype(int)

# get the accuracy
FNN_accuracy = accuracy_score(y_val, y_predicted_FNN)
print(f'The accuracy score for the FNN model is {FNN_accuracy}')

[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
The accuracy score for the FNN model is 0.8541666666666666


### Different Configurations

| Layers | Accuracy | Notes |
|--------|----------| ------ |
| (128, 64, 1) | 0.854 | The accuracy is not stable |
| (128, 64, 32, 1) | 0.868/0.861| The accuracy is not stable |
| (128, 64, 32, 16, 1) | [0.88, 0.847] | The accuracy is not stable |
| (64, 64, 16, 1) | 0.840/0.861| The accuracy is not stable |
| (64, 32, 16, 1) | 0.854/0.861| The accuracy is not stable |
| (32, 32, 1) | <= 0.854| The accuracy is not stable |
| (64, 64, 1) | <= 0.861| The accuracy is not stable |

## Model Selection

### Some thoughts and insights

1. The KNN model has the most stable and most accurate results

2. The LR model is worst among them

3. The Deep Learning approach was a middle solution and didn't outperform the other models  
  that is probably due to the small training set (only 574 example)
  
4. The FNN model gave unstable performance even for the same configurations  
  probably due to the initial weights it used

### Evaluation Criterial

I will depend on the accuracy of the three models

In [20]:
print(KNN_accuracy, LR_accuracy, FNN_accuracy)

0.8611111111111112 0.7986111111111112 0.8541666666666666


- As it is clear, the KNN is the most accurate (in almost every case) and the most stable as well  
- Computation will not make a big problem due to the small data set

In [21]:
knn

In [24]:
# saving the model
import joblib

joblib.dump(knn, './model/heart_disease_model.pkl')
joblib.dump(knn, './app/app_ml/model/heart_disease_model.pkl')

['./app/app_ml/model/heart_disease_model.pkl']