# Import Libraries

- `Data Manipulation`: **pandas**, **numpy**
- `Data Visualisation`: **matplotlib**, **seaborn**, **bokeh**
- `Machine Learning`: **sklearn**, **pytorch**, **tensorflow**
- `Others`: **klib**, **imblearn**, **streamlit**

In [1]:
import streamlit
import pandas as pd
import numpy as np
import matplotlib, klib, bokeh
import sklearn, imblearn
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
import pickle, json, joblib
import torch, netron

#### Streamlit
<img src="https://streamlit.io/images/brand/streamlit-logo-secondary-colormark-darktext.png" alt="Alternative text" width = 300 />

[Streamlit](https://streamlit.io/) is an open-source Python library that enables you to create web applications for data science and machine learning with minimal effort. With Streamlit, you can turn data scripts into shareable web apps in just a few lines of code.

To get started with Streamlit, you can install it using:

```bash
pip install streamlit
```

For more detailed information and documentation, refer to the official [Streamlit documentation](https://docs.streamlit.io/).

In [2]:
# !streamlit hello

The `streamlit hello` command is a quick and easy way to get started with [Streamlit](https://streamlit.io/).

This introduction provides step-by-step instructions on how to use the `streamlit hello` command, what to expect when running it, and additional resources for further exploration. Adjustments can be made based on your preferences or audience.

## 1. Data Description

The heart attack datasets were collected at Zheen hospital in Erbil, Iraq, from **January 2019 to May 2019.**

Dataset Link: **https://data.mendeley.com/datasets/wmhctcrt5v/1**

According to the provided information, the medical dataset **classifies** either heart attack or none.

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/FuZhangCheng/fyp-project/main/dataset/Medicaldataset.csv")
data.head()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin,Result
0,64,1,66,160,83,160.0,1.8,0.012,negative
1,21,1,94,98,46,296.0,6.75,1.06,positive
2,55,1,64,160,77,270.0,1.99,0.003,negative
3,64,1,70,120,55,270.0,13.87,0.122,positive
4,55,1,64,112,65,300.0,1.08,0.003,negative


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1319 entries, 0 to 1318
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       1319 non-null   int64  
 1   Gender                    1319 non-null   int64  
 2   Heart rate                1319 non-null   int64  
 3   Systolic blood pressure   1319 non-null   int64  
 4   Diastolic blood pressure  1319 non-null   int64  
 5   Blood sugar               1319 non-null   float64
 6   CK-MB                     1319 non-null   float64
 7   Troponin                  1319 non-null   float64
 8   Result                    1319 non-null   object 
dtypes: float64(3), int64(5), object(1)
memory usage: 92.9+ KB


In [5]:
data.isnull().sum()

Age                         0
Gender                      0
Heart rate                  0
Systolic blood pressure     0
Diastolic blood pressure    0
Blood sugar                 0
CK-MB                       0
Troponin                    0
Result                      0
dtype: int64

There is no **null** value in the dataset.

In [6]:
data.describe()

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin
count,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0,1319.0
mean,56.191812,0.659591,78.336619,127.170584,72.269143,146.634344,15.274306,0.360942
std,13.647315,0.474027,51.63027,26.12272,14.033924,74.923045,46.327083,1.154568
min,14.0,0.0,20.0,42.0,38.0,35.0,0.321,0.001
25%,47.0,0.0,64.0,110.0,62.0,98.0,1.655,0.006
50%,58.0,1.0,74.0,124.0,72.0,116.0,2.85,0.014
75%,65.0,1.0,85.0,143.0,81.0,169.5,5.805,0.0855
max,103.0,1.0,1111.0,223.0,154.0,541.0,300.0,10.3


## 2. Data exploration and visualization

In [7]:
# !streamlit run streamlit_app.py

## 3. Data Preprocessing

In Data Preprocessing Step:
1. In **Step 1**, Standardize the range of numerical features
2. In **Step 2**, Encode the categorical features by using OneHotEncoder
2. In **Step 3**, Convert  target feature into binary form like '0' and '1' (binary classfication)

After that, we perform `train-test-split` to train and test our model.

In [8]:
from preprocess_data import DataPreprocessor

In [9]:
preprocessor = DataPreprocessor(data=data,
                                remove_columns=None,  # No columns to remove in this example
                                numerical_columns=['Age', 'Heart rate', 'Systolic blood pressure', 'Diastolic blood pressure', 'Blood sugar', 'Troponin', "CK-MB"],
                                categorical_columns = None)

In [10]:
preprocessed_data = pd.DataFrame(preprocessor.fit_transform())
preprocessed_data["Result"] = (preprocessed_data["Result"] == "positive").astype(int)
preprocessed_data.columns = preprocessed_data.columns.astype(str)
preprocessed_data

Unnamed: 0,0,1,2,3,4,5,6,Result,Gender
0,0.572358,-0.239032,1.257215,0.764927,0.178459,-0.302342,-0.290962,0,1
1,-2.579640,0.303491,-1.117098,-1.872542,1.994344,0.605701,-0.184072,1,1
2,-0.087363,-0.277784,1.257215,0.337229,1.647189,-0.310140,-0.286859,0,1
3,0.572358,-0.161529,-0.274600,-1.230995,1.647189,-0.207032,-0.030324,1,1
4,-0.087363,-0.277784,-0.580963,-0.518166,2.047752,-0.310140,-0.306509,0,1
...,...,...,...,...,...,...,...,...,...
1314,-0.893688,0.303491,-0.198009,-0.375600,0.765951,-0.307541,-0.294633,0,1
1315,0.718963,0.109733,-0.083123,-1.230995,0.031586,-0.163710,-0.301111,1,1
1316,-0.820385,0.129109,1.563578,2.261869,-0.676074,3.369688,-0.303054,1,1
1317,-0.160665,-0.394039,-0.389486,-0.304317,3.957101,-0.001683,-0.204587,1,1


In [11]:
preprocessor.save("data_preprocessing/preprocessing_1.joblib")

DataPreprocessor object saved to data_preprocessing/preprocessing_1.joblib


In [12]:
data = preprocessed_data

#### Train Test Split

In [13]:
# X = training features, y = target features
X = data.drop('Result', axis = 1)
y = data['Result']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

print("After spliting into training set and test set")
print("X_train.shape and X_test.shape: ", X_train.shape, ",", X_test.shape)
print("y_train.shape and y_test.shape: ", y_train.shape, ",", y_test.shape)

After spliting into training set and test set
X_train.shape and X_test.shape:  (1055, 8) , (264, 8)
y_train.shape and y_test.shape:  (1055,) , (264,)


## 4. Select and Train model

1. Logistic Regression
2. Support Vector Machines Classifier
3. Nearest Neighbors Classifier
4. Naive Bayes Classifier
5. Decision Tree Classifier
6. Random Forest Classifier
7. Deep Learning Classifier (Neural Network)

In [14]:
log_cls = LogisticRegression(random_state=2, max_iter = 2000)
svm_cls = SVC(random_state = 13, max_iter = 2000)
knn_cls = KNeighborsClassifier(n_neighbors = 3)
nb_cls = GaussianNB()
dt_cls = DecisionTreeClassifier(random_state = 15)
rf_cls = RandomForestClassifier(random_state = 16, n_estimators = 100)
mlp_cls = MLPClassifier(random_state=17, max_iter = 2000, alpha = 0.001, hidden_layer_sizes = (3, 3))

In [15]:
classifier_name = ["Logistic Regression", "SVM", "KNN", "NB", "Decision Tree", "Random Forest", "MLP"]
classifier_filename = ["model/logistic.pkl", "model/svm.pkl", "model/knn.pkl", "model/nb.pkl", "model/dt.pkl", "model/rf.pkl", "model/mlp.pkl"]
classifier = [log_cls, svm_cls, knn_cls, nb_cls, dt_cls, rf_cls, mlp_cls]

In [16]:
def save_or_load_model(model, filename, action='save'):
    if action == 'save':
        with open(filename, 'wb') as file:
            pickle.dump(model, file)
        print(f"Model saved to {filename}")

    elif action == 'load':
        with open(filename, 'rb') as file:
            loaded_model = pickle.load(file)
        print(f"Model loaded from {filename}")
        return loaded_model
    else:
        raise ValueError("Invalid action. Use 'save' or 'load'.")

In [17]:
def train(classifier, X, y, cls_name = None):
    classifier.fit(X, y)
    if cls_name is not None:
        print(f"{cls_name} training completed!!!")
    else:
        print(f"Classifier training completed!!!")
    
    return classifier

In [18]:
def train_and_save_model(classifier, X, y, filename, cls_name=None):
    # Train the model
    trained_model = train(classifier, X, y, cls_name)

    # Save the trained model
    save_or_load_model(trained_model, filename)

    print(f"Model trained and saved to {filename}")
    return trained_model

In [19]:
trained_model_dict = dict()

for a, b, c in zip(classifier, classifier_filename, classifier_name):
    trained_model_dict[c] = train_and_save_model(a, X_train, y_train, b, cls_name=c)

Logistic Regression training completed!!!
Model saved to model/logistic.pkl
Model trained and saved to model/logistic.pkl
SVM training completed!!!
Model saved to model/svm.pkl
Model trained and saved to model/svm.pkl
KNN training completed!!!
Model saved to model/knn.pkl
Model trained and saved to model/knn.pkl
NB training completed!!!
Model saved to model/nb.pkl
Model trained and saved to model/nb.pkl
Decision Tree training completed!!!
Model saved to model/dt.pkl
Model trained and saved to model/dt.pkl
Random Forest training completed!!!
Model saved to model/rf.pkl
Model trained and saved to model/rf.pkl
MLP training completed!!!
Model saved to model/mlp.pkl
Model trained and saved to model/mlp.pkl


In [20]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def evaluation_table(model_dict, X, y):
    # Create an empty DataFrame with column names
    eval_df = pd.DataFrame(columns=['Model', 'Accuracy', 'Precision (True)', 'Precision (False)', 'Recall (True)', 'Recall (False)', 'F1 Score (True)', 'F1 Score (False)', 'ROC-AUC Score'])
    
    # Loop through the models in the dictionary
    for model_name, model in model_dict.items():
        y_pred = model.predict(X)
        accuracy = accuracy_score(y, y_pred)
        precision_true, precision_false = precision_score(y, y_pred, average=None, labels=[1, 0])
        recall_true, recall_false = recall_score(y, y_pred, average=None, labels=[1, 0])
        f1_true, f1_false = f1_score(y, y_pred, average=None, labels=[1, 0])
        roc_auc = roc_auc_score(y, y_pred)

        # Create a temporary DataFrame for the current model's evaluation
        temp_df = pd.DataFrame([{
            'Model': str(model_name),
            'Accuracy': accuracy,
            'Precision (True)': precision_true,
            'Precision (False)': precision_false,
            'Recall (True)': recall_true,
            'Recall (False)': recall_false,
            'F1 Score (True)': f1_true,
            'F1 Score (False)': f1_false,
            'ROC-AUC Score': roc_auc
        }])

        # Concatenate the temp_df with eval_df
        eval_df = pd.concat([eval_df, temp_df], ignore_index=True)

        print(str(model_name) + "  ... Completed")

    # Format the numeric columns as percentages
    eval_df = eval_df.style.format({
        'Accuracy': "{:.3%}",
        'Precision (True)': "{:.3%}",
        'Precision (False)': "{:.3%}",
        'Recall (True)': "{:.3%}",
        'Recall (False)': "{:.3%}",
        'F1 Score (True)': "{:.3%}",
        'F1 Score (False)': "{:.3%}",
        'ROC-AUC Score': "{:.3%}"
    })
        
    return eval_df

In [21]:
train_evaluation = evaluation_table(trained_model_dict, X_train, y_train)
train_evaluation.data.to_csv('evaluation/train_evaluation.csv', index=False, encoding='utf-8')
train_evaluation

Logistic Regression  ... Completed
SVM  ... Completed
KNN  ... Completed
NB  ... Completed
Decision Tree  ... Completed
Random Forest  ... Completed
MLP  ... Completed


  eval_df = pd.concat([eval_df, temp_df], ignore_index=True)


Unnamed: 0,Model,Accuracy,Precision (True),Precision (False),Recall (True),Recall (False),F1 Score (True),F1 Score (False),ROC-AUC Score
0,Logistic Regression,81.137%,86.364%,73.804%,82.226%,79.412%,84.244%,76.505%,80.819%
1,SVM,77.725%,82.087%,70.944%,81.453%,71.814%,81.769%,71.376%,76.633%
2,KNN,83.412%,87.700%,77.156%,84.853%,81.127%,86.253%,79.092%,82.990%
3,NB,78.957%,99.766%,64.809%,65.842%,99.755%,79.330%,78.571%,82.799%
4,Decision Tree,100.000%,100.000%,100.000%,100.000%,100.000%,100.000%,100.000%,100.000%
5,Random Forest,100.000%,100.000%,100.000%,100.000%,100.000%,100.000%,100.000%,100.000%
6,MLP,92.417%,96.705%,86.607%,90.726%,95.098%,93.620%,90.654%,92.912%


In [22]:
test_evaluation = evaluation_table(trained_model_dict, X_test, y_test)
test_evaluation.data.to_csv('evaluation/test_evaluation.csv', index=False, encoding='utf-8')
test_evaluation

Logistic Regression  ... Completed
SVM  ... Completed
KNN  ... Completed
NB  ... Completed
Decision Tree  ... Completed
Random Forest  ... Completed
MLP  ... Completed


  eval_df = pd.concat([eval_df, temp_df], ignore_index=True)


Unnamed: 0,Model,Accuracy,Precision (True),Precision (False),Recall (True),Recall (False),F1 Score (True),F1 Score (False),ROC-AUC Score
0,Logistic Regression,79.924%,81.977%,76.087%,86.503%,69.307%,84.179%,72.539%,77.905%
1,SVM,73.485%,75.978%,68.235%,83.436%,57.426%,79.532%,62.366%,70.431%
2,KNN,63.258%,69.880%,52.041%,71.166%,50.495%,70.517%,51.256%,60.830%
3,NB,78.788%,99.083%,64.516%,66.258%,99.010%,79.412%,78.125%,82.634%
4,Decision Tree,97.727%,98.160%,97.030%,98.160%,97.030%,98.160%,97.030%,97.595%
5,Random Forest,97.727%,98.160%,97.030%,98.160%,97.030%,98.160%,97.030%,97.595%
6,MLP,91.288%,93.210%,88.235%,92.638%,89.109%,92.923%,88.670%,90.873%


## Test the system

In [23]:
mlp_model = save_or_load_model(mlp_cls, "model/mlp.pkl", action='load')

Model loaded from model/mlp.pkl


In [24]:
preprocessor.load("data_preprocessing/preprocessing_1.joblib")

DataPreprocessor object loaded from data_preprocessing/preprocessing_1.joblib


In [25]:
record = {
    'Age': 12,
    "Gender": 1,
    'Heart rate': 80,
    'Systolic blood pressure': 120,
    "Diastolic blood pressure": 60,
    "Blood sugar": 100,
    "CK-MB": 3.1,
    "Troponin": 1
}

record = pd.DataFrame([record])
record

Unnamed: 0,Age,Gender,Heart rate,Systolic blood pressure,Diastolic blood pressure,Blood sugar,CK-MB,Troponin
0,12,1,80,120,60,100,3.1,1


In [26]:
preprocessed_data = pd.DataFrame(preprocessor.transform(record))
preprocessed_data.columns = preprocessed_data.columns.astype(str)
preprocessed_data["Gender"] = 1
preprocessed_data

Unnamed: 0,0,1,2,3,4,5,6,Gender
0,-3.23936,0.032229,-0.2746,-0.874581,-0.622666,0.553714,-0.26289,1


In [27]:
re = mlp_model.predict(preprocessed_data)
re

array([1])