## 1. Problem Definition
>This model aims to classify the patient as having heart disease or healthy, based on patient health records

## 2. Data
* The dataset is originally taken from UCI repository [https://archive.ics.uci.edu/ml/datasets/Heart+Disease](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)
* The dataset can also be found at [https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset)

## 3. Evaluation
As this model aims at predicting crucial information related to whether a person has heart disease or not, objective is to get more than 90% accuracy

## 4. Features
The dataset has following patient attributes:
* `age`
* `sex`
* `chest pain type` (4 values)
* `resting blood pressure`
* `serum cholestoral` in mg/dl
* `fasting blood sugar` > 120 mg/dl
* `resting electrocardiographic` results (values 0,1,2)
* `maximum heart rate` achieved
* `exercise induced angina`
* `oldpeak` = ST depression induced by exercise relative to rest
* the slope of the `peak exercise ST segment`
* number of major vessels (0-3) colored by flourosopy
* `thal`: 0 = normal; 1 = fixed defect; 2 = reversable defect

### 5.1 Getting Started with standard imports

In [None]:
#base imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#import models
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#import for evaluation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import RocCurveDisplay

#import for storing model
from joblib import load, dump

#### 5.2.1 Importing data


In [None]:
heart_df = pd.read_csv("./data/heart-disease.csv")
heart_df.head()

#### 5.2.2 Checking for missing values

In [None]:
heart_df.isna().sum()

#### 5.2.3 Splitting into dependendent and target variables

In [None]:
x = heart_df.drop("target", axis = 1)
y = heart_df["target"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
x, y

### 5.3 Training models
* Using Linear Support Vector Classifier estimator

In [None]:
#Set random seed
np.random.seed(42)

#Hyper Parameters to use
hyperParameters = {
    "max_iter" : [1000, 2000, 3000, 4000, 5000]
}

#Instantiating model
linearSVCModel = LinearSVC()
rs_linearSVCModel = RandomizedSearchCV(linearSVCModel, hyperParameters, cv = 5, verbose = 2)

#Fitting data
rs_linearSVCModel.fit(x_train, y_train)

#Making predictions
y_preds1 = rs_linearSVCModel.predict(x_test)

#score of model
rs_linearSVCModel_score = rs_linearSVCModel.score(x_test, y_test)
print(f"Linear SVC Model accuracy = {rs_linearSVCModel_score:.2f}")

#confusion matrix display 
ConfusionMatrixDisplay.from_predictions(y_test, y_preds1)

* Using KNeighborsClassifier estimator

In [None]:
#Set random seed
np.random.seed(42)

#Hyper Parameters to use
hyperParameters = {
    "n_neighbors": [3, 5, 7, 9],
    "algorithm": ["ball_tree", "kd_tree", "brute"]
}

#Instantiating model
model = KNeighborsClassifier()
rs_KNeighborsModel = RandomizedSearchCV(model, hyperParameters, cv = 5, verbose = 2)

#Fitting data
rs_KNeighborsModel.fit(x_test, y_test)

#Making predictions
y_preds2 = rs_KNeighborsModel.predict(x_test)

#score of model
rs_KNeighborModel_score = rs_KNeighborsModel.score(x_test, y_test)
print(f"KNeighbor Classifier Model accuracy = {rs_KNeighborModel_score:.2f}")

#confusion matrix display 
ConfusionMatrixDisplay.from_predictions(y_test, y_preds2)

* Using RandomForestClassifier estimator

In [None]:
#Set random seed
np.random.seed(42)

#Hyper Parameters to use
hyperParameters = {
    "n_estimators": [10, 100, 200, 500, 1000, 1200],
    "max_depth": [None, 5, 10, 20, 30],
    "max_features": ["auto", "sqrt"],
    "min_samples_split": [2, 4, 6],
    "min_samples_leaf": [1, 2, 4]
}

#Instantiating model
model = RandomForestClassifier()
rs_RandomForestClassifier = RandomizedSearchCV(model, hyperParameters, cv = 5, verbose = 2)

#Fitting data
rs_RandomForestClassifier.fit(x_test, y_test)

#Making predictions
y_preds3 = rs_RandomForestClassifier.predict(x_test)

#score of model
rs_RandomForestClassifierModel_score = rs_RandomForestClassifier.score(x_test, y_test)
print(f"Random Forest Classifier Model accuracy = {rs_RandomForestClassifierModel_score:.2f}")

#confusion matrix display 
ConfusionMatrixDisplay.from_predictions(y_test, y_preds3)

### 6. Comparison

In [None]:
def evaluatePrediction(y_true, y_preds):
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    return {"accuracy": round(accuracy, 2),
           "precision": round(precision, 2),
           "recall": round(recall, 2),
           "f1-score": round(f1, 2)}


In [None]:
evaluate_metrics = pd.DataFrame({
    "Linear SVC": evaluatePrediction(y_test, y_preds1),
    "KNeighbors Classifier": evaluatePrediction(y_test, y_preds2),
    "Random Forest Classifier": evaluatePrediction(y_test, y_preds3)
})
evaluate_metrics.plot.bar()

In [None]:
#save models

#Linear SVC
dump(rs_linearSVCModel, filename = "model/LinearSVCModel.joblib")

#KNeighbors Classifier
dump(rs_KNeighborsModel, filename = "model/KNeighborsClassifier.joblib")

#Random Forest Classifier
dump(rs_RandomForestClassifier, filename = "model/RandomForestClassifier.joblib")