## Data Processing

In [None]:
df = pd.read_csv('heart.csv')

In [None]:
df.rename(columns={'age':'Age',
                     'sex':'Sex',
                     'cp': 'Chest Pain',
                     'trestbps':'Resting Blood Pressure',
                     'chol':'Cholestrol',
                     'fbs':'Blood Sugar',
                     'restecg':'Resting ECG',
                     'thalach': 'Max Heart Rate',
                     'exang': 'Exercise Induced Angina',
                     'oldpeak': 'Depression',
                     'slope': 'Slope',
                     'ca': 'Vessels colored by flourosopy',
                     'thal': 'Thallium',
                     'target': 'Heart Condition'
},inplace=True)

Changing the column names for easy understanding of the dataset

In [None]:
df

Unnamed: 0,Age,Sex,Chest Pain,Resting Blood Pressure,Cholestrol,Blood Sugar,Resting ECG,Max Heart Rate,Exercise Induced Angina,Depression,Slope,Vessels colored by flourosopy,Thallium,Heart Condition
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


In [None]:
df.isnull().sum()

Age                              0
Sex                              0
Chest Pain                       0
Resting Blood Pressure           0
Cholestrol                       0
Blood Sugar                      0
Resting ECG                      0
Max Heart Rate                   0
Exercise Induced Angina          0
Depression                       0
Slope                            0
Vessels colored by flourosopy    0
Thallium                         0
Heart Condition                  0
dtype: int64

In [None]:
df.dtypes

Age                                int64
Sex                                int64
Chest Pain                         int64
Resting Blood Pressure             int64
Cholestrol                         int64
Blood Sugar                        int64
Resting ECG                        int64
Max Heart Rate                     int64
Exercise Induced Angina            int64
Depression                       float64
Slope                              int64
Vessels colored by flourosopy      int64
Thallium                           int64
Heart Condition                    int64
dtype: object

## **MODEL TRAINING AND PREDICTION**


**SVC (Support Vector Classifier)**: SVC is a supervised machine learning algorithm used for classification and regression analysis. It works by finding a hyperplane in an N-dimensional space (where N is the number of features) that distinctly classifies the data points. SVC is particularly effective in high-dimensional spaces and is versatile due to its ability to handle both linear and non-linear classification problems through the use of the kernel trick.

**RandomForestClassifier** - Random Forest is an ensemble learning method that operates by constructing multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It is known for its robustness and ability to handle large datasets with high dimensionality.

**DecisionTreeClassifier** - A Decision Tree Classifier is a type of supervised learning algorithm that is mostly used for classification problems. It works by creating a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. The model is built by splitting the dataset into subsets based on the values of input features, and these splits are made recursively in a manner called recursive partitioning.

**LogisticsRegression** - Logistic Regression is a statistical model used for binary classification problems. It is a linear model that uses the logistic function to model a binary dependent variable. The logistic function, also known as the sigmoid function, is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1. This output can be interpreted as the probability of the positive class.

**Confusion Matrix** - A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. It allows visualization of the performance of an algorithm. The matrix is a 2x2 square for binary classification problems, where the rows represent the actual classes and the columns represent the predicted classes. The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier

**Classification Report **- A classification report is a text report showing the main classification metrics. It includes precision, recall, f1-score, and support for each class. Precision is the ability of the classifier not to label a negative sample as positive.

##### **DATA SPLIT**

In [None]:
X = df.drop('Heart Condition', axis=1)
y = df['Heart Condition']

# X is my feature matrix and y are my labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Further split the training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

#### MODEL ARCHITECTURES

##### **LOGISTIC REGRESSION**

In [None]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)


In [None]:
y_pred = lr_model.predict(X_val)

# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 0.8494623655913979
[[31 11]
 [ 3 48]]
              precision    recall  f1-score   support

           0       0.91      0.74      0.82        42
           1       0.81      0.94      0.87        51

    accuracy                           0.85        93
   macro avg       0.86      0.84      0.84        93
weighted avg       0.86      0.85      0.85        93



##### **DECISION TREE**

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

In [None]:
y_pred = dt_model.predict(X_val)
# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 1.0
[[42  0]
 [ 0 51]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        42
           1       1.00      1.00      1.00        51

    accuracy                           1.00        93
   macro avg       1.00      1.00      1.00        93
weighted avg       1.00      1.00      1.00        93



##### **SUPPORT VECTOR MACHINE**

In [None]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

In [None]:
y_pred = svm_model.predict(X_val)
# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))


print(f"Validation Accuracy: {val_score}")

Model Accuracy: 0.8602150537634409
[[30 12]
 [ 1 50]]
              precision    recall  f1-score   support

           0       0.97      0.71      0.82        42
           1       0.81      0.98      0.88        51

    accuracy                           0.86        93
   macro avg       0.89      0.85      0.85        93
weighted avg       0.88      0.86      0.86        93

Validation Accuracy: 0.963855421686747


##### **RANDOM FOREST**

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
val_score = rf_model.score(X_val, y_val)
print(f"Validation Accuracy: {val_score}")

# Optionally, evaluate the model on the test set
test_score = rf_model.score(X_test, y_test)
print(f"Test Accuracy: {test_score}")

Validation Accuracy: 1.0
Test Accuracy: 0.970873786407767


In [None]:
y_pred = rf_model.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 1.0
[[42  0]
 [ 0 51]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        42
           1       1.00      1.00      1.00        51

    accuracy                           1.00        93
   macro avg       1.00      1.00      1.00        93
weighted avg       1.00      1.00      1.00        93



In [None]:
joblib.dump(rf_model, "rf_model.pkl")

['rf_model.pkl']

In [None]:
# cm = confusion_matrix(y_val, y_pred)

# # Plotting the confusion matrix
# plt.figure(figsize=(10,7))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
# plt.title("Top Left: TP, Top Right, Top ")
# plt.xlabel('Predicted')
# plt.ylabel('Actual')
# plt.show()

##### **XGBOOST**

In [None]:
xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)
xgb_model.fit(X_train, y_train)

In [None]:
y_pred = xgb_model.predict(X_val)
# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

val_score = xgb_model.score(X_val, y_val)
print(f"Validation Accuracy: {val_score}")

Model Accuracy: 1.0
[[42  0]
 [ 0 51]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        42
           1       1.00      1.00      1.00        51

    accuracy                           1.00        93
   macro avg       1.00      1.00      1.00        93
weighted avg       1.00      1.00      1.00        93

Validation Accuracy: 1.0


In [None]:
test_score = xgb_model.score(X_test, y_test)
print(f"Test Accuracy: {test_score}")

Test Accuracy: 0.970873786407767


In [None]:
joblib.dump(xgb_model, "xgb_model.pkl")

['xgb_model.pkl']

### MODEL TRAINING WITH FEATURE SELECTION

##### **DATA SPLIT**

In [None]:
data = df
data.head(2)

Unnamed: 0,Age,Sex,Chest Pain,Resting Blood Pressure,Cholestrol,Blood Sugar,Resting ECG,Max Heart Rate,Exercise Induced Angina,Depression,Slope,Vessels colored by flourosopy,Thallium,Heart Condition
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0


In [None]:
data.columns

Index(['Age', 'Sex', 'Chest Pain', 'Resting Blood Pressure', 'Cholestrol',
       'Blood Sugar', 'Resting ECG', 'Max Heart Rate',
       'Exercise Induced Angina', 'Depression', 'Slope',
       'Vessels colored by flourosopy', 'Thallium', 'Heart Condition'],
      dtype='object')

**1. The heart data would be split to only validation set and test set for the training of which the validation set would be only 10% of the entire data.**

**2. A separate test data (heart_statlog_cleveland_hungary_final) would be used to see how the model generalize well to new unseen set**

In [None]:
X = df.drop(['Heart Condition', 'Vessels colored by flourosopy', 'Thallium'], axis=1)
y = df['Heart Condition']


# split the training set into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

#### MODELS

##### **LOGISTIC REGRESSION**

In [None]:
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)


In [None]:
y_pred = lr_model.predict(X_val)

# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 0.7469879518072289
[[29 14]
 [ 7 33]]
              precision    recall  f1-score   support

           0       0.81      0.67      0.73        43
           1       0.70      0.82      0.76        40

    accuracy                           0.75        83
   macro avg       0.75      0.75      0.75        83
weighted avg       0.76      0.75      0.75        83



##### **DECISION TREE**

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

In [None]:
y_pred = dt_model.predict(X_val)
# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 0.963855421686747
[[42  1]
 [ 2 38]]
              precision    recall  f1-score   support

           0       0.95      0.98      0.97        43
           1       0.97      0.95      0.96        40

    accuracy                           0.96        83
   macro avg       0.96      0.96      0.96        83
weighted avg       0.96      0.96      0.96        83



##### **SUPPORT VECTOR MACHINE**

In [None]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

In [None]:
y_pred = svm_model.predict(X_val)
# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 0.7469879518072289
[[29 14]
 [ 7 33]]
              precision    recall  f1-score   support

           0       0.81      0.67      0.73        43
           1       0.70      0.82      0.76        40

    accuracy                           0.75        83
   macro avg       0.75      0.75      0.75        83
weighted avg       0.76      0.75      0.75        83



##### **RANDOM FOREST**

In [None]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

In [None]:
val_score = rf_model.score(X_val, y_val)
print(f"Validation Accuracy: {val_score}")

# Optionally, evaluate the model on the test set
test_score = rf_model.score(X_test, y_test)
print(f"Test Accuracy: {test_score}")

Validation Accuracy: 0.963855421686747
Test Accuracy: 0.9514563106796117


In [None]:
y_pred = rf_model.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 0.963855421686747
[[42  1]
 [ 2 38]]
              precision    recall  f1-score   support

           0       0.95      0.98      0.97        43
           1       0.97      0.95      0.96        40

    accuracy                           0.96        83
   macro avg       0.96      0.96      0.96        83
weighted avg       0.96      0.96      0.96        83



##### **XGBOOST**

In [None]:
# initialize the xgb classifier
xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)
xgb_model.fit(X_train, y_train)

In [None]:
# model predictions
y_pred = xgb_model.predict(X_val)
# model metrics
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Model Accuracy: 0.963855421686747
[[42  1]
 [ 2 38]]
              precision    recall  f1-score   support

           0       0.95      0.98      0.97        43
           1       0.97      0.95      0.96        40

    accuracy                           0.96        83
   macro avg       0.96      0.96      0.96        83
weighted avg       0.96      0.96      0.96        83



In [None]:
val_score = xgb_model.score(X_val, y_val)
print(f"Validation Accuracy: {val_score}")

# Optionally, evaluate the model on the test set
test_score = xgb_model.score(X_test, y_test)
print(f"Test Accuracy: {test_score}")

Validation Accuracy: 0.963855421686747
Test Accuracy: 0.9514563106796117


**MODEL INFERENCE ON THE HEART_STATLOG_CLEVELAND... SET**

In [None]:
inf_set = pd.read_csv('heart_statlog_cleveland_hungary_final.csv')
inf_set.head(2)

Unnamed: 0,age,sex,chest pain type,resting bp s,cholesterol,fasting blood sugar,resting ecg,max heart rate,exercise angina,oldpeak,ST slope,target
0,40,1,2,140,289,0,0,172,0,0.0,1,0
1,49,0,3,160,180,0,0,156,0,1.0,2,1


In [None]:
inf_set.columns

Index(['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol',
       'fasting blood sugar', 'resting ecg', 'max heart rate',
       'exercise angina', 'oldpeak', 'ST slope', 'target'],
      dtype='object')

In [None]:
inf_set.rename(columns={'age':'Age',
                     'sex':'Sex',
                     'chest pain type': 'Chest Pain',
                     'resting bp s':'Resting Blood Pressure',
                     'cholesterol':'Cholestrol',
                     'fasting blood sugar':'Blood Sugar',
                     'resting ecg':'Resting ECG',
                     'max heart rate': 'Max Heart Rate',
                     'exercise angina': 'Exercise Induced Angina',
                     'oldpeak': 'Depression',
                     'ST slope': 'Slope',
                     'target': 'Heart Condition'
},inplace=True)

In [None]:
X = df.drop(['Heart Condition'], axis=1)
y = df['Heart Condition']


##### **MODEL PERFORMANCE**

In [None]:
# prediction for xgboost
predictions = xgb_model.predict(X)

# Evaluate the model's performance
accuracy = accuracy_score(y, predictions)
print(f"Accuracy: {accuracy}")

# Print classification report for detailed evaluation metrics
print(classification_report(y, predictions))

Accuracy: 0.9902439024390244
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       499
           1       0.99      0.99      0.99       526

    accuracy                           0.99      1025
   macro avg       0.99      0.99      0.99      1025
weighted avg       0.99      0.99      0.99      1025



In [None]:
y_pred = rf_model.predict(X)

# model metrics
accuracy = accuracy_score(y, y_pred)
print(f"Model Accuracy: {accuracy}")

print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

Model Accuracy: 0.9902439024390244
[[495   4]
 [  6 520]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       499
           1       0.99      0.99      0.99       526

    accuracy                           0.99      1025
   macro avg       0.99      0.99      0.99      1025
weighted avg       0.99      0.99      0.99      1025



In [None]:
# prediction for xgboost
predictions = dt_model.predict(X)

# Evaluate the model's performance
accuracy = accuracy_score(y, predictions)
print(f"Accuracy: {accuracy:.4f}")

# Print classification report for detailed evaluation metrics
print(classification_report(y, predictions))

Accuracy: 0.9902
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       499
           1       0.99      0.99      0.99       526

    accuracy                           0.99      1025
   macro avg       0.99      0.99      0.99      1025
weighted avg       0.99      0.99      0.99      1025



**METRICS USED**
1. Precision:
Precision measures the proportion of true positive predictions among all positive predictions made by the classifier.

2. Recall (Sensitivity):
Recall, also known as sensitivity or true positive rate, measures the proportion of true positive predictions among all actual positive instances in the dataset.

3. F1-score:
The F1-score is the harmonic mean of precision and recall. It provides a balanced measure of a model's performance, especially in scenarios where class imbalance exists.

4. Support:
Support refers to the number of instances in each class in the dataset.

5. Accuracy:
Accuracy measures the overall correctness of the classifier's predictions across all classes.

6. Macro Avg and Weighted Avg:
Macro average (macro avg) calculates the average of precision, recall, and F1-score across classes without considering class imbalance.
Weighted average (weighted avg) calculates the average of precision, recall, and F1-score considering the support (number of instances) of each class, giving more weight to classes with more instances.

#### **PROBLEM, MODEL SELECTION AND MODEL SUMMARY**

 PROBLEM TYPE: Classification Problem

 MODELS CONSIDERED and Reason for consideration
1. Logistic Regression:
* Model Complexity: Logistic regression is a relatively simple model with low
complexity. It uses a linear decision boundary to separate classes.

* Classification Effectiveness: Logistic regression is effective for binary
classification tasks and can handle well-separated classes. It performs particularly well when the decision boundary is linear or when features have a linear relationship with the log-odds of the target.

* Pattern Capture: Logistic regression captures patterns through linear combinations of features. It assumes that the log-odds of the target variable are a linear function of the input features.

* Underlying Logic: The underlying logic of logistic regression is based on the logistic function (sigmoid function), which maps the linear combination of features to probabilities between 0 and 1. The decision boundary is determined by the threshold probability (e.g., 0.5 for binary classification).

* ** Model Performance: **   
Model Accuracy: 0.8494623655913979
[[31 11]
 [ 3 48]]
              precision    recall  f1-score   support

           0       0.91      0.74      0.82        42
           1       0.81      0.94      0.87        51

    accuracy                           0.85        93
   macro avg       0.86      0.84      0.84        93
weighted avg       0.86      0.85      0.85        93



2. Decision Trees:
* Model Complexity: Decision trees can have varying complexity levels depending on the depth, number of nodes, and splits in the tree. They can capture complex non-linear relationships.

* Classification Effectiveness: Decision trees are effective for both binary and multi-class classification tasks. They can handle non-linear decision boundaries and are robust to outliers.

* Pattern Capture: Decision trees capture patterns by recursively splitting the feature space based on the most informative features at each node. They create hierarchical decision rules that partition the data into classes.

* Underlying Logic: The underlying logic of decision trees involves selecting features that maximize information gain (or minimize impurity) at each split. The final prediction is made by traversing the tree from the root to a leaf node based on feature values.

** Model Performance: **
Model Accuracy: 1.0
[[42  0]
 [ 0 51]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        42
           1       1.00      1.00      1.00        51

    accuracy                           1.00        93
   macro avg       1.00      1.00      1.00        93
weighted avg       1.00      1.00      1.00        93



3. Random Forests (Ensemble of Decision Trees):
* Model Complexity: Random forests consist of multiple decision trees (ensemble method), which collectively form a more complex model than a single decision tree.

* Classification Effectiveness: Random forests are highly effective for classification tasks, including binary and multi-class problems. They reduce overfitting compared to individual decision trees and provide improved generalization.

* Pattern Capture: Random forests capture patterns by aggregating predictions from multiple decision trees. Each tree focuses on different subsets of features and instances, leading to diverse models that collectively capture complex patterns in the data.

* Underlying Logic: The underlying logic of random forests involves building a collection of decision trees through bootstrapping (sampling with replacement) and random feature selection. The final prediction is made by averaging or voting among the predictions of individual trees.

* Model Performance:
Model Accuracy: 1.0
[[42  0]
 [ 0 51]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        42
           1       1.00      1.00      1.00        51

    accuracy                           1.00        93
   macro avg       1.00      1.00      1.00        93
weighted avg       1.00      1.00      1.00        93




4. Support Vector Machines (SVM):
* Model Complexity: SVMs can have varying complexity depending on the choice of kernel (linear, polynomial, radial basis function, etc.) and regularization parameters. They can capture complex decision boundaries.

* Classification Effectiveness: SVMs are effective for binary and multi-class classification tasks. They work well in high-dimensional spaces and can handle non-linear relationships through kernel tricks.

* Pattern Capture: SVMs capture patterns by finding the hyperplane that best separates classes in the feature space. The kernel trick allows them to map data into higher-dimensional spaces where classes are more easily separable.

* Underlying Logic: The underlying logic of SVMs involves maximizing the margin (distance) between the hyperplane and the closest data points (support vectors) of different classes. The optimal hyperplane serves as the decision boundary.

* Model Performance:
Model Accuracy: 0.8602150537634409
[[30 12]
 [ 1 50]]
              precision    recall  f1-score   support

           0       0.97      0.71      0.82        42
           1       0.81      0.98      0.88        51

    accuracy                           0.86        93
   macro avg       0.89      0.85      0.85        93
weighted avg       0.88      0.86      0.86        93

Validation Accuracy: 0.963855421686747



5. XGBoost (Gradient Boosting Trees):
* Model Complexity: XGBoost is an ensemble method that combines multiple decision trees (boosting). The model complexity depends on the number of trees, tree depth, and learning rate.

* Classification Effectiveness: XGBoost is highly effective for classification tasks, including binary and multi-class problems. It reduces bias and variance, leading to improved predictive performance.

* Pattern Capture: XGBoost captures patterns by sequentially fitting decision trees to the residuals of the previous trees. It focuses on correcting errors made by earlier trees, leading to an overall model that captures complex patterns and interactions in the data.

* Underlying Logic: The underlying logic of XGBoost involves gradient boosting, where each new tree is trained to minimize the residual errors of the ensemble. XGBoost uses regularization techniques and tree pruning to control model complexity and prevent overfitting.

* Model Performance:
Model Accuracy: 1.0
[[42  0]
 [ 0 51]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        42
           1       1.00      1.00      1.00        51

    accuracy                           1.00        93
   macro avg       1.00      1.00      1.00        93
weighted avg       1.00      1.00      1.00        93

Validation Accuracy: 1.0


** BEST PERFORMING MODELS **

1. RANDOM FOREST with an accuracy score of 95% on fewer features
2. XGBOOST with also an accuracy score of 95% on fewer features

** Best Choice **
XGBOOST:
* Reason: XGBoost is very efficient in terms of storage and showed faster prediction that the Random Forest

