# Introduction

## Heart Disease
In this notebook, we will try to answer some questions regarding heart disease as well as try to devise a way to predict whether or not a patient may have heart disease. We will attempt to determine what factors may cause heart disease or at least have some correlation to heart disease.

### What is Heart Disease
Heart disease is a general term that refers to many types of heart conditions, including coronary artery disease (CAD), arrhythmia, heart valve disease, and heart failure. CAD is the most common type of heart disease in the United States. It affects the major blood vessels that supply the heart muscle. Decreased blood flow to the heart can cause a heart attack.

Once diagnosed with heart disease, it can't be cured. However, you can treat the things that contributed to the development of coronary artery disease, which can reduce how the condition impacts your body.

According to CDC statistics:
- One person dies EVERY 33 seconds from Heart Disease in the United States
- 695,000 people in the United States died from heart disease in 2021—that’s 1 in every 5 deaths
- Heart disease cost the United States about $239.9 billion each year from 2018 to 2019.3 This includes the cost of health care services, medicines, and lost productivity due to death

As you can see, Heart Disease is a major worry for alot of people in the United States. It effects many millions of people, either directly or indirectly. Preventing this disease or at least catching it early,  combined with latest treatments and lifestyle changes can result in major strides in longevity of live as well as quality of life well into older age.

# Data Wrangling
Import Data and make any necessary modifications in order to be able to work with the data.

In [1]:
# Import libraries to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read the data from the csv file
df = pd.read_csv('HD_cleaned.csv')
df.head()

Unnamed: 0,General_Health,Checkup,Exercise,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Height_(cm),Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption,Heart_Disease
0,Poor,Within the past 2 years,No,No,No,No,No,Yes,Female,70-74,150,32.66,14.54,Yes,0,30,16,12,No
1,Very Good,Within the past year,No,No,No,No,Yes,No,Female,70-74,165,77.11,28.29,No,0,30,0,4,Yes
2,Very Good,Within the past year,Yes,No,No,No,Yes,No,Female,60-64,163,88.45,33.47,No,4,12,3,16,No
3,Poor,Within the past year,Yes,No,No,No,Yes,No,Male,75-79,180,93.44,28.73,No,0,30,30,8,Yes
4,Good,Within the past year,No,No,No,No,No,No,Male,80+,191,88.45,24.37,Yes,0,8,4,0,No


In [2]:
df.shape

(308854, 19)

In [3]:
df.dtypes

General_Health                   object
Checkup                          object
Exercise                         object
Skin_Cancer                      object
Other_Cancer                     object
Depression                       object
Diabetes                         object
Arthritis                        object
Sex                              object
Age_Category                     object
Height_(cm)                       int64
Weight_(kg)                     float64
BMI                             float64
Smoking_History                  object
Alcohol_Consumption               int64
Fruit_Consumption                 int64
Green_Vegetables_Consumption      int64
FriedPotato_Consumption           int64
Heart_Disease                    object
dtype: object

In [4]:
# Split data into modeling and prediction sets
df = df.sample(frac=.95, random_state=123).reset_index(drop=True)
df_unseen = df.sample(frac=.05, random_state=123).reset_index(drop=True)

print('Data for modeling: ' + str(df.shape))
print('Data for predictions: ' + str(df_unseen.shape))

Data for modeling: (293411, 19)
Data for predictions: (14671, 19)


In [5]:
# Setup Pycarent environment
from pycaret.classification import *

In [6]:
clf_exp1 = setup(data = df, target = 'Heart_Disease', session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Heart_Disease
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(293411, 19)"
5,Transformed data shape,"(293411, 42)"
6,Transformed train set shape,"(205387, 42)"
7,Transformed test set shape,"(88024, 42)"
8,Ordinal features,7
9,Numeric features,7


# Exploratory Data Analysis

In [49]:
# Import EDA profiling report tools
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report", dark_mode=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [8]:
# profile.to_file('HD_EDA_Profile.html')

# Modeling

In [50]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9194,0.8367,0.0343,0.5429,0.0645,0.0554,0.1213,0.658
gbc,Gradient Boosting Classifier,0.9192,0.8358,0.054,0.5102,0.0976,0.0834,0.1462,4.577
lr,Logistic Regression,0.919,0.8356,0.0623,0.4982,0.1108,0.0945,0.1546,0.754
svm,SVM - Linear Kernel,0.919,0.0,0.0376,0.4784,0.0653,0.0559,0.1059,2.274
ridge,Ridge Classifier,0.919,0.0,0.0014,0.5379,0.0028,0.0023,0.0232,0.338
dummy,Dummy Classifier,0.919,0.5,0.0,0.0,0.0,0.0,0.0,0.336
ada,Ada Boost Classifier,0.9184,0.8357,0.0768,0.4767,0.1321,0.1122,0.1666,1.971
catboost,CatBoost Classifier,0.9183,0.8338,0.0526,0.4635,0.0945,0.0792,0.135,9.94
xgboost,Extreme Gradient Boosting,0.9182,0.8312,0.0527,0.4568,0.0944,0.079,0.1337,2.675
rf,Random Forest Classifier,0.9178,0.8086,0.0414,0.424,0.0753,0.0618,0.1121,0.742


In [10]:
lightgbm = create_model('lightgbm')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9185,0.8354,0.0337,0.459,0.0627,0.0523,0.1071
1,0.9199,0.836,0.0379,0.578,0.0711,0.0618,0.1331
2,0.9196,0.8347,0.0355,0.5566,0.0667,0.0576,0.1256
3,0.9199,0.8345,0.0385,0.5872,0.0722,0.0629,0.1355
4,0.9192,0.8277,0.0343,0.5229,0.0643,0.0549,0.1183
5,0.9195,0.8422,0.0349,0.5472,0.0655,0.0564,0.123
6,0.9197,0.8357,0.0349,0.5743,0.0657,0.057,0.1271
7,0.9199,0.841,0.0391,0.5804,0.0732,0.0637,0.1356
8,0.9199,0.8424,0.0307,0.6071,0.0584,0.051,0.1236
9,0.9182,0.8378,0.0241,0.4167,0.0455,0.037,0.0843


In [11]:
# tuned_lightgbm = tune_model(lightgbm)

Although the AUC went up slightly, due to the drop in all other scores, we will not use the tuned lightgbm model, just the standard lightgbm model.

In [12]:
lr = create_model('lr')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9176,0.8367,0.0619,0.4364,0.1085,0.0902,0.1405
1,0.92,0.8347,0.0679,0.5485,0.1209,0.1049,0.1725
2,0.9183,0.8327,0.0601,0.4651,0.1065,0.0896,0.1448
3,0.9197,0.8314,0.0727,0.5307,0.1279,0.1105,0.1746
4,0.9184,0.8279,0.0547,0.4715,0.098,0.0826,0.1394
5,0.919,0.8407,0.0565,0.5027,0.1016,0.0866,0.1481
6,0.9192,0.8341,0.0655,0.5093,0.1161,0.0995,0.1611
7,0.9209,0.8431,0.0746,0.5933,0.1325,0.1165,0.1904
8,0.9196,0.8389,0.0643,0.5271,0.1147,0.0988,0.1634
9,0.9171,0.8363,0.0451,0.3968,0.081,0.0656,0.1116


In [13]:
# tuned_lr = tune_model(lr)

Again, we saw very little imporvement in some areas to make it worthwhile to use the tuned version of the model so we will use the standard version of the model.

In [14]:
nb = create_model('nb')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6236,0.8063,0.8665,0.161,0.2716,0.1564,0.2579
1,0.6272,0.8009,0.8382,0.1587,0.2669,0.1514,0.2464
2,0.616,0.8017,0.84,0.1549,0.2616,0.1446,0.2398
3,0.6233,0.8011,0.845,0.1582,0.2666,0.1507,0.2471
4,0.6187,0.7977,0.8347,0.1553,0.2618,0.145,0.239
5,0.6273,0.8085,0.8534,0.1608,0.2706,0.1555,0.254
6,0.6244,0.8049,0.8588,0.1604,0.2703,0.155,0.2547
7,0.6261,0.8142,0.8629,0.1615,0.2721,0.1571,0.2578
8,0.619,0.802,0.8491,0.1572,0.2652,0.1489,0.2463
9,0.6209,0.805,0.8497,0.1579,0.2663,0.1503,0.2478


In [15]:
# tuned_nb = tune_model(nb)

Once again, we will use the standard model, as that model will compliment the other models with its higher recall.

In [16]:
# rf = create_model('rf')

In [17]:
# tuned_rf = tune_model(rf)

Once again, we will use the standard model, as that model will compliment the other models with its higher recall.

In [18]:
# Choosing different fonts for plots because of font warnings

plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'  # You can choose a different font if you like


In [19]:
evaluate_model(lightgbm)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [20]:
evaluate_model(lr)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [21]:
evaluate_model(nb)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [22]:
# evaluate_model(rf)

In [23]:
# blend_soft_standard = blend_models([lightgbm, lr, nb, rf], method = 'soft')

In [24]:
blend_soft_standard2 = blend_models([lightgbm, lr, nb], method = 'soft')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8911,0.8301,0.3428,0.3327,0.3377,0.2784,0.2784
1,0.8935,0.8257,0.3331,0.3393,0.3362,0.2783,0.2783
2,0.8952,0.8237,0.3542,0.3531,0.3536,0.2966,0.2966
3,0.893,0.825,0.354,0.344,0.3489,0.2906,0.2907
4,0.8942,0.8196,0.3377,0.3439,0.3408,0.2833,0.2833
5,0.8953,0.8313,0.3606,0.3559,0.3582,0.3012,0.3012
6,0.8927,0.8293,0.3516,0.3421,0.3468,0.2883,0.2884
7,0.8946,0.8362,0.3602,0.3524,0.3562,0.2988,0.2989
8,0.8946,0.8295,0.3536,0.3504,0.352,0.2946,0.2946
9,0.8951,0.8296,0.365,0.3558,0.3603,0.3032,0.3032


Due to the higher F1 score, and only slightly lower accuracy, we will use the second model going forward.

In [51]:
evaluate_model(blend_soft_standard2)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [26]:
predict_model(blend_soft_standard2);

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.8932,0.8239,0.3444,0.3416,0.343,0.2848,0.2848


In [27]:
# Finalize Model
final_model = finalize_model(blend_soft_standard2)
print(final_model)

Pipeline(memory=FastMemory(location=C:\Users\Ramon\AppData\Local\Temp\joblib),
         steps=[('label_encoding',
                 TransformerWrapperWithInverse(exclude=None, include=None,
                                               transformer=LabelEncoder())),
                ('numerical_imputer',
                 TransformerWrapper(exclude=None,
                                    include=['Height_(cm)', 'Weight_(kg)',
                                             'BMI', 'Alcohol_Consumption',
                                             'Fruit_Consumption',
                                             'Green_Vegetables_Consum...
                                                                  class_weight=None,
                                                                  dual=False,
                                                                  fit_intercept=True,
                                                                  intercept_scaling=1,
                     

In [28]:
# Test model against holdout data from unseen data
unseen_predictions = predict_model(final_model, data=df_unseen)
unseen_predictions.head()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.8979,0.843,0.3833,0.3827,0.383,0.3274,0.3274


Unnamed: 0,General_Health,Checkup,Exercise,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,...,Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption,Heart_Disease,prediction_label,prediction_score
0,Very Good,Never,Yes,No,No,No,No,No,Male,30-34,...,38.560001,13.72,Yes,3,2,30,2,0,No,0.9939
1,Very Good,Within the past year,Yes,No,No,Yes,Yes,No,Male,70-74,...,77.110001,28.290001,No,2,20,5,2,0,No,0.5459
2,Good,Within the past 2 years,Yes,No,No,No,No,No,Male,50-54,...,117.93,36.259998,No,2,0,0,0,0,No,0.9794
3,Very Good,Within the past year,Yes,Yes,No,No,No,No,Female,70-74,...,58.970001,20.98,No,4,30,20,8,0,No,0.6479
4,Very Good,Within the past year,No,No,No,No,No,No,Male,80+,...,79.379997,26.610001,No,2,30,8,2,1,No,0.5566


In [41]:
unseen_predictions.to_csv('unseen_predictions.csv', index=False)

In [29]:
# Save model for future use
# save_model(final_model, 'blended_final_model')

In [37]:
# Test loading model
# saved_blended_model = load_model('blended_final_model')

Transformation Pipeline and Model Successfully Loaded


In [42]:
# Verify model performance on unseen data and check against baseline finalized results
# new_predictions = predict_model(saved_blended_model, data=df_unseen)
# new_predictions.head()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Voting Classifier,0.8979,0.843,0.3833,0.3827,0.383,0.3274,0.3274


Unnamed: 0,General_Health,Checkup,Exercise,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,...,Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption,Heart_Disease,prediction_label,prediction_score
0,Very Good,Never,Yes,No,No,No,No,No,Male,30-34,...,38.560001,13.72,Yes,3,2,30,2,0,No,0.9939
1,Very Good,Within the past year,Yes,No,No,Yes,Yes,No,Male,70-74,...,77.110001,28.290001,No,2,20,5,2,0,No,0.5459
2,Good,Within the past 2 years,Yes,No,No,No,No,No,Male,50-54,...,117.93,36.259998,No,2,0,0,0,0,No,0.9794
3,Very Good,Within the past year,Yes,Yes,No,No,No,No,Female,70-74,...,58.970001,20.98,No,4,30,20,8,0,No,0.6479
4,Very Good,Within the past year,No,No,No,No,No,No,Male,80+,...,79.379997,26.610001,No,2,30,8,2,1,No,0.5566


# Conclusion

From this dataset, we were able to conclude that a high BMI, along with overall higher age, while not the only factors, are major contributors to Heart Disease.

![lgbm_feature_importance.png](attachment:lgbm_feature_importance.png)

![lr_feature_importance.png](attachment:lr_feature_importance.png)



![final_model_auc.png](attachment:final_model_auc.png)

![final_model_conf_matrix.png](attachment:final_model_conf_matrix.png)

Insights:
The model is performing well in identifying negative cases (no heart disease) with a high specificity of 94.21%.
However, there is room for improvement in the detection of positive cases (heart disease), as indicated by the relatively lower precision and recall values (around 34%).
The number of False Negatives is concerning, as these represent cases where the model failed to identify heart disease. This could have serious implications in a real-world medical scenario.
Similarly, the number of False Positives indicates instances where healthy individuals were incorrectly classified as having heart disease, leading to potential unnecessary medical interventions.

The heart disease classifier demonstrates good performance in identifying individuals without heart disease but requires further tuning and investigation, particularly in the detection of positive cases. The focus should be on reducing both False Negatives and False Positives to create a more reliable and effective classifier for heart disease prediction. Collaboration with medical experts and utilization of domain knowledge could further enhance the model's performance and applicability.

Overall we were able to answer which factors determine, or at least have high correlation to Heart Disease but we were not able to reliably build a model that does a good job at predicting whether or not someone has Heart Disease. As always, these models could be further tuned with more data and other techniques.

# Appendix

CDC statistics: https://www.cdc.gov/heartdisease/facts.htm#:~:text=Heart%20Disease%20in%20the%20United%20States&text=One%20person%20dies%20every%2033,United%20States%20from%20cardiovascular%20disease.&text=About%20695%2C000%20people%20in%20the,1%20in%20every%205%20deaths.

