# Introduction

## Heart Disease
In this notebook, we will try to answer some questions regarding heart disease as well as try to devise a way to predict whether or not a patient may have heart disease. We will attempt to determine what factors may cause heart disease or at least have some correlation to heart disease.

### What is Heart Disease
Heart disease is a general term that refers to many types of heart conditions, including coronary artery disease (CAD), arrhythmia, heart valve disease, and heart failure. CAD is the most common type of heart disease in the United States. It affects the major blood vessels that supply the heart muscle. Decreased blood flow to the heart can cause a heart attack.

Once diagnosed with heart disease, it can't be cured. However, you can treat the things that contributed to the development of coronary artery disease, which can reduce how the condition impacts your body.

According to CDC statistics:
- One person dies EVERY 33 seconds from Heart Disease in the United States
- 695,000 people in the United States died from heart disease in 2021—that’s 1 in every 5 deaths
- Heart disease cost the United States about $239.9 billion each year from 2018 to 2019.3 This includes the cost of health care services, medicines, and lost productivity due to death

As you can see, Heart Disease is a major worry for alot of people in the United States. It effects many millions of people, either directly or indirectly. Preventing this disease or at least catching it early,  combined with latest treatments and lifestyle changes can result in major strides in longevity of live as well as quality of life well into older age.

# Data Wrangling
Import Data and make any necessary modifications in order to be able to work with the data.

In [39]:
# Import libraries to be used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Read the data from the csv file
df = pd.read_csv('HD_cleaned.csv')
df.head()

Unnamed: 0,General_Health,Checkup,Exercise,Skin_Cancer,Other_Cancer,Depression,Diabetes,Arthritis,Sex,Age_Category,Height_(cm),Weight_(kg),BMI,Smoking_History,Alcohol_Consumption,Fruit_Consumption,Green_Vegetables_Consumption,FriedPotato_Consumption,Heart_Disease
0,Poor,Within the past 2 years,No,No,No,No,No,Yes,Female,70-74,150,32.66,14.54,Yes,0,30,16,12,No
1,Very Good,Within the past year,No,No,No,No,Yes,No,Female,70-74,165,77.11,28.29,No,0,30,0,4,Yes
2,Very Good,Within the past year,Yes,No,No,No,Yes,No,Female,60-64,163,88.45,33.47,No,4,12,3,16,No
3,Poor,Within the past year,Yes,No,No,No,Yes,No,Male,75-79,180,93.44,28.73,No,0,30,30,8,Yes
4,Good,Within the past year,No,No,No,No,No,No,Male,80+,191,88.45,24.37,Yes,0,8,4,0,No


In [12]:
df.shape

(308854, 19)

In [13]:
df.dtypes

General_Health                   object
Checkup                          object
Exercise                         object
Skin_Cancer                      object
Other_Cancer                     object
Depression                       object
Diabetes                         object
Arthritis                        object
Sex                              object
Age_Category                     object
Height_(cm)                       int64
Weight_(kg)                     float64
BMI                             float64
Smoking_History                  object
Alcohol_Consumption               int64
Fruit_Consumption                 int64
Green_Vegetables_Consumption      int64
FriedPotato_Consumption           int64
Heart_Disease                    object
dtype: object

In [14]:
# Split data into modeling and prediction sets
df = df.sample(frac=.95, random_state=123).reset_index(drop=True)
df_unseen = df.sample(frac=.05, random_state=123).reset_index(drop=True)

print('Data for modeling: ' + str(df.shape))
print('Data for predictions: ' + str(df_unseen.shape))

Data for modeling: (293411, 19)
Data for predictions: (14671, 19)


In [15]:
# Setup Pycarent environment
from pycaret.classification import *

In [16]:
clf_exp1 = setup(data = df, target = 'Heart_Disease', session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Heart_Disease
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(293411, 19)"
5,Transformed data shape,"(293411, 42)"
6,Transformed train set shape,"(205387, 42)"
7,Transformed test set shape,"(88024, 42)"
8,Ordinal features,7
9,Numeric features,7


# Exploratory Data Analysis

In [22]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title="Profiling Report", dark_mode=True)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [23]:
profile.to_file('HD_EDA_Profile.html')

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Modeling

In [24]:
compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.9194,0.8367,0.0343,0.5429,0.0645,0.0554,0.1213,0.612
gbc,Gradient Boosting Classifier,0.9192,0.8358,0.054,0.5102,0.0976,0.0834,0.1462,4.725
lr,Logistic Regression,0.919,0.8356,0.0623,0.4982,0.1108,0.0945,0.1546,6.772
svm,SVM - Linear Kernel,0.919,0.0,0.0376,0.4784,0.0653,0.0559,0.1059,2.34
ridge,Ridge Classifier,0.919,0.0,0.0014,0.5379,0.0028,0.0023,0.0232,0.313
dummy,Dummy Classifier,0.919,0.5,0.0,0.0,0.0,0.0,0.0,0.267
ada,Ada Boost Classifier,0.9184,0.8357,0.0768,0.4767,0.1321,0.1122,0.1666,1.787
catboost,CatBoost Classifier,0.9183,0.8338,0.0526,0.4635,0.0945,0.0792,0.135,10.162
xgboost,Extreme Gradient Boosting,0.9182,0.8312,0.0527,0.4568,0.0944,0.079,0.1337,2.654
rf,Random Forest Classifier,0.9178,0.8086,0.0414,0.424,0.0753,0.0618,0.1121,3.969


2023-08-15 11:27:33,094 - INFO     - Executing shutdown due to inactivity...
2023-08-15 11:27:53,330 - INFO     - Executing shutdown...
2023-08-15 11:27:53,331 - ERROR    - Exception on /shutdown [GET]
Traceback (most recent call last):
  File "c:\Users\Ramon\anaconda3\envs\PycaretEnv\lib\site-packages\flask\app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\Users\Ramon\anaconda3\envs\PycaretEnv\lib\site-packages\flask\app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\Users\Ramon\anaconda3\envs\PycaretEnv\lib\site-packages\flask\app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\Users\Ramon\anaconda3\envs\PycaretEnv\lib\site-packages\flask\app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "c:\Users\Ramon\anaconda3\envs\PycaretEnv\lib\site-packages\dtale\app.py", line 440, in shutdown
    shutdown

In [25]:
lightgbm = create_model('lightgbm')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9185,0.8354,0.0337,0.459,0.0627,0.0523,0.1071
1,0.9199,0.836,0.0379,0.578,0.0711,0.0618,0.1331
2,0.9196,0.8347,0.0355,0.5566,0.0667,0.0576,0.1256
3,0.9199,0.8345,0.0385,0.5872,0.0722,0.0629,0.1355
4,0.9192,0.8277,0.0343,0.5229,0.0643,0.0549,0.1183
5,0.9195,0.8422,0.0349,0.5472,0.0655,0.0564,0.123
6,0.9197,0.8357,0.0349,0.5743,0.0657,0.057,0.1271
7,0.9199,0.841,0.0391,0.5804,0.0732,0.0637,0.1356
8,0.9199,0.8424,0.0307,0.6071,0.0584,0.051,0.1236
9,0.9182,0.8378,0.0241,0.4167,0.0455,0.037,0.0843


In [27]:
tuned_lightgbm = tune_model(lightgbm, optimize = 'Recall')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.9105,0.8196,0.0842,0.307,0.1321,0.1008,0.1249
1,0.911,0.8185,0.0758,0.3029,0.1212,0.0918,0.117
2,0.9122,0.8169,0.0794,0.3259,0.1277,0.0991,0.1274
3,0.9119,0.8187,0.0823,0.327,0.1315,0.1023,0.1301
4,0.9114,0.8125,0.0721,0.303,0.1165,0.0881,0.1141
5,0.9105,0.8255,0.0631,0.2734,0.1025,0.0744,0.0973
6,0.9109,0.8171,0.0751,0.3005,0.1202,0.0907,0.1156
7,0.9099,0.824,0.0788,0.2918,0.1241,0.0928,0.1155
8,0.9134,0.8248,0.0734,0.3389,0.1206,0.0945,0.1263
9,0.9111,0.8202,0.0704,0.2947,0.1136,0.085,0.11


Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [26]:
qda = create_model('qda')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.5477,0.7911,0.896,0.1405,0.2429,0.1196,0.2254
1,0.75,0.6926,0.4925,0.1603,0.2419,0.1364,0.1674
2,0.51,0.7628,0.8966,0.131,0.2286,0.1016,0.2044
3,0.5426,0.7885,0.8798,0.1374,0.2376,0.1134,0.2145
4,0.5705,0.7895,0.8606,0.1429,0.2451,0.1232,0.2213
5,0.5293,0.7857,0.8936,0.1355,0.2352,0.11,0.2138
6,0.5554,0.7877,0.8786,0.1407,0.2425,0.1196,0.2213
7,0.5574,0.8045,0.893,0.1428,0.2462,0.1239,0.2296
8,0.4539,0.7414,0.9254,0.1219,0.2154,0.0843,0.1894
9,0.5286,0.7987,0.9086,0.1369,0.2379,0.1131,0.2209


In [29]:
tuned_qda = tune_model(qda, optimize = 'Recall')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6928,0.5732,0.3909,0.1093,0.1709,0.0507,0.0663
1,0.6959,0.5841,0.3921,0.1108,0.1727,0.0532,0.0692
2,0.7012,0.5753,0.3758,0.1092,0.1692,0.05,0.0643
3,0.7032,0.5776,0.3738,0.1096,0.1695,0.0505,0.0647
4,0.6977,0.5745,0.3732,0.1073,0.1667,0.0467,0.0604
5,0.6926,0.5759,0.3894,0.109,0.1703,0.0501,0.0655
6,0.6954,0.579,0.3834,0.1087,0.1694,0.0494,0.0642
7,0.7064,0.5802,0.3854,0.1135,0.1753,0.0574,0.0734
8,0.7022,0.5926,0.4041,0.1159,0.1801,0.0621,0.0803
9,0.7007,0.5836,0.377,0.1093,0.1695,0.0502,0.0646


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [30]:
nb = create_model('nb')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6236,0.8063,0.8665,0.161,0.2716,0.1564,0.2579
1,0.6272,0.8009,0.8382,0.1587,0.2669,0.1514,0.2464
2,0.616,0.8017,0.84,0.1549,0.2616,0.1446,0.2398
3,0.6233,0.8011,0.845,0.1582,0.2666,0.1507,0.2471
4,0.6187,0.7977,0.8347,0.1553,0.2618,0.145,0.239
5,0.6273,0.8085,0.8534,0.1608,0.2706,0.1555,0.254
6,0.6244,0.8049,0.8588,0.1604,0.2703,0.155,0.2547
7,0.6261,0.8142,0.8629,0.1615,0.2721,0.1571,0.2578
8,0.619,0.802,0.8491,0.1572,0.2652,0.1489,0.2463
9,0.6209,0.805,0.8497,0.1579,0.2663,0.1503,0.2478


In [33]:
tuned_nb = tune_model(nb, optimize = 'Recall')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6236,0.8063,0.8665,0.161,0.2716,0.1564,0.2579
1,0.6272,0.8009,0.8382,0.1587,0.2669,0.1514,0.2464
2,0.616,0.8017,0.84,0.1549,0.2616,0.1446,0.2398
3,0.6234,0.8011,0.845,0.1583,0.2666,0.1507,0.2472
4,0.6187,0.7977,0.8347,0.1553,0.2618,0.145,0.239
5,0.6273,0.8085,0.8534,0.1608,0.2706,0.1555,0.254
6,0.6244,0.8049,0.8588,0.1604,0.2703,0.155,0.2547
7,0.6261,0.8142,0.8629,0.1615,0.2721,0.1571,0.2578
8,0.619,0.802,0.8491,0.1572,0.2652,0.1489,0.2463
9,0.6209,0.805,0.8497,0.1579,0.2663,0.1503,0.2478


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [41]:
# Choosing different fonts for plots because of font warnings

plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.sans-serif'] = 'Arial'  # You can choose a different font if you like


In [42]:
evaluate_model(tuned_lightgbm)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [43]:
evaluate_model(tuned_qda)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [44]:
evaluate_model(tuned_nb)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [48]:
blend_soft_standard = blend_models([lightgbm, qda, nb], method = 'soft')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6298,0.8235,0.8533,0.1616,0.2718,0.157,0.2555
1,0.7827,0.7956,0.4877,0.1834,0.2666,0.1687,0.1967
2,0.6301,0.8168,0.8232,0.1578,0.2649,0.1493,0.2409
3,0.6324,0.8163,0.8293,0.1596,0.2677,0.1525,0.2456
4,0.6266,0.8139,0.8197,0.1562,0.2624,0.1462,0.2369
5,0.6366,0.8225,0.8383,0.1624,0.2721,0.1578,0.2529
6,0.6333,0.821,0.8413,0.1615,0.271,0.1563,0.2521
7,0.6313,0.83,0.8575,0.1627,0.2736,0.1591,0.2586
8,0.6188,0.8212,0.8485,0.157,0.2649,0.1486,0.2458
9,0.6275,0.8239,0.8443,0.1596,0.2685,0.1531,0.2495


In [49]:
evaluate_model(blend_soft_standard)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [50]:
blend_soft_tuned = blend_models([tuned_lightgbm, tuned_qda, tuned_nb], method = 'soft')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.63,0.818,0.8539,0.1618,0.2721,0.1573,0.256
1,0.7806,0.7912,0.4979,0.184,0.2687,0.1707,0.2001
2,0.6291,0.8103,0.8256,0.1578,0.265,0.1493,0.2415
3,0.6324,0.8106,0.8305,0.1598,0.268,0.1528,0.2461
4,0.6265,0.8081,0.8227,0.1565,0.263,0.1469,0.2384
5,0.6364,0.8167,0.8431,0.163,0.2731,0.1589,0.2551
6,0.6329,0.8131,0.8438,0.1617,0.2714,0.1567,0.253
7,0.6309,0.8244,0.8563,0.1625,0.2731,0.1586,0.2578
8,0.6181,0.8145,0.8503,0.157,0.265,0.1486,0.2463
9,0.6273,0.8178,0.8479,0.16,0.2692,0.154,0.2512


In [51]:
evaluate_model(blend_soft_tuned)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [52]:
blend_hard_standard = blend_models([lightgbm, qda, nb], method = 'hard')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6311,0.0,0.8515,0.1619,0.2721,0.1575,0.2555
1,0.7849,0.0,0.4787,0.1832,0.2649,0.1674,0.1941
2,0.6305,0.0,0.8196,0.1575,0.2643,0.1486,0.2395
3,0.6334,0.0,0.8239,0.1593,0.267,0.1518,0.2436
4,0.628,0.0,0.8173,0.1564,0.2625,0.1464,0.2367
5,0.6377,0.0,0.8347,0.1624,0.2718,0.1576,0.2519
6,0.6345,0.0,0.8341,0.161,0.2699,0.1552,0.2493
7,0.631,0.0,0.8557,0.1624,0.273,0.1585,0.2575
8,0.6206,0.0,0.8455,0.1573,0.2652,0.149,0.2455
9,0.6289,0.0,0.8425,0.1599,0.2688,0.1536,0.2496


In [53]:
blend_hard_tuned = blend_models([tuned_lightgbm, tuned_qda, tuned_nb], method = 'hard')

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.6309,0.0,0.8521,0.1619,0.2721,0.1575,0.2557
1,0.7828,0.0,0.4907,0.1842,0.2678,0.1701,0.1984
2,0.6297,0.0,0.8214,0.1575,0.2643,0.1486,0.2398
3,0.6327,0.0,0.8263,0.1593,0.2672,0.152,0.2443
4,0.6276,0.0,0.8191,0.1565,0.2627,0.1467,0.2373
5,0.637,0.0,0.8365,0.1623,0.2719,0.1576,0.2523
6,0.6336,0.0,0.8353,0.1608,0.2697,0.1549,0.2493
7,0.6303,0.0,0.8557,0.1621,0.2726,0.158,0.257
8,0.6199,0.0,0.8479,0.1573,0.2654,0.1492,0.2463
9,0.628,0.0,0.8443,0.1598,0.2688,0.1535,0.2499


# Conclusion

# Appendix

CDC statistics: https://www.cdc.gov/heartdisease/facts.htm#:~:text=Heart%20Disease%20in%20the%20United%20States&text=One%20person%20dies%20every%2033,United%20States%20from%20cardiovascular%20disease.&text=About%20695%2C000%20people%20in%20the,1%20in%20every%205%20deaths.

