### Heart Disease Risk Prediction and Early-Stage Heart Disease detection

In [None]:
import pandas as pd
import altair as alt
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split

# !! more imports later

### Summary

We wish to create a simple machine learning classification model which can help us predict high risk individuals for heart disease. We try three methods: a logistic regression and an SVM with RBF kernel method to use 14 common features related to heart disease to make the predictions. We have attempted to return whether the patient is predicted to contract heart disease or not, and thus it can enable us to identify high-risk individuals and implement relevent prevention methods ahead of time. 

We select and F2 score as our model performance metric since we really care about having the least amount of False Negatives as possible. Our final classifier performed fairly well on an unseen test data set, with the F2 score, where beta = 2, of <> and an overall accuracy calculated to be <>. On the <num> test data cases, it correctly predicted <num>. It incorrectly predicted <num> cases, which were all false positives - predicting that a patient is prone to contract heart disease when they are in fact not. These kind of incorrect predictions is not as harmful as a false negative in our context. Although they could theoretically cause the patient to undergo unnecessary treatment if the model is used as a decision tool, we expect there to be additional decision layers which can mitigate this. As such, we believe this model is at the very least a useful tool for medical professionals to look at important cases more closely and have more frequent follow ups with important cases.

### Introduction

According to The American Heart Association (n.d.), Ischemic heart disease or IHD is a condition in which narrowed coronary arteries reduce blood flow to the heart. From the World Health Organization (n.d.), in India, a country of 1.4 billion people, consistently over the last decade (2010-2020), Heart Disease has been the leading cause of deaths over both genders.

Cardiovascular diseases (CVDs) are the number one cause of mortality in India, accounting for approximately **31%** of all deaths, according to the latest Sample Registration System report (Press Trust of India, 2025). India’s age-standardized CVD death rate is estimated at **272 per 100,000**, significantly higher than the global average of about **235 per 100,000** (Gupta et al., 2018). According to one report, about **1.7 million deaths** in India (in 2019) were due to IHD, constituting **15.2%** of all deaths. 

If high-risk individuals can be identified before clinical events (like heart attacks), interventions (National Heart, Lung, and Blood Institute (n.d.)) can reduce morbidity and mortality. Traditional diagnosis often depends on physician expertise, subjective assessment, and resource-intensive tests. A data-driven predictive model could help select patients more prone to such cases, especially in resource-limited settings.

Given that CVDs cause almost one-third of all deaths in India, even minor improvements or supplementary methods in early detection could make a huge difference in population health. Thus, in this project we attempt to use measurable features to determine high risk cases which can lead to more careful monitoring and lead to early stage prevention measures.

### Methods

#### 1. Data

The dataset contain 1026 unique rows, each containing information such as cholestrol, blood pressure, fasting blood sugar, etc for some individual. There are a total of 12  provided features, with the 13th column being the patient ID. Our target column contains binary encoding where 1 translates to 'yes, heart disease' and 0, 'no heart disease'. This heart disease dataset is acquired from <link>. Each row in the data set represents summary statistics from one single patient.

A detailed explanation of all the important features are provided from the same source to get a better overview of the summary statistics for all numerical columns along with the dataset, which is downloaded <here>.

#### 2. Analysis

**2.1 Importing the data and preliminary EDA**

We can start with some EDA to see what we're working with.

In [4]:
df = pd.read_csv("./data/heart.csv")
df
# EDA CODE AND MARKDOWN PART

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


We can select an 80-20 split for training and testing data respectively. 

In [None]:
train_df, test_df = train_test_split(df, test_size = 0.2,random_state=123)

train_df.to_csv("./data/processed/heart_train.csv")
test_df.to_csv("./data/processed/heart_test.csv")

**2.2 Data Prep: Splitting target column and feature column cleanup**

Before we start working on anything we will separate the test data from the training data to avoid violating the golden rule.

In [None]:
X_train = train_df.drop(columns = ['target'])
y_train = train_df['target']
X_test = test_df.drop(columns = ['target'])
y_test = test_df['target']

We can now clean up the data according to the criteria for each column.

In [6]:
binary = ['sex','fbs','exang']
ohe = ['cp','restecg','thal']
numerical = ['age','trestbps','chol','thalach','oldpeak','ca']
ordinal = ['slope']

preprocessor = make_column_transformer(
 (StandardScaler(), numerical),
 (OneHotEncoder(), ohe),
 (OrdinalEncoder(), ordinal),
 ('passthrough', binary)
)

We can view the first 5 rows of the preprocessed data to have an idea of what we are going to input into our model.

In [7]:
X_train_preprocessed = preprocessor.fit_transform(X_train)
column_names = (
 numerical
 + ordinal
 + binary
 + preprocessor.named_transformers_['onehotencoder'].get_feature_names_out(ohe).tolist())
X_train_preprocessed = pd.DataFrame(X_train_preprocessed, columns = column_names)
X_train_preprocessed.head(5)

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,ca,slope,sex,fbs,exang,...,cp_1,cp_2,cp_3,restecg_0,restecg_1,restecg_2,thal_0,thal_1,thal_2,thal_3
0,-0.04821,3.439983,0.659128,1.968909,-0.914049,0.217877,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0
1,0.830279,-0.665467,0.621406,-2.052711,0.266165,0.217877,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0
2,0.720468,0.132815,-0.265068,-0.216754,1.277777,1.173272,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,-2.134621,-0.323346,0.640267,0.264092,-0.914049,-0.737518,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,1.0
4,1.708768,-0.095265,1.394713,-1.790431,1.109175,2.128667,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0


_table 2.2.1 Preprocessed columns_

**2.3 Scoring Metric**

We will select the F2 score for checking the performance of the model since this is a medical problem and we reason that false negatives are much more harmful than false positives, and we wish to catch as many possible cases as we can. Choosing the F2 score will enable us to measure precision and recall instead of accuracy, and give recall a 2x weight so that we try to reduce false negatives as much as possible.

$$
F_{\beta} = (1+\beta^2) . \frac{precision \times recall}{\beta^2 precision + recall}
$$

with $\beta = 2$ will give us the F2 score.

**2.4 Model Tuning**

In [None]:
# code and analysis for models

# init model
# hyperparameter tuning and cross validate
# plot the hyperparameter outputs and select the best
# accuracy, confusion matix, f score
# run on test set and get score

### Results and Discussion

**Final model selection**

< select the better model and discuss the results >

### References

1. Press Trust of India. (2025, September 5). Cardiovascular diseases cause one-third of all deaths in India: Report. Business Standard. https://www.business-standard.com/amp/health/cardiovascular-diseases-cause-one-third-of-all-deaths-in-india-report-125090500028_1.html
2. Prabhakaran, D., Jeemon, P., & Roy, A. (2016). Cardiovascular diseases in India: Current epidemiology and future directions. Circulation, 133(16), 1605–1620. https://doi.org/10.1161/CIRCULATIONAHA.114.008729
5. Gupta, R., Khedar, R. S., Gaur, K., & Xavier, D. (2018). Low quality cardiovascular care is important coronary risk factor in India. Indian Heart Journal, 70(Suppl 3), S419–S430. https://doi.org/10.1016/j.ihj.2018.05.002
6. Alizadehsani, R., Roshanzamir, M., Abdar, M., Beykikhoshk, A., Khosravi, A., Panahiazar, M., Koohestani, A., Khozeimeh, F., Nahavandi, S., & Sarrafzadegan, N. (2019). A database for using machine learning and data mining techniques for coronary artery disease diagnosis. Scientific Data, 6, Article 227. https://doi.org/10.1038/s41597-019-0206-3
7. World Health Organization. (n.d.). India — Health data overview. WHO. Retrieved November 18, 2025, from https://data.who.int/countries/356
8. American Heart Association. (n.d.). Ischemic Heart Disease and Silent Ischemia. Retrieved [date], from https://www.heart.org/en/health-topics/heart-attack/about-heart-attacks/silent-ischemia-and-ischemic-heart-disease
9. World Health Organization. (n.d.). Leading causes of death. WHO Global Health Observatory. Retrieved November 21, 2025, from https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-leading-causes-of-death
10. National Heart, Lung, and Blood Institute. (n.d.). Treatment of coronary heart disease. NIH. Retrieved November 21, 2025, from https://www.nhlbi.nih.gov/health/coronary-heart-disease/treatment
