# Keeping your finger on the pulse: an at-home predictive model of heart disease

## Introduction

According to the WHO, heart disease is one of the leading causes of death in the world. 1/3 of these deaths occur in people under 70 years old. Many of the deaths could have been prevented if the diagnosis and treatment had come sooner. 



In order to diagnose heart disease earlier, we want to train a model allowing people to make self-measurements ‘at home’, input them, and determine if they may be at risk for heart disease. With this information, they may be more likely to seek care from a specialist who can confirm a diagnosis. 

<font size="2">Source: https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1</font>

Data were collected from Cleveland, Hungary, Switzerland, and the VA Long Beach. The dataset consists of 303 instances and the following variables: Age, Sex, Chest Pain, Resting Blood Pressure, Serum Cholesterol,  Resting Electrocardiographic Results, Maximum Heart Rate Achieved, Exercise Induced Angina, ST Depression Induced by Exercise Relative to Rest, Diagnosis of Heart Disease.
</br>
Diagnosis of Heart Disease: 0 represents no heart disease, and 1, 2, 3, 4 each represent higher levels of heart disease. 


## Preliminary Exploratory Data Analysis

In [None]:
pip install ucimlrepo

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 

### Reading Data From Web

In [None]:
heart_df = pd.read_csv("https://archive.ics.uci.edu/static/public/45/data.csv")

In [None]:
heart_df

### Data Summary

In [None]:
columns = heart_df.columns.to_list()
columns

| Variable |Description|
| --- | --- |
| age | Age of Patient |
| sex | Sex of Patient |
| cp | Chest Pain Type  ,1: typical angina ,2: atypical angina ,3: non-anginal pain ,4: asymptomatic |
| trestbps | Resting Blood Pressure (in mm Hg on admission to the hospital) |
| chol | Serum Cholestoral in mg/dl |
| fbs | (Fasting Blood Sugar > 120 mg/dl)  (1 = true; 0 = false) |
| restecg | Resting Electrocardiographic Results 0: normal, 1: having ST-T wave abnormality, 2: showing probable or definite left ventricular hypertrophy |
| thalach | Maximum Heart Rate Achieved |
| exang | Exercise Induced Angina|
| oldpeak | ST Depression Induced by Exercise Relative to Rest |
| slope | Slope of The Peak Exercise ST Segment ,1: upsloping, 2: flat, 3: downsloping|
| thal | 3 = normal; 6 = fixed defect; 7 = reversable defect |
|ca|number of major vessels (0-3) colored by flourosopy|
| num | Diagnosis of Heart Disease |

We are only interested in training data in the Preliminary Exploratory Data Analysis, so we will split that data from here:

In [None]:
heart_train, heart_test =train_test_split(
    heart_df,
    test_size= 0.25,
    random_state=2000,
)

X_train = heart_train.iloc[:,:-1]
y_train = heart_train["num"]

X_test = heart_train.iloc[:,:-1]
y_test = heart_test["num"]

In [None]:
heart_train

In [None]:
def details (df):
    table_d = pd.DataFrame(df.dtypes, columns=["data_type"])
    table_d['#missing'] = df.isnull().sum().values
    table_d['%missing'] = df.isnull().sum().values / len(df)
    table_d['#unique'] = df.nunique().values
    return table_d

Finding count, mean, std, min, 25%, 50%, 75% and max of each ID, results rounded to 2 decimal points.

In [None]:
round(heart_train.describe(),2)

Checking for count of missing data, % of missing data, and count of unique data.

In [None]:
details(heart_train)

### Data Visualization

We are interested in how the number of patient would look like in different variables, so we plotted count vs columns variables accordingly in the following:

In [None]:
columns= ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
fig, ax = plt.subplots(5, 3, figsize=(30, 40))
ax = ax.flatten()
fig.subplots_adjust(top=0.97)
fig.suptitle("Count of Patient vs ID (Training Set)",fontsize = 20)

for i, f in enumerate(columns):
    if heart_train[f].nunique() > 10:
        sns.histplot(heart_train, ax=ax[i], x=f)
    else:
        sns.countplot(data=heart_train, ax=ax[i], x=f)
        


## Methods

The goal of this project is to allow people at home to have an idea of if they should visit the doctor for a more detailed heart disease exam. Therefore, in our dataset, we would only keep data that could be easily obtained without need for professional medical equipment or testing. Therefore, we will drop: Number of Major Vessels, Resting Electrocardiographic Results, Slope of The Peak Exercise ST Segment, thal, and ST Depression Induced by Exercise Relative to Rest.

We are planning on doing 3 sets of diagrams per feature: a histogram/bar diagram for positive diagnosis of heart disease, a histogram/bar diagram for negative diagnosis of heart disease, and a stacked histogram/bar diagram for all diagnoses.


## Expected outcomes and significance

We expect to find the correlation (if any) between numerous variables and heart disease. Based on previous research, we expect male patients of older age, high cholesterol, high blood sugar levels, high resting and post-exercise heart rate, to have higher rates of heart disease. 

This will allow patients to bypass waiting at a primary care clinic, often at risk of catching illness, and give themselves a preliminary diagnosis at home. Though this model should not replace the opinion of a primary care specialist, it may function to help the people determine their risk for heart disease. 

Additionally, this model can be used to examine trends in patient data and types of heart disease. If certain variables seem to always be correlated with a certain type of heart disease, further studies may be done to test this connection. 

## What future questions could this lead to?

- How can we keep advancing this model to diagnose heart disease earlier, more accurately, and in an unbiased manner? 

- Which variables have the biggest impact on heart disease?

- How can we suggest potential lifestyle changes to patients in an unbiased manner?


In [None]:
#Simon's code
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn import set_config
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV

In [None]:
heart_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "trestbps","chol","thalach"]),
)

knn = KNeighborsClassifier() 

X_train = heart_train[["age", "trestbps","chol","thalach"]]
y_train = heart_train["num"]

heart_tune_pipe = make_pipeline(heart_preprocessor, knn)
param_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 25, 1),
}

heart_tune_grid = GridSearchCV(
    estimator=heart_tune_pipe,
    param_grid=param_grid,
    cv=5
)

accuracies_grid = pd.DataFrame(
    heart_tune_grid.fit(
        X_train,     ##why is there an error here
        y_train
    ).cv_results_
)

accuracies_grid2 = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)
accuracies_grid2

In [None]:
import altair as alt; alt.__version__

cross_val_plot=alt.Chart(accuracies_grid2).mark_line(point=True).encode(
    x=alt.X("n_neighbors",title = "K value", scale=alt.Scale(zero=False)),
    y=alt.Y("mean_test_score",title = "Accuracy", scale=alt.Scale(zero=False))
)
cross_val_plot

In [None]:
knn_spec = KNeighborsClassifier(n_neighbors=17)

heart_fit = make_pipeline(heart_preprocessor, knn_spec).fit(X_train, y_train)

heart_test_predictions = heart_test.assign(
    predicted=heart_fit.predict(heart_test[["age", "trestbps","chol","thalach"]])
)
heart_test_predictions

In [None]:
X_test = heart_test[["age", "trestbps","chol","thalach"]]
y_test = heart_test["num"]
heart_prediction_accuracy = heart_fit.score(X_test, y_test)
heart_prediction_accuracy # this is the accuracy of our predictor when predicting test data after being trained with training data

In [None]:
heart_matrix = pd.crosstab(
    heart_test_predictions["num"],  # True labels
    heart_test_predictions["predicted"],  # Predicted labels
)

heart_matrix

In [None]:
plot1=alt.Chart(heart_test_predictions).mark_point().encode(
    x=alt.X("age",title = "Age (years)", scale=alt.Scale(zero=False)),
    y=alt.Y("chol",title = "Serum Cholestoral (mg/dl)", scale=alt.Scale(zero=False)),
    color=alt.Color("predicted:N",title = "Diagnosis of Heart Disease")
).properties(title="Serum Cholestral vs Age")
plot1

There are no clear trends in Serum Cholestral and Age that can be seen through the plot above. However, if a patient falls within the range of ages 50 to 65 and have a Serum Cholestral level of around 200 to 300, they should be aware that they have a higher risk of possessing heart disease. 

In [None]:
plot3=alt.Chart(heart_test_predictions).mark_point().encode(
    x=alt.X("age",title = "Age (years)", scale=alt.Scale(zero=False)),
    y=alt.Y("trestbps",title = "Resting Blood Pressure (in mm Hg on admission to the hospital)", scale=alt.Scale(zero=False)),
    color=alt.Color("predicted:N",title = "Diagnosis of Heart Disease")
).properties(title="Resting Blood Pressure vs Age")
plot3

There are no clear trends in Resting Blood Pressure and Age that can be seen through the plot above. However, if a patient falls within the range of ages 50 to 65 and have a Resting blood pressure of around 130, they should be aware that they have a higher risk of possessing heart disease. 

In [None]:
plot3=alt.Chart(heart_test_predictions).mark_point().encode(
    x=alt.X("age",title = "Age (years)", scale=alt.Scale(zero=False)),
    y=alt.Y("thalach",title = "Maximum Heart Rate Achieved", scale=alt.Scale(zero=False)),
    color=alt.Color("predicted:N",title = "Diagnosis of Heart Disease")
).properties(title="Maximum Heart Rate Achieved vs Age")
plot3

There are no clear trends in Maximum Heart Rate Achieved and Age that can be seen through the plot above. However, if a patient falls within the range of ages 50 to 65 and a Maximum heart rate of 100 to 160, they should be aware that they have a higher risk of possessing heart disease. As discussed in previous analyses, patients of ages 50 to 55 seem to have higher risk of heart disease. 

In [None]:
plot4=alt.Chart(heart_test_predictions).mark_point().encode(
    x=alt.X("trestbps",title = "Resting Blood Pressure (in mm Hg on admission to the hospital)", scale=alt.Scale(zero=False)),
    y=alt.Y("chol",title = "Serum Cholestoral (mg/dl)", scale=alt.Scale(zero=False)),
    color=alt.Color("predicted:N",title = "Diagnosis of Heart Disease")
).properties(title="Serum Cholestral vs Resting Blood Pressure")
plot4

There are no clear trends in Serum cholestral and Resting Blood Pressure that can be seen through the plot above. However, if a patient has a resting blood pressure of around 110 to 140 mm Hg and a Serum Cholestral level of 200 to 300, they should be aware that they have a higher risk of possessing heart disease. 

In [None]:
plot5=alt.Chart(heart_test_predictions).mark_point().encode(
    x=alt.X("trestbps",title = "Resting Blood Pressure", scale=alt.Scale(zero=False)),
    y=alt.Y("thalach",title = "Maximum Heart Rate Achieved", scale=alt.Scale(zero=False)),
    color=alt.Color("predicted:N",title = "Diagnosis of Heart Disease")
).properties(title="Maximum Heart Rate Achieved vs Resting Blood Pressure")
plot5

There are no clear trends in Maximum Heart rate Achieved and Resting Blood Pressure that can be seen through the plot above. However, if a patient has a resting blood pressure of around 110 to 140 mm Hg and a maximum heart rate of 100 to 160, they should be aware that they have a higher risk of possessing heart disease. 

In [None]:
plot6=alt.Chart(heart_test_predictions).mark_point().encode(
    x=alt.X("chol",title = "Serum Cholestoral (mg/dl)", scale=alt.Scale(zero=False)),
    y=alt.Y("thalach",title = "Maximum Heart Rate Achieved", scale=alt.Scale(zero=False)),
    color=alt.Color("predicted:N",title = "Diagnosis of Heart Disease")
).properties(title="Maximum Heart Rate Achieved vs Serum Cholestral")
plot6

There are no clear trends in Maximum Heart rate Achieved and Serum Cholestral that can be seen in the plot above. However, if a patient has a resting blood pressure of around 110 to 140 mm Hg and a Serum cholestral level of around 200 to 300, they should be aware that they have a higher risk of possessing heart disease. 