# Proposal

In [33]:
pip install -U altair

Note: you may need to restart the kernel to use updated packages.


In [34]:
pip install ucimlrepo

Note: you may need to restart the kernel to use updated packages.


In [35]:
import altair as alt
import pandas as pd 
from ucimlrepo import fetch_ucirepo 
  
# Fetching the dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 

## Introduction
Heart disease has often been attributed to two broad reasons in the field of medical sciences; lifestyle and genetics. Whilst it is difficult to obtain data, extrapolate and correctly predict the possibility that an individual develops a heart disease owing to genetics, it is easier to do so by looking at their lifestyle and present health. 

This project aims at predicting the the prevalence of heart disease in patients based on a number of lifestyle variables that provide us insight into the individual's current health. The data set classifies the presence of heart disease based on 5 categories, ranking from 0 (absence) to 4 (highest prevelance), and we aim to correctly predict the class of a new observation based on the variables we have.It provides us with 13 explanatory variables and 303 observations that we would use as inputs to a KNN model. 

Our aim is therefore not only to create a relevant KNN model, but also to train and test it and improve it's accuracy such that the probability of a new observation to be rightly predicted is sufficiently high. 

## Preliminary exploratory data analysis

In [36]:
heart_disease.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,years,no
1,sex,Feature,Categorical,Sex,,,no
2,cp,Feature,Categorical,,,,no
3,trestbps,Feature,Integer,,resting blood pressure (on admission to the ho...,mm Hg,no
4,chol,Feature,Integer,,serum cholestoral,mg/dl,no
5,fbs,Feature,Categorical,,fasting blood sugar > 120 mg/dl,,no
6,restecg,Feature,Categorical,,,,no
7,thalach,Feature,Integer,,maximum heart rate achieved,,no
8,exang,Feature,Categorical,,exercise induced angina,,no
9,oldpeak,Feature,Integer,,ST depression induced by exercise relative to ...,,no


Python data containing organized information is shown below. The data is in tidy format, with columns renamed.

In [37]:
X = X.rename(columns = {
    "trestps" : "resting_blood_pressure", 
    "chol" : "serum_cholestoral", 
    "fbs" : "fasting_blood_sugar_greater_than_120_mg/dl", 
    "thalach" : "maximum_heart_rate_achieved", 
    "exang" : "exercise_induced_angina", 
    "oldpeak" : "ST_depression_induced_by_exercise_relative_to_rest", 
    "ca" : "number_of_major_vessels"
}).drop(columns = ["cp", "restecg", "slope", "thal"])

X

Unnamed: 0,age,sex,trestbps,serum_cholestoral,fasting_blood_sugar_greater_than_120_mg/dl,maximum_heart_rate_achieved,exercise_induced_angina,ST_depression_induced_by_exercise_relative_to_rest,number_of_major_vessels
0,63,1,145,233,1,150,0,2.3,0.0
1,67,1,160,286,0,108,1,1.5,3.0
2,67,1,120,229,0,129,1,2.6,2.0
3,37,1,130,250,0,187,0,3.5,0.0
4,41,0,130,204,0,172,0,1.4,0.0
...,...,...,...,...,...,...,...,...,...
298,45,1,110,264,0,132,0,1.2,0.0
299,68,1,144,193,1,141,0,3.4,2.0
300,57,1,130,131,0,115,1,1.2,1.0
301,57,0,130,236,0,174,0,0.0,1.0


In [38]:
y

Unnamed: 0,num
0,0
1,2
2,1
3,0
4,0
...,...
298,1
299,2
300,3
301,1


Table showing the data types of each class:

In [39]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 9 columns):
 #   Column                                              Non-Null Count  Dtype  
---  ------                                              --------------  -----  
 0   age                                                 303 non-null    int64  
 1   sex                                                 303 non-null    int64  
 2   trestbps                                            303 non-null    int64  
 3   serum_cholestoral                                   303 non-null    int64  
 4   fasting_blood_sugar_greater_than_120_mg/dl          303 non-null    int64  
 5   maximum_heart_rate_achieved                         303 non-null    int64  
 6   exercise_induced_angina                             303 non-null    int64  
 7   ST_depression_induced_by_exercise_relative_to_rest  303 non-null    float64
 8   number_of_major_vessels                             299 non-null    float64
dtype

Table showing the number of observations (count), mean, standard deviation (std), and other statistics: 

In [40]:
X.describe()

Unnamed: 0,age,sex,trestbps,serum_cholestoral,fasting_blood_sugar_greater_than_120_mg/dl,maximum_heart_rate_achieved,exercise_induced_angina,ST_depression_induced_by_exercise_relative_to_rest,number_of_major_vessels
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,299.0
mean,54.438944,0.679868,131.689769,246.693069,0.148515,149.607261,0.326733,1.039604,0.672241
std,9.038662,0.467299,17.599748,51.776918,0.356198,22.875003,0.469794,1.161075,0.937438
min,29.0,0.0,94.0,126.0,0.0,71.0,0.0,0.0,0.0
25%,48.0,0.0,120.0,211.0,0.0,133.5,0.0,0.0,0.0
50%,56.0,1.0,130.0,241.0,0.0,153.0,0.0,0.8,0.0
75%,61.0,1.0,140.0,275.0,0.0,166.0,1.0,1.6,1.0
max,77.0,1.0,200.0,564.0,1.0,202.0,1.0,6.2,3.0


Histogram of distribution of age:

In [41]:
# combine X and y
Xy = X.assign(presence_of_heart_disease = y)

age_distribution = alt.Chart(Xy, title = "Distribution of age and severity of heart disease").mark_bar().encode(
    x = alt.X("age").title("Age (in years)"), 
    y = alt.Y("count()").title("Counts of Age"), 
    color = alt.Color("presence_of_heart_disease:N").title([
      "Presence of Heart Disease",
      "(with 0 being absent and",
      "4 being most severe)"
    ])
)

age_distribution

TypeError: 'UndefinedType' object is not callable

In [None]:
# distribution of gender and age for people with heart disease

# filter for presence_of_heart_disease != 0
Xy_with_disease = Xy[Xy["presence_of_heart_disease"] != 0]

age_gender_dist = alt.Chart(Xy_with_disease, title = "Age and gender of people with heart disease").mark_bar().encode(
    x = alt.X("age").title("Age (in years)"), 
    y = alt.Y("count()").title("Counts of Age (of people with heart disease)"), 
    color = alt.Color("sex:N").title("Gender")
)

age_gender_dist

## Methods

In [None]:
X.describe()

We can see that the dataset is already pretty clean. Indeed, we can see that the sex is already encoded as a binary feature such as all the other features.

We couldn't identify in the dataset useless features (such as "id") that do not give any information for the prediction. However, some features were not documented enough or at all so we decided to drop them such as:
- `cp`
- `restecg`
- `slope` 
- `thal`

As we can see on the descibe table, the features of the dataset are not scaled yet We will need to scale the features in order to be able to perform kNN classification correctly. To do so we are going to use `StandardScaler` as seen in class.

We are planning on using `GridSearchCV` in order to perform multiple cross-valdiation and find a great value for the hyperparameter `n_neighbors`.

## Expected outcomes and significance

This project aims to predict the possibility of an individual contracting heart disease due to their lifestyle and what behaviours are more likely to lead to heart disease. This analysis of the result should lead to a system that can predict heart disease by factors such as age, diet, or sex. Given the preliminary results, heart disease is more likely to be present in older individuals and those with higher blood sugar. As the project moves forward, the goal is to refine those results into the class system referenced in the introduction. If successful, this model could help health care professionals predict which patients need specific care in order to prevent heart disease and could even be a self-assessment tool for individuals that may want to check their propensity for the disease. Some questions that may come from the project could be looking at medical dataset and their biases. As with all data, this information will reflect the biases all around us. In situations where an informatic model can actively hurt individuals, a larger dataset could be useful to balance out this phenomenon. It may also lead to questions of what treatment plan may be effective considering the individual factors used to assess the person’s likelihood to contract the disease. A person’s health varies greatly between individuals, but also between the spheres variables being used in this project. These questions could further the usefulness of this project. 