### Predict the presence of heart disease given the parameters 

This data contains various features of patients and whether or not they had a heart disease. Data is from kaggle - https://www.kaggle.com/ronitf/heart-disease-uci

Various features that were recorded are:
- age
- sex
- chest pain type (**CP**)- 0: asymptomatic, 1: atypical angina, 2: non-anginal pain, 3: typical angina
- resting blood pressure in mmHg (**trestbps**) -> renamed to **restingBP**
- serum cholestoral in mg/dl (**chol**)
- fasting blood sugar - 1 if fastingBS > 120 mg/dl,  0 otherwise (**fbs**) -> renamed to fastingBS
- resting electrocardiographic results (values 0,1,2) (**restecg**) - 0: showing probable or definite left ventricular hypertrophy by Estes' criteria, 1: normal, 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
- maximum heart rate achieved (**thalach**) -> maxHR
- exercise induced angina (**exang**) 
- oldpeak = ST depression induced by exercise relative to rest (**oldpeak**)
- the slope of the peak exercise ST segment (**slope**) - 0: downsloping; 1: flat; 2: upsloping
- number of major vessels (0-3) colored by flourosopy (**ca**) -> numVessels
- thal: 1 = fixed defect; 2 = normal; 7 = reversable defect (**thal**)

- target: 0 = hearth disease, 1 = healthy

Note: data **93, 139, 164, 165 and 252** have ca=4 which is incorrect. These should be removed. In the original Cleveland dataset they are NaNs (so they should be removed)
data **49 and 282** have thal = 0, also incorrect. They are also NaNs in the original dataset.


In [25]:
import pandas as pd

In [26]:
heart_data = pd.read_csv('../data/heart.csv')
heart_data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [27]:
heart_data.shape

(303, 14)

Considerations for selecting a particular ML model

- There are fewer observations and relatively more features. This could be a **fat data** in that sense
- We need more **explainability**
- **False positives** are **acceptable** but **false negatives** should be **avoided**. This is because missing a potential heart disease can be life threatening. Also, recall is a import metric here.
- Trade off between training and prediction time are not that important I believe. 

In [28]:
# Check the number of target types
heart_data['target'].unique()

array([1, 0], dtype=int64)

In [29]:
# Check the count of targets
heart_data['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

##### There is almost an even distribution of targets

In [30]:
# Check the types of features
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


It's all integers and floats

In [31]:
# Check if there are any null values
heart_data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

In [32]:
heart_data['thal'].unique()

array([1, 2, 3, 0], dtype=int64)

Continuous features: Age, trestbps, chol, thalach, oldpeak, 

Categorical features: Sex, cp, fbs, restecg, exang, slope, ca, thal

In [33]:
# remove the incorrect entries as mentioned in the EDA_1 description
to_remove = [93, 139, 164, 165, 252, 49, 282]
heart_data.drop(to_remove, axis=0, inplace=True)
heart_data.shape

(296, 14)

In [34]:
# Give some variables a more reasonable name
heart_data['restingBP'] = heart_data['trestbps']
heart_data['fastingBS'] = heart_data['fbs']
heart_data['maxHR'] = heart_data['thalach']
heart_data['numVessels'] = heart_data['ca']

In [35]:
heart_data.drop(['trestbps', 'fbs', 'thalach', 'ca'], axis=1, inplace=True)

In [36]:
heart_data.to_csv('../data/heart_clean.csv', index=False)