# Objectives

We are using the UCI Heart Disease Data after wrangling in the file data/heart.csv

We will try to predict the indication of heart disease in the target variable.

In this notebook we will do feature engineering based on the EDA from the previous notebook, along with pre-processing and preparing the training and test data.

# 1. Load the data from data/heart.csv

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import math

from library.sb_utils import save_file

In [3]:
### Load the data produced by the Data Wrangling notebook.
heart_data = pd.read_csv("../data/heart.csv")
heart_data.info()
heart_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Age         303 non-null    int64
 1   Sex         303 non-null    int64
 2   ChestPain   303 non-null    int64
 3   SystolicBP  303 non-null    int64
 4   Chol        303 non-null    int64
 5   Glucose     303 non-null    int64
 6   RestECG     303 non-null    int64
 7   STMaxRate   303 non-null    int64
 8   STPain      303 non-null    int64
 9   STWave      303 non-null    int64
 10  STSlope     303 non-null    int64
 11  NumColor    303 non-null    int64
 12  Defects     303 non-null    int64
 13  AngioTgt    303 non-null    int64
dtypes: int64(14)
memory usage: 33.3 KB


Unnamed: 0,Age,Sex,ChestPain,SystolicBP,Chol,Glucose,RestECG,STMaxRate,STPain,STWave,STSlope,NumColor,Defects,AngioTgt
0,63,1,3,145,233,1,0,150,0,2,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1,2,0,2,1
3,56,1,1,120,236,0,1,178,0,1,2,0,2,1
4,57,0,4,120,354,0,1,163,1,1,2,0,2,1


# 2. Feature Selection and Engineering

## 2A. Candidates

### Best Candidate Features from EDA

1. Chest Pain (ordinal: monotonically negative AngioTgt trend 1-3, drop off at 4)
2. ST Pain    (binary:  negative correlation to AngioTgt) 
        Correlated with ChestPain(+.45)
3. ST Wave    (ordinal: monotonically negative trend 0-6, moderating elbow at 3, no 5s) 
        Correlated with ChestPain(+.33)
4. ST Max Rate (numerical: mostly positive correlation to AngioTgt)  
        Correlated with STPain (-.38) and ChestPain (-.38) and STWave (-.33)
5. Num Color  (categorical: monotonically negative trend 0-3, 4 jumps up to 80%)
6. ST Slope   (3 categories: 0 just above 1, 1 far below 2)  
        Correlated with STWave(-.55) and MaxRate(+.39)
        
### Other Candidate Features

7. Defects   (categorical: ignoring sparsely populated values 0 & 1, 2 and 3 are negative to AngioTgt)
8. Sex       (0=F, 1=M, surprisingly negative trend with AngioTgt)
9. Age       (numerical: mostly negative trend)
10. Systolic BP (numerical: surprisingly a weak negative correlation to AngioTgt in this data)
11. Chol     (numerical: surprisingly uncorrelated to AngioTgt, according to this data)
12. RestECG  (3 values: surprisingly weak correlation to AntioTgt)
13. Glucose  (binary: surprisingly uncorrelated to AntioTgt)


## 2B. Feature Sets

We have a good set of 13 features with little correlation, the biggest correlation magnitude is 0.55.  We'll use all the features:

ChestPain, STPain, STWave, STMaxRate, NumColor, STSlope, Defects, Sex, Age, SystolicBP, Chol, RestECG, Glucose

### Unscaled Set

For non-distance-based algorithms, we will use them unscaled, with dummy variables for categoricals, but not ordinals that showed a trend with the target. 

### Scaled Set

For distance-based algorithms, we will scale numerical and ordinal variables, in addition to dummy variables for categories.  As the standard scaling reduces numericals down to the +/- 2.5 range, with very few 3 sigmas as expected, we'll scale any ordinals down to roughly the same magnitude, 0 to 5 or +/- 3.  Any ordinals that already have a variation less than 5 will be left alone.


## 2C. Encoding Categorical Features with Dummy Variables

* ChestPain is Ordinal. (AngioTgt trends with values.)  Use the 4 values, 1-4 directly.  This is less than 5, so no scaling.
* STPain is already a binary. (1 = Yes)
* STWave is ordinal after Data Wrangling it down to 6 bins.  Bin 5 is empty. Move 6 to 5.  Use 0-5 directly, no scaling.
* STMaxRate is numerical with most values between 80 and 200 and mode at 160.  Standardize it AFTER splitting test and train data sets!
* NumColor is Ordinal. Documentation defined 0-3.  We see very few 4 values and they are against the trend.  Use 0-3 only as Ordinals.  Replace 4 with 0, as their mean AngioTgt scores are similar.  No scaling needed.
* STSlope is categorical. Make dummies. Drop sparse first value (0) to avoid collinearity.  2 binary columns.
* Defects is categorical.  Make dummies.  Drop very sparse first value (0) to avoid collinearity.  3 binary columns.
* Sex is already a binary.  (1 = Male)
* Age is numerical with most values between 40 and 70 and mode at 60. Standardize after splitting for scaled set.

* SystolicBP is continuous and shows little correlation to AngioTgt.  Standardize it after splitting.
* Chol is continuous. Standardize after splitting test and train data sets.
* Glucose is binary.  (1 = diabetes)
* RestECG has three values, but very few 2s. Replace 2 with 0 to make it ordinal and binary.


# 3. Pre-processing

## Unscaled Feature Set

In [4]:
features = heart_data.drop(columns="AngioTgt")

## Replace certain ordinal values as specified above
features = features.replace({'NumColor':4}, 0)
features = features.replace({'STWave':6}, 5)
features = features.replace({'RestECG':2}, 0)

## Use one-hot encoding, dropping one category's column to avoid collinearity
features = pd.get_dummies(features, columns=['STSlope','Defects'], drop_first=True)

## Must wait to scale continuous variables in testing and training sets separately
print(features.NumColor.value_counts(), features.STWave.value_counts(), features.RestECG.value_counts())
features

0    180
1     65
2     38
3     20
Name: NumColor, dtype: int64 0    135
1     83
2     47
3     25
4     11
5      2
Name: STWave, dtype: int64 1    152
0    151
Name: RestECG, dtype: int64


Unnamed: 0,Age,Sex,ChestPain,SystolicBP,Chol,Glucose,RestECG,STMaxRate,STPain,STWave,NumColor,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
0,63,1,3,145,233,1,0,150,0,2,0,0,0,1,0,0
1,37,1,2,130,250,0,1,187,0,3,0,0,0,0,1,0
2,41,0,1,130,204,0,0,172,0,1,0,0,1,0,1,0
3,56,1,1,120,236,0,1,178,0,1,0,0,1,0,1,0
4,57,0,4,120,354,0,1,163,1,1,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,4,140,241,0,1,123,1,0,0,1,0,0,0,1
299,45,1,3,110,264,0,1,132,0,1,0,1,0,0,0,1
300,68,1,4,144,193,1,1,141,0,3,2,1,0,0,0,1
301,57,1,4,130,131,0,1,115,1,1,1,1,0,0,0,1


In [6]:
features.mean()

Age            54.366337
Sex             0.683168
ChestPain       2.854785
SystolicBP    131.623762
Chol          246.264026
Glucose         0.148515
RestECG         0.501650
STMaxRate     149.646865
STPain          0.326733
STWave          1.009901
NumColor        0.663366
STSlope_1       0.462046
STSlope_2       0.468647
Defects_1       0.059406
Defects_2       0.547855
Defects_3       0.386139
dtype: float64

# 4. Split Training and Test Data


In [18]:
y = heart_data.AngioTgt    
print (y)

X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.20, random_state=21)
print (X_train.shape, X_test.shape)
X_train.head()

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: AngioTgt, Length: 303, dtype: int64
(242, 16) (61, 16)


Unnamed: 0,Age,Sex,ChestPain,SystolicBP,Chol,Glucose,RestECG,STMaxRate,STPain,STWave,NumColor,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
281,52,1,4,128,204,1,1,156,1,1,0,1,0,0,0,0
262,53,1,4,123,282,0,1,95,1,2,2,1,0,0,0,1
60,71,0,2,110,265,1,0,130,0,0,1,0,1,0,1,0
76,51,1,2,125,245,1,0,166,0,2,0,1,0,0,1,0
37,54,1,2,150,232,0,0,165,0,2,0,0,1,0,0,1


In [19]:
X_test.head()

Unnamed: 0,Age,Sex,ChestPain,SystolicBP,Chol,Glucose,RestECG,STMaxRate,STPain,STWave,NumColor,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
219,48,1,4,130,256,1,0,150,1,0,2,0,1,0,0,1
216,62,0,2,130,263,0,1,97,0,1,1,1,0,0,0,1
259,38,1,3,120,231,0,1,182,1,4,0,1,0,0,0,1
179,57,1,4,150,276,0,0,112,1,1,1,1,0,1,0,0
225,70,1,4,145,174,0,1,125,1,3,0,0,0,0,0,1


## Scaled Feature Set

In [22]:
### Set up a scaler for each column, to be fit and used on training set and re-used on test set, 
### as the scaler did not scale linearly when I did all the columns together.

SSA = StandardScaler()
SSMR = StandardScaler()
SSBP = StandardScaler()
SSC = StandardScaler()

Xs_train = X_train.drop(columns=['Age','STMaxRate','SystolicBP','Chol'])
Xs_train.insert(0, "Age_SS", SSA.fit_transform(X_train[['Age']]))
Xs_train.insert(5, "STMaxRate_SS", SSMR.fit_transform(X_train[['STMaxRate']]))
Xs_train.insert(3, "SystolicBP_SS", SSBP.fit_transform(X_train[['SystolicBP']]))
Xs_train.insert(4, "Chol_SS", SSC.fit_transform(X_train[['Chol']]))

Xs_train

Unnamed: 0,Age_SS,Sex,ChestPain,SystolicBP_SS,Chol_SS,Glucose,RestECG,STMaxRate_SS,STPain,STWave,NumColor,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
281,-0.250145,1,4,-0.169421,-0.808200,1,1,0.226721,1,1,0,1,0,0,0,0
262,-0.137206,1,4,-0.456536,0.692743,0,1,-2.576337,1,2,2,1,0,0,0,1
60,1.895685,0,2,-1.203033,0.365614,1,0,-0.968025,0,0,1,0,1,0,1,0
76,-0.363083,1,2,-0.341690,-0.019243,1,0,0.686238,0,2,0,1,0,0,1,0
37,-0.024268,1,2,1.093882,-0.269400,0,0,0.640287,0,2,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,-0.476021,1,2,0.519653,-0.250157,0,1,0.548383,0,1,1,1,0,0,0,1
120,1.105116,0,4,-0.054575,1.096843,0,1,-1.335639,0,2,2,1,0,0,1,0
48,-0.137206,0,2,-0.169421,-0.577286,0,0,-1.657302,0,0,0,0,1,0,0,0
260,1.330993,0,4,2.701723,-0.346371,1,1,0.640287,1,1,2,1,0,0,0,1


In [23]:
### Reuse the scalers fitted on the training set for the testing set, so as not to leak any information back into training.

Xs_test = X_test.drop(columns=['Age','STMaxRate','SystolicBP','Chol'])
Xs_test.insert(0, "Age_SS", SSA.fit_transform(X_test[['Age']]))
Xs_test.insert(5, "STMaxRate_SS", SSMR.fit_transform(X_test[['STMaxRate']]))
Xs_test.insert(3, "SystolicBP_SS", SSBP.fit_transform(X_test[['SystolicBP']]))
Xs_test.insert(4, "Chol_SS", SSC.fit_transform(X_test[['Chol']]))

Xs_test

Unnamed: 0,Age_SS,Sex,ChestPain,SystolicBP_SS,Chol_SS,Glucose,RestECG,STMaxRate_SS,STPain,STWave,NumColor,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
219,-0.707836,1,4,-0.243657,0.170897,1,0,0.229641,1,0,2,0,1,0,0,1
216,0.714498,0,2,-0.243657,0.308582,0,1,-1.804411,0,1,1,1,0,0,0,1
259,-1.723789,1,3,-0.810948,-0.320835,0,1,1.457748,1,4,0,1,0,0,0,1
179,0.206522,1,4,0.890927,0.564283,0,0,-1.228736,1,1,1,1,0,1,0,0
225,1.527260,1,4,0.607281,-1.441985,0,1,-0.729818,1,3,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44,-1.622194,1,2,0.323635,1.449402,0,0,1.457748,0,0,0,0,1,0,1,0
129,1.933642,0,1,-0.810948,0.426598,0,0,-0.883331,1,0,1,0,1,0,1,0
272,1.222475,1,4,-0.810948,-0.202819,0,1,-2.802248,0,1,0,1,0,0,1,0
9,0.206522,1,2,0.890927,-1.560001,0,1,1.150721,0,2,0,0,1,0,1,0


In [24]:
datapath = '../data'
save_file(X_train, 'X_train.csv', datapath)
save_file(Xs_train, 'Xs_train.csv', datapath)
save_file(X_test, 'X_test.csv', datapath)
save_file(Xs_test, 'Xs_test.csv', datapath)
save_file(y_train, 'y_train.csv', datapath)
save_file(y_test, 'y_test.csv', datapath)

Writing file.  "../data\X_train.csv"
Writing file.  "../data\Xs_train.csv"
Writing file.  "../data\X_test.csv"
Writing file.  "../data\Xs_test.csv"
Writing file.  "../data\y_train.csv"
Writing file.  "../data\y_test.csv"
