# Objectives

We are using the UCI Heart Disease Data after wrangling in the file data/heart.csv

We will try to predict the indication of heart disease in the target variable.

In this notebook we will do feature engineering based on the EDA from the previous notebook, along with pre-processing and preparing the training and test data.

# 1. Load the data from data/heart.csv

In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import math

from library.sb_utils import save_file

In [2]:
### Load the data produced by the Data Wrangling notebook.
heart_data = pd.read_csv("../data/heart.csv")
heart_data.info()
heart_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Age         303 non-null    int64
 1   Sex         303 non-null    int64
 2   ChestPain   303 non-null    int64
 3   SystolicBP  303 non-null    int64
 4   Chol        303 non-null    int64
 5   Glucose     303 non-null    int64
 6   RestECG     303 non-null    int64
 7   STMaxRate   303 non-null    int64
 8   STPain      303 non-null    int64
 9   STWave      303 non-null    int64
 10  STSlope     303 non-null    int64
 11  NumColor    303 non-null    int64
 12  Defects     303 non-null    int64
 13  AngioTgt    303 non-null    int64
dtypes: int64(14)
memory usage: 33.3 KB


Unnamed: 0,Age,Sex,ChestPain,SystolicBP,Chol,Glucose,RestECG,STMaxRate,STPain,STWave,STSlope,NumColor,Defects,AngioTgt
0,63,1,3,145,233,1,0,150,0,2,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1,2,0,2,1
3,56,1,1,120,236,0,1,178,0,1,2,0,2,1
4,57,0,4,120,354,0,1,163,1,1,2,0,2,1


# 2. Feature Selection and Engineering

## 2A. Candidates

### Best Candidate Features from EDA

1. Chest Pain (ordinal: monotonically negative trend 0-3, drop off at 4)
2. ST Pain    (2 values, negative correlation) 
        Correlated with ChestPain(+.45)
3. ST Wave    (ordinal: monotonically negative trend 0-6, elbow at 3, no 5s) 
        Correlated with ChestPain(+.33)
4. ST Max Rate (mostly positive continuous feature)  
        Correlated with STPain (-.38) and ChestPain (-.38) and STWave (-.33)
5. Num Color  (monotonically negative trend 0-3, 4 jumps up to 80%)
6. ST Slope   (0 just above 1, 1 far below 2)  
        Correlated with STWave(-.55) and MaxRate(+.39)
        
### Other Candidate Features

7. Defects   (ignoring sparsely populated values 0 & 1, 2 and 3 are negative to AngioTgt)
8. Sex       (0=F, 1=M, negative trend)
9. Age       (continuous, mostly negative trend)


### Minimal set of Features (tossing out features correlated to better features)

1. Chest Pain 
2. Num Color  
3. ST Slope   
4. Defects    
5. Sex        
6. Age        

### Decent Features to add in 
7. ST Wave      (ordinal after binning in DW.  Use as is now.)
8. ST Max Rate  (continuous, standardize)
9. ST Pain      (binary)

## 2B. Feature Sets

### Minimal Set

ChestPain, NumColor, STSlope, Defects, Sex, Age

### Medium Set

ChestPain, NumColor, STSlope, Defects, Sex, Age, STWave, STMaxRate, STPain

### Full Set

Age, Sex, ChestPain, SystolicBP, Chol, Glucose, RestECG, STMaxRate, STPain, STWave, STSlope, NumColor, Defects

## 2C. Encoding Categorical Features with Dummy Variables

### Minimal Set

* ChestPain is Ordinal. (AngioTgt trends with values.)  Use the 4 values, 1-4 directly.
* NumColor is Ordinal. Documentation defined 0-3.  We see very few 4 values and they are against the trend.  Use 0-3 only as Ordinals.  Replace 4 with 0, as their mean AngioTgt scores are similar.
* STSlope Categorical. Make dummies. Drop sparse first value (0) to avoid collinearity.
* Defects is categorical.  Make dummies.  Drop very sparse first value (0).
* Sex is already a binary.  (1 = Male)
* Age is continuous with most values between 40 and 70 and mode at 60. Standardize after splitting.


### Medium Set additions

* STWave is ordinal after Data Wrangling it down to 6 bins.  Use values 0-4 and 6 directly.
* STMaxRate is continuous with most values between 80 and 200 and mode at 160.  Standardize it with sklearn's scaler.  Must scale AFTER splitting test and train data sets!
* STPain is already a binary. (1 = Yes)

### Full Set additions

* SystolicBP is continuous and shows little correlation to AngioTgt.  Standardize it after splitting.
* Chol is continuous. Standardize after splitting test and train data sets!
* Glucose is binary.  (1 = diabetes)
* RestECG has three values, but very few 2s. Replace 2 with 0 to make it ordinal and binary.




# 3. Pre-processing

## Minimal Feature Set

In [137]:
min_feature = heart_data[['ChestPain','NumColor','STSlope','Defects','Sex','Age']]
min_feature = min_feature.replace({'NumColor':4}, 0)
min_feature = pd.get_dummies(min_feature, columns=['STSlope','Defects'], drop_first=True)
#min_feature.Age = pd.cut(min_feature.Age, [0,35,45,55,65,75,120], labels=[30,40,50,60,70,80])
#min_feature = pd.get_dummies(min_feature, columns=['Age'])
#min_feature.drop(columns='Age_60', inplace=True)

## Must wait to scale Age in testing and training sets separately
print(min_feature.NumColor.value_counts())
min_feature

0    180
1     65
2     38
3     20
Name: NumColor, dtype: int64


Unnamed: 0,ChestPain,NumColor,Sex,Age,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
0,3,0,1,63,0,0,1,0,0
1,2,0,1,37,0,0,0,1,0
2,1,0,0,41,0,1,0,1,0
3,1,0,1,56,0,1,0,1,0
4,4,0,0,57,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...
298,4,0,0,57,1,0,0,0,1
299,3,0,1,45,1,0,0,0,1
300,4,2,1,68,1,0,0,0,1
301,4,1,1,57,1,0,0,0,1


## Medium Feature Set

In [138]:
med_feature = pd.concat([min_feature, heart_data[['STWave','STMaxRate','STPain']] ], axis=1)
med_feature.head(10)
## Must wait to scale STMaxRate in testing and training sets separately

Unnamed: 0,ChestPain,NumColor,Sex,Age,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3,STWave,STMaxRate,STPain
0,3,0,1,63,0,0,1,0,0,2,150,0
1,2,0,1,37,0,0,0,1,0,3,187,0
2,1,0,0,41,0,1,0,1,0,1,172,0
3,1,0,1,56,0,1,0,1,0,1,178,0
4,4,0,0,57,0,1,0,1,0,1,163,1
5,4,0,1,57,1,0,1,0,0,0,148,0
6,1,0,0,56,1,0,0,1,0,1,153,0
7,1,0,1,44,0,1,0,0,1,0,173,0
8,2,0,1,52,0,1,0,0,1,0,162,0
9,2,0,1,57,0,1,0,1,0,2,174,0


## Full Feature Set

In [141]:
full_feature = pd.concat([med_feature, heart_data[['SystolicBP','Chol','Glucose','RestECG']]], axis=1)
min_feature = min_feature.replace({'RestECG':2}, 0)
full_feature.head()
## Must wait and scale SystolicBP and Chol in testing and training sets separately

Unnamed: 0,ChestPain,NumColor,Sex,Age,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3,STWave,STMaxRate,STPain,SystolicBP,Chol,Glucose,RestECG
0,3,0,1,63,0,0,1,0,0,2,150,0,145,233,1,0
1,2,0,1,37,0,0,0,1,0,3,187,0,130,250,0,1
2,1,0,0,41,0,1,0,1,0,1,172,0,130,204,0,0
3,1,0,1,56,0,1,0,1,0,1,178,0,120,236,0,1
4,4,0,0,57,0,1,0,1,0,1,163,1,120,354,0,1


# 4. Split Training and Test Data


## Minimal Feature Split


In [164]:
y = heart_data.AngioTgt      ## AngioTgt was never entered into the feature dataframes

Xmf_train, Xmf_test, ymf_train, ymf_test = train_test_split(min_feature, y, test_size=0.20, random_state=21)

## NOW standardize Age in the training set
SS = StandardScaler()
SS.fit(Xmf_train[['Age']])
Xmf_train.insert(3, "Age_SS", SS.transform(Xmf_train[['Age']]))
print(Xmf_train)
Xmf_train = Xmf_train.drop(columns="Age")

Xmf_test.insert(3, "Age_SS", SS.transform(Xmf_test[['Age']]))
Xmf_test = Xmf_test.drop(columns="Age")

Xmf_test

     ChestPain  NumColor  Sex    Age_SS  Age  STSlope_1  STSlope_2  Defects_1  \
281          4         0    1 -0.250145   52          1          0          0   
262          4         2    1 -0.137206   53          1          0          0   
60           2         1    0  1.895685   71          0          1          0   
76           2         0    1 -0.363083   51          1          0          0   
37           2         0    1 -0.024268   54          0          1          0   
..         ...       ...  ...       ...  ...        ...        ...        ...   
188          2         1    1 -0.476021   50          1          0          0   
120          4         2    0  1.105116   64          1          0          0   
48           2         0    0 -0.137206   53          0          1          0   
260          4         2    0  1.330993   66          1          0          0   
207          4         2    0  0.653363   60          1          0          0   

     Defects_2  Defects_3  

Unnamed: 0,ChestPain,NumColor,Sex,Age_SS,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3
219,4,2,1,-0.701898,0,1,0,0,1
216,2,1,0,0.879240,1,0,0,0,1
259,3,0,1,-1.831282,1,0,0,0,1
179,4,1,1,0.314547,1,0,1,0,0
225,4,0,1,1.782747,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...
44,2,0,1,-1.718344,0,1,0,1,0
129,1,1,0,2.234500,0,1,0,1,0
272,4,0,1,1.443932,1,0,0,1,0
9,2,0,1,0.314547,0,1,0,1,0


## Medium Feature Split


In [187]:
y = heart_data.AngioTgt     ## AngioTgt was never entered into the feature dataframes

XMdf_train, XMdf_test, yMdf_train, yMdf_test = \
        train_test_split(med_feature, y, test_size=0.20, stratify=y, random_state=21)

## Now scale Age and STMaxRate in the train and test features
SSA = StandardScaler()
SSA.fit(XMdf_train[['Age']])
SSMR = StandardScaler()
SSMR.fit(XMdf_train[['STMaxRate']])
#XMdf_train['Age_SS'] = SSA.fit_transform(XMdf_train[['Age']])
XMdf_train.insert(3, "Age_SS", SSA.transform(XMdf_train[['Age']]))
XMdf_train.insert(10, "STMaxRate_SS", SSMR.transform(XMdf_train[['STMaxRate']]))
XMdf_train = XMdf_train.drop(columns=['Age','STMaxRate'])
print(XMdf_train)

XMdf_test.insert(3, "Age_SS", SSA.transform(XMdf_test[['Age']]))
XMdf_test.insert(10, "STMaxRate_SS", SSMR.transform(XMdf_test[['STMaxRate']]))
XMdf_test = XMdf_test.drop(columns=['Age','STMaxRate'])
XMdf_test

     ChestPain  NumColor  Sex    Age_SS  STSlope_1  STSlope_2  Defects_1  \
96           4         0    0  0.803350          1          0          0   
52           2         3    1  0.803350          1          0          0   
168          4         1    1  0.910464          1          0          0   
228          3         0    1  0.482010          1          0          0   
17           3         0    0  1.231804          0          0          0   
..         ...       ...  ...       ...        ...        ...        ...   
205          4         1    1 -0.267783          0          1          0   
109          4         0    0 -0.482010          0          1          0   
123          2         0    0 -0.053557          0          1          0   
107          4         0    0 -1.017577          1          0          0   
51           4         0    1  1.231804          1          0          0   

     Defects_2  Defects_3  STMaxRate_SS  STWave  STPain  
96           1          0    

Unnamed: 0,ChestPain,NumColor,Sex,Age_SS,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3,STMaxRate_SS,STWave,STPain
148,2,0,1,-1.124690,0,1,0,1,0,0.866749,0,0
14,3,0,0,0.374897,0,1,0,1,0,0.567651,1,0
43,4,0,0,-0.160670,1,0,0,1,0,-0.244187,0,0
78,1,0,1,-0.267783,0,1,0,1,0,1.507673,0,0
118,1,0,0,-0.910464,0,1,0,1,0,0.994934,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
88,2,0,0,-0.053557,1,0,0,1,0,0.396738,2,0
221,4,0,1,0.053557,0,0,0,0,1,-1.611493,6,1
1,2,0,1,-1.874484,0,0,0,1,0,1.635858,3,0
265,4,1,1,1.231804,0,1,0,1,0,-0.714198,0,1


## Full Feature Split


In [193]:
y = heart_data.AngioTgt       
XFF_train, XFF_test, yFF_train, yFF_test = train_test_split(full_feature, y, test_size=0.20, random_state=21)

#SS = StandardScaler()
#scaled_cols = \
#    pd.DataFrame(SS.fit_transform(XFF_train[['Age','STMaxRate','SystolicBP','Chol']]),\
#                 columns=['Age_SS', 'STMaxRate_SS', 'SystolicBP_SS', 'Chol_SS'])
#XFF_train = pd.concat([XFF_train.drop(columns=['Age','STMaxRate','SystolicBP','Chol']), scaled_cols], axis=1)
#XFF_train = pd.concat([XFF_train[['Age','STMaxRate','SystolicBP','Chol']], scaled_cols], axis=1, join='inner')
#### Doesn't work.  Scaled values do not line up with the real values.


### Set up a scaler for each column, to be fit and used on training set and re-used on test set.
SSA = StandardScaler()
SSA.fit(XFF_train[['Age']])
SSMR = StandardScaler()
SSMR.fit(XFF_train[['STMaxRate']])
SSBP = StandardScaler()
SSBP.fit(XFF_train[['SystolicBP']])
SSC = StandardScaler()
SSC.fit(XFF_train[['Chol']])

XFF_train.insert(3, "Age_SS", SSA.fit_transform(XFF_train[['Age']]))
XFF_train.insert(10, "STMaxRate_SS", SSMR.fit_transform(XFF_train[['STMaxRate']]))
XFF_train.insert(15, "SystolicBP_SS", SSBP.fit_transform(XFF_train[['SystolicBP']]))
XFF_train.insert(16, "Chol_SS", SSC.fit_transform(XFF_train[['Chol']]))
XFF_train = XFF_train.drop(columns=['Age','STMaxRate','SystolicBP','Chol'])
#print(XFF_train.head(10), XFF_train.tail(10))


XFF_test.insert(3, "Age_SS", SSA.transform(XFF_test[['Age']]))
XFF_test.insert(10, "STMaxRate_SS", SSMR.transform(XFF_test[['STMaxRate']]))
XFF_test.insert(15, "SystolicBP_SS", SSBP.transform(XFF_test[['SystolicBP']]))
XFF_test.insert(16, "Chol_SS", SSC.transform(XFF_test[['Chol']]))

XFF_test = XFF_test.drop(columns=['Age','STMaxRate','SystolicBP','Chol'])

print(XFF_train, XFF_test)


     ChestPain  NumColor  Sex    Age_SS  STSlope_1  STSlope_2  Defects_1  \
281          4         0    1 -0.250145          1          0          0   
262          4         2    1 -0.137206          1          0          0   
60           2         1    0  1.895685          0          1          0   
76           2         0    1 -0.363083          1          0          0   
37           2         0    1 -0.024268          0          1          0   
..         ...       ...  ...       ...        ...        ...        ...   
188          2         1    1 -0.476021          1          0          0   
120          4         2    0  1.105116          1          0          0   
48           2         0    0 -0.137206          0          1          0   
260          4         2    0  1.330993          1          0          0   
207          4         2    0  0.653363          1          0          0   

     Defects_2  Defects_3  STMaxRate_SS  STWave  STPain  SystolicBP_SS  \
281          

In [194]:
datapath = '../data'
save_file(XFF_train, 'XFF_train.csv', datapath)
save_file(XMdf_train, 'XMdf_train.csv', datapath)
save_file(Xmf_train, 'Xmf_train.csv', datapath)
save_file(XFF_test, 'XFF_test.csv', datapath)
save_file(XMdf_test, 'XMdf_test.csv', datapath)
save_file(Xmf_test, 'Xmf_test.csv', datapath)
save_file(yFF_train, 'yFF_train.csv', datapath)
save_file(yMdf_train, 'yMdf_train.csv', datapath)
save_file(ymf_train, 'ymf_train.csv', datapath)
save_file(yFF_test, 'yFF_test.csv', datapath)
save_file(yMdf_test, 'yMdf_test.csv', datapath)
save_file(ymf_test, 'ymf_test.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\XFF_train.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\XMdf_train.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\Xmf_train.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\XFF_test.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\XMdf_test.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\Xmf_test.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\yFF_train.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\yMdf_train.csv"
A file already exists with this name.

Do you want to overwrite? (Y/N)y
Writing file.  "../data\ymf_train.csv"
A