# Objectives

We are using the UCI Heart Disease Data after wrangling in the file data/heart.csv

We will try to predict the indication of heart disease in the target variable.

In this notebook we will do feature engineering based on the EDA from the previous notebook, along with pre-processing and preparing the training and test data.

# 1. Load the data from data/heart.csv

In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import math

from library.sb_utils import save_file

In [2]:
heart_data = pd.read_csv("../data/heart.csv")
heart_data.info()
heart_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Age         303 non-null    int64
 1   Sex         303 non-null    int64
 2   ChestPain   303 non-null    int64
 3   SystolicBP  303 non-null    int64
 4   Chol        303 non-null    int64
 5   Glucose     303 non-null    int64
 6   RestECG     303 non-null    int64
 7   STMaxRate   303 non-null    int64
 8   STPain      303 non-null    int64
 9   STWave      303 non-null    int64
 10  STSlope     303 non-null    int64
 11  NumColor    303 non-null    int64
 12  Defects     303 non-null    int64
 13  AngioTgt    303 non-null    int64
dtypes: int64(14)
memory usage: 33.3 KB


Unnamed: 0,Age,Sex,ChestPain,SystolicBP,Chol,Glucose,RestECG,STMaxRate,STPain,STWave,STSlope,NumColor,Defects,AngioTgt
0,63,1,3,145,233,1,0,150,0,2,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1,2,0,2,1
3,56,1,1,120,236,0,1,178,0,1,2,0,2,1
4,57,0,4,120,354,0,1,163,1,1,2,0,2,1


# 2. Feature Selection and Engineering

## 2A. Candidates

### Best Candidate Features from EDA

1. Chest Pain (monotonically negative trend 0-3, drop off at 4)
2. ST Pain    (2 values, negative correlation) 
        Correlated with ChestPain(+.45)
3. ST Wave    (monotonically negative trend 0-6, elbow at 3, no 5s) 
        Correlated with ChestPain(+.33)
4. ST Max Rate (mostly positive continuous feature)  
        Correlated with STPain (-.38) and ChestPain (-.38) and STWave (-.33)
5. Num Color  (monotonically negative trend 0-3, 4 jumps up to 80%)
6. ST Slope   (0 just above 1, 1 far below 2)  
        Correlated with STWave(-.55) and MaxRate(+.39)
        
### Other Candidate Features

7. Defects   (ignoring sparsely populated values 0 & 1, 2 and 3 are negative to AngioTgt)
8. Sex       (0=F, 1=M, negative trend)
9. Age       (continuous, mostly negative trend)


### Minimal set of Features (tossing out features correlated to better features)

1. Chest Pain
2. Num Color
3. ST Slope
4. Defects
5. Sex
6. Age

### Decent Features to add in 
7. ST Wave
8. ST Max Rate
9. ST Pain

## 2B. Feature Sets

### Minimal Set

ChestPain, NumColor, STSlope, Defects, Sex, Age

### Medium Set

ChestPain, NumColor, STSlope, Defects, Sex, Age, STWave, STMaxRate, STPain

### Full Set

Age, Sex, ChestPain, SystolicBP, Chol, Glucose, RestECG, STMaxRate, STPain, STWave, STSlope, NumColor, Defects

## 2C. Encoding Categorical Features with Dummy Variables

### Minimal Set

* ChestPain is ordinal. (AngioTgt trends with values.)  Use the 4 values, 1-4 directly.
* NumColor: documentation defined 0-3.  We see very few 4 values and they are against the trend.  Use 0-3 only as ordinals.  Replace 4 with 0, as their mean AngioTgt scores are similar.
* STSlope has no trend with the target.  Categorical. Make dummies. Drop first value.
* Defects is categorical.  Make dummies.  Drop first value.
* Sex is already a binary.  (1 = Male)
* Age is continuous with most values between 40 and 70 and mode at 60.  Bin it by rounding to the nearest 10, then one-hot encode it with 60 as the baseline, leaving these dummy variables:
Age_30, Age_40, Age_50, Age_70, Age_80


### Medium Set additions

* STWave is ordinal after Data Wrangling it down to 6 bins.  Use values 0-4 and 6 directly.
* STMaxRate is continuous with most values between 80 and 200 and mode at 160.  Try keeping it continuous, but standardize it with sklearn's scaler.  Must scale AFTER splitting test and train data sets!
<Bin it by rounding to the nearest 20, then one-hot encode it with 140 as the baseline, as it is in the midst of the associated AngioTgt trend.  Dummy variables: STRate_80, STRate_100, STRate_120, STRate_160, STRate_180, STRate_200>
* STPain is already a binary. (1 = Yes)

### Full Set additions

* SystolicBP is continuous and shows little correlation to AngioTgt.  Try it with scale standardization.  Must scale AFTER splitting test and train data sets!
* Chol is continuous and shows very little correlation to AngioTgt in this data.  Try it with standardization.  Must scale AFTER splitting test and train data sets!
* Glucose is binary.  (1 = diabetes)
* RestECG has three values, but very few 2s, and shows little correlation to AngioTgt.  Use it as is.




# 3. Pre-processing

## Minimal Feature Set

In [74]:
min_feature = heart_data[['ChestPain','NumColor','STSlope','Defects','Sex','Age']]
min_feature = min_feature.replace({'NumColor':4}, 0)
min_feature = pd.get_dummies(min_feature, columns=['STSlope','Defects'], drop_first=True)
min_feature.Age = pd.cut(min_feature.Age, [0,35,45,55,65,75,120], labels=[30,40,50,60,70,80])
min_feature = pd.get_dummies(min_feature, columns=['Age'])
min_feature.drop(columns='Age_60', inplace=True)
print(min_feature.NumColor.value_counts())
min_feature

0    180
1     65
2     38
3     20
Name: NumColor, dtype: int64


Unnamed: 0,ChestPain,NumColor,Sex,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3,Age_30,Age_40,Age_50,Age_70,Age_80
0,3,0,1,0,0,1,0,0,0,0,0,0,0
1,2,0,1,0,0,0,1,0,0,1,0,0,0
2,1,0,0,0,1,0,1,0,0,1,0,0,0
3,1,0,1,0,1,0,1,0,0,0,0,0,0
4,4,0,0,0,1,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,4,0,0,1,0,0,0,1,0,0,0,0,0
299,3,0,1,1,0,0,0,1,0,1,0,0,0
300,4,2,1,1,0,0,0,1,0,0,0,1,0
301,4,1,1,1,0,0,0,1,0,0,0,0,0


## Medium Feature Set

In [76]:
med_feature = pd.concat([min_feature, heart_data[['STWave','STMaxRate','STPain']] ], axis=1)
med_feature.head(10)
## Must wait and scale STMaxRate in testing and training sets separately

Unnamed: 0,ChestPain,NumColor,Sex,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3,Age_30,Age_40,Age_50,Age_70,Age_80,STWave,STMaxRate,STPain
0,3,0,1,0,0,1,0,0,0,0,0,0,0,2,150,0
1,2,0,1,0,0,0,1,0,0,1,0,0,0,3,187,0
2,1,0,0,0,1,0,1,0,0,1,0,0,0,1,172,0
3,1,0,1,0,1,0,1,0,0,0,0,0,0,1,178,0
4,4,0,0,0,1,0,1,0,0,0,0,0,0,1,163,1
5,4,0,1,1,0,1,0,0,0,0,0,0,0,0,148,0
6,1,0,0,1,0,0,1,0,0,0,0,0,0,1,153,0
7,1,0,1,0,1,0,0,1,0,1,0,0,0,0,173,0
8,2,0,1,0,1,0,0,1,0,0,1,0,0,0,162,0
9,2,0,1,0,1,0,1,0,0,0,0,0,0,2,174,0


## Full Feature Set

In [77]:
full_feature = pd.concat([med_feature, heart_data[['SystolicBP','Chol','Glucose','RestECG']]], axis=1)
full_feature.head()
## Must wait and scale SystolicBP and Chol in testing and training sets separately

Unnamed: 0,ChestPain,NumColor,Sex,STSlope_1,STSlope_2,Defects_1,Defects_2,Defects_3,Age_30,Age_40,Age_50,Age_70,Age_80,STWave,STMaxRate,STPain,SystolicBP,Chol,Glucose,RestECG
0,3,0,1,0,0,1,0,0,0,0,0,0,0,2,150,0,145,233,1,0
1,2,0,1,0,0,0,1,0,0,1,0,0,0,3,187,0,130,250,0,1
2,1,0,0,0,1,0,1,0,0,1,0,0,0,1,172,0,130,204,0,0
3,1,0,1,0,1,0,1,0,0,0,0,0,0,1,178,0,120,236,0,1
4,4,0,0,0,1,0,1,0,0,0,0,0,0,1,163,1,120,354,0,1


# 4. Split Training and Test Data


## Minimal Feature Split


In [None]:
y = heart_data.AngioTgt
Xmf = min_feature        ## AngioTgt was never entered into the feature dataframes

Xmf_train, Xmf_test, ymf_train, ymf_test = train_test_split(Xmf, y, test_size=0.20, random_state=21)

## Medium Feature Split


In [118]:
y = heart_data.AngioTgt
XMdf = med_feature        ## AngioTgt was never entered into the feature dataframes

XMdf_train, XMdf_test, yMdf_train, yMdf_test = \
        train_test_split(XMdf, y, test_size=0.20, stratify=y, random_state=21)
## Now scale STMaxRate
trMR = scale(XMdf_train.STMaxRate)
XMdf_train['STMaxRate'] = trMR
#trMR = pd.DataFrame(trMR, columns=['STMaxRate2'])
#print (len(XMdf_train), len(XMdf_test), len(trMR), XMdf_train.isnull().sum(),  XMdf_train.notna().sum(), trMR.isna().sum(), trMR.notna().sum())
#print(XMdf_train, trMR)
#XMdf_train = pd.concat([XMdf_train, trMR], axis=1)
#print(XMdf_train.isnull().sum(), XMdf_train.notna().sum())
#print(XMdf_train)
teMR = scale(XMdf_test.STMaxRate)
XMdf_test.STMaxRate = teMR
print(XMdf_train, XMdf_test)

     ChestPain  NumColor  Sex  STSlope_1  STSlope_2  Defects_1  Defects_2  \
96           4         0    0          1          0          0          1   
52           2         3    1          1          0          0          0   
168          4         1    1          1          0          0          0   
228          3         0    1          1          0          0          0   
17           3         0    0          0          0          0          1   
..         ...       ...  ...        ...        ...        ...        ...   
205          4         1    1          0          1          0          0   
109          4         0    0          0          1          0          1   
123          2         0    0          0          1          0          1   
107          4         0    0          1          0          0          1   
51           4         0    1          1          0          0          1   

     Defects_3  Age_30  Age_40  Age_50  Age_70  Age_80  STWave  STMaxRate  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  XMdf_train['STMaxRate'] = trMR
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


## Full Feature Split


In [123]:
y = heart_data.AngioTgt
XFF = full_feature        ## AngioTgt was never entered into the feature dataframes

XFF_train, XFF_test, yFF_train, yFF_test = train_test_split(XFF, y, test_size=0.20, random_state=21)

trFF = scale(XFF_train[['STMaxRate','SystolicBP','Chol']])
XFF_train[['STMaxRate','SystolicBP','Chol']] = trFF

teFF = scale(XFF_test[['STMaxRate','SystolicBP','Chol']])
XFF_test[['STMaxRate','SystolicBP','Chol']] = teFF

print(XFF_train, XFF_test)


     ChestPain  NumColor  Sex  STSlope_1  STSlope_2  Defects_1  Defects_2  \
281          4         0    1          1          0          0          0   
262          4         2    1          1          0          0          0   
60           2         1    0          0          1          0          1   
76           2         0    1          1          0          0          1   
37           2         0    1          0          1          0          0   
..         ...       ...  ...        ...        ...        ...        ...   
188          2         1    1          1          0          0          0   
120          4         2    0          1          0          0          1   
48           2         0    0          0          1          0          0   
260          4         2    0          1          0          0          0   
207          4         2    0          1          0          0          0   

     Defects_3  Age_30  Age_40  Age_50  Age_70  Age_80  STWave  STMaxRate  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  XFF_train[['STMaxRate','SystolicBP','Chol']] = trFF
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  XFF_test[['STMaxRate','SystolicBP','Chol']] = teFF
A value is trying to be set on a copy of a slice from a DataFrame.
