# Machine Learning Pipeline - Feature Engineering

In the following notebooks, we will go through the implementation of each one of the steps in the Machine Learning Pipeline. 

We will discuss:

1. Data Analysis
2. **Feature Engineering**
3. Feature Selection
4. Model Training
5. Obtaining Predictions / Scoring

In [1]:
# to handle datasets
import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')

# for the yeo-johnson transformation
import scipy.stats as stats

# to divide train and test set
from sklearn.model_selection import train_test_split

# feature scaling
from sklearn.preprocessing import MinMaxScaler

# to save the trained scaler class
import joblib

# to visualise al the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

In [2]:
# load dataset
data = pd.read_csv('heart.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(918, 12)


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


# Separate dataset into train and test

It is important to separate our data intro training and testing set. 

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [3]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['HeartDisease'], axis=1), # predictive variables
    data['HeartDisease'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

In [4]:
X_train.shape, X_test.shape

((734, 11), (184, 11))

In [5]:
X_train.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope
378,70,M,ASY,140,0,1,Normal,157,Y,2.0,Flat
356,46,M,ASY,115,0,0,Normal,113,Y,1.5,Flat
738,65,F,NAP,160,360,0,LVH,151,N,0.8,Up
85,66,M,ASY,140,139,0,Normal,94,Y,1.0,Flat
427,59,M,ASY,140,0,0,ST,117,Y,1.0,Flat


# Feature Engineering

## Missing values

Recall we had no null values but rather a recurrence of unusual 0 values in our Cholesterol and RestingBP variables. To correct this error, we will fill replace the zero values with the means of the respective variables.

**NOTE:** Our mean computation will exclude the rows with 0 values, i.e. the total number of rows will be adjusted for only the rows with non-zero values.

### Cholesterol

In [6]:
# exclude non-zero rows
tmp = X_train[X_train['Cholesterol']!=0]

# grab the mean
cholesterol_mean = tmp['Cholesterol'].mean()
cholesterol_mean

242.8818635607321

In [7]:
# replace 0 values with the mean
X_train['Cholesterol'] = np.where(X_train['Cholesterol']!=0,X_train['Cholesterol'],cholesterol_mean)
X_test['Cholesterol'] = np.where(X_test['Cholesterol']!=0,X_test['Cholesterol'],cholesterol_mean)

### RestingBP

In [8]:
# exclude non-zero rows
tmp = X_train[X_train['RestingBP']!=0]

# grab the mean
restingbp_mean = tmp['RestingBP'].mean()
restingbp_mean

132.72442019099591

In [9]:
# replace 0 values with the mean
X_train['RestingBP'] = np.where(X_train['RestingBP']!=0,X_train['RestingBP'],restingbp_mean)
X_test['RestingBP'] = np.where(X_test['RestingBP']!=0,X_test['RestingBP'],restingbp_mean)

## Categorical Variables

A common operation with categorical variables is to map non-binary variables by their assigned order if they happen to be ordinal. Ordinality for our variables would have to be determined by domain knowledge which we currently do not have. In place of that, we can assign ordinality based on the the rate of heart disease per label in the category.

For the binary variables, we will go ahead and one-hot encode them. This operation is typically done after removing rare labels but our variables have no rare labels.

### Encoding of categorical variables

#### Binary variables

We will now transform the strings of our binary variables into numbers.

In [10]:
X_test['ExerciseAngina'].head()

306    N
711    N
298    N
466    Y
253    N
Name: ExerciseAngina, dtype: object

In [11]:
binary = ['Sex', 'ExerciseAngina', 'FastingBS']

# cast all binary variables as categorical
# FastingBS is recorded as int
X_train[binary] = X_train[binary].astype(str)
X_test[binary] = X_test[binary].astype(str)

In [12]:
X_train[binary].head()

Unnamed: 0,Sex,ExerciseAngina,FastingBS
378,M,Y,1
356,M,Y,0
738,F,N,0
85,M,Y,0
427,M,Y,0


In [13]:
# function to encode categorical variables

def category_encoder(X,variables):
        
        # loop over each feature in the list
        for feature in variables:
            
            dummies = pd.get_dummies(X[feature],drop_first=True) # grab the dummies
            for column in dummies.columns: # loop over the columns in the dummies dataframe
                dummies = dummies.rename(columns={column:feature + '_' + column}) # affix the feature name to the column name to make it easily identifiable
            X = pd.concat([X,dummies],axis=1) # concat the dummy to the original dataframe
            X = X.drop(feature,axis=1) # drop the string column
        return X # return the dataframe

In [14]:
X_train = category_encoder(X_train,binary)
X_test = category_encoder(X_test,binary)

A common issue with drop-one operations during one-hot encoding for binary variables is new variables in the train set may not be replicated in the test set. This suggests the original variable the encoded variables came from could quasi-constant features. 

It is always best to check for this disparity after one-hot encoding and dropping the variable entirely if the disparity exists.

In [15]:
# sanity check after drop-one operation
for feat in X_train.columns:
    if feat not in X_test.columns:
        print(feat)

We don't have any disparity in features between the train and test set.

#### Non-binary variables

In [16]:
non_binary = ['ChestPainType', 'RestingECG', 'ST_Slope']

In [17]:
X_train['ChestPainType'].value_counts()

ASY    391
NAP    162
ATA    145
TA      36
Name: ChestPainType, dtype: int64

In [18]:
train = pd.concat([X_train,y_train],axis=1)

In [19]:
# empty dictionary to store the parameters of the variables
params = {'ChestPainType': {},
     'RestingECG': {},
     'ST_Slope': {}}

# fit: grab and persist parameters to the dictionary
# NOTE: the parameters are the heart disease rates per label in the category
for column in non_binary:
    for label in train[column].unique():
        label_disease = len(train[(train[column]==label) & (train['HeartDisease']==1)])
        label_size = len(train[train[column]==label])
        params[column][label] = label_disease / label_size
        
# transform: re-categorise variables with saved parameters
    labels = pd.Series(params[column])
    ordered_labels = labels.sort_values().index
    ordinal_label = {k: i for i, k in enumerate(ordered_labels, 1)}
    
    print(column, ordinal_label)
    print()
    
    X_train[column] = X_train[column].map(ordinal_label)
    X_test[column] = X_test[column].map(ordinal_label)

ChestPainType {'ATA': 1, 'NAP': 2, 'TA': 3, 'ASY': 4}

RestingECG {'Normal': 1, 'LVH': 2, 'ST': 3}

ST_Slope {'Up': 1, 'Down': 2, 'Flat': 3}



In [20]:
# print out the persisted parameters
params

{'ChestPainType': {'ASY': 0.7851662404092071,
  'NAP': 0.35185185185185186,
  'TA': 0.5,
  'ATA': 0.1310344827586207},
 'RestingECG': {'Normal': 0.5146067415730337,
  'LVH': 0.5625,
  'ST': 0.6275862068965518},
 'ST_Slope': {'Flat': 0.8351648351648352,
  'Up': 0.1761006289308176,
  'Down': 0.7884615384615384}}

In [21]:
# visualise the dataframe to confirm the transformations
X_train.head()

Unnamed: 0,Age,ChestPainType,RestingBP,Cholesterol,RestingECG,MaxHR,Oldpeak,ST_Slope,Sex_M,ExerciseAngina_Y,FastingBS_1
378,70,4,140.0,242.881864,1,157,2.0,3,1,1,1
356,46,4,115.0,242.881864,1,113,1.5,3,1,1,0
738,65,2,160.0,360.0,2,151,0.8,1,0,0,0
85,66,4,140.0,139.0,1,94,1.0,3,1,1,0
427,59,4,140.0,242.881864,3,117,1.0,3,1,1,0


In [22]:
# let's now save the train and test sets for the next notebook!

X_train.to_csv('xtrain_unscaled.csv', index=False)
X_test.to_csv('xtest_unscaled.csv', index=False)

y_train.to_csv('ytrain.csv', index=False)
y_test.to_csv('ytest.csv', index=False)