# Machine Learning Pipeline - Feature Engineering with Classes

In this notebook, we will reproduce the Feature Engineering Pipeline from the notebook 2 (02-Machine-Learning-Pipeline-Feature-Engineering), but we will replace, whenever possible, the manually created functions with our created classes, and hopefully understand the value they bring forward.

Our classes are saved in the preprocessors module file.

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# from sklearn
from sklearn.model_selection import train_test_split

# to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# our in-house pre-processing module
import preprocessors as pp

In [2]:
# load dataset
data = pd.read_csv('heart.csv')

# rows and columns of the data
print(f'Number of rows: {data.shape[0]}')
print(f'Number of columns: {data.shape[1]}')

# visualise the dataset
data.head()

Number of rows: 918
Number of columns: 12


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


# Separate dataset into train and test

It is important to separate our data intro training and testing set.

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [3]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['HeartDisease'], axis=1), # predictive variables
    data['HeartDisease'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

In [4]:
X_train.shape, X_test.shape

((734, 11), (184, 11))

# Feature Engineering

In the following cells, we will engineer the variables of the student grades dataset so that we tackle:

1. Zero values in some variables.
2. Categorical Variables: one-hot encode binary variables
3. Categorical Variables: ordinal encoding on non-binary variables
4. Scale the continuous variables to the same range

## Mean Imputation

Recall we had no null values but rather a recurrence of unusual 0 values in our Cholesterol and RestingBP variables. To correct this error, we will replace the zero values with the means of the respective variables.

**NOTE:** Our mean computation will exclude the rows with 0 values, i.e. the total number of rows will be adjusted for only the rows with non-zero values.

In [5]:
mean_impute = pp.MeanImputation(variables=['Cholesterol','RestingBP'])

In [6]:
mean_impute.fit(X_train)

<preprocessors.MeanImputation at 0x11b0d4670>

In [7]:
# print out the parameters
mean_impute.params_

{'Cholesterol': 242.8818635607321, 'RestingBP': 132.72442019099591}

In [8]:
X_train = mean_impute.transform(X_train)
X_test = mean_impute.transform(X_test)

## Categorical Variables 

### Encode binary variables
Next, we need to transform the strings of the categorical variables into numbers. We will apply one-hot encoding on our binary features.

In [9]:
cat_encode = pp.CategoricalEncoder(variables=['Sex','ExerciseAngina','FastingBS'])

In [10]:
cat_encode.fit(X_train)

<preprocessors.CategoricalEncoder at 0x11b0d4610>

In [11]:
cat_encode.encoder_dict_

{'Sex': ['Sex_M'],
 'ExerciseAngina': ['ExerciseAngina_Y'],
 'FastingBS': ['FastingBS_1']}

In [12]:
X_train = cat_encode.transform(X_train)
X_test = cat_encode.transform(X_test)

### Encode non-binary variables

For our non-binary features, we will transform the strings into numbers that capture the monotonic relationship between the label/category and the target. 

A common operation with categorical variables is to map non-binary variables by their assigned order if they happen to be ordinal. Ordinality for our variables would have to be determined by domain knowledge which we currently do not have. In place of that, we can assign ordinality based on the the rate of heart disease per label in the category.

In [13]:
ord_encode = pp.OrdinalEncoder(variables=['ChestPainType','RestingECG','ST_Slope'],target='HeartDisease')

In [14]:
ord_encode.fit(X_train,y_train)

<preprocessors.OrdinalEncoder at 0x11b0d46a0>

In [15]:
ord_encode.params_

{'ChestPainType': {'ASY': 0.7851662404092071,
  'NAP': 0.35185185185185186,
  'TA': 0.5,
  'ATA': 0.1310344827586207},
 'RestingECG': {'Normal': 0.5146067415730337,
  'LVH': 0.5625,
  'ST': 0.6275862068965518},
 'ST_Slope': {'Flat': 0.8351648351648352,
  'Up': 0.1761006289308176,
  'Down': 0.7884615384615384}}

In [16]:
X_train = ord_encode.transform(X_train)
X_test = ord_encode.transform(X_test)

In [17]:
ord_encode.ordinal_labels_

{'ChestPainType': {'ATA': 1, 'NAP': 2, 'TA': 3, 'ASY': 4},
 'RestingECG': {'Normal': 1, 'LVH': 2, 'ST': 3},
 'ST_Slope': {'Up': 1, 'Down': 2, 'Flat': 3}}

In [18]:
X_train.head()

Unnamed: 0,Age,ChestPainType,RestingBP,Cholesterol,RestingECG,MaxHR,Oldpeak,ST_Slope,Sex_M,ExerciseAngina_Y,FastingBS_1
378,70,4,140.0,242.881864,1,157,2.0,3,1,1,1
356,46,4,115.0,242.881864,1,113,1.5,3,1,1,0
738,65,2,160.0,360.0,2,151,0.8,1,0,0,0
85,66,4,140.0,139.0,1,94,1.0,3,1,1,0
427,59,4,140.0,242.881864,3,117,1.0,3,1,1,0


In [19]:
scaled = ['Age',
 'ChestPainType',
 'RestingBP',
 'Cholesterol',
 'MaxHR',
 'Oldpeak',
 'ST_Slope']

In [20]:
# create scaler
scaler = pp.ContinuousScaler(variables=scaled)

# fit the scaler to the train set
scaler.fit(X_train)

# transform train and test sets with learned parameters
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [21]:
X_train.head()

Unnamed: 0,Age,ChestPainType,RestingBP,Cholesterol,RestingECG,MaxHR,Oldpeak,ST_Slope,Sex_M,ExerciseAngina_Y,FastingBS_1
378,0.857143,1.0,0.5,0.304791,1,0.676259,0.522727,1.0,1,1,1
356,0.367347,1.0,0.291667,0.304791,1,0.359712,0.465909,1.0,1,1,0
738,0.755102,0.333333,0.666667,0.530888,2,0.633094,0.386364,0.0,0,0,0
85,0.77551,1.0,0.5,0.104247,1,0.223022,0.409091,1.0,1,1,0
427,0.632653,1.0,0.5,0.304791,3,0.388489,0.409091,1.0,1,1,0


We now have several classes with parameters learned from the training dataset, that we can store and retrieve at a later stage, so that when a colleague comes with new data, we are in a better position to score it faster.

Still:

- we would need to save each class
- then we could load each class
- and apply each transformation individually.

Which sounds like a lot of work.

The good news is, we can reduce the amount of work, if we set up all the transformations within a pipeline.