# Final Machine Learning Pipeline

In this notebook, we will set up all the feature engineering steps within a Scikit-learn pipeline utilizing the open source transformers plus those we developed in house.

The pipeline features:

- open source classes
- in house package classes
- only the selected features

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')

# for saving the pipeline
import joblib

# from scikit-learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, precision_score, recall_score
from sklearn.pipeline import Pipeline

# from feature engine
from feature_engine.selection import DropFeatures

# to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# our in-house pre-processing module
import preprocessors as pp

In [2]:
# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

In [3]:
# load dataset
data = pd.read_csv('heart.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(918, 12)


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


# Separate dataset into train and test

It is important to separate our data intro training and testing set.

When we engineer features, some techniques learn parameters from data. It is important to learn these parameters only from the train set. This is to avoid over-fitting.

**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [4]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['HeartDisease'], axis=1), # predictive variables
    data['HeartDisease'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

In [5]:
X_train.shape, X_test.shape

((734, 11), (184, 11))

# Configuration

In [6]:
ZERO_VALUES = ['Cholesterol','RestingBP']

BINARY_VARS = ['Sex','ExerciseAngina','FastingBS']

NON_BINARY_VARS = ['ChestPainType','RestingECG','ST_Slope']

TARGET = 'HeartDisease'

SCALED_VARS = ['Age',
 'ChestPainType',
 'RestingBP',
 'Cholesterol',
 'MaxHR',
 'Oldpeak',
 'ST_Slope']

DROPPED_VARS = ['RestingECG']

# Pipeline - End-to-end

We have 5 steps less, commented out:

- replacing zero values in certain variables
- encoding of binary variables
- encoding of non-binary variables
- scaling the continuous variables
- dropping the unneeded features
- training the model with nearest neighbors

In [7]:
# set up the pipeline
heart_pipe = Pipeline([
    
    # ===== MEAN IMPUTATION =====
    # replace zero values with the adjusted mean
    ('mean_imputation', pp.MeanImputation(variables=ZERO_VALUES)),
    
    # ===== ENCODING =====
    # encoding of binary variables
    ('binary_encoder', pp.CategoricalEncoder(variables=BINARY_VARS)),
    
    # encoding of non-binary variables
    ('non_binary_encoder', pp.OrdinalEncoder(
        variables=NON_BINARY_VARS, target=TARGET)),
    
    # ===== SCALER =====
    # scale the continuous variables
    ('scaler', pp.ContinuousScaler(variables=SCALED_VARS)),
    
    # ===== DROP FEATURES =====
    # reduce dataset to selected features
    ('drop_features', DropFeatures(features_to_drop=DROPPED_VARS)),
    
    # ===== MODEL TRAINING =====
    ('knn_model', KNeighborsClassifier(n_neighbors=5)),
    
])

In [8]:
# train the pipeline
heart_pipe.fit(X_train,y_train)

Pipeline(steps=[('mean_imputation',
                 <preprocessors.MeanImputation object at 0x1279267f0>),
                ('binary_encoder',
                 <preprocessors.CategoricalEncoder object at 0x127926850>),
                ('non_binary_encoder',
                 <preprocessors.OrdinalEncoder object at 0x1279268b0>),
                ('scaler',
                 <preprocessors.ContinuousScaler object at 0x127926910>),
                ('drop_features',
                 DropFeatures(features_to_drop=['RestingECG'])),
                ('knn_model', KNeighborsClassifier())])

In [9]:
# evaluate the model:
# ====================

# make predictions for test set
preds = heart_pipe.predict(X_test)

# determine f1, accuracy, precision and recall
print(f'f1 score: {f1_score(y_test,preds)}')
print(f'accuracy: {accuracy_score(y_test,preds)}')
print(f'precision: {precision_score(y_test,preds)}')
print(f'recall: {recall_score(y_test,preds)}')

f1 score: 0.8755760368663594
accuracy: 0.8532608695652174
precision: 0.8636363636363636
recall: 0.8878504672897196


We can see that these are identical results with when we did the engineering manually.

WE CAN GO AHEAD AND DEPLOY!