# Final Machine Learning Pipeline

In this notebook, we will set up all the feature engineering steps within a Scikit-learn pipeline utilizing the open source transformers plus those we developed in house.

The pipeline features:

- open source classes
- in house package classes
- only the selected features

**NOTE:** Our over-sampling operation and model training will be run outside the pipeline.

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')

# for saving the pipeline
import joblib

# from scikit-learn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, precision_score, recall_score
from sklearn.pipeline import Pipeline

# from feature engine
from feature_engine.selection import DropFeatures

# to visualise all the columns in the dataframe
pd.pandas.set_option('display.max_columns', None)

# our in-house pre-processing module
import preprocessors as pp

# to over-sample our minority label
from imblearn.over_sampling import SMOTE

In [2]:
# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

In [3]:
# load dataset
data = pd.read_csv('campaign.csv')

# rows and columns of the data
print(data.shape)

# visualise the dataset
data.head()

(2240, 29)


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0


In [4]:
# Let's separate into train and test set
# Remember to set the seed (random_state for this sklearn function)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['ID','Response'], axis=1), # predictive variables
    data['Response'], # target
    test_size=0.2, # portion of dataset to allocate to test set
    random_state=0, # we are setting the seed here
)

In [5]:
X_train.shape, X_test.shape

((1792, 27), (448, 27))

In [6]:
TARGET = 'Response'

MISSING_VALS = ['Income']

DATE_VAR = ['Dt_Customer']

YEAR_VAR = ['Year_Birth']

NON_BINARY = ['Education','Marital_Status']

SCALED_VARS = ['Year_Birth',
 'Education',
 'Marital_Status',
 'Income',
 'Kidhome',
 'Teenhome',
 'Dt_Customer',
 'Recency',
 'MntWines',
 'MntFruits',
 'MntMeatProducts',
 'MntFishProducts',
 'MntSweetProducts',
 'MntGoldProds',
 'NumDealsPurchases',
 'NumWebPurchases',
 'NumCatalogPurchases',
 'NumStorePurchases',
 'NumWebVisitsMonth']

In [7]:
# set up the pipeline
customer_pipe = Pipeline([
    
    # ===== CONSTANT VALUES =====
    # drop variables with constant values from the dataset
    ('drop_constants', pp.DropConstant()),
    
    # ===== MEAN IMPUTATION =====
    # replace null values with the variable mean
    ('mean_imputation', pp.MissingImputer(
        variables=MISSING_VALS)),
    
    # ===== TEMPORAL VARIABLES =====
    # ===== Dt_Customer =====
    ('transform_date', pp.TransformDate(
        variables=DATE_VAR, current_year=2022)),
    
    # ===== Year_Birth =====
    ('transform_year', pp.TransformYear(
        variables=YEAR_VAR, current_year=2022)),
    
    # ===== ENCODING =====
    # encode non-binary variables
    ('non_binary_encoder', pp.OrdinalEncoder(
        variables=NON_BINARY, target=TARGET)),
    
    # ===== SCALER =====
    # scale the continuous variables
    ('scaler', pp.ContinuousScaler(variables=SCALED_VARS)),
    
])

In [8]:
customer_pipe.fit(X_train,y_train)

Pipeline(steps=[('drop_constants',
                 <preprocessors.DropConstant object at 0x11b8641c0>),
                ('mean_imputation',
                 <preprocessors.MissingImputer object at 0x11b8641f0>),
                ('transform_date',
                 <preprocessors.TransformDate object at 0x11b864250>),
                ('transform_year',
                 <preprocessors.TransformYear object at 0x11b8642b0>),
                ('non_binary_encoder',
                 <preprocessors.OrdinalEncoder object at 0x11b864f10>),
                ('scaler',
                 <preprocessors.ContinuousScaler object at 0x11b864f70>)])

In [9]:
X_train = customer_pipe.transform(X_train)
X_test = customer_pipe.transform(X_test)

In [10]:
# oversample the minority class to establish label parity in the target
smote = SMOTE(sampling_strategy='minority',random_state=0)
X_train, y_train = smote.fit_resample(X_train,y_train)

In [11]:
# set up the model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train,y_train)

KNeighborsClassifier()

In [12]:
# evaluate the model:
# ====================

# make predictions for test set
knn_preds = knn_model.predict(X_test)

# determine f1, accuracy, precision and recall
print(f'f1 score: {f1_score(y_test,knn_preds)}')
print(f'accuracy: {accuracy_score(y_test,knn_preds)}')
print(f'precision: {precision_score(y_test,knn_preds)}')
print(f'recall: {recall_score(y_test,knn_preds)}')

f1 score: 0.5849056603773586
accuracy: 0.8035714285714286
precision: 0.4696969696969697
recall: 0.775


The results mirror the results we got during model training.

Note that we can certainly do more to improve this model. However, it is best to rely on the data to get the best inference we can from the model. Oversampling could be unhelpful in production/deployment because the real-world data distribution could differ considerably from the distribution of the data the model was trained with. This could lead to incorrect predicitons and model decay.

The only calibration this current model might need would be to de-senstise from its false positive selections. Besides that, its true positice picks are pretty impressive.