# Predicting Term Sheet Purchase

#### Model Buidling Steps

-  Load Dataset and Clean Data
-  Check for Outlers and Treat Outliers
-  Check for Class Imbalances 
-  Split Data into Train & Test Split
-  Build Preprocessing and Estimation Pipeline
- - Using Imblearn pipeline, OverSample minority class, apply PCA, and estimator (Logistic Regression (L1 regularization - lasso ) 
       & RandomForest Classifier)
- - Use GridSearch to serach for best parameters and estimators as well as PCA components

#### References

- https://stackoverflow.com/questions/46062679/right-order-of-doing-feature-selection-pca-and-normalization

- https://towardsdatascience.com/preventing-data-leakage-in-your-machine-learning-model-9ae54b3cd1fb

- https://www.mikulskibartosz.name/pca-how-to-choose-the-number-of-components/

- https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

- https://stats.stackexchange.com/questions/363548/use-of-smote-with-training-test-and-dev-sets

- https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well

- https://ro-che.info/articles/2017-12-11-pca-explained-variance

- https://www.researchgate.net/deref/http%3A%2F%2Fwww.marcoaltini.com%2Fblog%2Fdealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation

> Import analysis and visualization libraires

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline  
import matplotlib.pyplot as plt
import seaborn as sns 
from pandas.api.types import is_numeric_dtype 
import os

> import from project-defined modules

In [3]:
import bi_plot
import uni_plot
from data import WrangleData
from model import Preprocessor, plot_pca_components, plot_confusion_matrix, check_imbalance
from model import x_y_split, gridSearch, plot_grid_search
from model_metrics import metrics 

> Import preprocessing libraries

In [4]:
from sklearn.preprocessing import StandardScaler,RobustScaler, MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
from imblearn.over_sampling import SMOTE, _random_over_sampler
from sklearn.decomposition import PCA
from sklearn.feature_selection import f_classif, from_model, SelectKBest,chi2, mutual_info_classif
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest # outlier detection and re,oval
from collections import Counter



> Import estimator libraries

In [5]:
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.model_selection import KFold, StratifiedKFold, GridSearchCV
import xgboost
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.base import BaseEstimator, TransformerMixin

> Import libraries for measuring model perofrmance

In [6]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate

> Import production libraries

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from imblearn.pipeline import Pipeline as ImbPipe
import joblib

## Data Preprocessing

> Import dataset

- drop duration column (directly impacts the target varible and not good for modelling)

In [4]:
wrangle = WrangleData()

In [5]:
wrangle.load_data(path, sep=';')

You are now fit to use this object for wrangling


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [6]:
wrangle.data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,2,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,no
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-1.1,94.767,-50.8,1.028,4963.6,yes


In [7]:
wrangle.check_outliers()

Unnamed: 0,age,duration,campaign,pdays,previous,cons.conf.idx
0,56,261,1,999,0,-36.4
1,57,149,1,999,0,-36.4
2,37,226,1,999,0,-36.4
3,40,151,1,999,0,-36.4
4,56,307,1,999,0,-36.4
...,...,...,...,...,...,...
41183,73,334,1,999,0,-50.8
41184,46,383,1,999,0,-50.8
41185,56,189,2,999,0,-50.8
41186,44,442,1,999,0,-50.8


In [8]:
wrangle.treat_outliers(type_="isf")

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40823,33,technician,single,professional.course,no,yes,no,cellular,sep,fri,...,1,999,0,nonexistent,-1.1,94.199,-37.5,0.879,4963.6,yes
40830,48,services,married,basic.6y,no,no,no,cellular,sep,mon,...,2,999,0,nonexistent,-1.1,94.199,-37.5,0.879,4963.6,yes
40831,32,admin.,married,high.school,no,no,no,cellular,sep,mon,...,2,999,0,nonexistent,-1.1,94.199,-37.5,0.879,4963.6,no
40839,48,unemployed,married,professional.course,no,yes,no,cellular,sep,mon,...,2,999,0,nonexistent,-1.1,94.199,-37.5,0.879,4963.6,yes


In [9]:
wrangle.split_data_single(target_cols=['duration']);

In [10]:
wrangle.split2.head(2)

Unnamed: 0,duration
0,261
1,149


In [11]:
wrangle.split1.head(2)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [12]:
wrangle.encode(use_split1 = True)

Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,0,...,45,46,47,48,49,50,51,52,53,54
0,56,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,57,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,37,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,40,1,999,0,1.1,93.994,-36.4,4.857,5191.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,56,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37062,43,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
37063,42,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
37064,42,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
37065,32,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [13]:
wrangle.split1

Unnamed: 0,age,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,0,...,45,46,47,48,49,50,51,52,53,54
0,56,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,57,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,37,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,40,1,999,0,1.1,93.994,-36.4,4.857,5191.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,56,1,999,0,1.1,93.994,-36.4,4.857,5191.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37062,43,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
37063,42,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
37064,42,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
37065,32,1,999,0,-2.9,92.469,-33.6,1.029,5076.2,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [14]:
wrangle.scale_data(use_split2=True)

array([[ 0.4047619 ],
       [-0.12857143],
       [ 0.23809524],
       ...,
       [-0.03333333],
       [ 0.66190476],
       [ 0.11904762]])

In [15]:
wrangle.split2

array([[ 0.4047619 ],
       [-0.12857143],
       [ 0.23809524],
       ...,
       [-0.03333333],
       [ 0.66190476],
       [ 0.11904762]])

In [1]:
"Works Fine"

'Works Fine'

### Modelling

In [20]:
p = Preprocessor() 

In [21]:
data = wrangle.data 

In [22]:
data.head(2)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [23]:
p.fit(data)

Fitted


Used to prepare data for modelling

In [24]:
data.head(2)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [25]:
p.check_outliers()

Unnamed: 0,age,duration,campaign,pdays,previous
0,56,261,1,999,0
1,57,149,1,999,0
2,37,226,1,999,0
3,40,151,1,999,0
4,56,307,1,999,0
...,...,...,...,...,...
40823,33,395,1,999,0
40830,48,188,2,999,0
40831,32,169,2,999,0
40839,48,315,2,999,0


In [26]:
treated_data = p.treat_outliers(type_='isf')

In [27]:
 treated_data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36223,27,technician,single,university.degree,no,no,no,cellular,may,thu,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.270,5099.1,no
36365,33,admin.,single,high.school,no,yes,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,no
36397,31,management,married,university.degree,no,no,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,no
36422,31,admin.,single,university.degree,no,no,yes,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,no


In [28]:
p.map_col_values(col_name='y', values_dict={'yes':1, 'no':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36223,27,technician,single,university.degree,no,no,no,cellular,may,thu,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.270,5099.1,0
36365,33,admin.,single,high.school,no,yes,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0
36397,31,management,married,university.degree,no,no,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0
36422,31,admin.,single,university.degree,no,no,yes,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0


In [29]:
p.data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36223,27,technician,single,university.degree,no,no,no,cellular,may,thu,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.270,5099.1,0
36365,33,admin.,single,high.school,no,yes,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0
36397,31,management,married,university.degree,no,no,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0
36422,31,admin.,single,university.degree,no,no,yes,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0


In [30]:
p.data.rename(columns={'y':'purchases'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [31]:
p.data

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,purchases
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36223,27,technician,single,university.degree,no,no,no,cellular,may,thu,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.270,5099.1,0
36365,33,admin.,single,high.school,no,yes,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0
36397,31,management,married,university.degree,no,no,no,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0
36422,31,admin.,single,university.degree,no,no,yes,cellular,jun,tue,...,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2,0


In [32]:
p.split_data_single(target_cols=['purchases'])

(       age         job  marital          education  default housing loan  \
 0       56   housemaid  married           basic.4y       no      no   no   
 1       57    services  married        high.school  unknown      no   no   
 2       37    services  married        high.school       no     yes   no   
 3       40      admin.  married           basic.6y       no      no   no   
 4       56    services  married        high.school       no      no  yes   
 ...    ...         ...      ...                ...      ...     ...  ...   
 36223   27  technician   single  university.degree       no      no   no   
 36365   33      admin.   single        high.school       no     yes   no   
 36397   31  management  married  university.degree       no      no   no   
 36422   31      admin.   single  university.degree       no      no  yes   
 36438   31  management  married  university.degree       no      no   no   
 
          contact month day_of_week  duration  campaign  pdays  previous  

In [33]:
p.features

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36223,27,technician,single,university.degree,no,no,no,cellular,may,thu,68,1,999,0,nonexistent,-1.8,92.893,-46.2,1.270,5099.1
36365,33,admin.,single,high.school,no,yes,no,cellular,jun,tue,167,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2
36397,31,management,married,university.degree,no,no,no,cellular,jun,tue,217,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2
36422,31,admin.,single,university.degree,no,no,yes,cellular,jun,tue,210,1,999,0,nonexistent,-2.9,92.963,-40.8,1.262,5076.2


In [34]:
p.target

Unnamed: 0,purchases
0,0
1,0
2,0
3,0
4,0
...,...
36223,0
36365,0
36397,0
36422,0


In [35]:
p.encode()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,42,43,44,45,46,47,48,49,50,51
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33357,33,277,1,999,0,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
33358,35,200,3,999,0,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
33359,48,725,1,999,0,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
33360,33,204,1,999,1,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [36]:
p.features

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,42,43,44,45,46,47,48,49,50,51
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33357,33,277,1,999,0,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
33358,35,200,3,999,0,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
33359,48,725,1,999,0,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
33360,33,204,1,999,1,-1.8,92.893,-46.2,1.291,5099.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [37]:
p.split_data_double() 

(       age  duration  campaign  pdays  previous  emp.var.rate  cons.price.idx  \
 4096    39       335         3    999         0           1.1          93.994   
 2392    40       202         3    999         0           1.1          93.994   
 29702   45       170         1    999         1          -1.8          93.075   
 4659    46       323         1    999         0           1.1          93.994   
 23462   40       183         3    999         0           1.4          93.444   
 ...    ...       ...       ...    ...       ...           ...             ...   
 20992   34        85         3    999         0           1.4          93.444   
 26885   36       235         3    999         0          -0.1          93.200   
 15366   56       383         1    999         0           1.4          93.918   
 909     38       181         1    999         0           1.1          93.994   
 13500   27       121         3    999         0           1.4          93.918   
 
        cons.c

In [38]:
p.X_train

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,42,43,44,45,46,47,48,49,50,51
4096,39,335,3,999,0,1.1,93.994,-36.4,4.858,5191.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2392,40,202,3,999,0,1.1,93.994,-36.4,4.856,5191.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
29702,45,170,1,999,1,-1.8,93.075,-47.1,1.405,5099.1,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4659,46,323,1,999,0,1.1,93.994,-36.4,4.858,5191.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
23462,40,183,3,999,0,1.4,93.444,-36.1,4.964,5228.1,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20992,34,85,3,999,0,1.4,93.444,-36.1,4.964,5228.1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
26885,36,235,3,999,0,-0.1,93.200,-42.0,4.076,5195.8,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
15366,56,383,1,999,0,1.4,93.918,-42.7,4.957,5228.1,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
909,38,181,1,999,0,1.1,93.994,-36.4,4.856,5191.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


#### Train Models

#### Models to be used

- LogisticRegression (CV)

- RandomForest Classifier

- SVM

- XGboost

- MLP (Multi-Layer Perceptron Network)

In [39]:
from sklearn.pipeline import FeatureUnion, Pipeline  
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline, make_pipeline
from sklearn.decomposition import PCA
from imblearn.metrics import make_index_balanced_accuracy
from sklearn.metrics import balanced_accuracy_score

from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.tree import ExtraTreeClassifier
import xgboost
from sklearn.neural_network import MLPClassifier

##### model1 - Logistic Regression CV

In [40]:
model1 = make_pipeline(SMOTE(sampling_strategy=.60), 
                       PCA(n_components=45), LogisticRegressionCV(cv=5,
                                                                  random_state=123, max_iter=1000));

In [41]:
model1.fit(p.X_train,p.y_train.purchases.ravel());

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

##### Test model1

In [42]:
model1_pred = model1.predict(p.X_test)

In [43]:
model1_pred

array([0, 0, 1, ..., 0, 0, 0])

In [44]:
balanced_accuracy_score(p.y_test, model1_pred)

0.8601190476190477

##### Train model2 - Logistic Regression

In [45]:
model2 = make_pipeline(SMOTE(sampling_strategy=.60), PCA(n_components=45), LogisticRegression(random_state=124
                                                                                             ,max_iter=1000))

In [46]:
model2.fit(p.X_train,p.y_train.purchases.ravel());

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


##### Test Model2

In [47]:
model2_pred = model2.predict(p.X_test)
model2_pred

array([0, 0, 1, ..., 0, 0, 0])

In [48]:
balanced_accuracy_score(p.y_test, model2_pred) 

0.8677312271062272

##### Train Model3 - RandomForest Classifier (SkLearn Implementation)

In [49]:
from sklearn.ensemble import RandomForestClassifier

In [50]:
model3 = make_pipeline(SMOTE(sampling_strategy=.60), PCA(n_components=45), RandomForestClassifier())
model3.fit(p.X_train,p.y_train.purchases.ravel()); 

##### Test Model3

In [51]:
model3_pred = model3.predict(p.X_test)
balanced_accuracy_score(p.y_test, model3_pred) 

0.7446199633699634

##### Train Model4 - RandomForest Classifier (Imblearn Implementation)

In [52]:
from imblearn.ensemble import BalancedRandomForestClassifier

In [53]:
model4 = make_pipeline(SMOTE(sampling_strategy=.60), PCA(n_components=45), BalancedRandomForestClassifier())
model4.fit(p.X_train,p.y_train.purchases.ravel());

##### Test Model3

In [54]:
model4_pred = model4.predict(p.X_test)
model4_pred_proba = model4.predict_proba(p.X_test)

> Probability of each input belonging to a particular class (purchase and no-purchase)

In [55]:
model4_pred_proba

array([[0.89, 0.11],
       [0.99, 0.01],
       [0.28, 0.72],
       ...,
       [1.  , 0.  ],
       [0.99, 0.01],
       [0.99, 0.01]])

> model score

In [56]:
balanced_accuracy_score(p.y_test, model4_pred)  

0.8282394688644689

> The BalancedRandomForestClassifier Implementation performs better than  RandomForestClassifier

##### Train Model5 - SVM Classification

In [57]:
from sklearn.svm import SVC

In [58]:
model5 = make_pipeline(SMOTE(sampling_strategy=.60), PCA(n_components=45), SVC())
model5.fit(p.X_train,p.y_train.purchases.ravel());

Test Model5

In [59]:
model5_pred = model5.predict(p.X_test)

In [60]:
balanced_accuracy_score(p.y_test, model5_pred)  

0.8265796703296704

### Measuring Model Performance

In [134]:
y_train = p.y_train.purchases.ravel()
y_test = p.y_test.purchases.ravel()

##### 1. Evaluating Model Performances
> put all models in a list and create Metrics object

In [135]:
models = [model1, model2, model3, model4, model5]
model_evaluations = Metrics(p.X_train, y_train, p.X_test, y_test)

> Create an empty Dataframe for models evaluations

In [136]:
models_scores = pd.DataFrame()

> Loop through models and evaluate

In [65]:
for model in models:
    
    models_scores = model_evaluations.evaluate_classifier(clf=model, models_eval_scores = models_scores);

LogisticRegressionCV


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

acc->0.8992829204693612, bal_acc->0.8650412087912087
LogisticRegression


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


acc->0.8973272490221643, bal_acc->0.8609775641025641
RandomForestClassifier
acc->0.9423076923076923, bal_acc->0.7481684981684982
BalancedRandomForestClassifier
acc->0.9276401564537158, bal_acc->0.8314445970695971
SVC
acc->0.8396349413298566, bal_acc->0.8275526556776557


##### Comparison of Model Performances (Acuuracy Vs Balanced Accuracy)

In [68]:
models_scores.T

Unnamed: 0,Accuracy,Balanced accuracy
LogisticRegressionCV,0.899,0.865
LogisticRegression,0.897,0.861
RandomForestClassifier,0.942,0.748
BalancedRandomForestClassifier,0.928,0.831
SVC,0.84,0.828


#### 2. Test conf_matrix, accuracy_score, classification_error, specif_sensitiv

In [91]:
predictions = [model1_pred, model2_pred, model3_pred, model4_pred, model5_pred]

confusion matrix

In [101]:
count = 0
for pred in predictions:
    count+=1
    print(f'model {count} confusion matrix -->\n{model_evaluations.conf_matrix(pred)}\n')

model 1 confusion matrix -->
               Actual_+ve  Actual_-ve
predicted_+ve        2620         292
predicted_-ve          28         128

model 2 confusion matrix -->
               Actual_+ve  Actual_-ve
predicted_+ve        2627         285
predicted_-ve          26         130

model 3 confusion matrix -->
               Actual_+ve  Actual_-ve
predicted_+ve        2806         106
predicted_-ve          74          82

model 4 confusion matrix -->
               Actual_+ve  Actual_-ve
predicted_+ve        2733         179
predicted_-ve          44         112

model 5 confusion matrix -->
               Actual_+ve  Actual_-ve
predicted_+ve        2462         450
predicted_-ve          30         126



balanced accuracy score

In [102]:
count = 0
for pred in predictions:
    count+=1
    print(f'model {count} confusion matrix -->\n{model_evaluations.balanced_accuracy_score(pred)}\n')

model 1 confusion matrix -->
0.8601190476190477

model 2 confusion matrix -->
0.8677312271062272

model 3 confusion matrix -->
0.7446199633699634

model 4 confusion matrix -->
0.8282394688644689

model 5 confusion matrix -->
0.8265796703296704



balanced classification error

In [112]:
count = 0
for pred in predictions:
    count+=1
    print(f'model {count} balanced classfication error -->\n{model_evaluations.balanced_classification_error(pred)}\n')

model 1 balanced classfication error -->
0.13988095238095233

model 2 balanced classfication error -->
0.1322687728937728

model 3 balanced classfication error -->
0.2553800366300366

model 4 balanced classfication error -->
0.17176053113553114

model 5 balanced classfication error -->
0.1734203296703296



In [137]:
count = 0
for pred in predictions:
    count+=1
    print(f'model {count} specificity Vs. sensitivuty -->\n{model_evaluations.specif_sensitiv(pred)}\n')

model 1 specificity Vs. sensitivuty -->
   sensitivity  specificity
0     0.820513     0.899725

model 2 specificity Vs. sensitivuty -->
   sensitivity  specificity
0     0.833333     0.902129

model 3 specificity Vs. sensitivuty -->
   sensitivity  specificity
0     0.525641     0.963599

model 4 specificity Vs. sensitivuty -->
   sensitivity  specificity
0     0.717949      0.93853

model 5 specificity Vs. sensitivuty -->
   sensitivity  specificity
0     0.807692     0.845467



In [191]:
x = model1_pred[[np.where(model1_pred == 0)]], model1_pred[[np.where(model1_pred == 1)]]

  """Entry point for launching an IPython kernel.


In [228]:
y = [[1,1,1,1,1], [0,0,0,0,0]]

In [223]:
pd.DataFrame(x[0].astype(list)).append(pd.DataFrame(x[1].astype(list))).T

Unnamed: 0,0,0.1
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
2643,0,
2644,0,
2645,0,
2646,0,


###  Building Model Pipelines

In [None]:
#%%writefile pipeline.py
#%%writefile ../scripts/project_package/model_package/pipeline.py 
from sklearn.pipeline import FeatureUnion, Pipeline  
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline, make_pipeline