**Customer Propensity to Buy**

A Model of Customer Propensity to buy using Data from a Portuguese Bank taken from:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

First we import our libraries and read in our data [which from a hard drive, is a little bit of a faff in colab!].

In [44]:
from google.colab import files
uploaded = files.upload()

Saving Bank.csv to Bank (1).csv


In [65]:
import io
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder
from sklearn.model_selection import GridSearchCV, train_test_split

In [58]:
df = pd.read_csv(io.BytesIO(uploaded["Bank.csv"]), sep=';')
df = df.dropna()
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


In [55]:
df['y'].value_counts()

no     4000
yes     521
Name: y, dtype: int64

**Preprocessing Dataframe**

We can see from above that the data is unbalanced by about 8.6:1. This will be important later on. There is not a large number of explanatory variables, so dimensionality reduction is not important

For now, we identify catogies in the catagorical columns, and convert them into the correct formats for processing

In [56]:
for i in ['default','housing','loan','job', 'marital', 'education','contact','month','poutcome']:
  print(df[i].unique())

['no' 'yes']
['no' 'yes']
['no' 'yes']
['unemployed' 'services' 'management' 'blue-collar' 'self-employed'
 'technician' 'entrepreneur' 'admin.' 'student' 'housemaid' 'retired'
 'unknown']
['married' 'single' 'divorced']
['primary' 'secondary' 'tertiary' 'unknown']
['cellular' 'unknown' 'telephone']
['oct' 'may' 'apr' 'jun' 'feb' 'aug' 'jan' 'jul' 'nov' 'sep' 'mar' 'dec']
['unknown' 'failure' 'other' 'success']


In [61]:
#We can see that it is probably okay to ordinally encode the month and education, but the remaining catagories will need to be one hot encoded
#We use sklearn's built in encoders
label = LabelEncoder()
df['y'] = label.fit_transform(df['y'])

ordinal = OrdinalEncoder()
df[['month', 'education']] = ordinal.fit_transform(df[['month', 'education']])

#Onehot encoding is slightly different. We have to make a one-hot array, then append it to the dataframe, then drop the original value. This is easier with pd.get_dummies
for column in ['default','housing','loan','job', 'marital', 'contact','poutcome']:
    tempdf = pd.get_dummies(df[column], prefix=column)
    df = pd.merge(
        left=df,
        right=tempdf,
        left_index=True,
        right_index=True,
    )
    df = df.drop(columns=column)


**Pre-Processing Model and Model Selection**

Now we define the independent and depend variable as x and y, split them into train and test sets, standardise the values, and tune the hyperparameters. I have opted to use Random Forest in this case, as it is very robust and less prone to variance error. The imbalance in the data can also be accounted for by adjusting the class weight setting since the classifier is resilliant to overfitting, which makes life easier vs having to use an oversampler. Other settings are set as defaults too - n_jobs is altered to allow for parrallel processing, and the scoring metric I opted to use was F1.

Accuracy is not valid on an unbalanced set, so we needed precision to ensure we identify customers who are likely to buy again. However, we also need sensitivity to ensure we don't miss out on customers willing to buy, as this represents potentially lost sales. So as a compromise, F1 is best for this set. If the business had one objective over the other, I would use F1Beta instead, tuned by precision:recall weighting as needed - for eg if the business has the infrastructure to chase almost every single lead, rather than lose a sale, then we can tune F1Beta to favor recall

In [76]:
#create x and y
x = df.drop('y', axis = 1)
y = df['y']

#split into train and test groups
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.25, random_state = 0)

#standardize the x variables
ss = StandardScaler()
xtrain = ss.fit_transform(xtrain)
xtest = ss.transform(xtest)


clf = RandomForestClassifier(class_weight = 'balanced', n_jobs = -1)
parameters = {'n_estimators':[80,100,120,150], 'min_samples_split':[2,3,4], 'ccp_alpha':[0,0.05,0.1]}
grid = GridSearchCV(clf, parameters, scoring = 'f1', cv = 5)
grid.fit(xtrain, ytrain)
grid.best_params_


{'ccp_alpha': 0.05, 'min_samples_split': 2, 'n_estimators': 80}

In [77]:
grid.score(xtrain, ytrain)

0.3590585659551177

**Model Deployment**


While this score isn't great, this is often expected in an unbalanced dataset. Lets see how it performs on the test set

In [78]:
grid.score(xtest, ytest)

0.38387096774193546

**Evaluation and Reflection**

Better than our train set! Excellent. Though the result is still not great. As usual, there are several ways we could address this:

1)There are many more hyperparameters which could be tuned, and possibly with wider ranges. I have only tuned those that, in my expereince, are the most effective.

2)It is very easy to try a range of models, subbing in the classifier in the code above. Another ensemble model that is very robust is ExtraTrees, but if time allowed, it would also be worth trying various linear models

3) The evaluation I carried out was based on F1. Instead of using the grid, we could re-run the classifier with the input parameters, to return a range of metrics from a confusion matrix. However, I opted not to do this because the ideal approach would have been to run the GridSearch with multiple scorers instead, had we wanted further metrics

4)As always, in an imbalanced dataset, more data is key. With a larger dataset, and possibly also more explanatory variables, we could gain some improvement