# Bank Marketing - Likelihood of a customer subscribing to a banking product

> The objective of the notebook is to create a machine learning model to predict the likelihood of a customer subscribing to a banking product. The dataset is obtained from [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/bank+marketing#)

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import calendar
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from models.src.preprocessing import PreProcessor

## Load Data

In [2]:
DATA_DIR = os.path.join(os.path.abspath(os.path.dirname('__file__')), '../../data')
df = pd.read_csv(os.path.join(DATA_DIR, 'bank-full.csv'), sep=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


## Data Preprocessing

> Simple data exploration and quality checks in preparation for building a ML model.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


> We have a mixture of continous and categorical features and have 100% fill rates for all the columns. We also know from the dataset, that some of the columns are binary:
* default
* housing
* loan
* y - target feature

> We will also replace month with the numeric value. The logic for this is encapsulated in [PreProcessor](../src/preprocessing.py)

In [4]:
pp = PreProcessor()
df = pp.transform(df)

2019-03-23 14:36:17,441:INFO:Numerising month
2019-03-23 14:36:17,499:INFO:Binarising columns: default
2019-03-23 14:36:17,512:INFO:Binarising columns: housing
2019-03-23 14:36:17,525:INFO:Binarising columns: loan
2019-03-23 14:36:17,542:INFO:Binarising columns: y


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null int64
balance      45211 non-null int64
housing      45211 non-null int64
loan         45211 non-null int64
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null int64
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null int64
dtypes: int64(12), object(5)
memory usage: 5.9+ MB


> `month`, `default`, `housing`, `loan` and `y` are all now numerical features. For categorical features, we will one hot encode them as it has low cardinality.