# Bank Marketing : Client subscription prediction 

In [34]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

## Dataset: 

'https://archive.ics.uci.edu/static/public/222/bank+marketing.zip'  

We will use bank-full.csv for this prediction.

**Bank client data**:  
   1. age (numeric)  
   2. - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services")   
   3. - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)  
   4. - education (categorical: "unknown","secondary","primary","tertiary")  
   5. - default: has credit in default? (binary: "yes","no")  
   6. - balance: average yearly balance, in euros (numeric)   
   7. - housing: has housing loan? (binary: "yes","no")  
   8. - loan: has personal loan? (binary: "yes","no")  
   **related with the last contact of the current campaign:**  
   9. - contact: contact communication type (categorical: "unknown","telephone","cellular") 
  10. - day: last contact day of the month (numeric)
  11. - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")  
  12. - duration: last contact duration, in seconds (numeric)  
   **other attributes**:  
  13. - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)  
  14. - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)  
  15. - previous: number of contacts performed before this campaign and for this client (numeric)  
  16. - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

  **Output variable (desired target)**:  
  17. - y - has the client subscribed a term deposit? (binary: "yes","no")

    ** [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. 
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.**

## Problem description and model selection:
- Given the above features of a client, we need to predict if the client has subscribed a term deposit or not.  
- Since prediction variable belongs to two classes, this is a supervised binary classifier problem. So, we use a Binary classifier (Logistic regression) for it.

## Data Preparation
- Download the data, read it with pandas
- Look at the data
- Make column names and values look uniform
- Check if all columns read correctly
- Check if the churn variable needs any preparation

In [35]:
df = pd.read_csv('bank-full.csv', delimiter=';')
df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [36]:
# We use below features 
features = ['age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 
            'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y']

df =  df[features]
df.head()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,2143,yes,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,29,yes,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,2,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,1506,yes,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,1,no,unknown,5,may,198,1,-1,0,unknown,no


**Look at the data** 

In [37]:
df.head().T

Unnamed: 0,0,1,2,3,4
age,58,44,33,47,33
job,management,technician,entrepreneur,blue-collar,unknown
marital,married,single,married,married,single
education,tertiary,secondary,secondary,unknown,unknown
balance,2143,29,2,1506,1
housing,yes,yes,yes,yes,no
contact,unknown,unknown,unknown,unknown,unknown
day,5,5,5,5,5
month,may,may,may,may,may
duration,261,151,76,92,198


- Columns names and values are uniform. (Cases wise)

In [38]:
df.dtypes

age           int64
job          object
marital      object
education    object
balance       int64
housing      object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

All columns read correctly in the dataframe as per the given documentation.

In [39]:
df.isnull().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

No missing values

## Question 1: 
What is the most frequent observation (mode) for the column education? 

In [40]:
df['education'].value_counts()

education
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: count, dtype: int64

## Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

1. What are the two features that have the biggest correlation?

In [41]:
categorical = list(df.dtypes[df.dtypes == 'object'].index)
numerical = list(set(df.columns) - set(categorical))
categorical.remove('y')

In [42]:
df[numerical].corr()

Unnamed: 0,pdays,day,previous,balance,age,campaign,duration
pdays,1.0,-0.093044,0.45482,0.003435,-0.023758,-0.088628,-0.001565
day,-0.093044,1.0,-0.05171,0.004503,-0.00912,0.16249,-0.030206
previous,0.45482,-0.05171,1.0,0.016674,0.001288,-0.032855,0.001203
balance,0.003435,0.004503,0.016674,1.0,0.097783,-0.014578,0.02156
age,-0.023758,-0.00912,0.001288,0.097783,1.0,0.00476,-0.004648
campaign,-0.088628,0.16249,-0.032855,-0.014578,0.00476,1.0,-0.08457
duration,-0.001565,-0.030206,0.001203,0.02156,-0.004648,-0.08457,1.0


In [43]:
print(df[['age']].corrwith(df['balance']))
print(df[['day']].corrwith(df['campaign']))
print(df[['day']].corrwith(df['pdays']))
print(df[['pdays']].corrwith(df['previous']))

age    0.097783
dtype: float64
day    0.16249
dtype: float64
day   -0.093044
dtype: float64
pdays    0.45482
dtype: float64


We see pdays and previous has max correlation.

## Target encoding

In [44]:
df['y'].unique()

array(['no', 'yes'], dtype=object)

In [45]:
# replace the values yes/no with 1/0
df['y'] = (df['y'] == 'yes').astype(int)


## Split the data

In [46]:
from sklearn.model_selection import train_test_split

#  splitting train/val/test sets with 60%/20%/20% distribution.
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)


In [47]:
len(df_train), len(df_val), len(df_test)

(27126, 9042, 9043)

In [48]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train['y']
y_val = df_val['y']
y_test = df_test['y']

del df_train['y']
del df_val['y']
del df_test['y']

## Question 3: 
Mutual_information

In [49]:
from sklearn.metrics import mutual_info_score

In [50]:
def mutual_info_y_score(series):
    return mutual_info_score(series, y_train)

In [51]:
categorical

['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome']

In [52]:
mi = df_train[categorical].apply(mutual_info_y_score)
mi.sort_values(ascending=False)

poutcome     0.029533
month        0.025090
contact      0.013356
housing      0.010343
job          0.007316
education    0.002697
marital      0.002050
dtype: float64

## Question 4

In [53]:
from sklearn.feature_extraction import DictVectorizer

dv = DictVectorizer(sparse=False)

train_dicts = df_train[categorical +  numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [54]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)

In [55]:
accuracy_with_all_features = (y_val == y_pred).mean().round(1)
print(accuracy_with_all_features.round(1))

0.9


## Question 5 

In [33]:
set(categorical +  numerical) - set(['age'])

{'balance',
 'campaign',
 'contact',
 'day',
 'duration',
 'education',
 'housing',
 'job',
 'marital',
 'month',
 'pdays',
 'poutcome',
 'previous'}

In [57]:
test_features =  ['age', 'balance', 'marital', 'previous']

differences_in_accuracy = []  

dv = DictVectorizer(sparse=False)

for feat in test_features:
    feature_set = list(set(categorical +  numerical) - set([feat]))

    train_dicts = df_train[categorical +  numerical].to_dict(orient='records')
    X_train = dv.fit_transform(train_dicts)
    
    val_dicts = df_val[feature_set].to_dict(orient='records')
    X_val = dv.transform(val_dicts)
    
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_val)
    
    acc = (y_val == y_pred).mean()

    differences_in_accuracy.append(accuracy_with_all_features - acc)


In [59]:
differences_in_accuracy

[-0.000796284007962833,
 -0.0013492590134925875,
 -0.0010174740101747126,
 -0.0006856890068568378]

## Question 6:
Use of different C's.  
**What is C**:  
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.


In [67]:
accuracy_scores = []

for inv_reg_strength in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', C=inv_reg_strength, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict(X_val)
    
    accuracy_score = (y_val == y_pred).mean().round(3)
    accuracy_scores.append(accuracy_score)

print(accuracy_scores)


[0.898, 0.901, 0.901, 0.901, 0.901]


In [68]:
np.array(accuracy_scores).argmin()

0