# Predicting Effectiveness of Bank Marketing 

`Author:` Rasika Guru (rguru@usc.edu)

This mini-ML project aims at predicting if a person will subscribe to a bank's term deposit or not based on previous outcomes of bank marketing campaigns of a Portuguese banking institution. For more information about the data see: https://archive.ics.uci.edu/ml/datasets/bank+marketing

### Importing packages that we'll use

In [1]:
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

### Reading the data

In [2]:
df = pd.read_csv('Data/bank-additional.csv', delimiter=';')
print(df.head())

   age          job  marital          education default  housing     loan  \
0   30  blue-collar  married           basic.9y      no      yes       no   
1   39     services   single        high.school      no       no       no   
2   25     services  married        high.school      no      yes       no   
3   38     services  married           basic.9y      no  unknown  unknown   
4   47       admin.  married  university.degree      no      yes       no   

     contact month day_of_week ...  campaign  pdays  previous     poutcome  \
0   cellular   may         fri ...         2    999         0  nonexistent   
1  telephone   may         fri ...         4    999         0  nonexistent   
2  telephone   jun         wed ...         1    999         0  nonexistent   
3  telephone   jun         fri ...         3    999         0  nonexistent   
4   cellular   nov         mon ...         1    999         0  nonexistent   

  emp.var.rate  cons.price.idx  cons.conf.idx  euribor3m  nr.employe

### Dealing with Categorical features

It is observed that many features are categorical. For scikit-learn classifiers to work with this, we one-hot encode these features to get dummy variables. The following function does this.

In [3]:
def hot_encoder(df, column_name):
    column = df[column_name].tolist() 
    lab_enc = sklearn.preprocessing.LabelEncoder()
    lab_enc.fit(column)
    enc_column = lab_enc.transform(column)
    enc_column = np.reshape(enc_column, (len(enc_column), 1)) 
    enc = sklearn.preprocessing.OneHotEncoder()
    enc.fit(enc_column)
    new_column = enc.transform(enc_column).toarray()
    column_titles = []
    for i in range(len(new_column[0])):
        this_column_name = column_name+"_"+str(i)
        df[this_column_name] = new_column[:,i]
    df.drop(column_name, axis=1, inplace=True)
    return df

In [4]:
categorical_features = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']

All categorical features are one hot encoded using the function written above. Also note that the feature 'duration' is removed as the duration of call is known only after the phone call's end, which should not be used for prediction as the outcome is also known when the call is over. 

In [5]:
for feature in categorical_features:
    df = hot_encoder(df, feature)
df.drop('duration', axis=1, inplace=True)
print(df.shape)

(4119, 63)


Now we have 62 features + the response variable y and 4119 samples. 

### Splitting Data into train and test sets

In [6]:
X = df[[column for column in df.columns if column != 'y']]
y = df[['y']]
print(X.shape)
print(y.shape)

(4119, 62)
(4119, 1)


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

### Training a simple Logistic Regression Classifier

In [8]:
clf = LogisticRegression()
clf.fit(X_train, np.ravel(y_train))
y_pred = clf.predict(X_test)
report = sklearn.metrics.classification_report(y_test, y_pred)
print(report)

             precision    recall  f1-score   support

         no       0.92      0.98      0.95      1223
        yes       0.59      0.23      0.34       137

avg / total       0.89      0.91      0.89      1360



A very good performance is achieved with a simple classifier (precision of 0.89) as seen above.  

### Training a Support Vector Classifier

In [9]:
clf = SVC()
clf.fit(X_train, np.ravel(y_train))
y_pred = clf.predict(X_test)
report = sklearn.metrics.classification_report(y_test, y_pred)
print(report)

             precision    recall  f1-score   support

         no       0.91      0.98      0.95      1223
        yes       0.49      0.15      0.22       137

avg / total       0.87      0.90      0.87      1360



The SVM seems to perform poorly compared to the logistic regression (which is not very common)

### Training a Ensemble classifier with logistic regression and SVC

In [10]:
clf1 = LogisticRegression()
clf2 = SVC(probability=True)

eclf1 = VotingClassifier(estimators=[('logist', clf1), ('svc', clf2)], voting='hard')
eclf1 = eclf1.fit(X_train, np.ravel(y_train))
y_pred1 = eclf1.predict(X_test)
report1 = sklearn.metrics.classification_report(y_test, y_pred1)
print(report1)

eclf2 = VotingClassifier(estimators=[('logist', clf1), ('svc', clf2)], voting='soft')
eclf2 = eclf2.fit(X_train, np.ravel(y_train))
y_pred2 = eclf2.predict(X_test)
report2 = sklearn.metrics.classification_report(y_test, y_pred2)
print(report2)

             precision    recall  f1-score   support

         no       0.91      0.99      0.95      1223
        yes       0.61      0.12      0.21       137

avg / total       0.88      0.90      0.87      1360

             precision    recall  f1-score   support

         no       0.92      0.99      0.95      1223
        yes       0.62      0.18      0.28       137

avg / total       0.89      0.91      0.88      1360



It can be seen that the precision for the 'Yes' class is more (both in case of hard voting and soft voting) for the ensemble classifier when compared to the Losgistic and SVC. This means this ensemble classifier has lesser false positives, which would in turn mean lesser number of campaign phone calls to people who will not likely enroll in the term deposit.

### Adding more cooks hoping for a better broth 

Let us add more classifiers for the voting, as we see that ensembles can be promising. 
#### Ensemble of SVC, Logistic Regression, Gaussian Naive Bayes, Random Forest 

In [11]:
clf1 = LogisticRegression()
clf2 = SVC(probability=True)
clf3 = RandomForestClassifier(random_state=1)
clf4 = GaussianNB()

eclf = VotingClassifier(estimators=[('logist', clf1), ('svc', clf2), ('rf', clf3), ('nb', clf4)], voting='soft')
eclf = eclf.fit(X_train, np.ravel(y_train))
y_pred = eclf.predict(X_test)
report = sklearn.metrics.classification_report(y_test, y_pred)
print(report)

             precision    recall  f1-score   support

         no       0.93      0.96      0.95      1223
        yes       0.51      0.35      0.41       137

avg / total       0.89      0.90      0.89      1360



With this ensemble classifier we have got the best recall so far for the 'Yes' class, which means we detect more people who will enroll in the term deposit. The choice of classifier depends on what the bank wants. If the bank wants to find more people who will enroll and does not mind a few extra calls, this last ensemble can be used. But if making more calls are expensive, the previous classifier can be used.