# A Solution to Bank Telemarketing Predictions

## Introduction

This notebook was completed as a response to
<a href="https://www.kaggle.com/c/predicting-bank-telemarketing/overview">this</a> Kaggle competition

### Goal

The focus of this Kaggle competition is to target clients through telemarketing to sell long-term deposits. The data, which was collected from 2008 to 2013, contains demographic and personal information about each client. Our goal is to correctly guess whether or not a client will buy long-term deposits given this information 

### Data

#### Necessary Import Statements

In [1]:
### Necessary Imports
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split

- The data was given in the form of training, testing, and a sample sumbission (as an example).
    - The training data has the column "duration," which indicates the duration of a call with the client. As we will not know the duration of a call **before** calling a client, this data is not included in our testing data. Because of this, we will remove it from the training set
    - Below, we can see that the training data has a column labeled 'y' with the result. 1 indicates a success (the client purchased the deposit), and 0 indicates a failute (the client did not purchase the deposit)
    
#### Reading in the Data

In [2]:
samp = pd.read_csv('samp_submission.csv')
train = pd.read_csv('bank-train.csv')
test = pd.read_csv('bank-test.csv')
train.drop(columns = 'duration')


Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,12556,40,blue-collar,married,basic.9y,unknown,yes,no,telephone,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.960,5228.1,0
1,35451,31,admin.,married,university.degree,no,no,no,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.244,5099.1,0
2,30592,59,retired,married,basic.4y,no,no,no,cellular,may,...,6,999,1,failure,-1.8,92.893,-46.2,1.354,5099.1,0
3,17914,43,housemaid,divorced,basic.9y,no,yes,no,cellular,jul,...,5,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228.1,0
4,3315,39,admin.,single,high.school,unknown,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32945,6265,58,retired,married,professional.course,unknown,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
32946,11284,37,management,married,university.degree,no,no,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.961,5228.1,0
32947,38158,35,admin.,married,high.school,no,yes,no,cellular,oct,...,1,4,1,success,-3.4,92.431,-26.9,0.754,5017.5,1
32948,860,40,management,married,university.degree,no,yes,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.856,5191.0,0


In [3]:
train.head()

Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,12556,40,blue-collar,married,basic.9y,unknown,yes,no,telephone,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.96,5228.1,0
1,35451,31,admin.,married,university.degree,no,no,no,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.244,5099.1,0
2,30592,59,retired,married,basic.4y,no,no,no,cellular,may,...,6,999,1,failure,-1.8,92.893,-46.2,1.354,5099.1,0
3,17914,43,housemaid,divorced,basic.9y,no,yes,no,cellular,jul,...,5,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228.1,0
4,3315,39,admin.,single,high.school,unknown,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,0


In [4]:
test.head()

Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,...,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,32884,57,technician,married,high.school,no,no,yes,cellular,may,...,371,1,999,1,failure,-1.8,92.893,-46.2,1.299,5099.1
1,3169,55,unknown,married,unknown,unknown,yes,no,telephone,may,...,285,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0
2,32206,33,blue-collar,married,basic.9y,no,no,no,cellular,may,...,52,1,999,1,failure,-1.8,92.893,-46.2,1.313,5099.1
3,9403,36,admin.,married,high.school,no,no,no,telephone,jun,...,355,4,999,0,nonexistent,1.4,94.465,-41.8,4.967,5228.1
4,14020,27,housemaid,married,high.school,no,yes,no,cellular,jul,...,189,2,999,0,nonexistent,1.4,93.918,-42.7,4.963,5228.1


#### Probbilistic Distributions

From the head of the training file, we can see that none of the first 5 clients purchased deposits. Here, I will look at the percentage of clients who declined the offer, and the percentage that accepted.
- **Failure (0) :** 88.76%
- **Success (1) :** 11.24%

Clearly, not many people liked the offer. We can use this information to infer that most of the time, a client is most likely to say no.

In [5]:
### This gives us the probability of each occurance
train['y'].value_counts(1)

0    0.887557
1    0.112443
Name: y, dtype: float64

## Random Guess

If you were to guess randomly using these distributions, this is how you would do it:
- This attempt yielded me a "success score" of around 0.8. This seems pretty good, but it's only because it is very easy to guess a failure. If the failure rate is 0.88%, and I guess "failure" 88% of the time, then I would correctly theoretically predict 77.44% of the failures simply by guessing

Here is my code for determining the random predictions:

In [6]:
### In the sample data (which only has the client id), we randomly assign success values based off the probabilities shown above
samp.Predicted = np.random.choice(range(2), size = samp.shape[0], p = [train['y'].value_counts(1)[0], train['y'].value_counts(1)[1]])

In [None]:
#samp.to_csv('first_test.csv', index = False)

## Decision Tree

### Removing NULL Rows
- Rows with NULL values can screw up our decision tree. Here, will will remove any rows that have NULL values

In [7]:
print(train.columns)
train.dropna()

Index(['id', 'age', 'job', 'marital', 'education', 'default', 'housing',
       'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign',
       'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')


Unnamed: 0,id,age,job,marital,education,default,housing,loan,contact,month,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,12556,40,blue-collar,married,basic.9y,unknown,yes,no,telephone,jul,...,2,999,0,nonexistent,1.4,93.918,-42.7,4.960,5228.1,0
1,35451,31,admin.,married,university.degree,no,no,no,cellular,may,...,4,999,0,nonexistent,-1.8,92.893,-46.2,1.244,5099.1,0
2,30592,59,retired,married,basic.4y,no,no,no,cellular,may,...,6,999,1,failure,-1.8,92.893,-46.2,1.354,5099.1,0
3,17914,43,housemaid,divorced,basic.9y,no,yes,no,cellular,jul,...,5,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228.1,0
4,3315,39,admin.,single,high.school,unknown,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.860,5191.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32945,6265,58,retired,married,professional.course,unknown,no,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
32946,11284,37,management,married,university.degree,no,no,no,telephone,jun,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.961,5228.1,0
32947,38158,35,admin.,married,high.school,no,yes,no,cellular,oct,...,1,4,1,success,-3.4,92.431,-26.9,0.754,5017.5,1
32948,860,40,management,married,university.degree,no,yes,no,telephone,may,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.856,5191.0,0


### Creating Dummy Variables
- Some of our columns contain categorical variables, like job status and marital status. Unfortunately, the decision tree can't handle categorical variables
- By creating dummy variables, we create a new column for each category in the original column
    - For example, the **marital** column would be broken up into 3 new columns: Single, Married, and Divorced. A '1' in the Single column represents that the person is single, a '0' in the Married column represents that the person is not married, etc.
- I added an extra column to the testing data. This is because there was no 'default_yes' value in the original default column. Because of this, I make default_yes a column of zeros

In [8]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)
train.head()

Unnamed: 0,id,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,12556,40,94,2,999,0,1.4,93.918,-42.7,4.96,...,0,0,0,1,0,0,0,0,1,0
1,35451,31,116,4,999,0,-1.8,92.893,-46.2,1.244,...,0,0,0,1,0,0,0,0,1,0
2,30592,59,13,6,999,1,-1.8,92.893,-46.2,1.354,...,0,0,0,1,0,0,0,1,0,0
3,17914,43,94,5,999,0,1.4,93.918,-42.7,4.961,...,0,0,0,0,0,1,0,0,1,0
4,3315,39,344,2,999,0,1.1,93.994,-36.4,4.86,...,0,0,0,0,1,0,0,0,1,0


In [9]:
X = train.drop(columns = 'y')
X = X.drop(columns = 'id')
X = X.drop(columns = 'duration')
Y = train['y']
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2)

test = test.drop(columns = 'id')
test = test.drop(columns = 'duration')
test['default_yes'] = 0

### Creating the tree
- I created the tree with a maximum depth of 5. This is because with more than 20 features (columns), our tree would otherwise grow very large. Having a large tree not only slows down the algorithm and becomes confusing, but it can cause overfitting

In [10]:
## Fitting the tree to our testing data 
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train,Y_train)

### Testing the Tree
- Below we can see how well our model ran on the training and testing data we partitioned
- About 90% isn't bad!

In [11]:
## Running the tree on our training and testing data
print("Training accuracy:", tree.score(X_train, Y_train))
print("Testing accuracy:", tree.score(X_test, Y_test))

Training accuracy: 0.9048937784522003
Testing accuracy: 0.9015174506828528


### Most "Important" Features
- We will sort the features based off of their **gain**, or the weight given to each feature
- Features with higher gain more heavily impact the model

In [12]:
pd.DataFrame({'Gain': tree.feature_importances_}, index = X_train.columns).sort_values('Gain', ascending = False)

Unnamed: 0,Gain
nr.employed,0.650221
pdays,0.121273
cons.conf.idx,0.062923
euribor3m,0.038924
month_oct,0.025556
...,...
default_unknown,0.000000
default_yes,0.000000
housing_no,0.000000
housing_unknown,0.000000


## Decision Tree with Bagging Classifier
- Now that we've created our decision tree, we can run a bagging classifier on the tree
- The classifier will run on a model n_estimators times. Each time the classifier runs, it selects a percentage of the original data points, with replacement. All of these attempts are then averaged together
- For more information about bagging classifiers, look here: 
- Our classifier improved our model from **90.51%** to **90.64%**

In [13]:
## Running bagging classifier on our original decision tree
bag_model = BaggingClassifier(base_estimator=tree, n_estimators=100,bootstrap=True)
bag_model = bag_model.fit(X_train,Y_train)
y_pred = bag_model.predict(X_test)
print("Training accuracy: ", bag_model.score(X_train,Y_train))
print("Testing accuracy: ", bag_model.score(X_test,Y_test))


Training accuracy:  0.9053869499241275
Testing accuracy:  0.9018209408194233


### Most "Important" Features
- We will sort the features based off of their **gain**, or the weight given to each feature
- Features with higher gain more heavily impact the mode
- We can see that these features are similar to, but not exactly the same as, the original decision tree

In [14]:

feature_importances = np.mean([
    tree.feature_importances_ for tree in bag_model.estimators_
], axis=0)

pd.DataFrame({'Gain': tree.feature_importances_}, index = X_train.columns).sort_values('Gain', ascending = False)

Unnamed: 0,Gain
nr.employed,0.650221
pdays,0.121273
cons.conf.idx,0.062923
euribor3m,0.038924
month_oct,0.025556
...,...
default_unknown,0.000000
default_yes,0.000000
housing_no,0.000000
housing_unknown,0.000000


## Exporting Data
- To export the data, we're running the bagging model on our ORIGINAL testing set. Then, we're saving the results to a csv file

In [15]:
predictions = pd.DataFrame(bag_model.predict(test))


Feature names must be in the same order as they were in fit.



In [16]:
samp['Predicted'] = predictions
samp.to_csv('second_test.csv', index=False)