# Logistic regression

In this project we need to know how effective the marketing campaign of a bank was on the customers deposit decision.!!   

The classification goal is to predict if the client will subscribe a term deposit (variable y).

Interest rate</i> indicates the 3-month interest rate between banks and <i> duration </i> indicates the time since the last contact was made with a given consumer. The <i> previous </i> variable shows whether the last marketing campaign was successful with this customer. The <i>march</i> and <i> may </i> are Boolean variables that account for when the call was made to the specific customer and <i> credit </i> shows if the customer has enough credit to avoid defaulting.

Source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(10,6)})

## Load the data

Load the ‘Bank_data.csv’ dataset.

In [8]:
raw_data = pd.read_csv('bank-data.csv')
raw_data

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,3,4.120,0.0,0.0,0.0,0.0,686.0,yes
4,4,4.856,0.0,1.0,0.0,0.0,157.0,no
...,...,...,...,...,...,...,...,...
513,513,1.334,0.0,1.0,0.0,0.0,204.0,no
514,514,0.861,0.0,0.0,2.0,1.0,806.0,yes
515,515,0.879,0.0,0.0,0.0,0.0,290.0,no
516,516,0.877,0.0,0.0,5.0,1.0,473.0,yes


In [3]:
data = raw_data.copy()
data = data.drop(['Unnamed: 0'], axis = 1) 
data['y'] = data['y'].map({'yes':1, 'no':0})
data

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,0
1,0.767,0.0,0.0,2.0,1.0,274.0,1
2,4.858,0.0,1.0,0.0,0.0,167.0,0
3,4.120,0.0,0.0,0.0,0.0,686.0,1
4,4.856,0.0,1.0,0.0,0.0,157.0,0
...,...,...,...,...,...,...,...
513,1.334,0.0,1.0,0.0,0.0,204.0,0
514,0.861,0.0,0.0,2.0,1.0,806.0,1
515,0.879,0.0,0.0,0.0,0.0,290.0,0
516,0.877,0.0,0.0,5.0,1.0,473.0,1


In [4]:
data.describe()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,2.835776,0.034749,0.266409,0.388031,0.127413,382.177606,0.5
std,1.876903,0.183321,0.442508,0.814527,0.333758,344.29599,0.500483
min,0.635,0.0,0.0,0.0,0.0,9.0,0.0
25%,1.04275,0.0,0.0,0.0,0.0,155.0,0.0
50%,1.466,0.0,0.0,0.0,0.0,266.5,0.5
75%,4.9565,0.0,1.0,0.0,0.0,482.75,1.0
max,4.97,1.0,1.0,5.0,1.0,2653.0,1.0


### Declare the dependent and independent variables

In [5]:
y = data['y'] # dependent variable [terget]
x1 = data['duration'] # independent variable [feature]

### Simple Logistic Regression

In [6]:
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log.summary() # summary

NameError: name 'results_log' is not defined

In [None]:
plt.scatter(x1,y,color = 'C0') # scatter plot
plt.xlabel('Duration', fontsize = 20)
plt.ylabel('Subscription', fontsize = 20)

plt.show()

## Expanding the model

Add the ‘interest_rate’, ‘march’, ‘credit’ and ‘previous’ estimators to our model and run the regression again. 

### Declare the independent variable(s)

In [13]:
estimators=['interest_rate','credit','march','previous','duration']

X1_all = data[estimators]
y = data['y']

In [14]:
X_all = sm.add_constant(X1_all)
reg_logit = sm.Logit(y,X_all)
results_logit = reg_logit.fit()
results_logit.summary2()

Optimization terminated successfully.
         Current function value: 0.336664
         Iterations 7


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.514
Dependent Variable:,y,AIC:,360.7836
Date:,2021-05-23 18:54,BIC:,386.2834
No. Observations:,518,Log-Likelihood:,-174.39
Df Model:,5,LL-Null:,-359.05
Df Residuals:,512,LLR p-value:,1.2114e-77
Converged:,1.0000,Scale:,1.0
No. Iterations:,7.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
const,-0.0211,0.3113,-0.0677,0.9460,-0.6313,0.5891
interest_rate,-0.8001,0.0895,-8.9434,0.0000,-0.9755,-0.6248
credit,2.3585,1.0875,2.1688,0.0301,0.2271,4.4900
march,-1.8322,0.3297,-5.5563,0.0000,-2.4785,-1.1859
previous,1.5363,0.5010,3.0666,0.0022,0.5544,2.5182
duration,0.0070,0.0007,9.3810,0.0000,0.0055,0.0084


### Confusion Matrix

In [15]:
def confusion_matrix(data,actual_values,model):
        
        
        #Predict the values using the Logit model
        pred_values = model.predict(data)
        # Specify the bins 
        bins=np.array([0,0.5,1])
        # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
        # if they are between 0.5 and 1, they will be considered 1
        cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
        # Calculate the accuracy
        accuracy = (cm[0,0]+cm[1,1])/cm.sum()
        # Return the confusion matrix and 
        return cm, accuracy

In [16]:
confusion_matrix(X_all,y,results_logit) # confusion matrix for actual data .. accuracy 86%

(array([[218.,  41.],
        [ 30., 229.]]),
 0.862934362934363)

## Test the model

### Load new data 

In [17]:
raw_data2 = pd.read_csv('Bank-data-testing.csv')
data_test = raw_data2.copy()
data_test = data_test.drop(['Unnamed: 0'], axis = 1) # Removes the index column

In [18]:
data_test['y'] = data_test['y'].map({'yes':1, 'no':0})
data_test

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.313,0.0,1.0,0.0,0.0,487.0,0
1,4.961,0.0,0.0,0.0,0.0,132.0,0
2,4.856,0.0,1.0,0.0,0.0,92.0,0
3,4.120,0.0,0.0,0.0,0.0,1468.0,1
4,4.963,0.0,0.0,0.0,0.0,36.0,0
...,...,...,...,...,...,...,...
217,4.963,0.0,0.0,0.0,0.0,458.0,1
218,1.264,0.0,1.0,1.0,0.0,397.0,1
219,1.281,0.0,1.0,0.0,0.0,34.0,0
220,0.739,0.0,0.0,2.0,0.0,233.0,0


### Declare the dependent and the independent variables

In [20]:
y_test = data_test['y']
X1_test = data_test[estimators]
X_test = sm.add_constant(X1_test)

In [21]:
confusion_matrix(X_test, y_test, results_logit) # Test confusion matrix .. accuracy 86.03%

(array([[93., 18.],
        [13., 98.]]),
 0.8603603603603603)

In [22]:
confusion_matrix(X_all,y, results_logit) # Actual confusion matrix .. accuracy 86.3%

(array([[218.,  41.],
        [ 30., 229.]]),
 0.862934362934363)

### Looking at the test acccuracy we see a number which is a tiny but lower: 86.03%, compared to 86.29% for train accuracy. 
