# Binary Predictors in a Logistic Regression

Using the same code as in the previous exercise, find the odds of 'duration'. 

What do they tell you?

## Import the relevant libraries

In [100]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Load the data

Load the ‘Bank_data.csv’ dataset.

In [101]:
raw_data = pd.read_csv('Bank-data.csv')
raw_data.head()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,3,4.12,0.0,0.0,0.0,0.0,686.0,yes
4,4,4.856,0.0,1.0,0.0,0.0,157.0,no


In [102]:
data = raw_data.drop('Unnamed: 0', axis=1)
data['y'] = data['y'].map({'yes':1, 'no':0})
data.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,0
1,0.767,0.0,0.0,2.0,1.0,274.0,1
2,4.858,0.0,1.0,0.0,0.0,167.0,0
3,4.12,0.0,0.0,0.0,0.0,686.0,1
4,4.856,0.0,1.0,0.0,0.0,157.0,0


### Declare the dependent and independent variables

Use 'duration' as the independet variable.

### Simple Logistic Regression

Run the regression.

In [103]:
logit = smf.logit(formula='y ~ duration', data=data[['y', 'duration']])
logit_result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.546118
         Iterations 7


In [104]:
logit_result.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,518.0
Model:,Logit,Df Residuals:,516.0
Method:,MLE,Df Model:,1.0
Date:,"Tue, 16 May 2023",Pseudo R-squ.:,0.2121
Time:,16:35:04,Log-Likelihood:,-282.89
converged:,True,LL-Null:,-359.05
Covariance Type:,nonrobust,LLR p-value:,5.387e-35

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.7001,0.192,-8.863,0.000,-2.076,-1.324
duration,0.0051,0.001,9.159,0.000,0.004,0.006


### Find the odds of duration

In [105]:
np.exp(0.0051)

1.005113027136717

In [106]:
cm_df = pd.DataFrame(logit_result.pred_table())
cm_df.columns = ['Predicted 0', 'Predicted 1']
cm_df = cm_df.rename(index={0:'Actual 0', 1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,204.0,55.0
Actual 1,104.0,155.0


In [107]:
cm = np.array(cm_df)
accuracy_train = (cm[0,0]+cm[1,1])/cm.sum()
accuracy_train

0.693050193050193

The odds of duration are pretty close to 1. This tells us that although duration is a significant predictor, a change in 1 day would barely affect the regression. 

Note that we could have inferred that from the coefficient itself.

Finally, note that the data is not standardized (scaled) and duration is a feature of a relatively big order of magnitude.

In [108]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [109]:
input = data[['duration']]
target = data[['y']]
scaler.fit(X=input, y=target)

In [110]:
scaler.transform(input)

array([[-7.70946942e-01],
       [-3.14503158e-01],
       [-6.25582679e-01],
       [ 8.83298363e-01],
       [-6.54655532e-01],
       [-7.44781374e-01],
       [-8.66887355e-01],
       [-1.06167547e+00],
       [ 9.35629497e-01],
       [-5.73251545e-01],
       [-2.37746338e-02],
       [-8.46536358e-01],
       [-6.19768109e-01],
       [-3.87185289e-01],
       [-4.86032988e-01],
       [-1.69138896e-01],
       [-6.48840961e-01],
       [-7.62225086e-01],
       [-1.31344188e-01],
       [ 7.49563241e-01],
       [ 8.16430802e-01],
       [ 6.05366383e-02],
       [-2.38913742e-01],
       [-3.69741578e-01],
       [-2.59264739e-01],
       [-9.13403919e-01],
       [-3.37761440e-01],
       [-3.20317729e-01],
       [ 4.60002121e-02],
       [-7.44781374e-01],
       [-1.00934433e+00],
       [-7.44781374e-01],
       [ 8.74576507e-01],
       [-2.97059447e-01],
       [-6.43026391e-01],
       [-5.55807834e-01],
       [ 2.18866944e+00],
       [-8.75609210e-01],
       [ 3.0

In [111]:
import statsmodels.api as sm
logit = sm.Logit(target, input)
logit_result = logit.fit()

Optimization terminated successfully.
         Current function value: 0.641883
         Iterations 5


In [112]:
logit_result.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,518.0
Model:,Logit,Df Residuals:,517.0
Method:,MLE,Df Model:,0.0
Date:,"Tue, 16 May 2023",Pseudo R-squ.:,0.07396
Time:,16:35:05,Log-Likelihood:,-332.5
converged:,True,LL-Null:,-359.05
Covariance Type:,nonrobust,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
duration,0.0014,0.000,6.535,0.000,0.001,0.002


In [113]:
np.exp(0.0014)

1.0014009804574935

## Acuracy of the Model

In [114]:
logit_result.predict()

array([0.54149696, 0.59619119, 0.55908984, 0.72620877, 0.55558168,
       0.54467263, 0.52982593, 0.50604308, 0.73126834, 0.56538947,
       0.62990678, 0.53230476, 0.55979078, 0.58760435, 0.57584111,
       0.61318459, 0.55628377, 0.5425559 , 0.6175599 , 0.71301265,
       0.71965811, 0.63946758, 0.60505936, 0.58967027, 0.60267831,
       0.52415468, 0.59344954, 0.59550633, 0.6378268 , 0.54467263,
       0.51243967, 0.54467263, 0.72535977, 0.59824351, 0.55698563,
       0.56748476, 0.83395718, 0.52876311, 0.88671242, 0.65021379,
       0.54678774, 0.61453282, 0.56189215, 0.55136459, 0.55487937,
       0.54079079, 0.52025199, 0.56014116, 0.56084174, 0.60845222,
       0.7332199 , 0.52238107, 0.53088847, 0.55452813, 0.54396723,
       0.61453282, 0.55206796, 0.52273584, 0.74093542, 0.62525396,
       0.53725753, 0.54008446, 0.74850434, 0.58932616, 0.61890236,
       0.60539911, 0.55241958, 0.56049148, 0.73349796, 0.536904  ,
       0.51386066, 0.58069601, 0.60879093, 0.58380885, 0.62658

In [115]:
np.array(data['y'])

array([0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,

### Confusion Matrix

In [119]:
# this accuracy is with respect to the training data not the test data
logit_result.pred_table()

array([[  0., 259.],
       [  0., 259.]])

In [117]:
cm_df = pd.DataFrame(logit_result.pred_table())
cm_df.columns = ['Predicted 0', 'Predicted 1']
cm_df = cm_df.rename(index={0:'Actual 0', 1:'Actual 1'})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,0.0,259.0
Actual 1,0.0,259.0


In [118]:
cm = np.array(cm_df)
accuracy_train = (cm[0,0]+cm[1,1])/cm.sum()
accuracy_train

0.5