We continue to consider the use of a logistic regression model to
predict the probability of `default` using `income` and `balance` on the
`Default` data set. In particular, we will now compute estimates for the
standard errors of the `income` and `balance` logistic regression coefcients in two diferent ways: (1) using the bootstrap, and (2) using the
standard formula for computing the standard errors in the `sm.GLM()`
function. Do not forget to set a random seed before beginning your
analysis.

### Preprocessing

In [0]:
# import relevant statistical packages
import numpy as np
import pandas as pd

In [0]:
# import relevant data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
# load and preprocess data
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Default.csv"
Default = spark.read.option("header", "true").csv(url).toPandas()
Default.set_index('_c0', inplace=True)

float_cols = ["balance", "income"]
str_cols = ["default", "student"]
Default[float_cols] = Default[float_cols].astype(float)
Default[str_cols] = Default[str_cols].astype(str)

In [0]:
Default.head()

In [0]:
Default.info()

In [0]:
dfX = Default[['student', 'balance','income']]
dfX = pd.get_dummies(data = dfX, drop_first=True)
dfy = Default['default']

In [0]:
dfX.head()

In [0]:
dfy.head()

**a. Using the `.summary()` and `sm.GLM()` functions, determine the
estimated standard errors for the coefcients associated with
income and balance in a multiple logistic regression model that
uses both predictors.**

In [0]:
import statsmodels.api as sm

In [0]:
X = dfX[['balance', 'income']]
X = sm.add_constant(X)
y = pd.get_dummies(dfy, drop_first=True)

In [0]:
glmfit = sm.GLM(y, X, family=sm.families.Binomial()).fit()

In [0]:
glmfit.summary()

In [0]:
estimated_std_err = np.array(glmfit.params / glmfit.tvalues)

In [0]:
estimated_std_err

**b. Write a function, `boot_fn()`, that takes as input the `Default` data
set as well as an index of the observations, and that outputs
the coefcient estimates for `income` and `balance` in the multiple
logistic regression model.**

In [0]:
def bootfn(data, index):
    X = data[['balance', 'income']]
    X = sm.add_constant(X)
    y = pd.get_dummies(data['default'], drop_first=True)
    X_train = X.iloc[index]
    y_train = y.iloc[index]
    glmfit = sm.GLM(y_train, X_train, family=sm.families.Binomial()).fit()
    estimated_std_err = np.array(glmfit.params / glmfit.tvalues)
    return estimated_std_err

In [0]:
bootfn(Default, list(range(1,10000)))

**c. Following the bootstrap example in the lab, use your `boot_fn()`
function to estimate the standard errors of the logistic regression
coefcients for `income` and `balance`.**

In [0]:
from sklearn.utils import resample

In [0]:
std_err_df = pd.DataFrame()
idx = list(range(10000))

In [0]:
for i in range(1000):
    std_temp = bootfn(Default, resample(idx, replace=True))
    std_err_df = std_err_df.append([std_temp])

In [0]:
std_err_df.reset_index(drop=True, inplace=True)
std_err_df.columns = ['intercept', 'balance', 'income']

In [0]:
std_err_df.head()

In [0]:
std_err_df.shape

In [0]:
std_err_df.mean()

**d. Comment on the estimated standard errors obtained using the
`sm.GLM()` function and using the bootstrap.**

As we can see, the bootstrap of the standard error estimates are close to standard errors of logistic regression coefficients.