<blockquote>
    <h1>Exercise 5.5</h1>
    <p>In Chapter 4, we used logistic regression to predict the probability of $\mathrm{default}$ using $\mathrm{income}$ and $\mathrm{balance}$ on the <code>Default</code> data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.</p>
    <ol>
        <li>Fit a logistic regression model that uses $\mathrm{income}$ and $\mathrm{balance}$ to predict $\mathrm{default}$.</li>
        <li>Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
            <ol>
                <li>Split the sample set into a training set and a validation set.</li>
                <li>Fit a multiple logistic regression model using only the training observations.</li>
                <li>Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the $\mathrm{default}$ category if the posterior probability is greater than $0.5$.</li>
                <li>Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.</li>
            </ol>
        </li>
        <li>Repeat the process in 2 three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.</li>
        <li>Now consider a logistic regression model that predicts the probability of $\mathrm{default}$ using $\mathrm{income}$, $\mathrm{balance}$, and a dummy variable for $\mathrm{student}$. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for $\mathrm{student}$ leads to a reduction in the test error rate.</li>
    </ol>
</blockquote>

In [1]:
import pandas as pd
import numpy as np

%run ../../customModules/usefulFunctions.ipynb
# https://stackoverflow.com/questions/34398054/ipython-notebook-cell-multiple-outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.model_selection import train_test_split
import statsmodels.api as sm

<h3>Exercise 5.5.1</h3>
<blockquote>
    <i>Fit a logistic regression model that uses $\mathrm{income}$ and $\mathrm{balance}$ to predict $\mathrm{default}$.</i>
</blockquote>

In [2]:
df = pd.read_csv("../../DataSets/Default/Default.csv")
df['default01'] = np.where(df['default'] == 'Yes', 1, 0)
df.insert(0, 'Intercept', 1)
targetColumn = ['default01']
descriptiveColumns = ['Intercept', 'balance', 'income']
df_X = df[descriptiveColumns]
df_Y = df[targetColumn]
model = sm.Logit(df_Y, df_X)
fitted = model.fit()
fitted.summary()

Optimization terminated successfully.
         Current function value: 0.078948
         Iterations 10


0,1,2,3
Dep. Variable:,default01,No. Observations:,10000.0
Model:,Logit,Df Residuals:,9997.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 12 Jan 2020",Pseudo R-squ.:,0.4594
Time:,20:51:57,Log-Likelihood:,-789.48
converged:,True,LL-Null:,-1460.3
Covariance Type:,nonrobust,LLR p-value:,4.541e-292

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-11.5405,0.435,-26.544,0.000,-12.393,-10.688
balance,0.0056,0.000,24.835,0.000,0.005,0.006
income,2.081e-05,4.99e-06,4.174,0.000,1.1e-05,3.06e-05


<h3>Exercise 5.5.2</h3>
<blockquote>
    <i>Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
            <ol>
                <li>Split the sample set into a training set and a validation set.</li>
                <li>Fit a multiple logistic regression model using only the training observations.</li>
                <li>Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the $\mathrm{default}$ category if the posterior probability is greater than $0.5$.</li>
                <li>Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.</li>
            </ol></i>
</blockquote>

In [3]:
df_train, df_test = train_test_split(df, test_size=0.5, random_state=42)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()
fitted.summary()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
df_confusion  
df_confusion_pct.round(2)

confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

Optimization terminated successfully.
         Current function value: 0.078493
         Iterations 10


0,1,2,3
Dep. Variable:,default01,No. Observations:,5000.0
Model:,Logit,Df Residuals:,4997.0
Method:,MLE,Df Model:,2.0
Date:,"Sun, 12 Jan 2020",Pseudo R-squ.:,0.4804
Time:,20:51:58,Log-Likelihood:,-392.46
converged:,True,LL-Null:,-755.25
Covariance Type:,nonrobust,LLR p-value:,2.774e-158

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-11.9681,0.640,-18.688,0.000,-13.223,-10.713
balance,0.0060,0.000,17.665,0.000,0.005,0.007
income,1.934e-05,6.99e-06,2.766,0.006,5.63e-06,3.3e-05


Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted,Predicted
Unnamed: 0_level_1,Unnamed: 1_level_1,less than median,greater than median
Observed,less than median,4818,23
Observed,greater than median,106,53


Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted (%),Predicted (%)
Unnamed: 0_level_1,Unnamed: 1_level_1,less than median,greater than median
Observed (%),less than median,99.52,0.48
Observed (%),greater than median,66.67,33.33


'The validation set error is 2.58%.'

<h3>Exercise 5.5.3</h3>
<blockquote>
    <i>Repeat the process in 2 three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.</i>
</blockquote>

In [4]:
df_train, df_test = train_test_split(df, test_size=0.5, random_state=42)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

'-----------------------'

f_train, df_test = train_test_split(df, test_size=0.5, random_state=43)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

'-----------------------'

f_train, df_test = train_test_split(df, test_size=0.5, random_state=44)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

Optimization terminated successfully.
         Current function value: 0.078493
         Iterations 10


'The validation set error is 2.58%.'

'-----------------------'

Optimization terminated successfully.
         Current function value: 0.078493
         Iterations 10


'The validation set error is 2.86%.'

'-----------------------'

Optimization terminated successfully.
         Current function value: 0.078493
         Iterations 10


'The validation set error is 2.54%.'

<p>The average validation set error of these 3 fitted models is $2.66 \%$.</p>

<h3>Exercise 5.5.4</h3>
<blockquote>
    <i>Now consider a logistic regression model that predicts the probability of $\mathrm{default}$ using $\mathrm{income}$, $\mathrm{balance}$, and a dummy variable for $\mathrm{student}$. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for $\mathrm{student}$ leads to a reduction in the test error rate.</i>
</blockquote>

In [5]:
df['student01'] = np.where(df['student'] == 'Yes', 1, 0)
descriptiveColumns = ['Intercept', 'balance', 'income', 'student01']

df_train, df_test = train_test_split(df, test_size=0.5, random_state=42)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

'-----------------------'

f_train, df_test = train_test_split(df, test_size=0.5, random_state=43)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

'-----------------------'

f_train, df_test = train_test_split(df, test_size=0.5, random_state=44)
df_X_train = df_train[descriptiveColumns]
df_Y_train = df_train[targetColumn]
df_X_test = df_test[descriptiveColumns]
df_Y_test = df_test[targetColumn]

model = sm.Logit(df_Y_train, df_X_train)
fitted = model.fit()

sr_Y_pred = fitted.predict(df_X_test)
df_Y_test_and_pred = pd.DataFrame({
    'Observed': df_Y_test['default01'],
    'Predicted': np.where(sr_Y_pred > 0.5, 1, 0),
})
df_confusion, df_confusion_pct = createConfusionMatrixFromOutOfSampleData(df=df_Y_test_and_pred, binaryMap={0: 'less than median', 1: 'greater than median'})
confusion_matrix = df_confusion.to_numpy()
TN, FP, FN, TP = confusion_matrix[0, 0], confusion_matrix[0, 1], confusion_matrix[1, 0], confusion_matrix[1, 1]
missclass_acc = 100 * ((FP + FN) / (TN + FP + FN + TP))
f'The validation set error is {missclass_acc:.2f}%.'

Optimization terminated successfully.
         Current function value: 0.077900
         Iterations 10


'The validation set error is 2.56%.'

'-----------------------'

Optimization terminated successfully.
         Current function value: 0.077900
         Iterations 10


'The validation set error is 2.88%.'

'-----------------------'

Optimization terminated successfully.
         Current function value: 0.077900
         Iterations 10


'The validation set error is 2.56%.'

<p>The average validation set error is $2.67 \%$, so adding $\mathrm{student}$ as an independent variable does not seem to help reducing the validation set error.</p>