In Chapter 4, we used logistic regression to predict the probability of
`default` using `income` and `balance` on the `Default` data set. We will
now estimate the test error of this logistic regression model using the
validation set approach. Do not forget to set a random seed before
beginning your analysis.

### Preprocessing

In [0]:
# import relevant statistical packages
import numpy as np
import pandas as pd

In [0]:
# import relevant data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
# load and preprocess data
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Default.csv"
Default = spark.read.option("header", "true").csv(url).toPandas()
Default.set_index('_c0', inplace=True)

float_cols = ["balance", "income"]
str_cols = ["default", "student"]
Default[float_cols] = Default[float_cols].astype(float)
Default[str_cols] = Default[str_cols].astype(str)

In [0]:
Default.head()

In [0]:
Default.info()

In [0]:
dfX = Default[['student', 'balance','income']]
dfX = pd.get_dummies(data = dfX, drop_first=True)
dfy = Default['default']

In [0]:
dfX.head()

In [0]:
dfy.head()

**a. Fit a logistic regression model that uses `income` and `balance` to
predict `default`.**

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
X = dfX[['income', 'balance']]
y = dfy

In [0]:
glmfit = LogisticRegression(solver = 'liblinear').fit(X, y)

In [0]:
glmfit.coef_

**b. Using the validation set approach, estimate the test error of this
model. In order to do this, you must perform the following steps:**
- i. Split the sample set into a training set and a validation set.
- ii. Fit a multiple logistic regression model using only the training observations.
- iii. Obtain a prediction of `default` status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
- iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassifed.

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X = dfX[['income', 'balance']]
y = dfy
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [0]:
print("X_train, ", X_train.shape, "y_train, ", y_train.shape, "X_test: ", X_test.shape, "y_test: ", y_test.shape)

In [0]:
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [0]:
glmpred = glmfit.predict(X_test)

In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
conf_mat = confusion_matrix(y_test, glmpred)
conf_mat

In [0]:
round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)

**c. Repeat the process in (b) three times, using three diferent splits
of the observations into a training set and a validation set. Comment on the results obtained.**

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print("X_train, ", X_train.shape, "y_train, ", y_train.shape, "X_test: ", X_test.shape, "y_test: ", y_test.shape)
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)
glmpred = glmfit.predict(X_test)
conf_mat = confusion_matrix(y_test, glmpred)
round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
print("X_train, ", X_train.shape, "y_train, ", y_train.shape, "X_test: ", X_test.shape, "y_test: ", y_test.shape)
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)
glmpred = glmfit.predict(X_test)
conf_mat = confusion_matrix(y_test, glmpred)
round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75, random_state=42)
print("X_train, ", X_train.shape, "y_train, ", y_train.shape, "X_test: ", X_test.shape, "y_test: ", y_test.shape)
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)
glmpred = glmfit.predict(X_test)
conf_mat = confusion_matrix(y_test, glmpred)
round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)

Checking for multiple splits

In [0]:
sample = np.linspace(start = 0.05, stop = 0.95, num = 20)

In [0]:
sample

In [0]:
X = dfX[['income', 'balance']]
y = dfy
confpd = pd.DataFrame()
for i in sample:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i, random_state=42)
    print("X_train, ", X_train.shape, "y_train, ", y_train.shape, "X_test: ", X_test.shape, "y_test: ", y_test.shape)
    glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)
    glmpred = glmfit.predict(X_test)
    conf_mat = confusion_matrix(y_test, glmpred)
    sum = round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)
    confpd = confpd.append([sum])

In [0]:
confpd.reset_index(drop=True, inplace=True)

In [0]:
confpd.columns = ['Error']

In [0]:
confpd.head()

In [0]:
confpd.mean()

In [0]:
plt.xkcd()
plt.figure(figsize = (25, 10))
plt.plot(confpd, marker = 'o', markersize = 10)
plt.title("split% vs error rates")
plt.ylabel("error rates")
plt.xlabel("split%")

We notice that the error rate asymptotically settle around ~0.62, but the growth really begins to plateau around 0.2.

**d. Now consider a logistic regression model that predicts the probability of `default` using `income`, `balance`, and a dummy variable
for `student`. Estimate the test error for this model using the validation set approach. Comment on whether or not including a
dummy variable for `student` leads to a reduction in the test error
rate.**

In [0]:
X = dfX # no need to change since dfX already incorporates the dummy variable transformation for 'student'
y = dfy

Using the validation set approach

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [0]:
glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [0]:
glmpred = glmfit.predict(X_test)

In [0]:
confusion_matrix(y_test, glmpred)

In [0]:
round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)

Checking for multiple splits

In [0]:
sample = np.linspace(start = 0.05, stop = 0.95, num = 20)
sample

In [0]:
X = dfX
y = dfy
confpd = pd.DataFrame()
for i in sample:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=i, random_state=42)
    print("X_train, ", X_train.shape, "y_train, ", y_train.shape, "X_test: ", X_test.shape, "y_test: ", y_test.shape)
    glmfit = LogisticRegression(solver='liblinear').fit(X_train, y_train)
    glmpred = glmfit.predict(X_test)
    conf_mat = confusion_matrix(y_test, glmpred)
    sum = round((conf_mat[0][1] + conf_mat[1][0]) / y_train.shape[0], 4)
    confpd = confpd.append([sum])

In [0]:
confpd.reset_index(drop=True, inplace=True)
confpd.columns = ['Error']
confpd.head()

In [0]:
confpd.mean()

In [0]:
plt.xkcd()
plt.figure(figsize = (25, 10))
plt.plot(confpd, marker = 'o', markersize = 10)
plt.title("split% vs error rates")
plt.ylabel("error rates")
plt.xlabel("split%")

We notice the same graph as that for logit without the dummy variable. So, we can conclude that the dummy variable
does not lead to a reduction in the test error rate