# Validation Set and Cross Validation with `default` Data Set

Complete exercise #5 in section 5.4 in *Introduction to Statistical Learning 2ed* (pp 220-221).

For (b),(c) and (d), feel free to use convenience functions from `sklearn`.

Wherever possible, set `random_state=0` so that your results are reproducable.

In [1]:
import pandas as pd

# Load the dataset to inspect its structure
file_path = '/Users/yuanhanlim/Desktop/DS & ML/13_default_cross_validation/default.csv'
default_data = pd.read_csv(file_path)

# Display the first few rows of the dataset for inspection
default_data.head()

Unnamed: 0,default,student,balance,income
0,No,No,729.526495,44361.625074
1,No,Yes,817.180407,12106.1347
2,No,No,1073.549164,31767.138947
3,No,No,529.250605,35704.493935
4,No,No,785.655883,38463.495879


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Encode categorical variables
default_data['default'] = default_data['default'].apply(lambda x: 1 if x == 'Yes' else 0)
default_data['student'] = default_data['student'].apply(lambda x: 1 if x == 'Yes' else 0)

# Task (a): Fit a logistic regression model using income and balance to predict default
X = default_data[['income', 'balance']]
y = default_data['default']

# Initialize the logistic regression model
model = LogisticRegression()
model.fit(X, y)

# Display coefficients of the fitted model
model.coef_, model.intercept_


(array([[2.08089741e-05, 5.64710265e-03]]), array([-11.54046792]))

In [3]:
# Set random seed for reproducibility
np.random.seed(42)

# Task (b): Validation set approach to estimate test error

# Split the data into training and validation sets (70% train, 30% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit the logistic regression model on the training set
model.fit(X_train, y_train)

# Predict on the validation set
y_pred_proba = model.predict_proba(X_val)[:, 1]  # Get probabilities for the positive class (default = 1)
y_pred = (y_pred_proba > 0.5).astype(int)  # Classify based on probability threshold of 0.5

# Compute the validation set error (misclassification rate)
validation_error = 1 - accuracy_score(y_val, y_pred)
validation_error


0.026666666666666616

In [4]:
# Task (c): Repeat the validation set approach with three different splits
errors = []

for seed in [1, 123, 456]:
    np.random.seed(seed)
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=seed)
    model.fit(X_train, y_train)
    y_pred = (model.predict_proba(X_val)[:, 1] > 0.5).astype(int)
    errors.append(1 - accuracy_score(y_val, y_pred))

errors


[0.024666666666666615, 0.025666666666666615, 0.02633333333333332]

These results are fairly consistent, suggesting that the model performs reliably across different random splits of the data.

In [5]:
# Task (d): Include the student variable and evaluate the test error

# Add 'student' to the features
X_with_student = default_data[['income', 'balance', 'student']]

# Repeat the validation set approach with the new model
errors_with_student = []

for seed in [1, 123, 456]:
    np.random.seed(seed)
    X_train, X_val, y_train, y_val = train_test_split(X_with_student, y, test_size=0.3, random_state=seed)
    model.fit(X_train, y_train)
    y_pred = (model.predict_proba(X_val)[:, 1] > 0.5).astype(int)
    errors_with_student.append(1 - accuracy_score(y_val, y_pred))

errors_with_student


[0.024333333333333318, 0.02733333333333332, 0.027000000000000024]

Including the student variable does not consistently reduce the test error compared to the model without it. The impact of this variable on model performance appears minimal in this context.