# Chapter 5 - Resampling Methods

# 5. In Chapter 4, we used logistic regression to 
**predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.**

In [13]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [14]:
# Read data
file = 'data/Default.csv'
default = pd.read_csv(file)
default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,No,No,729.526495,44361.625074
1,2,No,Yes,817.180407,12106.1347
2,3,No,No,1073.549164,31767.138947
3,4,No,No,529.250605,35704.493935
4,5,No,No,785.655883,38463.495879


In [15]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  10000 non-null  int64  
 1   default     10000 non-null  object 
 2   student     10000 non-null  object 
 3   balance     10000 non-null  float64
 4   income      10000 non-null  float64
dtypes: float64(2), int64(1), object(2)
memory usage: 390.8+ KB


In [16]:
encoding_dict = {'Yes': 1, 'No': 0}
default['default'] = default['default'].map(encoding_dict)
default['student'] = default['student'].map(encoding_dict)

default.head()

Unnamed: 0.1,Unnamed: 0,default,student,balance,income
0,1,0,0,729.526495,44361.625074
1,2,0,1,817.180407,12106.1347
2,3,0,0,1073.549164,31767.138947
3,4,0,0,529.250605,35704.493935
4,5,0,0,785.655883,38463.495879


In [17]:
default.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  10000 non-null  int64  
 1   default     10000 non-null  int64  
 2   student     10000 non-null  int64  
 3   balance     10000 non-null  float64
 4   income      10000 non-null  float64
dtypes: float64(2), int64(3)
memory usage: 390.8 KB


**(a) Fit a logistic regression model that uses income and balance to predict default.**

In [20]:
from sklearn.metrics import confusion_matrix

In [25]:
# Drop rows with missing values in the 'default', 'balance', and 'income' columns
default.dropna(subset=['default', 'balance', 'income'], inplace=True)

# Define features (X) and target variable (y)
X = default[['balance', 'income']]
y = default['default']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and fit logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)

# Print coefficients and intercept
print("Coefficients:", log_reg.coef_)
print("Intercept:", log_reg.intercept_)

# Evaluate the model
train_accuracy = log_reg.score(X_train_scaled, y_train)
test_accuracy = log_reg.score(X_test_scaled, y_test)
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

# Predictions on the test set
y_pred = log_reg.predict(X_test_scaled)

# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)

# Select elements
tn = conf_matrix[0][0]
fp = conf_matrix[0][1]
fn = conf_matrix[1][0]
tp = conf_matrix[1][1]

# Overall fraction of correct predictions
correct_predictions = (tp + tn) / (tp + tn + fp + fn)
print("\nTrue Negative:", tn)
print("False Positive:", fp)
print("False Negative:", fn)
print("True Positive:", tp)
print("\nOverall Fraction of Correct Predictions:", correct_predictions)

Coefficients: [[2.77251347 0.26996715]]
Intercept: [-6.2137023]
Training Accuracy: 0.974875
Testing Accuracy: 0.9695

Confusion Matrix:
 [[1921   10]
 [  51   18]]

True Negative: 1921
False Positive: 10
False Negative: 51
True Positive: 18

Overall Fraction of Correct Predictions: 0.9695


**(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:**

**i. Split the sample set into a training set and a validation set.**

**ii. Fit a multiple logistic regression model using only the training observations.**

**iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of
default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.**

**iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.**