In [2]:
# Problem1

import pandas as pd
import numpy as np
from ISLP import load_data

Default = load_data('Default')

# Report the dataset dimensions
print("Dataset dimensions:", Default.shape)
# Report the column names and their data types
print("\nColumn names and data types:\n", Default.dtypes)
# Report the distribution of the default variable
print("\nDistribution of the default variable:\n", Default['default'].value_counts())

Dataset dimensions: (10000, 4)

Column names and data types:
 default    category
student    category
balance     float64
income      float64
dtype: object

Distribution of the default variable:
 default
No     9667
Yes     333
Name: count, dtype: int64


In [3]:
# Problem 1

import statsmodels.api as sm
import statsmodels.formula.api as smf
# Create logit miodel
Default['default_bin'] = (Default['default'] == 'Yes').astype(int)
logit_model = smf.logit('default_bin ~ student + balance + income', data=Default).fit()

# Report the summary of the model
print("\nLogit Model Summary:\n", logit_model.summary())

print("Balance coefficient:", logit_model.params['balance'])

# Report the odds ratios for the coefficients
odds_ratios = pd.Series(np.exp(logit_model.params), index=logit_model.params.index)
print("\nOdds Ratios:\n", odds_ratios)

Optimization terminated successfully.
         Current function value: 0.078577
         Iterations 10

Logit Model Summary:
                            Logit Regression Results                           
Dep. Variable:            default_bin   No. Observations:                10000
Model:                          Logit   Df Residuals:                     9996
Method:                           MLE   Df Model:                            3
Date:                Fri, 26 Sep 2025   Pseudo R-squ.:                  0.4619
Time:                        11:54:40   Log-Likelihood:                -785.77
converged:                       True   LL-Null:                       -1460.3
Covariance Type:            nonrobust   LLR p-value:                3.257e-292
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept        -10.8690      0.492    -22.079      0.000     -11.834      

### Problem 1

The dataset has 10,000 observations and 4 variables (default, student, balance, income).

The default variable has 9674 "No" and 326 "Yes".

In the logistic regression model predicting default from income, balance, and student, the coefficient for **balance** is approximately 0.0055.

Interpretation: A one-unit increase in balance raises the log-odds of defaulting by 0.0055. Equivalently, each extra dollar increases the odds of defaulting by about 0.55%, controlling for income and student status.

In [4]:
# Problem 2
from sklearn.model_selection import train_test_split

#use imcone and balance as predictors
X = Default[['income', 'balance']]
y = (Default['default'] == 'Yes').astype(int) 
# Split 70% training and 30% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

In [5]:
# Problem 2
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

# Report class means
print("\nLDA Class Means:\n", lda.means_)
# Report prior probabilities
print("\nLDA Prior Probabilities:\n", lda.priors_)


LDA Class Means:
 [[33530.99663426   806.21587478]
 [31626.77324835  1744.36429028]]

LDA Prior Probabilities:
 [0.96671429 0.03328571]


In [6]:
# Problem 2
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

qda = QuadraticDiscriminantAnalysis()
qda.fit(X_train, y_train)

0,1,2
,priors,
,reg_param,0.0
,store_covariance,False
,tol,0.0001


In [7]:
from sklearn.metrics import confusion_matrix, accuracy_score

# LDA predictions
y_pred_lda = lda.predict(X_test)
cm_lda = confusion_matrix(y_test, y_pred_lda)
acc_lda = accuracy_score(y_test, y_pred_lda)

print("\nLDA Confusion Matrix:\n", cm_lda)
print("LDA Test Accuracy:", acc_lda)

# QDA predictions
y_pred_qda = qda.predict(X_test)
cm_qda = confusion_matrix(y_test, y_pred_qda)
acc_qda = accuracy_score(y_test, y_pred_qda)

print("\nQDA Confusion Matrix:\n", cm_qda)
print("QDA Test Accuracy:", acc_qda)


LDA Confusion Matrix:
 [[2891    9]
 [  74   26]]
LDA Test Accuracy: 0.9723333333333334

QDA Confusion Matrix:
 [[2886   14]
 [  71   29]]
QDA Test Accuracy: 0.9716666666666667


### Problem 2

Both LDA and QDA achieve very high overall accuracy (about 97%), mainly because most customers do not default.

The class means show that defaulting customers tend to have higher balances and lower incomes compared to non-defaulters.

The priors reflect the strong class imbalance (about 3% defaults).

LDA correctly classifies most “No” cases but misses many “Yes” cases (74 false negatives vs. 26 true positives).

QDA performs similarly, with slightly lower accuracy on “No” but slightly better detection of “Yes.”

Overall, balance is the most important variable for distinguishing default status, while the models struggle to capture the minority “Yes” cases due to data imbalance.

In [8]:
# Problem 3
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)

cm_nb = confusion_matrix(y_test, y_pred_nb)
acc_nb = accuracy_score(y_test, y_pred_nb)
print("\nNaive Bayes Confusion Matrix:\n", cm_nb)
print("Naive Bayes Test Accuracy:", acc_nb)


Naive Bayes Confusion Matrix:
 [[2883   17]
 [  73   27]]
Naive Bayes Test Accuracy: 0.97


In [9]:
new_data = [[40000, 2000]]   # shape = (1, 2)
prob_default = nb.predict_proba(new_data)

print("Predicted probabilities [No, Yes]:", prob_default)
print("Probability of default:", prob_default[0][1])

Predicted probabilities [No, Yes]: [[0.50059302 0.49940698]]
Probability of default: 0.4994069795081303




### Problem 3

**Naive Bayes classifier**

Trained with income and balance as predictors, the test accuracy is about 97%.

Overall accuracy (97%) is slightly lower than LDA (97.23%) and QDA (97.16%).

**Confusion matrix results**

Similar to LDA and QDA, most “No default” cases are classified correctly.

The model has weaker performance in identifying “Yes default” cases, with more misclassifications.

**Prediction for a new customer**

For income = 40,000 and balance = 2,000, the predicted probability of default is about 0.50.

This means the customer has nearly equal chances of defaulting or not defaulting.

**Overall conclusion**

All three models achieve high accuracy on the imbalanced dataset, but mainly by correctly predicting “No.”

Balance is the most important predictor of default, while income contributes very little.

LDA and QDA perform slightly better than Naive Bayes, as they can model correlations between predictors.

In [10]:
# Problem 4
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd

#standardize the predictors
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Try KNN with k=1,3,5,10
k_values = [1, 3, 5, 10]
results = {}

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_scaled, y_train)
    y_pred_knn = knn.predict(X_test_scaled)
    acc_knn = accuracy_score(y_test, y_pred_knn)
    results[k] = acc_knn  
    print(f"KNN (k={k}) Test Accuracy:", acc_knn)

# Create table
results_df = pd.DataFrame.from_dict(results, orient='index', columns=['Test Accuracy'])
print("\nSummary Table:")
print(results_df)

KNN (k=1) Test Accuracy: 0.9583333333333334
KNN (k=3) Test Accuracy: 0.9673333333333334
KNN (k=5) Test Accuracy: 0.968
KNN (k=10) Test Accuracy: 0.9703333333333334

Summary Table:
    Test Accuracy
1        0.958333
3        0.967333
5        0.968000
10       0.970333


### Problem 4
The best performance is obtained at K = 10, with a test accuracy of about 97.0%.

Very small values of K (K = 1) tend to perform worse because the classifier is overly sensitive to noise or outliers: the prediction for a test point depends entirely on a single nearest neighbor, which may not represent the broader pattern.

As K increases, the decision boundary becomes smoother (as the graph shown in class), reducing variance and improving generalization.

However, if K is set too large, the model may become too biased by averaging over many neighbors and ignore local structure — so there is a trade-off.

In [11]:
# Calculate Logistic accuracy
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X = Default[['student', 'balance', 'income']]
y = Default['default_bin']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

train = X_train.copy()
train['default_bin'] = y_train

test = X_test.copy()
test['default_bin'] = y_test

# logistic regression (refit on training data)
logit_model = smf.logit('default_bin ~ student + balance + income', data=train).fit()
y_pred_prob = logit_model.predict(test)
y_pred = (y_pred_prob > 0.5).astype(int)
acc_logit = accuracy_score(test['default_bin'], y_pred)

print("Logistic Regression Test Accuracy:", acc_logit)

# confusion matrix
cm_logit = confusion_matrix(test['default_bin'], y_pred)
print("Logistic Regression Confusion Matrix:\n", cm_logit)

Optimization terminated successfully.
         Current function value: 0.078110
         Iterations 10
Logistic Regression Test Accuracy: 0.9716666666666667
Logistic Regression Confusion Matrix:
 [[2882   18]
 [  67   33]]


In [12]:
from sklearn.neighbors import KNeighborsClassifier

# use k=10 as an example cuz is the best performance
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train_scaled, y_train)

# predict
y_pred_knn = knn.predict(X_test_scaled)

# confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
print("KNN Confusion Matrix:\n", cm_knn)

KNN Confusion Matrix:
 [[2878   22]
 [  67   33]]


In [13]:
fnr_logit = cm_logit[1,0] / (cm_logit[1,0] + cm_logit[1,1])
fnr_lda  = cm_lda[1,0]  / (cm_lda[1,0]  + cm_lda[1,1])
fnr_qda  = cm_qda[1,0]  / (cm_qda[1,0]  + cm_qda[1,1])
fnr_nb   = cm_nb[1,0]   / (cm_nb[1,0]   + cm_nb[1,1])
fnr_knn   = cm_knn[1,0]   / (cm_knn[1,0]   + cm_knn[1,1])

print("Logistic Regression FNR:", fnr_logit)
print("LDA FNR:", fnr_lda)
print("QDA FNR:", fnr_qda)
print("Naive Bayes FNR:", fnr_nb)
print("KNN FNR:", fnr_knn)

Logistic Regression FNR: 0.67
LDA FNR: 0.74
QDA FNR: 0.71
Naive Bayes FNR: 0.73
KNN FNR: 0.67


In [14]:
from sklearn.metrics import confusion_matrix

def rates_from_cm(cm):
    tn, fp, fn, tp = cm.ravel()
    # False Positive Rate (FPR) = FP / (FP + TN)
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0
    # False Negative Rate (FNR) = FN / (FN + TP)
    fnr = fn / (fn + tp) if (fn + tp) > 0 else 0.0
    return fpr, fnr, (tn, fp, fn, tp)

# Predicted probabilities on the test set
y_prob_logit = logit_model.predict(test)

# Threshold = 0.5
y_pred_05 = (y_prob_logit > 0.5).astype(int)
cm_05 = confusion_matrix(test['default_bin'], y_pred_05)
fpr_05, fnr_05, _ = rates_from_cm(cm_05)

# Threshold = 0.3
y_pred_03 = (y_prob_logit > 0.3).astype(int)
cm_03 = confusion_matrix(test['default_bin'], y_pred_03)
fpr_03, fnr_03, _ = rates_from_cm(cm_03)

print("Logistic Regression (threshold 0.5) -> FPR:", fpr_05, "FNR:", fnr_05, "CM:", cm_05)
print("Logistic Regression (threshold 0.3) -> FPR:", fpr_03, "FNR:", fnr_03, "CM:", cm_03)


Logistic Regression (threshold 0.5) -> FPR: 0.006206896551724138 FNR: 0.67 CM: [[2882   18]
 [  67   33]]
Logistic Regression (threshold 0.3) -> FPR: 0.019655172413793102 FNR: 0.47 CM: [[2843   57]
 [  47   53]]


### Problem 5

#### Summary Table of Test Accuracy

| Method              | Test Accuracy |
|---------------------|---------------|
| Logistic Regression | 0.9717        |
| LDA                 | 0.9723        |
| QDA                 | 0.9717        |
| Naive Bayes         | 0.9700        |
| KNN (k=1)           | 0.9583        |
| KNN (k=3)           | 0.9673        |
| KNN (k=5)           | 0.9680        |
| KNN (k=10, best)    | 0.9703        |

#### False Negative Rate (FNR) Comparison

| Method              | FNR  |
|---------------------|------|
| Logistic Regression | 0.67 |
| LDA                 | 0.74 |
| QDA                 | 0.71 |
| Naive Bayes         | 0.73 |
| KNN (k=10)          | 0.67 |

Lowest FNR: Logistic Regression (0.67) and KNN (0.67)

#### Cost-sensitive recommendation

Since missing a default (FN) costs 10× more than a false alarm (FP), the method with the lowest false negative rate (FNR) is preferred.

Logistic Regression already had one of the lowest FNRs (0.67) while also maintaining very high test accuracy (97.17%).

It is also easy to tune using probability thresholds, making it the most appropriate choice.

#### Effect of lowering the threshold (0.5 to 0.3)

| Threshold | Confusion Matrix        | FPR   | FNR   | Interpretation                       |
|-----------|-------------------------|-------|-------|--------------------------------------|
| 0.5       | [[2882, 18], [67, 33]] | 0.006 | 0.67  | Very low FPR, but misses most defaults |
| 0.3       | [[2843, 57], [47, 53]] | 0.020 | 0.47  | Higher FPR, but catches more defaults |

#### Conclusion

Lowering the threshold significantly reduces false negatives (catching more defaults) at the expense of a modest increase in false positives. Given the cost imbalance, this trade-off makes Logistic Regression at a 0.3 threshold the best choice.