#**DS5110 - Essentials of Data Science**
##**Fall 2025 Homework Assignment 6**

Submission Instructions:
- Please complete this homework assignment in the same notebook provided.
- Submit your completed assignment on Canvas by the deadline.

Submission Deadline:
**November 18th, 2025**

<p align="justify">
Please read the instructions carefully when answering questions and ensure your code works correctly before submission. The grader will run your code for grading the coding questions without any adjustment.
</p>

In [None]:
#@markdown ### Enter your first and last names below:
First Name = "Linxuan" #@param {type:"string"}
Last Name = "Li" #@param {type:"string"}

##**Problem Description**

Financial institutions that lend to consumers rely on models to help decide on who to approve or decline for credit (for lending products such as credit cards, automobile loans, or home loans). In this project, your task is to develop models that review credit card applications to determine which ones should be approved. You are given historical data on response (binary default indicator) and 20 predictor variables from credit card accounts for a hypothetical bank XYZ, a regional bank in the Bay area. There are three datasets available: a [training](https://raw.githubusercontent.com/mh2t/DS5110/main/Homework/HW4-Train.csv) dataset with 20,000 accounts; a [validation](https://raw.githubusercontent.com/mh2t/DS5110/main/Homework/HW4-Validation.csv) dataset with 3,000 accounts, and a **hidden** test dataset with 5,000 accounts. Information about the variables is given in the [Appendix](https://github.com/mh2t/DS5110/blob/main/Homework/HW4-appx.pdf).

You are asked to do the following and also address specific questions below:

* **(10 points)** Do any necessary data pre-processing in preparation for modeling.
* **(20 points)** Develop and fit a logistic regression (LR) model, assess its performance, and interpret the results.
* **(20 points)** Develop an additional model based on a machine learning (ML) algorithm selected from one of the following: Random Forest, Gradient Boosting (XGBoost or another implementation), or Feedforward Neural Network; assess its performance, and make sure to explain why you chose this particular algorithm.
* **(10 points)** Compare the results from the ML algorithm with those from logistic regression model and discuss their advantages and disadvantages; select one of these models for credit approval; and describe the reasons for your selection.
* **(5 points)** Describe what performance metrics you chose to evalaute your proposed models and why.
* **(10 points)** Describe how you would use it to make decisions on future credit card applications.
* **(5 points)** Do customers who already have an account with the financial institution receive any favorable treatment in your model? Support your answer with appropriate analysis.
* **(20 points)** 2-page report.
* You can use any libraries for this homework.



##**Deliverables**

Please submit the following:

1. A report (doc file) that describes all important steps in your data analysis,
model development, comparison of the models, and answer to the specific questions in addition to justification for your final model selection. The body of the report should be no more than 2 pages in length (font size 11 and spacing 1.2).
2. The codes you used for the analysis should have brief but adequate annotations so that we can run it. Using a format of **IPYNB** is mandatory. Clearly indicate the software packages and versions (if appropriate) that you used for the analysis.
3. You are allowed to review textbooks, published papers, websites, and other open literature in preparing for this homework. Note, however, that the material you submit in your report must be based on your own analysis and writing. If you relied on published scholarly work and open-source software for your analysis and findings (beyond what is generally known), you should provide references at the end of the report.


##**Top Model Bonus**

If the evaluation metric of your chosen model achieve the **highest** rank among all submissions, you will be awarded an additional **10 bonus points**. This bonus will be directly applied to your homework 6 score. It's important to note that the performance of your best model will be assessed using a hidden test set, ensuring a fair and unbiased evaluation.

In [14]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score, confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


train = pd.read_csv("/content/HW4-Train.csv")
valid = pd.read_csv("/content/HW4-Validation.csv")

# Example preprocessing
train = train.dropna()
valid = valid.dropna()

y_train = train['Default_ind']
X_train = train.drop(['Default_ind'], axis=1)

y_valid = valid['Default_ind']
X_valid = valid.drop(['Default_ind'], axis=1)

X_train_dum = pd.get_dummies(X_train, drop_first=True)
X_valid_dum = pd.get_dummies(X_valid, drop_first=True)

X_train_dum, X_valid_dum = X_train_dum.align(X_valid_dum, join='left', axis=1, fill_value=0)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_dum)
X_valid_scaled = scaler.transform(X_valid_dum)

print("X_train_dum shape:", X_train_dum.shape)
print("X_valid_dum shape:", X_valid_dum.shape)


X_train_dum shape: (16559, 25)
X_valid_dum shape: (2473, 25)


In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score

lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)

y_valid_pred = lr.predict(X_valid_scaled)
y_valid_prob = lr.predict_proba(X_valid_scaled)[:, 1]

accuracy  = accuracy_score(y_valid, y_valid_pred)
precision = precision_score(y_valid, y_valid_pred)
recall    = recall_score(y_valid, y_valid_pred)
f1        = f1_score(y_valid, y_valid_pred)
roc_auc   = roc_auc_score(y_valid, y_valid_prob)
cm        = confusion_matrix(y_valid, y_valid_pred)

print("==== Logistic Regression Performance ====")
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")

print("\nConfusion Matrix:")
print(cm)


==== Logistic Regression Performance ====
Accuracy:  0.9393
Precision: 0.7260
Recall:    0.2896
F1-score:  0.4141
ROC-AUC:   0.8213

Confusion Matrix:
[[2270   20]
 [ 130   53]]


In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score

# 1. Define and train Random Forest
rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_split=2,
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_dum, y_train)

# 2. Predict on validation set
y_valid_pred_rf = rf.predict(X_valid_dum)
y_valid_prob_rf = rf.predict_proba(X_valid_dum)[:, 1]

# 3. Evaluation metrics
acc_rf  = accuracy_score(y_valid, y_valid_pred_rf)
pre_rf  = precision_score(y_valid, y_valid_pred_rf)
rec_rf  = recall_score(y_valid, y_valid_pred_rf)
f1_rf   = f1_score(y_valid, y_valid_pred_rf)
auc_rf  = roc_auc_score(y_valid, y_valid_prob_rf)
cm_rf   = confusion_matrix(y_valid, y_valid_pred_rf)

print("==== Random Forest Performance ====")
print(f"Accuracy:  {acc_rf:.4f}")
print(f"Precision: {pre_rf:.4f}")
print(f"Recall:    {rec_rf:.4f}")
print(f"F1-score:  {f1_rf:.4f}")
print(f"ROC-AUC:   {auc_rf:.4f}")

print("\nConfusion Matrix:")
print(cm_rf)


==== Random Forest Performance ====
Accuracy:  0.9406
Precision: 0.7647
Recall:    0.2842
F1-score:  0.4143
ROC-AUC:   0.8530

Confusion Matrix:
[[2274   16]
 [ 131   52]]
