<div style="border:2px solid #2196F3; border-radius:10px; padding:16px; background:#f9f9ff">

<h3>⚙️ Notebook 2 — Preprocessing & Baseline Modeling</h3>

**Purpose**  
- Load the lean dataset (~54 features) derived from Notebook 1.  
- Apply preprocessing:  
  - Handle missing values  
  - Encode categorical variables  
  - Scale numerical features  
- Split data into train/test sets.  
- Train baseline models (Logistic Regression, Random Forest, XGBoost) to establish performance benchmarks.  
- Evaluate using ROC-AUC, LogLoss, precision/recall, and calibration plots.  

**Inputs**  
- <code>data/processed/loan_default_slim.parquet</code> (sampled or full dataset)  
- <code>columns_to_keep.txt</code> (list of selected features)  

**Outputs**  
- Preprocessing pipeline object  
- Baseline model artifacts for comparison in Notebook 3  

</div>

### <div class="alert alert-info" align = center> Imports</div>

In [1]:
# Core
import os, sys, json, pathlib, warnings
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn preprocessing & modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, log_loss, classification_report, confusion_matrix
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Gradient boosting
from xgboost import XGBClassifier

# Utility
warnings.filterwarnings("ignore")
plt.style.use("seaborn-v0_8-whitegrid")


### <div class="alert alert-info" align = center> Functions</div>

<div class="alert alert-info"><strong>Load The Data</strong></div>

In [5]:
# === Dataset Loading ===

USE_SAMPLE = True   # flip to False when ready for full dataset

# Paths
datapath = pathlib.Path("data/processed")
feat_file = pathlib.Path("columns_to_keep.txt")

# Load kept features
with open(feat_file) as f:
    features_to_keep = [line.strip() for line in f if line.strip()]

# Always include target
target = "target_default"
features_final = features_to_keep + [target]

# Load data
if USE_SAMPLE:
    print("🔹 Using SAMPLE dataset (10%)")
    df = pd.read_parquet(datapath / "loan_default_slim55.parquet").sample(frac=0.1, random_state=42)
else:
    print("🔹 Using FULL dataset")
    df = pd.read_parquet(datapath / "loan_default_full55.parquet")

# Keep only final features + target
df = df[features_final]

print(f"✅ Loaded dataset: {df.shape[0]:,} rows, {df.shape[1]} columns")
df.head()


🔹 Using SAMPLE dataset (10%)
✅ Loaded dataset: 1,000 rows, 56 columns


Unnamed: 0,loan_amnt,funded_amnt_inv,int_rate,annual_inc,mths_since_last_record,open_acc,pub_rec,out_prncp,out_prncp_inv,total_rec_prncp,...,annual_inc_joint_is_missing,il_util_is_missing,mths_since_recent_bc_dlq_is_missing,mths_since_recent_revol_delinq_is_missing,sec_app_open_act_il_is_missing,target_default,emp_length,emp_length_is_missing,int_rate_trunc,target_default.1
6252,15000.0,15000.0,13.56,42000.0,90.0,13.0,0.0,14645.769531,14645.769531,354.230011,...,1,1,1,1,1,,2 years,0,13.56,
4684,10000.0,10000.0,11.55,25000.0,90.0,8.0,0.0,9051.410156,9051.410156,948.590027,...,1,0,1,1,1,,,1,11.55,
1731,1675.0,1675.0,8.46,22000.0,90.0,4.0,0.0,1551.01001,1551.01001,123.989998,...,1,0,1,1,1,,,1,8.46,
4742,6500.0,6500.0,10.33,65000.0,69.0,13.0,1.0,5975.100098,5975.100098,524.900024,...,1,0,0,0,1,,9 years,0,10.3299,
4521,13000.0,13000.0,10.72,55000.0,90.0,8.0,0.0,12669.110352,12669.110352,330.890015,...,1,0,1,1,1,,10+ years,0,10.72,
