### Stage 3: Analysis of the key reporting metrics and final feature selection.

### ⚠️ Features with High Missing Rates

| Feature            | Rate   |
|--------------------|--------|
| `dti_joint`        | 0.9640 |
| `annual_inc_joint` | 0.9640 |
| `revol_bal_final`  | 0.0057 |

### 🚫 Low-Variance / Sparse Features (Zero-Dominant or Unused)

| Feature                        | Rate   |
|--------------------------------|--------|
| `loan_status_binary`           | 0.8049 |
| `hardship_dpd_filled`          | 0.9951 |
| `delinq_2yrs_reg`              | 0.8105 |
| `delinquency_score`            | 0.8023 |
| `avg_cur_bal_missing`          | 0.9636 |
| `num_tl_op_past_12m_missing`   | 0.9636 |
| `pub_rec_bankruptcies_missing` | 0.9996 |

### 📈 Binary Tree Features with Split Potential
| Feature                     | Correlation with Target |
|-----------------------------|-------------------------|
| `sub_grade_encoded`         | 0.2571                  |
| `int_rate`                  | 0.2533                  |
| `grade_encoded`             | 0.2507                  |
| `dti_joint`                 | 0.1276                  |
| `dti_final`                 | 0.0915                  |
| `hardship_dpd_filled`       | 0.0853                  |
| `loan_amount_band`          | 0.0782                  |
| `loan_to_installment_ratio` | 0.0542                  |
| `revol_util_reg`            | 0.0493                  |


### Features to keep
---

| Feature                             | 	Notes                                      |
|-------------------------------------|---------------------------------------------|
| sub_grade_encoded                   | 	Correlated with target (0.257)             |
| int_rate                            | 	Correlated with target (0.253)             |
| grade_encoded                       | 	Correlated with target (0.250)             |
| dti_final                           | 	Somewhat correlated (0.091), complete      |
| loan_to_installment_ratio           | 	Complete, weak signal (0.054)              |
| revol_util_reg                      | 	Reasonable signal (0.049), complete        |
| annual_inc_final                    | 	Complete, weak negative correlation        |
| cur_bal_to_income / cur_bal_to_loan | 	Strong skew, but informative               |
| fico_average                        | 	Good distribution, useful for models       |
| loan_amount_band                    | 	Clean categorical binning                  |

### Needs additional tuning
Feature	Issue	Recommendation
hardship_dpd_filled	Extreme sparsity (99.5% zero)	Consider dropping unless modeling rare hardship
delinquency_score	80% zeros	Keep if it helps tree-based splits; consider binning
emp_length_clean_reg	8% zero, low correlation	May need binning or encode non-linearly
initial_list_status_flag	Weak correlation	Possibly useful interaction term
purpose_risk_score	Low signal, categorical	Re-check bins or combine with other purpose signals

### Can be dropped
Feature	Reason
dti_joint	96.4% missing
annual_inc_joint	96.4% missing
pub_rec_bankruptcies_missing	99.96% zero
num_tl_op_past_12m_missing	96% zero
avg_cur_bal_missing	96% zero
tot_cur_bal_missing	96% zero
recent_major_derog_flag	94% zero, very weak correlation
mths_since_recent_inq_missing	87% zero, inverse correlation
mths_since_last_major_derog_filled	Skewed & weak correlation
mths_since_last_record_filled	Weak signal, heavy skew

### Complementing statistics from YData use the RandomForestRegressor feature analysis

In [1]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import numpy as np

def run_random_forest_analysis(df: pd.DataFrame):
    """
    Trains and evaluates a RandomForestRegressor on the cleaned regression dataset.

    Parameters:
        df (pd.DataFrame): The regression-ready dataset including 'loan_status_binary' as the target.

    Returns:
        dict: Model performance and feature importance
    """
    # Drop rows with any missing values
    df_clean = df.dropna()

    # Separate features and target
    target_col = "loan_status_binary"
    feature_cols = [col for col in df_clean.columns if col != target_col]

    X = df_clean[feature_cols]
    y = df_clean[target_col]

    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=None
    )

    # Train model
    model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
    model.fit(X_train, y_train)

    # Predict
    y_pred = model.predict(X_test)

    # Evaluate
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    # Feature importance
    importance_df = pd.DataFrame({
        "feature": X.columns,
        "importance": model.feature_importances_
    }).sort_values(by="importance", ascending=False)

    return {
        "model": model,
        "rmse": rmse,
        "r2_score": r2,
        "feature_importances": importance_df
    }


In [None]:
%reload_ext kedro.ipython

In [4]:
from sklearn.ensemble import RandomForestRegressor

# Separate features and target
X = df_reg.drop(columns=["loan_status_binary"])
y = df_reg["loan_status_binary"]

# Check object or non-numeric columns
non_numeric_cols = X.select_dtypes(exclude=["number"]).columns
print("🔎 Non-numeric (object or category) columns:")
print(non_numeric_cols)

# Try to convert the full feature matrix to float — catch the first failure
for col in non_numeric_cols:
    try:
        _ = X[col].astype(float)
    except ValueError as e:
        print(f"❌ Column '{col}' caused an error: {e}")


🔎 Non-numeric (object or category) columns:
Index(['fico_risk_band'], dtype='object')
❌ Column 'fico_risk_band' caused an error: Cannot cast object dtype to float64


In [5]:
df_reg = df_reg.drop(columns=["fico_risk_band"])
results = run_random_forest_analysis(df_reg)

print("RMSE:", results["rmse"])
print("R² Score:", results["r2_score"])
results["feature_importances"].head(10)

RMSE: 0.42063062607056145
R² Score: 0.0772807966335689


Unnamed: 0,feature,importance
6,cur_bal_to_loan,0.106465
5,cur_bal_to_income,0.092901
26,revol_util_reg,0.091724
25,revol_bal_final,0.090279
17,loan_to_installment_ratio,0.075271
27,sub_grade_encoded,0.068717
11,fico_average,0.057947
0,dti_joint,0.056613
9,dti_final,0.05646
1,annual_inc_joint,0.048277


In [6]:


model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
importances = model.feature_importances_

importance_df = pd.DataFrame({
    "feature": X_train.columns,
    "importance": model.feature_importances_
}).sort_values(by="importance", ascending=False)

importance_df.head(20).plot.barh(x="feature", y="importance", figsize=(10, 8))
plt.gca().invert_yaxis()
plt.title("Top Feature Importances - RandomForestRegressor")
plt.tight_layout()
plt.show()
