# 0) Initiation

With our data cleaned and processed, the next logical step is to prepare it for the modeling phase. This involves loading the processed dataset, separating our features (the independent variables, X) from the target variable we want to predict (`int_rate`, y), and splitting the data into training and testing sets. The training set will be used to train and tune our models, while the test set will be kept aside as a final, unseen benchmark to evaluate the performance of our chosen model.

## 0.1) Loading Essential Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

from sklearn.base import clone
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate, RandomizedSearchCV
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

print("Loading Successful.")

Loading Successful.


## 0.2) Loading and Preparing Dataset

In [2]:
df_processed = pd.read_csv('Data/LoansData_Processed.csv')

X = df_processed.drop(['int_rate', 'grade_encoded'], axis=1)
y = df_processed['int_rate']

X.columns = X.columns.str.replace(' ', '_', regex=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=86)

scaler = StandardScaler()
scaler.set_output(transform="pandas") 

print(f"Training set shape: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Testing set shape: X_test={X_test.shape}, y_test={y_test.shape}")

Training set shape: X_train=(8649, 27), y_train=(8649,)
Testing set shape: X_test=(2163, 27), y_test=(2163,)


# 1) Initial Training and Model Selection

To determine the best model for our specific case, we will first evaluate which type of model performs best – whether it is linear models, tree absed models, or gradient boosting models. Then, upon identifying the best-performing type, we will choose the specific model from that category that achieved the strongest results and attempt to squeeze more performance out of it through tuning.

## 1.1) Initial Training

To find the best approach for estimating interest rates, we'll evaluate a few different types of regression models known for their effectiveness on tabular data: a regularized linear model (`Ridge`), an ensemble of decision trees (`RandomForestRegressor`), and a gradient boosting model (`GradientBoostingRegressor`). To ensure fair comparison and prevent data leakage, especially since some models benefit from feature scaling, we'll bundle the scaling step (`StandardScaler`) and the model training step into a `Pipeline`.

In [3]:
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

ridge_pipe = Pipeline([('scaler', scaler), ('ridge', Ridge(random_state=86))])
rf_pipe = Pipeline([('scaler', scaler), ('rf', RandomForestRegressor(random_state=86, n_estimators=100, n_jobs=-1))])
gbr_pipe = Pipeline([('scaler', scaler), ('gbr', GradientBoostingRegressor(random_state=86))])

pipelines = {
    'Ridge': ridge_pipe,
    'Random Forest': rf_pipe,
    'Gradient Boosting': gbr_pipe
}

scoring = {
    'R2': 'r2',
    'MAE': 'neg_mean_absolute_error',
    'RMSE': 'neg_root_mean_squared_error'
}

print("Models and scoring prepared successfully.")

Models and scoring prepared successfully.


In [4]:
cv_results = {}

print("Running Cross-Validation...")
for name, pipe in pipelines.items():
    scores = cross_validate(pipe, X_train, y_train, cv=5, scoring=scoring, n_jobs=-1)
    cv_results[name] = {
        'Fit Time': scores['fit_time'].mean(),
        'Score Time': scores['score_time'].mean(),
        'R2': scores['test_R2'].mean(),
        'MAE': -scores['test_MAE'].mean(), 
        'RMSE': -scores['test_RMSE'].mean() 
    }
    print(f"... {name} Done.")

cv_results_df = pd.DataFrame(cv_results).T
print("\nCross-Validation Results:")
cv_results_df

Running Cross-Validation...
... Ridge Done.
... Random Forest Done.
... Gradient Boosting Done.

Cross-Validation Results:


Unnamed: 0,Fit Time,Score Time,R2,MAE,RMSE
Ridge,0.023724,0.010787,0.48459,2.032439,2.552447
Random Forest,4.843351,0.073843,0.492766,2.016636,2.531651
Gradient Boosting,2.412019,0.011058,0.523871,1.949803,2.453181


From the 5-fold cross-validation we performed on the training data using `Ridge`, `Random Forest`, and `Gradient Boosting` models within pipelines that included feature scaling. The results indicate clearly that `Gradient Boosting` performed best across all metrics, achieving the highest R2 score (0.52) and the lowest MAE (1.95) and RMSE (2.45).

## 1.2) Model Selection

Since the initial results suggest `GradientBoostingRegressor` performed well, we will examine similar gradient models to select the best performing one for tuning and further analysis.

In [5]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

gbr_pipe = Pipeline([('scaler', scaler), ('gbr', GradientBoostingRegressor(random_state=86))])
hgb_pipe = Pipeline([('scaler', scaler), ('hgb', HistGradientBoostingRegressor(random_state=86))])
xgb_pipe = Pipeline([('scaler', scaler), ('xgb', XGBRegressor(random_state=86, n_jobs=-1, objective='reg:squarederror'))])
lgbm_pipe = Pipeline([('scaler', scaler), ('lgbm', LGBMRegressor(random_state=86, n_jobs=-1, verbose=0))])
catboost_pipe = Pipeline([('scaler', scaler), ('catboost', CatBoostRegressor(random_state=86, verbose=0))])

pipelines = {
    'Gradient Boosting': gbr_pipe,
    'XGBoost': xgb_pipe,
    'LightGBM': lgbm_pipe,
    'CatBoost': catboost_pipe,
    'HistGradientBoosting': hgb_pipe
}

scoring = {
    'R2': 'r2',
    'MAE': 'neg_mean_absolute_error',
    'RMSE': 'neg_root_mean_squared_error'
}

print("Models and scoring prepared successfully.")

Models and scoring prepared successfully.


In [6]:
cv_results = {}

print("Running Cross-Validation...")
for name, pipe in pipelines.items():
    scores = cross_validate(pipe, X_train, y_train, cv=5, scoring=scoring, n_jobs=-1)
    cv_results[name] = {
        'Fit Time': scores['fit_time'].mean(),
        'Score Time': scores['score_time'].mean(),
        'R2': scores['test_R2'].mean(),
        'MAE': -scores['test_MAE'].mean(),
        'RMSE': -scores['test_RMSE'].mean() 
    }
    print(f"... {name} Done.")

cv_results_df = pd.DataFrame(cv_results).T
print("\nCross-Validation Results:")
cv_results_df

Running Cross-Validation...
... Gradient Boosting Done.
... XGBoost Done.
... LightGBM Done.
... CatBoost Done.
... HistGradientBoosting Done.

Cross-Validation Results:


Unnamed: 0,Fit Time,Score Time,R2,MAE,RMSE
Gradient Boosting,2.23273,0.011572,0.523871,1.949803,2.453181
XGBoost,0.293193,0.016802,0.47757,2.025575,2.569511
LightGBM,1.899278,0.008927,0.530787,1.923744,2.435237
CatBoost,4.441201,0.013621,0.537908,1.905839,2.416588
HistGradientBoosting,0.604818,0.027318,0.526006,1.93515,2.447559


Such results make the choice of the model quite hard. `CatBoost` seems to edge out all other models while having a the longest fit time, which means hyperparameter tuning will be the most computationally expensive for this model. `LightGBM` is a close runner up, but with much lower fit time, making it more suitable for hyperparameter tuning.

# 2) Feature Selection & Hyperparams Tuning

We want to reduce the number of irrelevent features for our models, that's why we will be defining a function that will train the model, select the most important features, retrain the model on these features then compare performance before and after. This function will then be applied to both `LightGBM` and `CatBoost` to determine which one will be `The chosen one` for hypertuning.

## 2.0) Feature Selection Helper Function :

In [7]:
def perform_feature_selection(pipeline, X_train, X_test, y_train, y_test, model_name='model', threshold="median", cv=5):
    # Store feature names
    feature_names = X_train.columns.tolist()
    
    # Convert to numpy arrays if they're DataFrames to avoid the warnings
    X_train_values = X_train.values if hasattr(X_train, 'values') else X_train
    X_test_values = X_test.values if hasattr(X_test, 'values') else X_test
    
    # Train full model
    pipeline_full = clone(pipeline)
    pipeline_full.fit(X_train_values, y_train)
    model = pipeline_full.named_steps[model_name]
    
    # Get feature importances
    if hasattr(model, 'feature_importances_'):
        feature_importances = model.feature_importances_
    elif hasattr(model, 'get_feature_importance'):
        feature_importances = model.get_feature_importance()
    else:
        raise ValueError("Model doesn't support feature importance")
    
    # Evaluate full model
    y_pred_full = pipeline_full.predict(X_test_values)
    full_metrics = {
        'Test R2': r2_score(y_test, y_pred_full),
        'Test MAE': mean_absolute_error(y_test, y_pred_full),
        'Test RMSE': np.sqrt(mean_squared_error(y_test, y_pred_full))
    }
    
    # CV on full model
    scoring = {'R2': 'r2', 'MAE': 'neg_mean_absolute_error', 'RMSE': 'neg_root_mean_squared_error'}
    scores_full = cross_validate(pipeline_full, X_train_values, y_train, cv=cv, scoring=scoring, n_jobs=-1)
    full_metrics.update({
        'CV R2': scores_full['test_R2'].mean(),
        'CV MAE': -scores_full['test_MAE'].mean(),
        'CV RMSE': -scores_full['test_RMSE'].mean(),
        'Fit Time': scores_full['fit_time'].mean(),
        'Score Time': scores_full['score_time'].mean(),
        'Feature Count': X_train.shape[1]
    })
    
    # Feature selection
    selector = SelectFromModel(estimator=model, threshold=threshold, prefit=True)
    X_train_selected = selector.transform(X_train_values)
    X_test_selected = selector.transform(X_test_values)
    selected_indices = selector.get_support(indices=True)
    selected_names = [feature_names[i] for i in selected_indices]
    print(f"Features: {X_train.shape[1]} → {X_train_selected.shape[1]}, selected: {selected_names}")
    
    # Train and evaluate selected model
    pipeline_selected = clone(pipeline)
    pipeline_selected.fit(X_train_selected, y_train)
    y_pred_selected = pipeline_selected.predict(X_test_selected)
    selected_metrics = {
        'Test R2': r2_score(y_test, y_pred_selected),
        'Test MAE': mean_absolute_error(y_test, y_pred_selected),
        'Test RMSE': np.sqrt(mean_squared_error(y_test, y_pred_selected))
    }
    
    # CV on selected model
    scores_selected = cross_validate(pipeline_selected, X_train_selected, y_train, cv=cv, scoring=scoring, n_jobs=-1)
    selected_metrics.update({
        'CV R2': scores_selected['test_R2'].mean(),
        'CV MAE': -scores_selected['test_MAE'].mean(),
        'CV RMSE': -scores_selected['test_RMSE'].mean(),
        'Fit Time': scores_selected['fit_time'].mean(),
        'Score Time': scores_selected['score_time'].mean(),
        'Feature Count': X_train_selected.shape[1]
    })
    
    # Create comparison DataFrame
    metrics = ['Test R2', 'Test MAE', 'Test RMSE', 'CV R2', 'CV MAE', 'CV RMSE', 
               'Fit Time', 'Score Time', 'Feature Count']
    comparison = pd.DataFrame({
        'Metric': metrics,
        'All Features': [full_metrics[m] for m in metrics],
        'Selected Features': [selected_metrics[m] for m in metrics]
    })
    
    # Convert selected datasets back to DataFrames with proper feature names
    if hasattr(X_train, 'iloc'):  # Check if it's a DataFrame
        X_train_selected_df = pd.DataFrame(X_train_selected, columns=selected_names)
        X_test_selected_df = pd.DataFrame(X_test_selected, columns=selected_names)
        return comparison, selected_names, (X_train_selected_df, X_test_selected_df)
    else:
        return comparison, selected_names, (X_train_selected, X_test_selected)

## 2.1) Feature Selection on LightGBM & CatBoost

In [8]:
lgbm_pipeline = Pipeline([('scaler', scaler), ('lgbm', LGBMRegressor(random_state=86, n_jobs=-1, verbose=-1))])
lgbm_comparison, lgbm_selected_features, (X_train_lgbm_selected, X_test_lgbm_selected) = perform_feature_selection(
    lgbm_pipeline, X_train, X_test, y_train, y_test, model_name='lgbm'
)
print("\nLightGBM Feature Selection Results:")
display(lgbm_comparison.set_index('Metric').T)

catboost_pipeline = Pipeline([('scaler', scaler), ('catboost', CatBoostRegressor(random_seed=86, thread_count=-1, verbose=0))])
catboost_comparison, catboost_selected_features, (X_train_catboost_selected, X_test_catboost_selected) = perform_feature_selection(
    catboost_pipeline, X_train, X_test, y_train, y_test, model_name='catboost'
)
print("\nCatBoost Feature Selection Results:")
display(catboost_comparison.set_index('Metric').T)

print("\nModel Comparison (Selected Features Only):")
model_comparison = pd.DataFrame({
    'Metric': lgbm_comparison['Metric'],
    'LightGBM': lgbm_comparison['Selected Features'],
    'CatBoost': catboost_comparison['Selected Features']
})
display(model_comparison.set_index('Metric').T)

Features: 27 → 14, selected: ['fico', 'loan_amnt', 'dti', 'revol_util', 'total_bc_limit', 'revol_bal', 'total_acc', 'annual_inc', 'mths_since_recent_inq', 'open_acc', 'percent_bc_gt_75', 'term_encoded', 'verified_Verified', 'title_Credit_card_refinancing']

LightGBM Feature Selection Results:


Metric,Test R2,Test MAE,Test RMSE,CV R2,CV MAE,CV RMSE,Fit Time,Score Time,Feature Count
All Features,0.531715,1.924787,2.446152,0.530787,1.923744,2.435237,1.759468,0.011785,27.0
Selected Features,0.503099,1.990578,2.519782,0.507929,1.97743,2.493542,1.826092,0.009679,14.0


Features: 27 → 14, selected: ['fico', 'loan_amnt', 'dti', 'revol_util', 'total_bc_limit', 'revol_bal', 'total_acc', 'annual_inc', 'mths_since_recent_inq', 'open_acc', 'term_encoded', 'verified_Not_Verified', 'verified_Verified', 'title_Credit_card_refinancing']

CatBoost Feature Selection Results:


Metric,Test R2,Test MAE,Test RMSE,CV R2,CV MAE,CV RMSE,Fit Time,Score Time,Feature Count
All Features,0.542365,1.902167,2.418174,0.537908,1.905839,2.416588,4.502258,0.026253,27.0
Selected Features,0.520976,1.944465,2.47404,0.527284,1.936432,2.443754,3.374027,0.009694,14.0



Model Comparison (Selected Features Only):


Metric,Test R2,Test MAE,Test RMSE,CV R2,CV MAE,CV RMSE,Fit Time,Score Time,Feature Count
LightGBM,0.503099,1.990578,2.519782,0.507929,1.97743,2.493542,1.826092,0.009679,14.0
CatBoost,0.520976,1.944465,2.47404,0.527284,1.936432,2.443754,3.374027,0.009694,14.0


Preforming this feature selection enables us to reduce the number of features by half, while losing out on negligible predictive performance (3-5% worse). The `CatBoost` models proved to be more resilient to the selection compared to `LightGBM`. Although `CatBoost` still has significantly higher fit time, the performance difference is no longer ignorable, so we will be choosing it from now on.

## 2.2) Hyperparameter Tuning

Since we do not have prior knowledge about the hyperparamters for this data set, using `GridSearchCV` would take too much time. So as a start we chose to use `RandomizedSearchCV` to ensure computational efficiency.

In [9]:
pipeline = Pipeline([('scaler', scaler), ('catboost', CatBoostRegressor(random_seed=86, thread_count=-1, verbose=0))])

param_grid = {
    'catboost__learning_rate': [0.01, 0.03, 0.1],
    'catboost__depth': [4, 6, 8, 10],
    'catboost__iterations': [100, 200, 500],
    'catboost__l2_leaf_reg': [1, 3, 5, 7],
    'catboost__border_count': [32, 64, 128],
    'catboost__bagging_temperature': [0, 1, 10],
    'catboost__random_strength': [1, 10, 100]
}

random_search = RandomizedSearchCV(estimator=pipeline, param_distributions=param_grid, n_iter=50, cv=5,  n_jobs=-1, 
                                   verbose=1, random_state=86, return_train_score=True)
random_search.fit(X_train_catboost_selected, y_train)

best_params = random_search.best_params_
best_score = random_search.best_score_
print(f"Best parameters: {best_params}")
print(f"Best cross-validation score (negative MSE): {best_score:.4f}")

best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_catboost_selected)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
test_r2 = r2_score(y_test, y_pred)
print(f"Test set RMSE: {test_rmse:.4f}")
print(f"Test set R²: {test_r2:.4f}")

results = pd.DataFrame(random_search.cv_results_)
results = results.sort_values(by='rank_test_score')

Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters: {'catboost__random_strength': 1, 'catboost__learning_rate': 0.03, 'catboost__l2_leaf_reg': 3, 'catboost__iterations': 500, 'catboost__depth': 8, 'catboost__border_count': 128, 'catboost__bagging_temperature': 0}
Best cross-validation score (negative MSE): 0.5253
Test set RMSE: 2.4722
Test set R²: 0.5217


Hyperparameter tuning was able to squeeze out a bit more performance out of our model. Although negligible, it still is a free performance boost that we desperately need for our model.

# 3) Conclusion

In this notebook, we were able to use the data we processed in the other notebook to train and tune a CatBoost model to perform relatively well, explaining about `52.43%` of the variance, while having a margin of error of about `~2%`. This is acceptable for an initial assessment for users to help them get an idea of the interest rate they should be expecting. And with that we consider this modeling done successfully.

In [10]:
from joblib import dump
dump(best_model, 'catboost.joblib')
print("Model saved as 'catboost.joblib'")

Model saved as 'catboost.joblib'


# 4) NOTE

After deployement of model, and initial testing with some users, feedback prooved that some featues are too hard to estimate or guess, this is why we will be eliminating some of there features, purely to enhance user experience!

In [26]:
X_train_catboost_selected.describe()

Unnamed: 0,fico,loan_amnt,dti,revol_util,total_bc_limit,revol_bal,total_acc,annual_inc,mths_since_recent_inq,open_acc,term_encoded,verified_Not_Verified,verified_Verified,title_Credit_card_refinancing
count,8649.0,8649.0,8649.0,8649.0,8649.0,8649.0,8649.0,8649.0,8649.0,8649.0,8649,8649,8649,8649
unique,20.0,794.0,3121.0,1002.0,401.0,7075.0,49.0,932.0,20.0,19.0,2,2,2,2
top,662.0,10000.0,12.78,61.0,5000.0,0.0,22.0,60000.0,1.0,9.0,0,False,False,False
freq,1403.0,701.0,13.0,26.0,109.0,11.0,447.0,395.0,1057.0,985.0,6081,6940,5899,6836


In [28]:
X_train_catboost_selected['revol_bal'].mean()

np.float64(9894.56364897676)