# 04 - Model Comparison

### Goals:
- Train and evaluate different models.
- Compare model performance using metrics such as Accuracy, Precision, Recall, F1-score, and ROC-AUC.
- Visualize results and select the best model for predicting loan default probability.

## 1. Load Data and Test


In [1]:
import pandas as pd
import os

# Define data paths
data_dir = '../data'
processed_df_path = os.path.join(data_dir, 'processed_df.parquet')
# processed_df = pd.read_parquet("../data/processed_df.parquet", memory_map=False, engine="pyarrow")
processed_df = pd.read_parquet("/tmp/processed_df.parquet")


balanced_df = processed_df.groupby('target').sample(n=5500, random_state=42).reset_index(drop=True)
print(balanced_df['target'].value_counts())

nan_counts = balanced_df.isna().sum()

0    5500
1    5500
Name: target, dtype: int64


In [2]:
from _Model_Comparator_class import modelComparator

In [3]:
model_to_test = ['logistic_regression', 'xgboost', 'random_forest', 'svc', 'knn', 'gradient_boosting', 'adaboost', 'gaussian_nb']

model_comparator = modelComparator(balanced_df)
model_comparator.compare_models(model_to_test, kfolds=10, n_trials=15)

{'Best Params': {'model__n_estimators': 304, 'model__learning_rate': 0.14235407208694828, 'model__random_state': 42}, 'Recall': 0.92, 'Precision': 0.8083067092651757, 'Accuracy': 0.850909090909091, 'F1 Score': 0.8605442176870748, 'AUC ROC': 0.8509090909090908, 'Model': 'adaboost', 'Processing Time (s)': 209.4823019504547}
Testing model: gaussian_nb
{'Best Params': {}, 'Recall': 0.9872727272727273, 'Precision': 0.6609860012172855, 'Accuracy': 0.7404545454545455, 'F1 Score': 0.7918337586584032, 'AUC ROC': 0.7404545454545455, 'Model': 'gaussian_nb', 'Processing Time (s)': 0.06160426139831543}


  results_df = results_df.append(self.metrics_dict, ignore_index=True)
  results_df = results_df.append(self.metrics_dict, ignore_index=True)


Unnamed: 0,Best Params,Recall,Precision,Accuracy,F1 Score,AUC ROC,Model,Processing Time (s)
0,"{'model__penalty': 'l2', 'model__C': 1.8523547...",0.884545,0.806131,0.835909,0.84352,0.835909,logistic_regression,48.498645
1,"{'model__n_estimators': 250, 'model__max_depth...",0.902727,0.82202,0.853636,0.860485,0.853636,xgboost,257.482895
2,"{'model__n_estimators': 298, 'model__max_depth...",0.931818,0.81869,0.862727,0.871599,0.862727,random_forest,521.000919
3,"{'model__C': 0.30814016839651326, 'model__kern...",0.916364,0.7875,0.834545,0.847059,0.834545,svc,205.566595
4,"{'model__n_neighbors': 17, 'model__weights': '...",0.74,0.714662,0.722273,0.72711,0.722273,knn,35.727542
5,"{'model__n_estimators': 440, 'model__learning_...",0.918182,0.827869,0.863636,0.87069,0.863636,gradient_boosting,972.399217
6,"{'model__n_estimators': 304, 'model__learning_...",0.92,0.808307,0.850909,0.860544,0.850909,adaboost,209.482302
7,{},0.987273,0.660986,0.740455,0.791834,0.740455,gaussian_nb,0.061604


In the model comparison, the gradient boosting model achieved the highest performance with an AUC ROC of 0.865, recall of 0.938, and an F1 score of 0.874, despite a processing time of 1108 seconds. The xgboost and random forest models also performed well with AUC ROC scores of 0.8627 and 0.8623, respectively. Logistic regression and SVC had lower recall and AUC ROC values around 0.835. KNN and Gaussian NB showed significantly lower scores across metrics, indicating limited effectiveness.

## 2. Feature Engineering

We create new features with financial logics based on existing features.

In [None]:
new_feature_df = balanced_df.copy()

# Debt-to-Income Adjusted Ratio
new_feature_df['dti_adjusted'] = new_feature_df['dti'] + (new_feature_df['annual_inc'] / new_feature_df['annual_inc'].mean())

# Income-to-Loan Ratio
new_feature_df['income_to_loan'] = new_feature_df['annual_inc'] / new_feature_df['loan_amnt']

# Income per Installment
new_feature_df['income_per_installment'] = new_feature_df['annual_inc'] / new_feature_df['installment']

# Loan Amount Percentile
new_feature_df['loan_amount_percentile'] = pd.qcut(new_feature_df['loan_amnt'], 10, labels=False)

# Installment-to-Income Ratio
new_feature_df['installment_to_income'] = new_feature_df['installment'] / (new_feature_df['annual_inc'] / 12)

# Repayment Progress (assuming 'out_prncp' is available as outstanding principal)
new_feature_df['repayment_progress'] = 1 - (new_feature_df['last_pymnt_amnt'] / new_feature_df['loan_amnt'])

In [5]:
model_to_test = ['logistic_regression', 'xgboost', 'random_forest', 'svc', 'knn', 'gradient_boosting', 'adaboost', 'gaussian_nb']

model_comparator_new_feature = modelComparator(new_feature_df)
model_comparator_new_feature.compare_models(model_to_test, kfolds=10, n_trials=20)

{'Best Params': {'model__n_estimators': 448, 'model__learning_rate': 0.9883477038666212, 'model__random_state': 42}, 'Recall': 0.9172727272727272, 'Precision': 0.8401332223147377, 'Accuracy': 0.8713636363636363, 'F1 Score': 0.8770099956540635, 'AUC ROC': 0.8713636363636363, 'Model': 'adaboost', 'Processing Time (s)': 427.5174045562744}
Testing model: gaussian_nb
{'Best Params': {}, 'Recall': 0.9845454545454545, 'Precision': 0.7273337810611148, 'Accuracy': 0.8077272727272727, 'F1 Score': 0.8366164542294322, 'AUC ROC': 0.8077272727272726, 'Model': 'gaussian_nb', 'Processing Time (s)': 0.06751108169555664}


  results_df = results_df.append(self.metrics_dict, ignore_index=True)
  results_df = results_df.append(self.metrics_dict, ignore_index=True)


Unnamed: 0,Best Params,Recall,Precision,Accuracy,F1 Score,AUC ROC,Model,Processing Time (s)
0,"{'model__penalty': 'l1', 'model__C': 0.1051764...",0.913636,0.810484,0.85,0.858974,0.85,logistic_regression,49.946633
1,"{'model__n_estimators': 350, 'model__max_depth...",0.960909,0.928822,0.943636,0.944593,0.943636,xgboost,540.60002
2,"{'model__n_estimators': 297, 'model__max_depth...",0.948182,0.906169,0.925,0.926699,0.925,random_forest,967.051095
3,"{'model__C': 0.21767257562818648, 'model__kern...",0.941818,0.792049,0.847273,0.860465,0.847273,svc,290.485037
4,"{'model__n_neighbors': 19, 'model__weights': '...",0.816364,0.755892,0.776364,0.784965,0.776364,knn,60.502779
5,"{'model__n_estimators': 170, 'model__learning_...",0.957273,0.934339,0.945,0.945667,0.945,gradient_boosting,1091.760772
6,"{'model__n_estimators': 448, 'model__learning_...",0.917273,0.840133,0.871364,0.87701,0.871364,adaboost,427.517405
7,{},0.984545,0.727334,0.807727,0.836616,0.807727,gaussian_nb,0.067511


After feature engineering, the AUC ROC scores improved for all models, particularly boosting algorithms. Gradient Boosting’s AUC ROC jumped significantly from 0.865 to 0.943. This improvement highlights how feature engineering added valuable information that allowed Gradient Boosting to capture complex patterns more effectively.

## 3. Feature Selection
Because Gradient Boosting is the most effficient we will focuse our attention on this model.

In [None]:
model_to_test = ['gradient_boosting']

model_comparator_best_param = modelComparator(new_feature_df)
model_comparator_best_param.compare_models(model_to_test, kfolds=10, n_trials=20)


Next, we’ll examine the feature importance scores to identify the most influential features in our models. By selecting only the features with an importance score above 1%, we can refine the dataset, focusing on the most predictive elements. 

In [None]:
import matplotlib.pyplot as plt
# Access the Gradient Boosting model from the pipeline
gradient_boosting_model = model_comparator_best_param.best_model.named_steps['model']

# Get the feature importances and feature names
importances = gradient_boosting_model.feature_importances_
feature_names = model_comparator_best_param.X_train.columns

# Create a DataFrame for easy sorting and filtering
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Filter for the top features
filtered_feature_importance = importance_df[importance_df['Importance'] > 0.01].set_index('Feature')['Importance'].to_dict()

print(f'Number of features initially: {importance_df.shape[0]}')
print(f'Number of features after selection: {len(filtered_feature_importance)}')

# Plot the filtered feature importance
plt.figure(figsize=(10, 6))
plt.barh(list(filtered_feature_importance.keys()), list(filtered_feature_importance.values()), color='skyblue')
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance for Gradient Boosting Model in Pipeline")
plt.show()

AttributeError: 'modelComparator' object has no attribute 'best_model'

After evaluating the feature importance scores, we found that we started with 60 features initially. Following our selection process, we narrowed it down to just 8 features with an importance score greater than 1%. 

We train the model again to see if there is an improvement.

In [None]:
# Get the list of keys from filtered_feature_importance
feature_keys = list(filtered_feature_importance.keys())

feature_selec_df = new_feature_df[['target'] + feature_keys]

In [None]:
model_to_test = ['gradient_boosting']
model_comparator_feature_selec = modelComparator(feature_selec_df)
model_comparator_feature_selec.compare_models(model_to_test, 10, 25)

After applying feature selection, the performance of our Gradient Boosting model improved significantly. Initially, with 60 features, the model achieved an Accuracy of 0.943182. Following the selection process, where we reduced the feature set to just 8 important features, the model's Accuracy increased to 0.959545.

In [None]:
import joblib

feature_selec_df.to_parquet(os.path.join(data_dir, 'feature_selec_df.parquet'))

# Save the best model
model_dir = 'model'
model_path = os.path.join(model_dir, 'best_gradient_boosting_model.joblib')
joblib.dump(model_comparator_feature_selec.best_model, model_path)

print(f'Model saved to {model_path}')

NameError: name 'feature_selec_df' is not defined

## 5. Conclusion
- The best model based on evaluation metrics is identified.
- Future improvements may include hyperparameter tuning or additional feature engineering.
- Further steps: Deploy the best model for production use.