# Monotone constraints

## Rationale:

The first tools used to make decision about credit scoring were based on expert rules, those rules were basically knowledge of professionals with wide experience in business. In order to use this prior knowledge we can adjust the model's learning process to take into account the relationships between features and taget. This is an special characteristic of boosting models that we are to test to see if we can improve model performance.

## Methodology:
We are to define the monotone constraints for each feature i.e the direction of the relationship between feature and target in the following way:

1. We are to train a linear regression model to explain target average (average risk) for every model decile.
2. We are then to keep the trend coeficient of the model to define the constraint.
3. The constrains may be the sign of the coreficient: ```(+ , -, 0)```
4. Finally, we are to test the model performance of both models, the one with and without constraints.

## Conclusions:

**Conclusions from Model Performance Table:**

1. **Boruta Variants**: The table presents the performance of two variants of the Boruta model: 'boruta vanilla' and 'boruta monotone.'

2. **ROC AUC Scores**: Both variants show competitive ROC AUC scores, with 'boruta vanilla' having a slightly higher out-of-fold score (0.792510) compared to 'boruta monotone' (0.791011).

3. **Validation Dataset**: The models' performance on the validation dataset ('roc_auc_val') is consistent with the out-of-fold performance, indicating that the models generalize well.

4. **Model Selection**: Choosing between these two Boruta variants may depend on other factors such as model complexity, interpretability, or specific task requirements.

5. **Further Exploration**: To make a more informed decision, it might be valuable to explore other evaluation metrics, conduct feature importance analysis, and consider the context of the problem.

In summary we were not able to improve the model performance using the monotone constrains, this may be to the relationship is beyong linear.


| Model             | out_of_fold | roc_auc_val |
|-------------------|-------------|-------------|
| boruta vanilla    | 0.792510    | 0.799865    |
| boruta monotone   | 0.791011    | 0.799071    |



In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from lightgbm import LGBMClassifier as lgbm
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from copy import deepcopy

import warnings;warnings.filterwarnings("ignore")

import sys
sys.path.append("../")

# local imports
from src.learner_params import target_column, space_column, boruta_learner_params, test_params
from utils.functions__utils import find_constraint

from utils.feature_selection_lists import fw_features, boruta_features, optuna_features, ensemble_features

from utils.functions__training import model_pipeline

In [8]:
train_df = pd.read_pickle("../data/train_df.pkl")
validation_df = pd.read_pickle("../data/validation_df.pkl")

In [9]:
monotone_const_dict = {}
for feature in boruta_features:
    aux = find_constraint(train_df, feature, target_column)
    monotone_const_dict[feature] = aux

In [10]:
mc_params = deepcopy(test_params)
mc_params["learner_params"]["extra_params"]["monotone_constraints"] = list(monotone_const_dict.values())

In [31]:
boruta_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = test_params,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-09-23T14:43:15 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-23T14:43:16 | INFO | Training for fold 1
2023-09-23T14:45:42 | INFO | Training for fold 2
2023-09-23T14:48:10 | INFO | Training for fold 3
2023-09-23T14:50:44 | INFO | CV training finished!
2023-09-23T14:50:44 | INFO | Training the model in the full dataset...
2023-09-23T14:54:05 | INFO | Training process finished!
2023-09-23T14:54:05 | INFO | Calculating metrics...
2023-09-23T14:54:05 | INFO | Full process finished in 10.87 minutes.


In [11]:
challenger_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = mc_params,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-09-23T23:48:09 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-23T23:48:11 | INFO | Training for fold 1
2023-09-23T23:52:06 | INFO | Training for fold 2
2023-09-23T23:56:02 | INFO | Training for fold 3
2023-09-24T00:00:14 | INFO | CV training finished!
2023-09-24T00:00:14 | INFO | Training the model in the full dataset...
2023-09-24T00:05:53 | INFO | Training process finished!
2023-09-24T00:05:53 | INFO | Calculating metrics...
2023-09-24T00:05:53 | INFO | Full process finished in 17.76 minutes.


In [42]:
model_metrics  ={}
models = [boruta_logs, challenger_logs]
names = ["boruta vanilla", "boruta monotone"]

for model, name in zip(models, names):
    model_metrics[f"{name}"] = model["metrics"]["roc_auc"]
pd.DataFrame(model_metrics).T.sort_values(by = "roc_auc_val", ascending = False)

Unnamed: 0,out_of_fold,roc_auc_val
boruta vanilla,0.79251,0.799865
boruta monotone,0.791011,0.799071
