# Linear trees

## Rationale:
We need to experiment if more complex approaches on the model design can help to improve the model performance.
## Methodology:
We are to test the linear trees paramter o the ```LightGBM``` API:

1. We are to train a model using linear trees
2. Train a base model
3. Compare the results on the validation set

## Conclusions:
The results showed that the linear trees did not offer a gain in performance. I showed a decrasing of ```~16PP``` This may be due to adding more complexity to the model affects specially because we would need to take care of the linear model's hyperparamters.

| Model             | out_of_fold | validation |
|-------------------|------------|------------|
| boruta vanilla    | 0.862986   | 0.865603   |
| fw linear trees   | 0.708666   | 0.669988   |


In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

from lightgbm import LGBMClassifier as lgbm
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from copy import deepcopy

import warnings;warnings.filterwarnings("ignore")

import sys
sys.path.append("../")

# local imports
from src.learner_params import target_column, boruta_learner_params, test_params
from utils.functions__utils import find_constraint

from utils.feature_selection_lists import fw_features, boruta_features, optuna_features, ensemble_features

from utils.functions__training import model_pipeline

In [2]:
train_df = pd.read_pickle("../data/train_df.pkl")
validation_df = pd.read_pickle("../data/validation_df.pkl")

In [3]:
lt_params = deepcopy(test_params)
lt_params["learner_params"]["extra_params"]["linear_trees"] = True

In [4]:
base_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = test_params,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-10-08T15:18:05 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-08T15:18:05 | INFO | Training for fold 1
2023-10-08T15:18:23 | INFO | Training for fold 2
2023-10-08T15:18:40 | INFO | Training for fold 3
2023-10-08T15:18:57 | INFO | CV training finished!
2023-10-08T15:18:57 | INFO | Training the model in the full dataset...
2023-10-08T15:19:17 | INFO | Training process finished!
2023-10-08T15:19:17 | INFO | Calculating metrics...
2023-10-08T15:19:17 | INFO | Full process finished in 1.19 minutes.
2023-10-08T15:19:17 | INFO | Saving the predict function.
2023-10-08T15:19:17 | INFO | Predict function saved.


In [5]:
challenger_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = lt_params,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-10-08T15:19:17 | INFO | linear trees will be applied so training time may increase significantly.
2023-10-08T15:19:17 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-10-08T15:19:17 | INFO | Training for fold 1
2023-10-08T15:19:33 | INFO | Training for fold 2
2023-10-08T15:19:52 | INFO | Training for fold 3
2023-10-08T15:20:11 | INFO | CV training finished!
2023-10-08T15:20:11 | INFO | Training the model in the full dataset...
2023-10-08T15:20:34 | INFO | Training process finished!
2023-10-08T15:20:34 | INFO | Calculating metrics...
2023-10-08T15:20:34 | INFO | Full process finished in 1.29 minutes.
2023-10-08T15:20:34 | INFO | Saving the predict function.
2023-10-08T15:20:34 | INFO | Predict function saved.


In [6]:
model_metrics  ={}
models = [base_logs, challenger_logs]
names = ["boruta vanilla", "fw linear trees"]

for model, name in zip(models, names):
    model_metrics[f"{name}"] = model["metrics"]["roc_auc"]
pd.DataFrame(model_metrics).T.sort_values(by = "validation", ascending = False)

Unnamed: 0,out_of_fold,validation
boruta vanilla,0.862986,0.865603
fw linear trees,0.708666,0.669988
