## Dependencies

Install dependencies not available on Google Collab.
Collab provides numpy, pandas, sklearn, tensorflow, scipy, etc. (see requirements.txt)

In [None]:
!pip install pinard
!pip install scikeras

## Benchmark details

The results aggregate the combination of the following trainings configurations:
- estimation configuration: [regression, classification]
- datasets configurations: [Single Train, Cross validation with 5 folds and 2 repeats, Augmented Single Train]
- preprocessing configuration: [flat spectrum, savgol, haar, [small set], [big_set]]
- models: 
   - for all configuration: BACON, BACON-VG, DECON, PLS(components from 1 to 100), XGBoost, LW-PLS
   - for single train + small_set : Stack > [ BACON, BACON-VG, DECON, PLS(components from 1 to 100), XGBoost, LW-PLS,
   f_PLSRegression,f_AdaBoostRegressor,f_BaggingRegressor,f_ExtraTreesRegressor, f_GradientBoostingRegressor,f_RandomForestRegressor,
   f_ARDRegression,f_BayesianRidge,f_ElasticNet,f_ElasticNetCV,f_HuberRegressor, f_LarsCV,f_LassoCV,f_Lasso,f_LassoLars,f_LassoLarsCV,
   f_LassoLarsIC,f_LinearRegression,f_OrthogonalMatchingPursuit,f_OrthogonalMatchingPursuitCV, f_PassiveAggressiveRegressor,f_RANSACRegressor,
   f_Ridge,f_RidgeCV,f_SGDRegressor,f_TheilSenRegressor,f_GaussianProcessRegressor,f_KNeighborsRegressor, f_Pipeline,f_MLPRegressor,f_LinearSVR,
   f_NuSVR,f_SVR,f_DecisionTreeRegressor,f_ExtraTreeRegressor,f_KernelRidge,f_XGBRegressor]

We perform training in 2 steps, (1) data transformation and (2) training because the sklearn pipeline does not use test data natively.
To change with pinard update in the future.

In [None]:
### FAST GPU RESET ####
from numba import cuda 
device = cuda.get_current_device()
device.reset()

In [4]:
## Browse path and launch benchmark for every folders
%load_ext autoreload
%autoreload 2

from pathlib import Path
from preprocessings import preprocessing_list

from benchmark_loop import benchmark_dataset

import tensorflow as tf

tf.get_logger().setLevel("ERROR")
tf.keras.mixed_precision.set_global_policy("mixed_float16")

rootdir = Path('data/regression')
folder_list = [f for f in rootdir.glob('**/*') if f.is_dir()]

SEED = ord('D') + 31373

import preprocessings
import regressors
import pinard.preprocessing as pp
from pinard import augmentation, model_selection
from sklearn.cross_decomposition import PLSRegression
from xgboost import XGBRegressor
import sys
import os.path

def str_to_class(classname):
    return getattr(sys.modules['pinard.preprocessing'], classname)

# print(str_to_class('SavitzkyGolay'))




def get_dataset_list(path):
    datasets = []
    for r, d, _ in os.walk(path):
        for folder in d:
            path = os.path.join(r, folder)
            if os.path.isdir(path):
                if len(datasets) < 3:
                    datasets.append(str(path))
    return datasets

split_configs = [
    None,
    # {'test_size':None, 'method':"random", 'random_state':SEED},
    # {'test_size':None, 'method':"stratified", 'random_state':SEED, 'n_bins':5},
    # {'test_size':None, 'method':"kennard_stone", 'random_state':SEED, 'metric':"euclidean", 'pca_components':None},
]

augmentations = [
    # None,
    [(2, augmentation.Rotate_Translate()),
    (1, augmentation.Random_X_Operation()),
    # (1, augmentation.Random_X_Spline_Deformation()),
    (1, augmentation.Random_Spline_Addition()),]
]

preprocessings_list = [
    # preprocessings.id_preprocessing(),
    # [('haar', pp.Haar()), ('savgol', pp.SavitzkyGolay())],
    preprocessings.decon_set(),
    # preprocessings.dumb_set(),
]



cv_configs = [
    # None,
    # {'n_splits':5, 'n_repeats':4},
    {'n_splits':4, 'n_repeats':2},
]

# import os
folder = "data/regression"
folders = get_dataset_list(folder)

len_cv_configs = 0
for c in cv_configs:
    if c == None:
        len_cv_configs += 1
    else:
        len_cv_configs += (c['n_splits'] * c['n_repeats'])

models = [
    # (regressors.ML_Regressor(XGBRegressor), {"n_estimators":200, "max_depth":50, "seed":SEED}),
    # (regressors.Decon(), {'batch_size':512, 'epoch':1000, 'verbose':0, 'patience':250, 'optimizer':'Adam', 'loss':'mse'}),
    (regressors.Transformer_VG(), {'batch_size':60, 'epoch':200, 'verbose':0, 'patience':30, 'optimizer':'Adam', 'loss':'mse'}),
]

benchmark_size = len(folders) * len(split_configs) * len_cv_configs * len(augmentations) * len(preprocessings_list) * len(models)
print("Benchmarking", benchmark_size, "runs.")


# benchmark_dataset(folders, split_configs, cv_configs, augmentations, preprocessings_list, models, SEED,)
benchmark_dataset(["data/regression/ALPINE_Calpine_424_Murguzur_RMSE1.36"], split_configs, cv_configs, augmentations, preprocessings_list, models, SEED,)


# for folder in folder_list:
    # # print(ord(str(folder)[17]), ord('A'), ord('M'))
    # if ord(str(folder)[16]) < ord("L") or ord(str(folder)[16]) > ord("M"):
    #     continue
    # benchmark_dataset(folder, SEED, preprocessing_list(), 20, augment=False)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Benchmarking 24 runs.
Transformer_VG-NoSpl-CV_4_2-Fold_1(8)-Aug_2RT_1RXO_1RSA-PP_22_-1452-31441_22-12-14_11-24-05 (816, 2151) (816, 1) (68, 2151, 22) (68, 1)
--- Trainable: 20629 - untrainable: 0.0 > 20629.0




Epoch: 0 > RMSE: 4.461751 ( 1.36 ) - R²: -1.4918706917578297  val_loss 0.031275130808353424
Epoch: 1 > RMSE: 4.7474813 ( 1.36 ) - R²: -1.8212483060606939  val_loss 0.02151399478316307
Epoch: 2 > RMSE: 4.2011185 ( 1.36 ) - R²: -1.2092489472606833  val_loss 0.018280029296875
Epoch: 3 > RMSE: 3.7649891 ( 1.36 ) - R²: -0.7743620972546299  val_loss 0.014913670718669891
Epoch: 4 > RMSE: 3.48141 ( 1.36 ) - R²: -0.5171381852053241  val_loss 0.011570201255381107
Epoch: 5 > RMSE: 3.1640036 ( 1.36 ) - R²: -0.25310879711947565  val_loss 0.010711221024394035
Epoch: 6 > RMSE: 2.8755052 ( 1.36 ) - R²: -0.03500673683956568  val_loss 0.009290134534239769
Epoch: 7 > RMSE: 2.7926943 ( 1.36 ) - R²: 0.023748670659052507  val_loss 0.00829741545021534
Epoch: 9 > RMSE: 2.8114102 ( 1.36 ) - R²: 0.010619760355956576  val_loss 0.008164125494658947
Epoch: 10 > RMSE: 2.780783 ( 1.36 ) - R²: 0.03205874652841034  val_loss 0.006999969482421875
Epoch: 12 > RMSE: 2.8239748 ( 1.36 ) - R²: 0.0017565299478796703  val_loss

KeyboardInterrupt: 