<a href="https://colab.research.google.com/github/Existanze54/sirius-neural-networks-2024/blob/main/Practices/S11_AutoML_LLM/4_LightAutoML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Краткий обзор AutoML иструментов в 2024 году: <a href='https://habr.com/ru/articles/811425/'>link</a>

<center><img src="https://github.com/Existanze54/sirius-neural-networks-2024/blob/main/Images/LAMA.png?raw=true" width=600></img></center>

LightAutoML github репозиторий: <a href='https://github.com/sb-ai-lab/LightAutoML'>link</a>

LightAutoML (LAMA) – мощный open-source AutoML фреймворк за которым стоит одна из сильнейших по экспертизе DS команд из Sber AI Lab. Суперсила LAMA – это бленды и настраиваемые эксперименты. В то же время LAMA скорее скальпель для профессионалов,. Давно не было обновления, очень надеюсь, что мы увидим его в ближайшее время.

In [None]:
!pip install lightautoml

Collecting lightautoml
  Downloading lightautoml-0.3.8.1-py3-none-any.whl (416 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m416.4/416.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autowoe>=1.2 (from lightautoml)
  Downloading AutoWoE-1.3.2-py3-none-any.whl (215 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.7/215.7 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting catboost>=0.26.1 (from lightautoml)
  Downloading catboost-1.2.5-cp310-cp310-manylinux2014_x86_64.whl (98.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes (from lightautoml)
  Downloading cmaes-0.10.0-py3-none-any.whl (29 kB)
Collecting joblib<1.3.0 (from lightautoml)
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[?25hC

# Import Libraries & Set Parameters

In [None]:
# Standard python libraries
import os
import time
import requests

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, accuracy_score
import torch

# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.report.report_deco import ReportDeco

Here we setup some parameters to use in the kernel:
- `N_THREADS` - number of vCPUs for LightAutoML model creation
- `N_FOLDS` - number of folds in LightAutoML inner CV
- `RANDOM_STATE` - random seed for better reproducibility
- `TEST_SIZE` - houldout data part size
- `TIMEOUT` - limit in seconds for model to train
- `TARGET_NAME` - target column name in dataset

In [None]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
#TEST_SIZE = 0.2
#TIMEOUT = 10*3600
TIMEOUT = 5*60
TARGET_NAME = 'hospital_death'

In [None]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

# Load Data

It is important to note that missing values (NaN and other) in the data should be left as is, unless the reason for their presence or their specific meaning are known. Otherwise, AutoML model will perceive the filled NaNs as a true pattern between the data and the target variable, without knowledge and assumptions about missing values, which can negatively affect the model quality. LighAutoML can deal with missing values and outliers automatically.

In [None]:
train_df = pd.read_csv('https://raw.githubusercontent.com/Existanze54/sirius-neural-networks-2024/main/Datasets/patient-survival-prediction/train_preprocessed.csv')
print(train_df.shape)
train_df.head()

(44939, 83)


Unnamed: 0.1,Unnamed: 0,hospital_id,age,bmi,elective_surgery,ethnicity,gender,height,icu_admit_source,icu_id,...,cirrhosis,diabetes_mellitus,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,apache_3j_bodysystem,apache_2_bodysystem,hospital_death
0,0,118.0,69.9,25.719814,0.0,2.0,0.0,162.6,0.0,100.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,4.0,1
1,1,185.0,57.0,20.357278,0.0,0.0,1.0,182.9,0.0,687.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0
2,2,99.0,71.0,30.558683,0.0,5.0,1.0,175.2,0.0,514.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
3,3,21.0,75.0,44.990982,0.0,2.0,1.0,175.2,3.0,504.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,4.0,0
4,4,70.0,62.0,16.620499,0.0,0.0,0.0,152.0,0.0,464.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [None]:
test_df = pd.read_csv('https://raw.githubusercontent.com/Existanze54/sirius-neural-networks-2024/main/Datasets/patient-survival-prediction/test_preprocessed.csv')
print(test_df.shape)
test_df.head()

(19260, 83)


Unnamed: 0.1,Unnamed: 0,hospital_id,age,bmi,elective_surgery,ethnicity,gender,height,icu_admit_source,icu_id,...,cirrhosis,diabetes_mellitus,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,apache_3j_bodysystem,apache_2_bodysystem,hospital_death
0,0,188.0,69.0,29.605976,0.0,2.0,0.0,165.1,0.0,840.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0
1,1,10.0,68.0,27.986953,0.0,2.0,1.0,185.4,0.0,428.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,4.0,0
2,2,176.0,55.0,32.64147,1.0,2.0,0.0,162.6,2.0,611.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0
3,3,19.0,53.0,19.444444,0.0,2.0,1.0,180.0,3.0,653.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0
4,4,128.0,74.0,16.508909,0.0,2.0,0.0,165.1,1.0,377.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,9.0,0.0,0


## Task definition

First we need to create ```Task``` object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task) in our documentation).

The following task types are available:

- ```'binary'``` - for binary classification.

- ```'reg’``` - for regression.

- ```‘multiclass’``` - for multiclass classification.

- ```'multi:reg``` - for multiple regression.

- ```'multilabel'``` - for multi-label classification.

In this example we will consider a binary classification:

In [None]:
task = Task('binary')

Note that only logloss loss is available for binary task and it is the default loss. Default metric for binary classification is ROC-AUC. See more info about available and default losses and metrics [here](https://lightautoml.readthedocs.io/en/latest/pages/modules/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task).

**Depending on the task, you can and shold choose exactly those metrics and losses that you want and need to optimize.**

To solve the task, we need to setup columns roles. LightAutoML can automatically define types and roles of data columns, but it is possible to specify it directly through the dictionary parameter ```roles``` when training AutoML model (see next section "AutoML training"). Specific roles can be specified using a string with the name (any role can be set like this).  So the key in dictionary must be the name of the role, the value must be a list of the names of the corresponding columns in dataset. The **only role you must setup is** ```'target'``` **role** (that is column with target variable obviously), everything else (```'drop', 'numeric', 'categorical', 'group', 'weights'``` etc) is up to user:

In [None]:
roles = {
    'target': TARGET_NAME,
    'drop': ['patient_id', 'encounter_id']
}

You can also optionally specify the following roles:

- ```'numeric'``` - numerical feature

- ```'category'``` - categorical feature

- ```'text'``` - text data

- ```'datetime'``` - features with date and time

- ```'date'``` - features with date only

- ```'group'``` - features by which the data can be divided into groups and which can be taken into account for group k-fold validation (so the same group is not represented in both testing and training sets)

- ```'drop'``` - features to drop, they will not be used in model building

- ```'weights'``` - object weights for the loss and metric

- ```'path'``` - image file paths (for CV tasks)

- ```'treatment'``` - object group in uplift modelling tasks: treatment or control

Note that role name can be written in any case. Also it is possible to pass individual objects of role classes with specific arguments instead of strings with role names for specific tasks and more optimal pipeline construction ([more details](https://github.com/sb-ai-lab/LightAutoML/blob/master/lightautoml/dataset/roles.py)).

For example, to set the date role, you can use the ```DatetimeRole``` class.

In [None]:
# from lightautoml.dataset.roles import DatetimeRole

Different seasonality can be extracted from the data through the ```seasonality``` parameter: years (```'y'```), months (```'m'```), days (```'d'```), weekdays (```'wd'```), hours (```'hour'```), minutes (```'min'```), seconds (```'sec'```), milliseconds (```'ms'```), nanoseconds (```'ns'```). This features will be considered as categorical. Another important parameter is ```base_date```. It allows to specify the base date and convert the feature to the distances to this date (set to ```False``` by default). Also for all roles classes there is a ```force_input``` parameter, and if it is ```True```, then the corresponding features will pass all further feature selections and won't be excluded (equals ```False``` by default). Also it is always possible to specify data type for all roles using ```dtype``` argument.

Here is an example of such a role assignment through a class object for date feature (but there is no such feature in the considered dataset):

In [None]:
# roles = {
#     DatetimeRole(base_date=False, seasonality=('d', 'wd', 'hour')): 'date_time'
# }

Next we are going to create LightAutoML model with `TabularAutoML` class - preset with default model structure in just several lines.

Let's discuss some of the params we can setup:
- `task` - the type of the ML task (the only **must have** parameter)
- `timeout` - time limit in seconds for model to train
- `cpu_limit` - vCPU count for model to use
- `reader_params` - parameter change for ```Reader``` object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc. For example, we setup `n_jobs` threads for typization algo, `cv` folds and `random_state` as inside CV seed.
- `general_params` - general parameters dictionary, in which it is possible to specify a list of algorithms used (```'use_algos'```), nested CV using (```'nested_cv'```) etc.

**Important note**: `reader_params` key is one of the YAML config keys, which is used inside `TabularAutoML` preset. [More details](https://github.com/sb-ai-lab/LightAutoML/blob/master/lightautoml/automl/presets/tabular_config.yml) on its structure with explanation comments can be found on the link attached. Each key from this config can be modified with user settings during preset object initialization. To get more info about different parameters setting (for example, ML algos which can be used in `general_params->use_algos`) please take a look at our [article on TowardsDataScience](https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936).

Moreover, to receive the automatic report for our model we will use `ReportDeco` decorator and work with the decorated version in the same way as we do with usual one.

In [None]:
automl = TabularUtilizedAutoML(
    task = task,
    timeout = TIMEOUT,
    cpu_limit = N_THREADS,
    tuning_params = {'max_tuning_time': 900},
    reader_params = {'n_jobs': N_THREADS}
)

To run autoML training use ```fit_predict``` method.

Main arguments:

- `train_data` - dataset to train.
- `roles` - column roles dict.
- `verbose` - controls the verbosity: the higher, the more messages:
        <1  : messages are not displayed;
        >=1 : the computation process for layers is displayed;
        >=2 : the information about folds processing is also displayed;
        >=3 : the hyperparameters optimization process is also displayed;
        >=4 : the training process for every algorithm is displayed;

Note: out-of-fold prediction is calculated during training and returned from the fit_predict method

In [None]:
%%time
oof_pred = automl.fit_predict(train_df, roles = roles, verbose = 1)

[07:17:42] Start automl [1mutilizator[0m with listed constraints:


INFO:lightautoml.addons.utilization.utilization:Start automl [1mutilizator[0m with listed constraints:


[07:17:42] - time: 300.00 seconds


INFO:lightautoml.addons.utilization.utilization:- time: 300.00 seconds


[07:17:42] - CPU: 4 cores


INFO:lightautoml.addons.utilization.utilization:- CPU: 4 cores


[07:17:42] - memory: 16 GB



INFO:lightautoml.addons.utilization.utilization:- memory: 16 GB



[07:17:42] [1mIf one preset completes earlier, next preset configuration will be started[0m



INFO:lightautoml.addons.utilization.utilization:[1mIf one preset completes earlier, next preset configuration will be started[0m







[07:17:42] Start 0 automl preset configuration:


INFO:lightautoml.addons.utilization.utilization:Start 0 automl preset configuration:


[07:17:42] [1mconf_0_sel_type_0.yml[0m, random state: {'reader_params': {'random_state': 42}, 'nn_params': {'random_state': 42}, 'general_params': {'return_all_predictions': False}}


INFO:lightautoml.addons.utilization.utilization:[1mconf_0_sel_type_0.yml[0m, random state: {'reader_params': {'random_state': 42}, 'nn_params': {'random_state': 42}, 'general_params': {'return_all_predictions': False}}
INFO3:lightautoml.addons.utilization.utilization:Found reader_params in kwargs, need to combine
INFO3:lightautoml.addons.utilization.utilization:Merged variant for reader_params = {'n_jobs': 4, 'random_state': 42}


[07:17:42] Stdout logging level is INFO.


INFO:lightautoml.automl.presets.base:Stdout logging level is INFO.


[07:17:42] Task: binary



INFO:lightautoml.automl.presets.base:Task: binary



[07:17:42] Start automl preset with listed constraints:


INFO:lightautoml.automl.presets.base:Start automl preset with listed constraints:


[07:17:42] - time: 299.99 seconds


INFO:lightautoml.automl.presets.base:- time: 299.99 seconds


[07:17:42] - CPU: 4 cores


INFO:lightautoml.automl.presets.base:- CPU: 4 cores


[07:17:42] - memory: 16 GB



INFO:lightautoml.automl.presets.base:- memory: 16 GB



[07:17:42] [1mTrain data shape: (44939, 83)[0m



INFO:lightautoml.reader.base:[1mTrain data shape: (44939, 83)[0m

INFO3:lightautoml.reader.base:Feats was rejected during automatic roles guess: []


[07:17:58] Layer [1m1[0m train process start. Time left 283.92 secs


INFO:lightautoml.automl.base:Layer [1m1[0m train process start. Time left 283.92 secs


[07:18:02] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'tol': 1e-06, 'max_iter': 100, 'cs': [1e-05, 5e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000], 'early_stopping': 2, 'categorical_idx': [95, 96, 97, 98], 'embed_sizes': array([11, 11, 11,  5], dtype=int32), 'data_size': 99}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m =====
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 1e-05 score = 0.8570695569050832
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 5e-05 score = 0.8651004179787073
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 0.0001 score = 0.8686367583845654
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 0.0005 score = 0.8748398997850754
INFO3:lightautoml.ml_algo.torch_based.linear_mode

[07:18:19] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m0.8817612834449979[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m0.8817612834449979[0m


[07:18:19] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed


[07:18:19] Time left 262.78 secs



INFO:lightautoml.automl.base:Time left 262.78 secs



[07:18:22] Start fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'task': 'train', 'learning_rate': 0.03, 'num_leaves': 32, 'feature_fraction': 0.7, 'bagging_fraction': 0.7, 'bagging_freq': 1, 'max_depth': -1, 'verbosity': -1, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'min_split_gain': 0.0, 'zero_as_missing': False, 'num_threads': 2, 'max_bin': 255, 'min_data_in_bin': 3, 'num_trees': 1200, 'early_stopping_rounds': 200, 'random_state': 42}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m =====
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100]	valid's auc: 0.885849
DEBUG:lightautoml.ml_algo.boost_lgbm:[200]	valid's auc: 0.888384
DEBUG:lightautoml.ml_algo.boost_lgbm:[300]	valid's auc: 0.888928
DEBUG:lightautoml.ml_algo.boost_lgbm:[400]	valid's auc: 0.88922
DEBUG:lightautoml.ml

[07:18:51] Time limit exceeded after calculating fold 1



INFO:lightautoml.ml_algo.base:Time limit exceeded after calculating fold 1



[07:18:51] Fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m finished. score = [1m0.8938905757984704[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m finished. score = [1m0.8938905757984704[0m


[07:18:51] [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_1_Mod_0_LightGBM[0m fitting and predicting completed


[07:18:51] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m ... Time budget is 1.00 secs


INFO:lightautoml.ml_algo.tuning.optuna:Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m ... Time budget is 1.00 secs


[07:18:51] Copying TaskTimer may affect the parent PipelineTimer, so copy will create new unlimited TaskTimer


INFO:optuna.storages._in_memory:A new study created in memory with name: no-name-506d1f68-f9e8-44b9-a09b-d22c359cc9eb
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100]	valid's auc: 0.883725
DEBUG:lightautoml.ml_algo.boost_lgbm:[200]	valid's auc: 0.885732
DEBUG:lightautoml.ml_algo.boost_lgbm:[300]	valid's auc: 0.885442
DEBUG:lightautoml.ml_algo.boost_lgbm:[400]	valid's auc: 0.885301
DEBUG:lightautoml.ml_algo.boost_lgbm:[500]	valid's auc: 0.885009
DEBUG:lightautoml.ml_algo.boost_lgbm:Early stopping, best iteration is:
[356]	valid's auc: 0.886118
INFO:optuna.study.study:Trial 0 finished with value: 0.8861184522916979 and parameters: {'feature_fraction': 0.6872700594236812, 'num_leaves': 244, 'bagging_fraction': 0.8659969709057025, 'min_sum_hessian_in_leaf': 0.24810409748678125, 'reg_alpha': 2.5361081166471375e-07, 'reg_lambda': 2.5348407664333426e-07}. Best is trial 0 with value: 0.8861184522916979

[07:19:46] Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m completed


INFO:lightautoml.ml_algo.tuning.optuna:Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m completed
INFO2:lightautoml.ml_algo.tuning.optuna:The set of hyperparameters [1m{'feature_fraction': 0.6872700594236812, 'num_leaves': 244, 'bagging_fraction': 0.8659969709057025, 'min_sum_hessian_in_leaf': 0.24810409748678125, 'reg_alpha': 2.5361081166471375e-07, 'reg_lambda': 2.5348407664333426e-07}[0m
 achieve 0.8861 auc


[07:19:46] Start fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'task_type': 'CPU', 'thread_count': 2, 'random_seed': 42, 'num_trees': 5000, 'learning_rate': 0.03, 'l2_leaf_reg': 0.01, 'bootstrap_type': 'Bernoulli', 'grow_policy': 'SymmetricTree', 'max_depth': 5, 'min_data_in_leaf': 1, 'one_hot_max_size': 10, 'fold_permutation_block': 1, 'boosting_type': 'Plain', 'boost_from_average': True, 'od_type': 'Iter', 'od_wait': 100, 'max_bin': 32, 'feature_border_type': 'GreedyLogSum', 'nan_mode': 'Min', 'verbose': 100, 'allow_writing_files': False}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m =====
INFO3:lightautoml.ml_algo.boost_cb:0:	test: 0.7628845	best: 0.7628845 (0)	total: 67.5ms	remaining: 5m 37s
DEBUG:lightautoml.ml_algo.boost_cb:100:	test: 0.8753086	best: 0.8753086 (100)	total: 1.57s	remaining: 1m 16s
DEBUG:lightautoml.ml_algo.boost_cb:200:	test: 0.881

[07:20:42] Time limit exceeded after calculating fold 2



INFO:lightautoml.ml_algo.base:Time limit exceeded after calculating fold 2



[07:20:42] Fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m finished. score = [1m0.8919156073152174[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m finished. score = [1m0.8919156073152174[0m


[07:20:42] [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_1_Mod_2_CatBoost[0m fitting and predicting completed


[07:20:42] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_CatBoost[0m ... Time budget is 1.00 secs


INFO:lightautoml.ml_algo.tuning.optuna:Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_CatBoost[0m ... Time budget is 1.00 secs
INFO:optuna.storages._in_memory:A new study created in memory with name: no-name-9fcb5525-1e93-4bc4-b32e-179cdf5648e2
INFO3:lightautoml.ml_algo.boost_cb:0:	test: 0.7481555	best: 0.7481555 (0)	total: 13.8ms	remaining: 1m 8s
DEBUG:lightautoml.ml_algo.boost_cb:100:	test: 0.8745042	best: 0.8745042 (100)	total: 1.34s	remaining: 1m 4s
DEBUG:lightautoml.ml_algo.boost_cb:200:	test: 0.8804225	best: 0.8804225 (200)	total: 2.66s	remaining: 1m 3s
DEBUG:lightautoml.ml_algo.boost_cb:300:	test: 0.8836618	best: 0.8837374 (296)	total: 4.03s	remaining: 1m 2s
DEBUG:lightautoml.ml_algo.boost_cb:400:	test: 0.8851430	best: 0.8851430 (400)	total: 5.35s	remaining: 1m 1s
DEBUG:lightautoml.ml_algo.boost_cb:500:	test: 0.8862520	best: 0.8862545 (499)	total: 6.67s	remaining: 59.9s
DEBUG:lightautoml.ml_algo.boost_cb:600:	test: 0.8873430	best: 0.8873430 (600)	total: 7.9

[07:21:06] Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_CatBoost[0m completed


INFO:lightautoml.ml_algo.tuning.optuna:Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_CatBoost[0m completed
INFO2:lightautoml.ml_algo.tuning.optuna:The set of hyperparameters [1m{'max_depth': 4, 'l2_leaf_reg': 3.6010467344475403, 'min_data_in_leaf': 15}[0m
 achieve 0.8895 auc


[07:21:06] Time left 96.20 secs



INFO:lightautoml.automl.base:Time left 96.20 secs



[07:21:06] Time limit exceeded in one of the tasks. AutoML will blend level 1 models.



INFO:lightautoml.automl.base:Time limit exceeded in one of the tasks. AutoML will blend level 1 models.



[07:21:06] [1mLayer 1 training completed.[0m



INFO:lightautoml.automl.base:[1mLayer 1 training completed.[0m



[07:21:06] Blending: optimization starts with equal weights and score [1m0.8878270225483917[0m


INFO:lightautoml.automl.blend:Blending: optimization starts with equal weights and score [1m0.8878270225483917[0m


[07:21:06] Blending: iteration [1m0[0m: score = [1m0.8933582621082622[0m, weights = [1m[0.  0.5 0.5][0m


INFO:lightautoml.automl.blend:Blending: iteration [1m0[0m: score = [1m0.8933582621082622[0m, weights = [1m[0.  0.5 0.5][0m


[07:21:06] Blending: iteration [1m1[0m: score = [1m0.8933582621082622[0m, weights = [1m[0.  0.5 0.5][0m


INFO:lightautoml.automl.blend:Blending: iteration [1m1[0m: score = [1m0.8933582621082622[0m, weights = [1m[0.  0.5 0.5][0m


[07:21:06] Blending: no score update. Terminated



INFO:lightautoml.automl.blend:Blending: no score update. Terminated



[07:21:06] [1mAutoml preset training completed in 204.40 seconds[0m



INFO:lightautoml.automl.presets.base:[1mAutoml preset training completed in 204.40 seconds[0m



[07:21:06] Model description:
Final prediction for new objects (level 0) = 
	 0.50000 * (2 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
	 0.50000 * (3 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) 



INFO:lightautoml.automl.presets.base:Model description:
Final prediction for new objects (level 0) = 
	 0.50000 * (2 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
	 0.50000 * (3 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) 







CPU times: user 4min 57s, sys: 6.45 s, total: 5min 4s
Wall time: 3min 24s


## Prediction for test data

In [None]:
test_pred = automl.predict(test_df)
print(f'Prediction for te_data:\n{test_pred}\nShape = {test_pred.shape}')

Prediction for te_data:
array([[0.00656395],
       [0.00816865],
       [0.00441861],
       ...,
       [0.00490397],
       [0.02964572],
       [0.01784633]], dtype=float32)
Shape = (19260, 1)


In [None]:
print(f'OOF score: {roc_auc_score(train_df[TARGET_NAME].values, oof_pred.data[:, 0])}')

ValueError: Input contains NaN.

In [None]:
oof_pred.data[:, 0]

array([0.44725886,        nan, 0.00581809, ..., 0.02756848,        nan,
       0.00322664], dtype=float32)

In [None]:
np.count_nonzero(~np.isnan(oof_pred.data[:, 0]))

26964

In [None]:
np.count_nonzero(np.isnan(oof_pred.data[:, 0])) # NaNs

17975

In [None]:
np.count_nonzero(np.isnan(test_pred.data[:, 0])) # NaNs

0

In [None]:
print(f'Test score: {roc_auc_score(test_df[TARGET_NAME].values, test_pred.data[:, 0])}')

Test score: 0.885788495114911


In [None]:
test_auc = roc_auc_score(test_df[TARGET_NAME].values, test_pred.data[:, 0])
test_acc = accuracy_score(test_df[TARGET_NAME].values, test_pred.data[:, 0] > 0.5)

print("RFC test metrics: AUC={}, acc={}".format(test_auc, test_acc))

RFC test metrics: AUC=0.885788495114911, acc=0.9280373831775701


## Model analysis

In [None]:
print(automl.create_model_str_desc())

Final prediction for new objects = 
	1.00000 * 1 averaged models with config = "conf_0_sel_type_0.yml" and different CV random_states. Their structures: 

	    Model #0.
		Final prediction for new objects (level 0) = 
			 0.11268 * (4 averaged models Lvl_0_Pipe_0_Mod_0_LinearL2) +
			 0.52931 * (2 averaged models Lvl_0_Pipe_1_Mod_0_LightGBM) +
			 0.35802 * (3 averaged models Lvl_0_Pipe_1_Mod_2_CatBoost) 






For feature importances calculation we have 2 different methods in LightAutoML:
- Fast (`fast`) - this method uses feature importances from feature selector LGBM model inside LightAutoML. It works extremely fast and almost always (almost because of situations, when feature selection is turned off or selector was removed from the final models with all GBM models). There is no need to use new labelled data.
- Accurate (`accurate`) - this method calculate *features permutation importances* for the whole LightAutoML model based on the **new labelled data**. It always works but can take a lot of time to finish (depending on the model structure, new labelled dataset size etc.).

In [None]:
%%time

# Fast feature importances calculation
fast_fi = automl.get_feature_scores('fast')
fast_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)

AttributeError: 'NoneType' object has no attribute 'set_index'

In [None]:
fast_fi