# Regression on Yb based OLED dataset by LightAutoML

Designed by Koshelev Daniil

https://github.com/Lamblador/Yb_OLED_Dataset

This noteboot show how to read and use YbOLED dataset with LightAutoML package.

# Import and installation of modules

In [1]:
!pip install lightautoml
from IPython.display import clear_output
clear_output()

In [2]:
import pandas as pd
import numpy as np
import torch
import sklearn

In [226]:
!git clone https://github.com/Lamblador/Yb_OLED_Dataset/

Cloning into 'Yb_OLED_Dataset'...
remote: Enumerating objects: 16, done.[K
remote: Counting objects: 100% (16/16), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 16 (delta 3), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (16/16), 49.00 KiB | 1.09 MiB/s, done.
Resolving deltas: 100% (3/3), done.


# Initializing of the LightAutoML

In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Standard python libraries
import os
import requests

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch

# LightAutoML presets, task and report generation
from lightautoml.automl.presets.tabular_presets import TabularAutoML, TabularUtilizedAutoML
from lightautoml.tasks import Task
from lightautoml.report.report_deco import ReportDeco, ReportDecoUtilized
from lightautoml.addons.tabular_interpretation import SSWARM

Standart parameters

In [228]:
N_THREADS = 4 # number of the CPU to use
N_FOLDS = 5 # number folds in cross-validation
RANDOM_STATE = 42 # random state
TEST_SIZE = 0.2 # test dataset size
TIMEOUT = 300 # time for model to train
TARGET_NAME = 'ECE  uW/W' #name of the target parameter
# you can chose 'ECE  uW/W', 'Max. irradiance  uW/cm2', 'Uon ' or 'EQE  %'

In [229]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

In [230]:
task = Task('reg') # tast regression

### Read dataset

In [231]:
data = pd.read_csv('/content/Yb_OLED_Dataset/yb_oled_data_short2.csv', delimiter=';') # read file
data.drop('year', axis=1, inplace=True) #drop year data - no information for ML
data.dropna(subset = [TARGET_NAME], inplace=True) # drop all rows without target data

data.head()

Unnamed: 0,HIL tikness nm,HIL HOMO,HIL LUMO,HTL tikness nm,HTL HOMO,HTL LUMO,HTL HOMO-EML HOMO,HTL LUMO-EML LUMO,EML tikness nm,EML HOMO,...,Max. irradiance uW/cm2,EQE %,ECE uW/W,Uon,t us,QY %,hole mobilty cm2/Vs,electron mobility cm2/Vs,total cm2/Vs,Pixel size mm2
5,30,-5.5,-2.4,0,,,,,50,,...,0.8,,0.1,10.0,24.0,1.2,,,,
8,25,-5.2,-2.4,0,,,0.72,0.26,40,-5.92,...,22.48,,51.0,7.6,,2.4,6.24e-08,2.69e-12,3.21e-06,
9,25,-5.2,-2.4,0,,,0.72,0.26,40,-5.92,...,12.13,,44.0,7.7,,1.41,,,2.69e-05,
10,25,-5.2,-2.4,0,,,0.72,0.26,40,-5.92,...,9.6,,42.0,7.9,,1.33,,,1.3e-07,
11,25,-5.5,-2.4,0,,,0.42,0.26,40,-5.92,...,19.29,,47.0,7.7,,1.92,,,4.5e-08,


Train/test split

In [232]:
train_data, test_data = train_test_split(
    data,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE
)

print(f'Data is splitted. Parts sizes: train_data = {train_data.shape}, test_data = {test_data.shape}')

train_data.head()

Data is splitted. Parts sizes: train_data = (21, 38), test_data = (6, 38)


Unnamed: 0,HIL tikness nm,HIL HOMO,HIL LUMO,HTL tikness nm,HTL HOMO,HTL LUMO,HTL HOMO-EML HOMO,HTL LUMO-EML LUMO,EML tikness nm,EML HOMO,...,Max. irradiance uW/cm2,EQE %,ECE uW/W,Uon,t us,QY %,hole mobilty cm2/Vs,electron mobility cm2/Vs,total cm2/Vs,Pixel size mm2
36,50,-5.2,-2.3,20,-5.2,-2.3,0.627,0.895,30,-5.827,...,17.0,0.06,188.0,4.5,14.2,0.91,6.47e-05,6.08e-05,,12.0
37,50,-5.2,-2.3,20,-5.2,-2.3,0.691,0.947,30,-5.891,...,7.0,0.12,429.0,4.5,14.4,0.95,0.000155,0.000101,,12.0
32,50,-5.2,-2.3,15,-5.8,-2.2,-0.662,0.45,23,-5.138,...,4.0,0.0003,93.0,3.9,10.0,0.5,,,,12.0
46,50,-5.2,-2.3,20,-5.2,-2.3,,,23,,...,16.0,0.00045,140.0,4.0,,0.8,,,,12.0
8,25,-5.2,-2.4,0,,,0.72,0.26,40,-5.92,...,22.48,,51.0,7.6,,2.4,6.24e-08,2.69e-12,3e-06,


Roles of the columns. **target** - target value to predict, **drop** - ignoragle value.

In [233]:
#TARGET_NAME = 'Max. irradiance  uW/cm2' #Max. irradiance  uW/cm2 'ECE  uW/W', 'EQE  %', 'Uon'
roles = {
    'target': TARGET_NAME,
    'drop': ['Max. irradiance  uW/cm2', 'EQE  %', 'Uon'] # HAVE TO CHANGE list if you change TARGET_NAME
}

In [234]:
automl = TabularAutoML( #tabular auto ml pipeline class
    task = task,
    timeout = TIMEOUT,
    cpu_limit = N_THREADS,
    reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
)

## Model training

In [235]:
%%time
out_of_fold_predictions = automl.fit_predict(train_data, roles = roles, verbose = 1)

[15:05:25] Stdout logging level is INFO.


INFO:lightautoml.automl.presets.base:Stdout logging level is INFO.


[15:05:25] Task: reg



INFO:lightautoml.automl.presets.base:Task: reg



[15:05:25] Start automl preset with listed constraints:


INFO:lightautoml.automl.presets.base:Start automl preset with listed constraints:


[15:05:25] - time: 300.00 seconds


INFO:lightautoml.automl.presets.base:- time: 300.00 seconds


[15:05:25] - CPU: 4 cores


INFO:lightautoml.automl.presets.base:- CPU: 4 cores


[15:05:25] - memory: 16 GB



INFO:lightautoml.automl.presets.base:- memory: 16 GB



[15:05:25] [1mTrain data shape: (21, 38)[0m



INFO:lightautoml.reader.base:[1mTrain data shape: (21, 38)[0m

INFO3:lightautoml.reader.base:Feats was rejected during automatic roles guess: []


[15:05:31] Layer [1m1[0m train process start. Time left 293.93 secs


INFO:lightautoml.automl.base:Layer [1m1[0m train process start. Time left 293.93 secs


[15:05:31] Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'tol': 1e-06, 'max_iter': 100, 'cs': [1e-05, 5e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000], 'early_stopping': 2, 'categorical_idx': [62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76], 'embed_sizes': array([ 2,  3,  2,  2,  3,  3,  2,  2,  2,  2,  3,  9,  3,  3, 11],
      dtype=int32), 'data_size': 77}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m =====
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 1e-05 score = -26241.170440328693
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 5e-05 score = -25645.78291510234
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear model: C = 0.0001 score = -24955.98777728643
INFO3:lightautoml.ml_algo.torch_based.linear_model:Linear

[15:05:33] Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m-10059.717400570567[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m finished. score = [1m-10059.717400570567[0m


[15:05:33] [1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_0_Mod_0_LinearL2[0m fitting and predicting completed


[15:05:33] Time left 291.45 secs



INFO:lightautoml.automl.base:Time left 291.45 secs

INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:[200]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:Early stopping, best iteration is:
[1]	valid's l2: 18833.6


[15:05:34] [1mSelector_LightGBM[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mSelector_LightGBM[0m fitting and predicting completed


[15:05:34] Start fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'task': 'train', 'learning_rate': 0.01, 'num_leaves': 16, 'feature_fraction': 0.9, 'bagging_fraction': 0.9, 'bagging_freq': 1, 'max_depth': -1, 'verbosity': -1, 'reg_alpha': 1, 'reg_lambda': 0.0, 'min_split_gain': 0.0, 'zero_as_missing': False, 'num_threads': 2, 'max_bin': 255, 'min_data_in_bin': 3, 'num_trees': 3000, 'early_stopping_rounds': 200, 'random_state': 42}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m =====
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:[200]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:Early stopping, best iteration is:
[1]	valid's l2: 18833.6
INFO2:lightautoml.ml_algo.base:===== Start working with [1

[15:05:34] Fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m finished. score = [1m-14707.618009136819[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m finished. score = [1m-14707.618009136819[0m


[15:05:34] [1mLvl_0_Pipe_1_Mod_0_LightGBM[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_1_Mod_0_LightGBM[0m fitting and predicting completed


[15:05:34] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m ... Time budget is 64.78 secs


INFO:lightautoml.ml_algo.tuning.optuna:Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m ... Time budget is 64.78 secs
INFO:optuna.storages._in_memory:A new study created in memory with name: no-name-fa06d5be-45dd-4747-8724-01a87d5b9aa4
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 200 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:[200]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:Early stopping, best iteration is:
[1]	valid's l2: 18833.6
INFO:optuna.study.study:Trial 0 finished with value: -26416.482234039308 and parameters: {'feature_fraction': 0.6872700594236812, 'num_leaves': 244, 'bagging_fraction': 0.8659969709057025, 'min_sum_hessian_in_leaf': 0.24810409748678125, 'reg_alpha': 2.5361081166471375e-07, 'reg_lambda': 2.5348407664333426e-07}. Best is trial 0 with value: -26416.482234039308.
INFO3:lightautoml.ml_algo.tuning.optuna:[1mTrial

[15:05:48] Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m completed


INFO:lightautoml.ml_algo.tuning.optuna:Hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m completed
INFO2:lightautoml.ml_algo.tuning.optuna:The set of hyperparameters [1m{'feature_fraction': 0.6872700594236812, 'num_leaves': 244, 'bagging_fraction': 0.8659969709057025, 'min_sum_hessian_in_leaf': 0.24810409748678125, 'reg_alpha': 2.5361081166471375e-07, 'reg_lambda': 2.5348407664333426e-07}[0m
 achieve -26416.4822 mse


[15:05:48] Start fitting [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'task': 'train', 'learning_rate': 0.05, 'num_leaves': 244, 'feature_fraction': 0.6872700594236812, 'bagging_fraction': 0.8659969709057025, 'bagging_freq': 1, 'max_depth': -1, 'verbosity': -1, 'reg_alpha': 2.5361081166471375e-07, 'reg_lambda': 2.5348407664333426e-07, 'min_split_gain': 0.0, 'zero_as_missing': False, 'num_threads': 2, 'max_bin': 255, 'min_data_in_bin': 3, 'num_trees': 3000, 'early_stopping_rounds': 100, 'random_state': 42, 'min_sum_hessian_in_leaf': 0.24810409748678125}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m =====
INFO3:lightautoml.ml_algo.boost_lgbm:Training until validation scores don't improve for 100 rounds
DEBUG:lightautoml.ml_algo.boost_lgbm:[100]	valid's l2: 18833.6
DEBUG:lightautoml.ml_algo.boost_lgbm:Early stopping, best iteration is:
[1]	valid's l2:

[15:05:48] Fitting [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m finished. score = [1m-14707.618009136819[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m finished. score = [1m-14707.618009136819[0m


[15:05:48] [1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_1_Mod_1_Tuned_LightGBM[0m fitting and predicting completed


[15:05:48] Start fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m ...


INFO:lightautoml.ml_algo.base:Start fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m ...
DEBUG:lightautoml.ml_algo.base:Training params: {'task_type': 'CPU', 'thread_count': 2, 'random_seed': 42, 'num_trees': 2000, 'learning_rate': 0.05, 'l2_leaf_reg': 0.01, 'bootstrap_type': 'Bernoulli', 'grow_policy': 'SymmetricTree', 'max_depth': 5, 'min_data_in_leaf': 1, 'one_hot_max_size': 10, 'fold_permutation_block': 1, 'boosting_type': 'Plain', 'boost_from_average': True, 'od_type': 'Iter', 'od_wait': 300, 'max_bin': 32, 'feature_border_type': 'GreedyLogSum', 'nan_mode': 'Min', 'verbose': 100, 'allow_writing_files': False}
INFO2:lightautoml.ml_algo.base:===== Start working with [1mfold 0[0m for [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m =====
INFO3:lightautoml.ml_algo.boost_cb:0:	learn: 97.1823097	test: 162.1961564	best: 162.1961564 (0)	total: 1.13ms	remaining: 2.26s
DEBUG:lightautoml.ml_algo.boost_cb:100:	learn: 94.9625467	test: 156.1957214	best: 156.1957214 (100)	total: 19.9ms	remaining: 375ms
DEBUG:li

[15:05:49] Fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m finished. score = [1m-13496.901950175428[0m


INFO:lightautoml.ml_algo.base:Fitting [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m finished. score = [1m-13496.901950175428[0m


[15:05:49] [1mLvl_0_Pipe_1_Mod_2_CatBoost[0m fitting and predicting completed


INFO:lightautoml.ml_algo.base:[1mLvl_0_Pipe_1_Mod_2_CatBoost[0m fitting and predicting completed


[15:05:49] Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_CatBoost[0m ... Time budget is 208.00 secs


INFO:lightautoml.ml_algo.tuning.optuna:Start hyperparameters optimization for [1mLvl_0_Pipe_1_Mod_3_Tuned_CatBoost[0m ... Time budget is 208.00 secs
INFO:optuna.storages._in_memory:A new study created in memory with name: no-name-b8ce2ba2-101c-4e85-ab67-5a76a1c4f483
INFO3:lightautoml.ml_algo.boost_cb:0:	learn: 97.2782733	test: 162.2731881	best: 162.2731881 (0)	total: 128us	remaining: 258ms
DEBUG:lightautoml.ml_algo.boost_cb:100:	learn: 94.9712357	test: 156.2761517	best: 156.2761517 (100)	total: 7.71ms	remaining: 145ms
DEBUG:lightautoml.ml_algo.boost_cb:200:	learn: 94.9625076	test: 156.1641004	best: 156.1641004 (200)	total: 13.7ms	remaining: 123ms
DEBUG:lightautoml.ml_algo.boost_cb:300:	learn: 94.9624664	test: 156.1618839	best: 156.1618839 (300)	total: 21.2ms	remaining: 119ms
DEBUG:lightautoml.ml_algo.boost_cb:400:	learn: 94.9624662	test: 156.1618400	best: 156.1618400 (400)	total: 28.8ms	remaining: 115ms
DEBUG:lightautoml.ml_algo.boost_cb:500:	learn: 94.9624662	test: 156.1618391	best:

## Model prediction

In [236]:
%%time

test_predictions = automl.predict(test_data)
print(f'Prediction for test_data:\n{test_predictions}\nShape = {test_predictions.shape}')

In [237]:
y_true = test_data[TARGET_NAME]
y_pred = test_predictions.data
y_true

In [238]:
print(automl.create_model_str_desc())

## Model prediction accuracy by MAPE

In [239]:
from sklearn.metrics import mean_absolute_percentage_error
scorer = mean_absolute_percentage_error(y_true, y_pred)
print(scorer)

## Feature analysis

In [None]:
%%time

# Accurate feature importances calculation with detailed info (Permutation importances) -  can take long time to calculate
accurate_fi = automl.get_feature_scores('accurate', test_data, silent = True)

In [None]:
accurate_fi.set_index('Feature')['Importance'].plot.bar(figsize = (30, 10), grid = True)

# Model saving

In [None]:
import joblib
joblib.dump(automl, 'model.pkl') #model save
#automl=joblib.load(‘model.pkl’) #model load