# Tutorial 1: Basics


In this tutorial you will learn how to:
* run LightAutoML GPU version training on tabular data
* obtain feature importances and reports
* configure resource usage in LightAutoML


### 0.1. Import libraries

Here we will import the libraries we use in this kernel:
- Standard python libraries for timing, working with OS etc.
- Essential python DS libraries like numpy, pandas, scikit-learn and torch (the last we will use in the next cell)
- LightAutoML modules: presets for AutoML, task and report generation module

In [None]:
# Standard python libraries
import os
import time

# Essential DS libraries
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import torch

# LightAutoML presets, task and report generation
from lightautoml.automl.presets.gpu.tabular_gpu_presets import TabularAutoML_gpu
from lightautoml.tasks import Task
from lightautoml.report.gpu import ReportDeco

### 0.2. Constants

Here we setup the constants to use in the kernel:
- `N_THREADS` - number of vCPUs for LightAutoML model creation
- `N_FOLDS` - number of folds in LightAutoML inner CV
- `RANDOM_STATE` - random seed for better reproducibility
- `TEST_SIZE` - houldout data part size 
- `TIMEOUT` - limit in seconds for model to train
- `TARGET_NAME` - target column name in dataset

In [None]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TEST_SIZE = 0.2
TIMEOUT = 300
TARGET_NAME = 'TARGET'

In [None]:
DATASET_DIR = './data/'
DATASET_NAMES = ['higgs.csv', 'Fashion-MNIST.csv']
DATASET_FULLNAME = [os.path.join(DATASET_DIR, name) for name in DATASET_NAMES]

### 0.3. Imported models setup

For better reproducibility fix numpy random seed with max number of threads for Torch (which usually try to use all the threads on server):

In [None]:
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

### 0.4. Data loading
Let's check the data we have:

In [None]:
data = pd.read_csv('./data/higgs.csv')
data.head()

data_info_ = {
                'path': 'openml/higgs.csv',
                'target': 'class',
                'task_type': 'binary',
                'read_csv_params': {'na_values': '?'}
}



for col in data.columns:
    if data[col].isin(['?']).any():
        data[col] = data[col].replace('?', np.nan).astype(np.float32)

### 0.5. Data splitting for train-holdout
As we have only one file with target values, we can split it into 80%-20% for holdout usage:

In [None]:
tr_data, te_data = train_test_split(
    data, 
    test_size=TEST_SIZE, 
    stratify=data['class'], 
    random_state=RANDOM_STATE
)

print(f'Data splitted. Parts sizes: tr_data = {tr_data.shape}, te_data = {te_data.shape}')

tr_data.head()

## 1. Task definition

### 1.1. Task type


On the cell below we create Task object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found [here](https://lightautoml.readthedocs.io/en/latest/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task) in our documentation):

In [None]:
task = Task('binary', device='gpu')

### 1.2. Feature roles setup

To solve the task, we need to setup columns roles. The **only role you must setup is target role**, everything else (drop, numeric, categorical, group, weights etc.) is up to user - LightAutoML models have automatic columns typization inside:

In [None]:
roles = {
    'target': 'class',
}

### 1.3. LightAutoML model creation - TabularAutoML preset

In next the cell we are going to create LightAutoML model with `TabularAutoML` class - preset with default model structure like in the image below:

<img src="../../imgs/tutorial_blackbox_pipeline.png" alt="TabularAutoML preset pipeline" style="width:85%;"/>

in just several lines. Let's discuss the params we can setup:
- `task` - the type of the ML task (the only **must have** parameter)
- `timeout` - time limit in seconds for model to train
- `cpu_limit` - vCPU count for model to use
- `reader_params` - parameter change for Reader object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc. For example, we setup `n_jobs` threads for typization algo, `cv` folds and `random_state` as inside CV seed.

**Important note**: `reader_params` key is one of the YAML config keys, which is used inside `TabularAutoML` preset. [More details](https://github.com/sberbank-ai-lab/LightAutoML/blob/master/lightautoml/automl/presets/tabular_config.yml) on its structure with explanation comments can be found on the link attached. Each key from this config can be modified with user settings during preset object initialization. To get more info about different parameters setting (for example, ML algos which can be used in `general_params->use_algos`) please take a look at our [article on TowardsDataScience](https://towardsdatascience.com/lightautoml-preset-usage-tutorial-2cce7da6f936).

Moreover, to receive the automatic report for our model we will use `ReportDeco` decorator and work with the decorated version in the same way as we do with usual one. 

In [None]:
automl = TabularAutoML_gpu(task=task,     
    timeout=TIMEOUT)

## 2. AutoML training

To run autoML training use fit_predict method:

- `train_data` - Dataset to train.
- `roles` - Roles dict.
- `verbose` - Controls the verbosity: the higher, the more messages.
        <1  : messages are not displayed;
        >=1 : the computation process for layers is displayed;
        >=2 : the information about folds processing is also displayed;
        >=3 : the hyperparameters optimization process is also displayed;
        >=4 : the training process for every algorithm is displayed;

Note: out-of-fold prediction is calculated during training and returned from the fit_predict method

In [None]:
%%time 
oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 1)

## 3. Prediction on holdout and model evaluation

In [None]:
%%time

te_pred = automl.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')

In [None]:
print(f'OOF score: {roc_auc_score(tr_data[TARGETS_DICT[DATASET_NAMES[0]]].values, oof_pred.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(te_data[TARGETS_DICT[DATASET_NAMES[0]]].values, te_pred.data[:, 0])}')

## 4. Model analysis

### 4.1. Reports

You can obtain the description of the resulting pipeline:

In [None]:
print(automl.create_model_str_desc())

Also for this purposes LightAutoML have ReportDeco, use it to build reports:

In [None]:
RD = ReportDeco(output_path = 'tabularAutoML_model_report')

automl_rd = RD(
    TabularAutoML_gpu(
        task = task, 
        timeout = TIMEOUT,
        cpu_limit = N_THREADS,
        reader_params = {'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE}
    )
)

In [None]:
%%time
oof_pred = automl_rd.fit_predict(tr_data, roles = roles, verbose = 1)

So the report is available in tabularAutoML_model_report folder

In [None]:
!ls tabularAutoML_model_report

In [None]:
%%time

te_pred = automl_rd.predict(te_data)
print(f'Prediction for te_data:\n{te_pred}\nShape = {te_pred.shape}')

In [None]:
print(f'OOF score: {roc_auc_score(tr_data[TARGET_NAME].values, oof_pred.data[:, 0])}')
print(f'HOLDOUT score: {roc_auc_score(te_data[TARGET_NAME].values, te_pred.data[:, 0])}')

## 5. Multi-GPU results

Here is an example of how to run Multi-GPU configuration.

In [None]:
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

In [None]:
# set CUDA_VISIBLE_DEVICES to gpu ids which you want to use for training



task = Task(task_types['higgs.csv'], device='mgpu')

automl = TabularAutoML_gpu(task=task,     
    timeout=TIMEOUT,
    config_path='./data/pf.yml')

%%time 
oof_pred = automl.fit_predict(tr_data, roles = roles, verbose = 1)

## Additional materials

- [Official LightAutoML github repo](https://github.com/sberbank-ai-lab/LightAutoML)
- [LightAutoML documentation](https://lightautoml.readthedocs.io/en/latest)