## Hello colleagues 😎

## This notebook is a copy of @tsano430 [notebook](https://www.kaggle.com/tsano430/lightautoml-starter) with the changes based only in LightAutoML part so please upvote it first before reading this notebook - this is really amazing 👍

## As for the changes, there are 3 changes here, which should be done to receive better score:
- Fixed loss in `Task` object (it should be MAE as we have MAE as evaluation metric)
- Fixed roles - if we setup `breath_id` as a group, it is automatically dropped from the feature set (you have no need to send it to drop manually)
- Changed params for `TabularAutoML` run: increased tuning time limit and removing slow Catboost models to make the model faster

## Please enjoy and do not forget to upvote us on [Github](https://github.com/sberbank-ai-lab/LightAutoML) ⭐️

## References

- https://www.kaggle.com/alexryzhkov/tps-july-21-lightautoml-baseline
- https://lightautoml.readthedocs.io/en/latest/
- https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models
- https://www.kaggle.com/junhyeok99/tensorflow
- https://www.kaggle.com/tolgadincer/tensorflow-bidirectional-lstm-0-234

## LightAutoML installation

In [None]:
!pip install -U lightautoml

## Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
import torch
from lightautoml.automl.presets.tabular_presets import TabularAutoML
from lightautoml.tasks import Task
from lightautoml.dataset.roles import CategoryRole

Here we setup the constants to use in the kernel:

- `N_THREADS` - number of vCPUs for LightAutoML model creation
- `N_FOLDS` - number of folds in LightAutoML inner CV
- `RANDOM_STATE` - random seed for better reproducibility
- `TIMEOUT` - limit in seconds for model to train
- `TARGET_NAME` - target column name in dataset

In [None]:
N_THREADS = 4
N_FOLDS = 5
RANDOM_STATE = 42
TIMEOUT = 36000
TARGET_NAME = 'pressure'

In [None]:
# for reproducibility
np.random.seed(RANDOM_STATE)
torch.set_num_threads(N_THREADS)

## Data loading

In [None]:
train = pd.read_csv('../input/ventilator-pressure-prediction/train.csv')
test = pd.read_csv('../input/ventilator-pressure-prediction/test.csv')
sample_sub = pd.read_csv('../input/ventilator-pressure-prediction/sample_submission.csv')

In [None]:
print(train.shape)
train.head()

In [None]:
print(test.shape)
test.head()

## Add LSTM and bidirectional LSTM results

Thanks for these predictions goes to the notebooks [Tensorflow](https://www.kaggle.com/junhyeok99/tensorflow) and [Tensorflow Bidirectional LSTM (0.234)](https://www.kaggle.com/tolgadincer/tensorflow-bidirectional-lstm-0-234).

In [None]:
# LSTM
train_lstm = pd.read_csv('../input/googlebrainlstm/lstm_train.csv')
test_lstm = pd.read_csv('../input/googlebrainlstm/lstm_test.csv')
train['lstm_pred'] = train_lstm['pressure']
test['lstm_pred'] = test_lstm['pressure']

# Bidirectional LSTM
train_bilstm = pd.read_csv('../input/googlebrainbilstm/bilstm_train.csv')
test_bilstm = pd.read_csv('../input/googlebrainbilstm/bilstm_test.csv')
train['bilstm_pred'] = train_lstm['pressure']
test['bilstm_pred'] = test_lstm['pressure']

## Feature engineering

Thanks for these feature engineering goes to the notebook [Ventilator Pressure Prediction: EDA, FE and models](https://www.kaggle.com/artgor/ventilator-pressure-prediction-eda-fe-and-models). 

In [None]:
# rewritten calculation of lag features from this notebook: https://www.kaggle.com/patrick0302/add-lag-u-in-as-new-feat
# some of ideas from this notebook: https://www.kaggle.com/mst8823/google-brain-lightgbm-baseline
train['last_value_u_in'] = train.groupby('breath_id')['u_in'].transform('last')
train['u_in_lag1'] = train.groupby('breath_id')['u_in'].shift(1)
train['u_out_lag1'] = train.groupby('breath_id')['u_out'].shift(1)
train['u_in_lag_back1'] = train.groupby('breath_id')['u_in'].shift(-1)
train['u_out_lag_back1'] = train.groupby('breath_id')['u_out'].shift(-1)
train['u_in_lag2'] = train.groupby('breath_id')['u_in'].shift(2)
train['u_out_lag2'] = train.groupby('breath_id')['u_out'].shift(2)
train['u_in_lag_back2'] = train.groupby('breath_id')['u_in'].shift(-2)
train['u_out_lag_back2'] = train.groupby('breath_id')['u_out'].shift(-2)
train = train.fillna(0)

train['R__C'] = train["R"].astype(str) + '__' + train["C"].astype(str)

# max value of u_in and u_out for each breath
train['breath_id__u_in__max'] = train.groupby(['breath_id'])['u_in'].transform('max')
train['breath_id__u_out__max'] = train.groupby(['breath_id'])['u_out'].transform('max')

# difference between consequitive values
train['u_in_diff1'] = train['u_in'] - train['u_in_lag1']
train['u_out_diff1'] = train['u_out'] - train['u_out_lag1']
train['u_in_diff2'] = train['u_in'] - train['u_in_lag2']
train['u_out_diff2'] = train['u_out'] - train['u_out_lag2']
# from here: https://www.kaggle.com/yasufuminakama/ventilator-pressure-lstm-starter
train.loc[train['time_step'] == 0, 'u_in_diff'] = 0
train.loc[train['time_step'] == 0, 'u_out_diff'] = 0

# difference between the current value of u_in and the max value within the breath
train['breath_id__u_in__diffmax'] = train.groupby(['breath_id'])['u_in'].transform('max') - train['u_in']
train['breath_id__u_in__diffmean'] = train.groupby(['breath_id'])['u_in'].transform('mean') - train['u_in']

# https://www.kaggle.com/c/ventilator-pressure-prediction/discussion/273974
train['u_in_cumsum'] = train.groupby(['breath_id'])['u_in'].cumsum()
train['time_step_cumsum'] = train.groupby(['breath_id'])['time_step'].cumsum()
# https://www.kaggle.com/yasufuminakama/ventilator-pressure-lstm-starter
train['breath_time'] = train['time_step'] - train.groupby('breath_id')['time_step'].shift(1)

In [None]:
# all the same for the test data
test['last_value_u_in'] = test.groupby('breath_id')['u_in'].transform('last')
test['u_in_lag1'] = test.groupby('breath_id')['u_in'].shift(1)
test['u_out_lag1'] = test.groupby('breath_id')['u_out'].shift(1)
test['u_in_lag_back1'] = test.groupby('breath_id')['u_in'].shift(-1)
test['u_out_lag_back1'] = test.groupby('breath_id')['u_out'].shift(-1)
test['u_in_lag2'] = test.groupby('breath_id')['u_in'].shift(2)
test['u_out_lag2'] = test.groupby('breath_id')['u_out'].shift(2)
test['u_in_lag_back2'] = test.groupby('breath_id')['u_in'].shift(-2)
test['u_out_lag_back2'] = test.groupby('breath_id')['u_out'].shift(-2)
test = test.fillna(0)

test['R__C'] = test["R"].astype(str) + '__' + test["C"].astype(str)

test['breath_id__u_in__max'] = test.groupby(['breath_id'])['u_in'].transform('max')
test['breath_id__u_out__max'] = test.groupby(['breath_id'])['u_out'].transform('max')

test['u_in_diff1'] = test['u_in'] - test['u_in_lag1']
test['u_out_diff1'] = test['u_out'] - test['u_out_lag1']
test['u_in_diff2'] = test['u_in'] - test['u_in_lag2']
test['u_out_diff2'] = test['u_out'] - test['u_out_lag2']
test.loc[test['time_step'] == 0, 'u_in_diff'] = 0
test.loc[test['time_step'] == 0, 'u_out_diff'] = 0

test['breath_id__u_in__diffmax'] = test.groupby(['breath_id'])['u_in'].transform('max') - test['u_in']
test['breath_id__u_in__diffmean'] = test.groupby(['breath_id'])['u_in'].transform('mean') - test['u_in']

test['u_in_cumsum'] = test.groupby(['breath_id'])['u_in'].cumsum()
test['time_step_cumsum'] = test.groupby(['breath_id'])['time_step'].cumsum()

test['breath_time'] = test['time_step'] - test.groupby('breath_id')['time_step'].shift(1)

## Feature engineering (LSTM prediction lags)

In [None]:
train['lstm_pred_lag1'] = train.groupby('breath_id')['lstm_pred'].shift(1)
train['lstm_pred_lag_back1'] = train.groupby('breath_id')['lstm_pred'].shift(-1)
train['bilstm_pred_lag1'] = train.groupby('breath_id')['bilstm_pred'].shift(1)
train['bilstm_pred_lag_back1'] = train.groupby('breath_id')['bilstm_pred'].shift(-1)
train['lstm_pred_lag2'] = train.groupby('breath_id')['lstm_pred'].shift(2)
train['lstm_pred_lag_back2'] = train.groupby('breath_id')['lstm_pred'].shift(-2)
train['bilstm_pred_lag2'] = train.groupby('breath_id')['bilstm_pred'].shift(2)
train['bilstm_pred_lag_back2'] = train.groupby('breath_id')['bilstm_pred'].shift(-2)
train = train.fillna(0)

In [None]:
test['lstm_pred_lag1'] = test.groupby('breath_id')['lstm_pred'].shift(1)
test['lstm_pred_lag_back1'] = test.groupby('breath_id')['lstm_pred'].shift(-1)
test['bilstm_pred_lag1'] = test.groupby('breath_id')['bilstm_pred'].shift(1)
test['bilstm_pred_lag_back1'] = test.groupby('breath_id')['bilstm_pred'].shift(-1)
test['lstm_pred_lag2'] = test.groupby('breath_id')['lstm_pred'].shift(2)
test['lstm_pred_lag_back2'] = test.groupby('breath_id')['lstm_pred'].shift(-2)
test['bilstm_pred_lag2'] = test.groupby('breath_id')['bilstm_pred'].shift(2)
test['bilstm_pred_lag_back2'] = test.groupby('breath_id')['bilstm_pred'].shift(-2)
test = test.fillna(0)

In [None]:
train.info()

## LightAutoML model building

### Task setup

On the cell below we create Task object - the class to setup what task LightAutoML model should solve with specific loss and metric if necessary (more info can be found [here](https://lightautoml.readthedocs.io/en/latest/generated/lightautoml.tasks.base.Task.html#lightautoml.tasks.base.Task) in our documentation):

In [None]:
task = Task('reg', loss='mae', metric='mae')

### Feature roles setup

To solve the task, we need to setup columns roles. The only role you must setup is target role, everything else (drop, numeric, categorical, group, weights etc.) is up to user - LightAutoML models have automatic columns typization inside:

In [None]:
roles = {
    'drop': ['id'],
    'group': 'breath_id', # for group k-fold
    'target': TARGET_NAME
}

### LightAutoML model creation - TabularAutoML preset

In next the cell we are going to create LightAutoML model with `TabularAutoML` class - preset with default model structure like in the image below:

![LightAutoML model](https://raw.githubusercontent.com/sberbank-ai-lab/LightAutoML/master/imgs/tutorial_blackbox_pipeline.png "LightAutoML model")

in just several lines. Let's discuss the params we can setup:

- `task` - the type of the ML task (the only must have parameter)
- `timeout` - time limit in seconds for model to train
- `cpu_limit` - vCPU count for model to use
- `reader_params` - parameter change for Reader object inside preset, which works on the first step of data preparation: automatic feature typization, preliminary almost-constant features, correct CV setup etc. For example, we setup `n_jobs` threads for typization algo, `cv` folds and `random_state` as inside CV seed.
- `general_params` - we use `use_algos` key to setup the model structure to work with (Linear and LGBM model on the first level and their weighted composition creation on the second). This setup is only to speedup the kernel, you can remove this `general_params` setup if you want the whole LightAutoML model to run.

In [None]:
%%time

# Fitting
automl = TabularAutoML(task=task, 
                       timeout=TIMEOUT,
                       cpu_limit=N_THREADS,
                       reader_params={'n_jobs': N_THREADS, 'cv': N_FOLDS, 'random_state': RANDOM_STATE},
                       general_params={'use_algos': [['linear_l2', 'lgb', 'lgb_tuned']]},
                       tuning_params = {'max_tuning_time': 1800},
                      )
automl.fit_predict(train, roles=roles)

In [None]:
# Prediction
test_pred = automl.predict(test)
sample_sub[TARGET_NAME] = test_pred.data[:, 0]

## Feature importance

For feature importances calculation we have 2 different methods in LightAutoML:

- Fast (`fast`) - this method uses feature importances from feature selector LGBM model inside LightAutoML. It works extremely fast and almost always (almost because of situations, when feature selection is turned off or selector was removed from the final models with all GBM models). no need to use new labelled data.
- Accurate (`accurate`) - this method calculate features permutation importances for the whole LightAutoML model based on the new labelled data. It always works but can take a lot of time to finish (depending on the model structure, new labelled dataset size etc.).

In [None]:
fi_score = automl.get_feature_scores('fast').sort_values('Importance', ascending=True)

In [None]:
plt.figure(figsize=(10, 30))
fi_score.set_index('Feature')['Importance'].plot.barh(fontsize=16)
plt.title('Feature importance', fontsize=18)
plt.show()

## Create submission file

In [None]:
sample_sub.head()

In [None]:
sample_sub.to_csv('submission.csv', index=False)