# 🚀 python package style code
### with package code on datasets - LightGBM and TabNet

This is the code of training model and inference.   
Normally we use **ipynb** style code in kaggle.    
I just change the code style to **py package** and it's better for training with shell command.

I refer the original code below and thanks to @chumajin

[[Notebook] Reference Notebook by chumajin](https://www.kaggle.com/chumajin/optiver-realized-ensemble-tabnet-and-lgbm)

------

### This Notebook is the code for last prediction. 
#### Anyone can download my public code dataset of 'volatility' 
#### and it include 
* **preprocessing and feature engineering**
* **lgbm train and predict**
* **tabnet train and predict**

The code of 'volatility_2021.ipynb' in volatility code dataset is the local version of this notebook.  
Just enjoy it!

----
### My Public datasets
`1. Volatility` : source code of py files
* prepare : preprocessing and feature engineering
* light_gbm : train and predict
* tabnet : train and predict


`2.volatility-data` : feature engineered data
* preprocessed_test.csv
* preprocessed_train.csv

----

## 0. Prepare
### 0-1. Install and Import

In [None]:
!pip install ../input/pytorchtabnet/pytorch_tabnet-3.1.1-py3-none-any.whl

In [None]:
import os, sys, shutil, glob
import numpy as np
import pandas as pd
from tqdm import tqdm

### 0-2. Make directory for working

In [None]:
def make_directory(folder):
    if os.path.isdir(folder):
        shutil.rmtree(folder)
    os.makedirs(folder)
    print(f"build directory : < {folder} >")

In [None]:
# tmp path
submission_dir = '/kaggle/working'
working_dir = '/kaggle/working_space'
tmp_output_dir = '/kaggle/tmp_output'
tmp_data_dir = '/kaggle/tmp'
original_data_dir = '/kaggle/input/optiver-realized-volatility-prediction'

# source and working_space path
source_path = '/kaggle/input/volatility/src'
prepare_path = os.path.join(working_dir, 'prepare')
lgbm_path = os.path.join(working_dir, 'light_gbm')
tabnet_path = os.path.join(working_dir, 'tabnet')

# make directory
make_directory(tmp_output_dir)
make_directory(tmp_data_dir)

# copy code to working_space

shutil.copytree(src = '/kaggle/input/volatility/src',
                dst = working_dir)

## 0-3. Preprocessing and Feature Engineering

```py
!python preprocessing.py --data_dir=='location for raw data'\
                         --temp_data_dir=='location for feature engineered data'\
                         --train_data=='either use train data or not(store_true)'\
                         --test_data=='either use test data or not(store_true)'

```

In [None]:
%cd $prepare_path

# preprocessing test data

!python preprocessing.py --data_dir=$original_data_dir\
                         --tmp_data_dir=$tmp_data_dir\
                         --test_data

# only preprocessing test_data
# use preprocessed_train_data in volatility_data for train set

## 1. Model Inference

### 1-1. light gbm



In [None]:
%cd $lgbm_path

# lightgbm prediction

lgb_model_dir = 'models'

!python predict_test.py --data=$tmp_data_dir\
                        --save_dir=$tmp_output_dir\
                        --save_sub\
                        --model_dir=$lgb_model_dir
print('predict by lightgbm')

### 1-2. tabnet

The format of tabnet trained model is **zip**.  
But, zip file is auto unpacked during dataset mounting.  
So, it need to be zipped again.

We need to use **shutil.make_archive** not **!zip** command.  
I refered the function of `'save_model and load_model'` in original tabnet code.  

https://github.com/dreamquark-ai/tabnet/blob/develop/pytorch_tabnet/abstract_model.py

In [None]:
model_dir = '0825_1015'

model_folder = os.path.join(tabnet_path, 'models/')
model_folder += model_dir + '/tabnet_' + model_dir
model_folder

In [None]:
%cd $model_folder

for i, model in enumerate(sorted(glob.glob('./*'))):
    print(f"zip model fold {i} again")
    shutil.make_archive(model, "zip", model)

In [None]:
%cd $tabnet_path

# tabnet prediction

preprocessed_train_data_path = '/kaggle/input/volatility-data/preprocessed_data'

!python predict_test.py --train_data=$preprocessed_train_data_path\
                        --test_data=$tmp_data_dir\
                        --save_dir=$tmp_output_dir\
                        --save_sub\
                        --model_dir=$model_dir

print('predict by tabnet')

In [None]:
%cd /kaggle/working

results = glob.glob(tmp_output_dir + '/*')
submission = pd.read_csv(results[0])
targets = 0
for result in results:
    target = pd.read_csv(result)['target']
    targets += target
targets = targets/len(results)
submission['target'] = targets
submission.to_csv("submission.csv", index=False)
submission

In [None]:
# delete tmp directory

if os.path.isdir(working_dir):
    shutil.rmtree(working_dir)
if os.path.isdir(tmp_output_dir):
    shutil.rmtree(tmp_output_dir)
if os.path.isdir(tmp_data_dir):
    shutil.rmtree(tmp_data_dir)

In [None]:
# complete