After having our LGBM baseline as ref [1], the following "headache" will undoubtedly be tuning our model for better performance.  

So several ideas came just on time 😉 - are our data clean enough? do we need a normalization/standardization? will get better results with certain hyperparams for LightGBM model? as time-related, so possible for time series analysis? 

**Version Notes**
- V2.0 - Model Tuning | 3rd Sep 2021
- V1.0 - 1st published & submitted version | 29th Aug 2021

**Work Principles**
- A clean & comfortable format to help our brains digest
- Occam's Razor - Non sunt multiplicanda entia sine necessitate
- Refactoring 

**General Notes**
- This notebook will focus on ideas tuning our model based on ref [1], so more code snippets here which you could "digest" with [1] for a submission.

# Overview
- Libs Import
- Clean Data
    - Remove Coluns 
    - Imputing
    - Normalization
    - Pipeline
- GridSearchCV & RandomSearchCV
    - GridSearch
    - RandoSearch
    - Cloud Computing
- Summary
- Reference

# Libs Import

In [None]:
# Import order: data manipulating -> machine/deep learning -> utilities/helpers/improvement -> configuration
import pandas as pd
import numpy as np
import scipy as sc
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

import warnings
warnings.filterwarnings('ignore')
pd.options.display.max_columns = None # Show all as it has
pd.options.display.max_rows = None

In [None]:
# Here, we use the intermediate result from ref [1] (after feature processing)
train = pd.read_csv('../input/optiverrealizedvolatilitypredictionpreprocess/train_dataset.csv')
test = pd.read_csv('../input/optiverrealizedvolatilitypredictionpreprocess/test_dataset.csv')

# Clean Data

NaN or None values is a classical problem we need to do for our dataset, like washing your vegies 🥬 before cutting for dinner. 

Let's how "dirty" they are now...

**Remove Columns (as you reason/want)**

In [None]:
train.head()

In [None]:
train.columns

Oops. The 1st col seems weird, right... "Unnamed". Where does it come from? 

Actually it is due to not setting index=False *for to_csv('submission.csv', index=False)*

Use this as an example for snippets -- removing columns

In [None]:
remove_cols = ["Unnamed: 0"]
train = train.drop(remove_cols, axis=1)

In [None]:
train.head()

**Remove NaN/None Values**

In [None]:
# Check how many null values for each col
train.isnull().sum()

Several cols, like trade_log_return_realized_volatility_450, contain around 1700 null values. But compared with 428932 rows, it's not a large number.

So let's impute it rather than deleting.

**Imputing**

In [None]:
# Get X out of train
X = train.drop(['row_id', 'target', 'time_id'], axis = 1)

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
X.iloc[:, :] = imp.fit_transform(X)

In [None]:
# Let's verify
X.isnull().sum()

**Normalization**

Check the stat for train dataset, we find the range for mean and std are a bit large for like wap1_sum, wap1_sum_150...

It is possible to make the model fitting slower and less accurate.

In [None]:
train.describe()

In [None]:
scaler = StandardScaler()
# Don't forget to keep 'stock_id' out, as this number matters for its "large range"
X_features = X.loc[:, X.columns != 'stock_id']
X_features.iloc[:, :] = scaler.fit_transform(X_features)

In [None]:
# You see, here we got mean around 0 and std around 1 for all cols
X_features.describe()

**Pipeline it!**

We are engineers! 👷‍ No wait for building a pipeline if possible.

In [None]:
X_features = X.loc[:, X.columns != 'stock_id']
steps = [
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('scaler', StandardScaler())
]
pipe = Pipeline(steps)
X_features.iloc[:, :] = pipe.fit_transform(X_features)
X.loc[:, X.columns != 'stock_id'] = X_features

Now, we kind of "washing" our vegies and pipeline the steps which we assume will perhaps improve our model.

# GridSearchCV & RandomSearchCV (with GCP)

To be honest, it took me a while to understand [all params listed for LGBM model](https://lightgbm.readthedocs.io/en/latest/Parameters.html). I knew the basic idea is simple models are arranged into a "powerful" one and certain tree structure here, so having "depth", "leaf"...

Ok enough, let's don't waste the time. We could do Grid and Random Search in parallel while reading posts for these params right.

**GridSearchCV**

Let's try different boosting strategies, see if any differences. and also diff learning rates and 2 fractions which should be the "promising" ones.

In [None]:
X = train.drop(['row_id', 'target', 'time_id'], axis = 1)
y = train['target']

param_grid = {
    'boosting_type': ['gbdt', 'rf', 'dart', 'goss'],
    'learning_rate': [0.1, 0.01, 0.001],
    'num_leaves': [31, 100, 200, 500, 1000],
    'feature_fraction': [0.5, 0.8, 1.0],
    'bagging_fraction': [0.5, 0.8, 1.0]
    
}

model = lgb.LGBMModel(objective='rmse')
gs = GridSearchCV(model, param_grid=param_grid, scoring='neg_root_mean_squared_error', verbose=2, refit=True, cv=5, n_jobs=-1)
gs.fit(X, y)
print('Best score reached: {} with params: {} '.format(gs.best_score_, gs.best_params_))

Here is the result:    
Best score reached: -0.0010287521953893668 with params: {'bagging_fraction': 0.5, 'boosting_type': 'gbdt', 'feature_fraction': 0.8, 'learning_rate': 0.1, 'num_leaves': 500}

**RandomSearchCV**

Params like num_leaves are continous, quite hard for us to pick reasonable ones. So leave them for RandomSearch here

In [None]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
from sklearn.model_selection import RandomizedSearchCV

X = train.drop(['row_id', 'target', 'time_id'], axis = 1)
y = train['target']

param_grid = {
    'num_leaves': sp_randint(10, 1000), 
    'min_child_samples': sp_randint(10, 500), 
    'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
    'bagging_fraction': sp_uniform(loc=0.2, scale=0.8), 
    'feature_fraction': sp_uniform(loc=0.4, scale=0.5),
    'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
    'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]
}

model = lgb.LGBMModel(objective='rmse')
gs = RandomizedSearchCV(model, param_grid, n_iter=1000, scoring='neg_root_mean_squared_error', verbose=2, refit=True, cv=5, n_jobs=-1)
gs.fit(X, y)
print('Best score reached: {} with params: {} '.format(gs.best_score_, gs.best_params_))

Here is result:
Best score reached: -0.001018671586219258 with params: {'bagging_fraction': 0.8202697666845151, 'feature_fraction': 0.6674647807946869, 'min_child_samples': 154, 'min_child_weight': 1e-05, 'num_leaves': 491, 'reg_alpha': 0, 'reg_lambda': 10}

**Using Cloud Computing**

In order to shorten the time and have more trials, I decided to use GCP (a computing resource upgrade option you could find in "File", top-left). As we have Parallel code here, more CPUs mean faster. I chose the 96 vCPUs and got the 5000 around fitting compelted for around 1 hour.

More details for finding a balance for what we need and pricing, chekc [this](https://cloud.google.com/compute/all-pricing).

I think cloud computing should be a valuable, even necessary, part for us to master -- no way seems for us to run such a heavy computing load on local in a reasonable time. 

# Summary

After applying all above (except StandardScaler as worsing the score), the learderboard score got immproved for a bit, around top 48% from 42%. Not an impressive one right.

I tried several rounds more with GridSearch and RandomSearch, seems no better "magic" hyperparams I found for LGBM model.

Check back [the best-score ranked posts](https://www.kaggle.com/c/optiver-realized-volatility-prediction/code?competitionId=27233&sortBy=scoreAscending), I found most of them mentioned the NN/Neural Network which is combined with using LGBM. I guess this result could be around the limit for the LGBM-only approach.

There should be some space for using the information from time series. A post I found here -- [Exploring time_id relationship](https://www.kaggle.com/stassl/exploring-time-id-relationships), leaving an open end for everyone to try.

# Reference

[[1] [Reproduction & Explanation] LGBM Baseline](https://www.kaggle.com/austinzhao/reproduction-explanation-lgbm-baseline?scriptVersionId=73453864https://www.kaggle.com/austinzhao/reproduction-explanation-lgbm-baseline?scriptVersionId=73453864)