## ML modelling

Here we compute a cross-validation on both short time horizon (30 minute intervals for marginal liquidity cost) and long time horizons (1 day intervals). This was done because initially the caculation of a short-time horizon only captured temporal effects. To capture volume effects as desribed in the literature a longer time horizon was required.


References:
* [Impact Cost calculation](https://economictimes.indiatimes.com/definition/impact-cost#:~:text=Definition%3A%20Impact%20cost%20is%20the,liquidity%20condition%20on%20the%20counter.&text=This%20is%20a%20cost%20that,to%20lack%20of%20market%20liquidity).
* [Limit Order Books](https://www.imperial.ac.uk/media/imperial-college/research-centres-and-groups/cfm-imperial-institute-of-quantitative-finance/events/imperial-eth-2016/Julius-Bonart.pdf)
* [Paper on gap K fold cross validation techniques](https://arxiv.org/pdf/1905.11744.pdf)

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
import etl
import config as cfg
import liquidity_costs as lc
import feature_engineering as fe
import ml_modelling as ml
from sklearn.preprocessing import MinMaxScaler
from tabulate import tabulate

scaler = MinMaxScaler((-100,100))

asks_merged_df, bids_merged_df = fe.get_data(sin_cos_transform=True)

for df in [asks_merged_df, bids_merged_df]:
    df[cfg.Y_COL] = scaler.fit_transform(df[cfg.Y_COL])

## CV prediction 30-min intervals 

In [2]:
scores = ml.get_cv_results(asks_merged_df, "30min")

Running GridSearchCV for ExtraTreesRegressor.
Fitting 5 folds for each of 42 candidates, totalling 210 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    3.7s
[Parallel(n_jobs=3)]: Done 205 out of 210 | elapsed:   19.1s remaining:    0.5s
[Parallel(n_jobs=3)]: Done 210 out of 210 | elapsed:   20.2s finished


In [4]:
scores.head(3)

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,max_depth,n_estimators,random_state
12,ExtraTreesRegressor,2601.71,1169.22,23.4365,-1170.15,3,16,121301
13,ExtraTreesRegressor,2808.37,1187.55,24.0981,-1229.93,3,32,121301
18,ExtraTreesRegressor,3526.15,1499.12,24.4908,-1616.75,4,16,121301


## CV prediction 1-day intervals
Data is split over 30 minute intervals per row so each full day requires 48 time steps

In [None]:
scores_daily_models_dict = ml.get_scores_daily_models(asks_merged_df, bids_merged_df)