# To Do List

## High Priority
1. CHECK AGAIN TS/GR SOLUTIONS
1. Hyperparameter Optimization
    - BayesSearchCV
1. Model Evaluation + Model Ensembling + Feature Selection (Feature Importance)
    - Model Ensembling:
        - How to specialize models?
        - Split dataset by volatility regimes?
            - Split dataset by looking at quantiles of std(mid)
            - 3 quantiles: low / normal / high volatility regimes
            - Do feature selection and build a model for each regime
        - Do model averaging/ensempling using Linear Regression/Ridge Regression?
    - Cumulative $R^2$ Plotting:
        - Implement cumulative $R^2$ in a time-series cross-validation setting
        - Plot Cumulative $R^2$ for various models for different cv-folds and understand which ones are the best models for each fold
        - For each model/set of features for all the (5) folds of the CV set
        - Use learning_curve (scikit-learn)?
    - Check how to combine different models built on different set of features
        - How to select the features for each model?
    - Feature Selection (Feature Importance):
        - Check RFECV to do Feature Selection
        - Do Feature Importance with XGBoost using Cross-Validation:
            - Does the importance change on different folds? If yes, probably you need to use different features for different periods
1. Features:
    1. Bukosabino "Blackmagic"
        - Check Bukosabino GitHub "Blackmagic"
        - Use his standardization techniques as well? (same file of Blackmagic)
    1. Use Robust Z-Score?
        - Use by Sam (GR)
        - $\text{Robust Z-Score} = \frac{x-\text{median}}{\text{MAD (Median Abs Deviation)}}$
    1. New Features
        - Volatility / Std
        - Diffs
        - Sum
        - Ratios
    1. Select Different Features:
        - Select less order book levels and see if anything changes
    1. Reduce Memory Footprint:
        - Try using Differences/Returns in time instead of Z-Score
    1. Transform Features:
        - Check if features are non-normally distributed (using an histogram)
        - Apply Log/Box-Cox Transformation?
1. Cross-Validation
    - Change $R^2$ to use $\overline{y}=0$ (true metric used by XTX)
    - Change Cross-Validation Method:
        - Test other CV strategies to get a better estimate for OOS
        - Change TimeSeriesSplit to something else
        - Test Stratified Sampling / StratifiedKFold
    - Shuffling dataset gives better $R^2$: Is it because I use the future to forecast the present?
    - Adding lags doens't improve $R^2$: Is it because samples are independent?
    - Select best features for each fold + Train single model on each fold/feature set + Ensemble
1. MAE / Huber Loss
    - Test Huber Loss with XGBRegressor

## Low Priority
1. Remove Outliers
1. Test Neural Networks
1. New Models:
    - Fix models that aren't working + Test new models:
        - ExtraTrees
        - Random Forests
        - GBM
        - Neural Networks
        - Lasso
1. Feature Selection:
    - CCF & Stationary Data?
        - You should have checked if data was stationary before coing CCF (Cross-Correlation Function)
1. Multi-class Classification:
    - Treat problem as multi-class classification problem?
1. Train on true (reconstructed) $y$, not winsorized $y$
    - clip prediction output to $[-5,5]$?

# BayesSearchCV
- Bayes Optimization offered by `Scikit-Optimize`/`skopt`
- Check file `skopt_sklearn-gridsearchcv-replacement.ipynb`

## Check Scikit-Learn GridSearchCV Interface
- Check interesting options offered by GridSearchCV to be replicated with BayesSearchCV:
    - plotting
    - scores
    - partial results via callbacks
    - save results to be resumed

## Progress monitoring/control via `callbacks`

Monitor progress of BayesSearchCV with event handler called on every step of subspace exploration
- n_jobs=1: event handler called on every evaluation of model configuration
- n_jobs>1 (parallel mode): event handler called when n_jobs model configurations are evaluated in parallel
- Exploration can be stopped if callback returns True. E.g, stop exploration early if accuracy obtained is sufficiently high.

In [None]:
searchcv = BayesSearchCV(
    SVC(gamma='scale'),
    search_spaces={'C': (0.01, 100.0, 'log-uniform')},
    n_iter=10,
    cv=3
)

# callback handler
def on_step(optim_result):
    score = searchcv.best_score_
    print("best score: %s" % score)
    if score >= 0.98:
        print('Interrupting!')
        return True

searchcv.fit(X, y, callback=on_step)

# Ideas

## Bukosabino (GR-29th)
- https://medium.com/@bukosabino/financial-forecasting-challenge-by-g-research-8792c5344ae9
- https://github.com/bukosabino/financial-forecasting-challenge-gresearch

### Feature Engineering
- Black magic: I added **rolling_mean**, **inverse_rolling_mean**, **diffs**, **cumsum** and **shift** for all features order by 'Stock' with different periods. When I included them my results grew up greatly.

### Models
- Tuned different boosting models like **XGBoost** and **LightGBM**
- XGBoost had always better results than LightGBM.
- **Stacked the best models** using different weights

### Ideas with Bad Results:
- **RNN/Recurrent Neural Networks** using Keras library (**LSTMs** with different approaches)
- Delete and/or **clip outliers** values through some observations and using statistical methods like standard deviation
- feature engineering with decomposition methods such as **PCA**, **SVD** or **TSNE** for different groups of features from the original dataset
- **Data normalization**
- **ExtraTreesRegressor**, **RandomForestRegressor**, and different **linear regressions**

## Feature Selection: Feature Importance

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), feature_names), reverse=True))