In [1]:
# hide
import pandas as pd
from nbdev.showdoc import *

# Synthetic Data Generation with NumerBlox

This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more data to train. We will both cover ways to generate synthetic target variables and features.

## 0. Download and load

In [2]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe, NumerFrame

In [3]:
dl = NumeraiClassicDownloader(directory_path="synth_test")
dl.download_training_data(version=3)

2022-05-05 12:40:27,886 INFO numerapi.utils: target file already exists
2022-05-05 12:40:27,887 INFO numerapi.utils: download complete


2022-05-05 12:40:29,469 INFO numerapi.utils: target file already exists
2022-05-05 12:40:29,471 INFO numerapi.utils: download complete


In [4]:
dataf = create_numerframe("synth_test/numerai_training_data.parquet")

In [5]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.25,0.25,0.25,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,1.0,1.0,1.0,1.0,0.833333,0.666667,0.833333,0.666667,0.833333,0.666667


## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [6]:
from numerblox.preprocessing import BayesianGMMTargetProcessor

In [7]:
show_doc(BayesianGMMTargetProcessor)

<h2 id="BayesianGMMTargetProcessor" class="doc_header"><code>class</code> <code>BayesianGMMTargetProcessor</code><a href="https://github.com/crowdcent/numerblox/tree/master/numerblox/preprocessing.py#L302" class="source_link" style="float:right">[source]</a></h2>

> <code>BayesianGMMTargetProcessor</code>(**`target_col`**:`str`=*`'target'`*, **`feature_names`**:`list`=*`None`*, **`n_components`**:`int`=*`6`*) :: [`BaseProcessor`](/numerbloxpreprocessing.html#BaseProcessor)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model. 

Based on Michael Oliver's GitHub Gist implementation: 

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target. 

:param feature_names: Selection of features used for Bayesian GMM. All features by default.
:param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

In [8]:
dataf.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.25,0.25,0.25,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,1.0,1.0,1.0,1.0,0.833333,0.666667,0.833333,0.666667,0.833333,0.666667
n0048ac83aff7194,1,train,0.5,0.25,0.75,0.0,0.75,0.0,0.75,0.75,...,0.5,0.25,0.25,0.25,0.5,0.333333,0.5,0.333333,0.5,0.333333
n00691bec80d3e02,1,train,1.0,0.5,0.5,0.75,0.0,1.0,0.25,1.0,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5
n00b8720a2fdc4f2,1,train,1.0,0.75,1.0,1.0,0.0,0.0,1.0,0.5,...,0.5,0.5,0.5,0.5,0.666667,0.5,0.5,0.5,0.666667,0.5


In [9]:
bgmm = BayesianGMMTargetProcessor(target_col="target_nomi_20")
test_columns = ['era', 'data_type', 'feature_dichasial_hammier_spawner',
                'feature_rheumy_epistemic_prancer', 'target',
                'target_nomi_20', 'target_paul_20']
sample_dataf = NumerFrame(dataf[test_columns].sample(100).fillna(0.5))
fake_dataf = bgmm(sample_dataf)

Generating fake target:   0%|          | 0/92 [00:00<?, ?it/s]

In [10]:
sample_dataf.head()

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,target,target_nomi_20,target_paul_20,target_nomi_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
naa66d893c9d675e,321,train,0.5,0.5,0.5,0.5,0.5,0.5
na511dc5503842c6,472,train,0.0,1.0,0.5,0.5,0.5,0.5
n9c281a44edd8f50,530,train,0.5,1.0,0.5,0.5,0.5,0.5
n06e7e03c16fde5a,569,train,0.25,0.75,0.5,0.5,0.5,0.5
nb82b454df84a09a,340,train,1.0,0.75,0.5,0.5,0.5,0.5


The new target will be suffixed by `_fake` to distinguish it from the original targets.

In [11]:
fake_dataf.get_target_data.head(2)

Unnamed: 0_level_0,target,target_nomi_20,target_paul_20,target_nomi_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
naa66d893c9d675e,0.5,0.5,0.5,0.5
na511dc5503842c6,0.5,0.5,0.5,0.5


Note that you can easily generate multiple fake targets in a loop.

In [12]:
for target_col in sample_dataf.target_cols:
    bgmm = BayesianGMMTargetProcessor(target_col=target_col)
    sample_dataf = bgmm(sample_dataf)
sample_dataf.get_target_data.head(2)

Generating fake target:   0%|          | 0/92 [00:00<?, ?it/s]

Generating fake target:   0%|          | 0/92 [00:00<?, ?it/s]

Generating fake target:   0%|          | 0/92 [00:00<?, ?it/s]

Unnamed: 0_level_0,target,target_nomi_20,target_paul_20,target_nomi_20_fake,target_fake,target_paul_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
naa66d893c9d675e,0.5,0.5,0.5,0.5,0.5,0.5
na511dc5503842c6,0.5,0.5,0.5,0.5,0.5,0.5


## 2. DeepDreamGenerator

In [13]:
from numerblox.preprocessing import DeepDreamGenerator

In [14]:
show_doc(DeepDreamGenerator)

<h2 id="DeepDreamGenerator" class="doc_header"><code>class</code> <code>DeepDreamGenerator</code><a href="https://github.com/crowdcent/numerblox/tree/master/numerblox/preprocessing.py#L180" class="source_link" style="float:right">[source]</a></h2>

> <code>DeepDreamGenerator</code>(**`model_path`**:`str`, **`batch_size`**:`int`=*`200000`*, **`steps`**:`int`=*`5`*, **`step_size`**:`float`=*`0.01`*, **`feature_names`**:`list`=*`None`*) :: [`BaseProcessor`](/numerbloxpreprocessing.html#BaseProcessor)

Generate synthetic eras using DeepDream technique. 

Based on implementation by nemethpeti: 

https://github.com/nemethpeti/numerai/blob/main/DeepDream/deepdream.py

:param model_path: Path to trained DeepDream model. Example can be downloaded from 

https://github.com/nemethpeti/numerai/blob/main/DeepDream/model.h5 

:param batch_size: How much synthetic data to process in each batch. 

:param steps: Number of gradient ascent steps to perform. More steps will lead to more augmentation. 

:param step_size: How much to augment the batch based on computed gradients. 

Like with the number of steps, a larger step size will lead to more dramatic changes to the input features. 

The default parameters are found to work well in practice, but could be further optimized.

### 2.1. Simple data generation example

For our example we will use the model open sourced by [nemethpeti](https://github.com/nemethpeti) which you can download [here](https://github.com/nemethpeti/numerai/blob/main/DeepDream/model.h5). This model works on the v3 medium feature set. We therefore use v3 data in this example. The v3 medium feature set can be easily retrieved using `NumeraiClassicDownloader`.

In [15]:
#hide_output
feature_set = dl.get_classic_features(filename="v3/features.json")
feature_names = feature_set['feature_sets']['medium']

2022-05-05 12:40:43,868 INFO numerapi.utils: target file already exists
2022-05-05 12:40:43,869 INFO numerapi.utils: download complete


In [16]:
ddg = DeepDreamGenerator(model_path="../test_assets/deepdream_model.h5",
                         feature_names=feature_names)

2022-05-05 12:40:43.915881: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Let's try to generate features from a small subset of 100 rows.

In [17]:
sample_dataf_2 = NumerFrame(dataf.sample(100))

In [18]:
dreamed_dataf = ddg.transform(sample_dataf_2)

Deepdreaming Synthetic Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The new dreamed `NumerFrame` consists of the original data and 100 new additional rows. Note that targets are the same.

Also, `era`, `data_type` and any other columns besides features and targets will be `NaN`s.

In [19]:
print(dreamed_dataf.shape)
dreamed_dataf.tail()

(199, 1073)


Unnamed: 0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
94,,,0.0,0.0,0.926837,0.701231,0.988952,0.195643,0.780995,0.321617,...,0.5,0.0,0.5,0.25,0.5,0.166667,0.5,0.166667,0.5,0.166667
95,,,0.480184,0.427976,0.223835,0.183721,0.761205,0.395356,0.238707,0.99749,...,1.0,0.75,1.0,0.75,1.0,0.833333,0.833333,0.833333,0.833333,0.666667
96,,,0.970524,0.246957,0.463982,0.280058,1.0,0.766258,0.239566,0.723194,...,0.75,0.75,0.5,0.75,0.666667,0.666667,0.666667,0.666667,0.5,0.666667
97,,,0.01046,1.0,0.247279,0.204357,0.757548,0.702845,1.0,0.468156,...,0.25,0.25,0.25,0.25,0.333333,0.333333,0.333333,0.333333,0.333333,0.333333
98,,,0.736778,0.750979,0.007622,0.220263,0.268438,0.741399,0.044169,0.976056,...,0.5,0.5,0.25,0.5,0.5,0.5,0.333333,0.333333,0.5,0.5


To only get new synthetic data use `.get_synthetic_batch`.


In [20]:
synth_dataf = ddg.get_synthetic_batch(sample_dataf_2)

Deepdreaming Synthetic Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [21]:
print(synth_dataf.shape)
synth_dataf.head()

(99, 441)


Unnamed: 0,feature_abstersive_emotional_misinterpreter,feature_accessorial_aroused_crochet,feature_acerb_venusian_piety,feature_affricative_bromic_raftsman,feature_agile_unrespited_gaucho,feature_agronomic_cryptal_advisor,feature_alkaline_pistachio_sunstone,feature_altern_unnoticed_impregnation,feature_ambisexual_boiled_blunderer,feature_amoebaean_wolfish_heeler,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
0,0.0,0.51751,0.590794,0.0,1.0,0.108258,0.042904,0.531423,0.994126,0.322113,...,0.5,0.5,0.75,0.75,0.666667,0.5,0.5,0.5,0.666667,0.666667
1,0.007362,0.018246,0.816708,0.0,0.050006,0.079645,0.055798,0.0,0.0,0.0,...,0.5,0.5,0.25,0.25,0.5,0.5,0.333333,0.333333,0.333333,0.333333
2,1.0,0.731188,0.384121,0.424877,1.0,0.120615,0.01253,0.010341,0.433381,0.050186,...,0.5,0.75,0.75,0.5,0.666667,0.666667,0.666667,0.666667,0.5,0.666667
3,0.00971,0.026749,0.858364,0.420578,0.720584,0.041099,0.0,0.010843,0.733748,0.47773,...,0.75,0.5,0.5,0.5,0.5,0.5,0.5,0.333333,0.5,0.5
4,0.458052,0.623646,0.371411,0.0,0.0,0.740971,1.0,0.746225,1.0,0.0,...,0.75,0.75,0.75,0.75,0.666667,1.0,0.666667,0.833333,0.666667,0.666667


### 2.2. Improve Numerai performance with synthetic data.

Now we will demonstrate that `DeepDreamGenerator` will actually lead to better validation performance when the synthetic data is mixed with the original Numerai data. We hope to see more experiments by the Numerai community with v4 data, different feature sets, etc. Hope this section makes it easy for participants to get started and to setup up your own experiments.

In [22]:
import numpy as np
import lightgbm as lgb

from numerblox.preprocessing import FeatureSelectionPreProcessor
from numerblox.evaluation import NumeraiClassicEvaluator

In [23]:
dataf = create_numerframe("synth_test/numerai_training_data.parquet")
val_dataf = create_numerframe("deepdream_eval_test/numerai_validation_data.parquet")

In [24]:
ddg = DeepDreamGenerator("../test_assets/deepdream_model.h5", batch_size=200_000)

We will train a LightGBM model and evaluate the most common metrics for both original data and original data + ~5% synthetic data (500000 rows).

In [25]:
def train_and_evaluate(dataf: NumerFrame, val_dataf: NumerFrame, feature_names: list):
    """ Train LightGBM model with proper parameters and evaluate. """
    X_train, y_train = dataf.get_feature_target_pair(multi_target=False)
    X_val, y_val = val_dataf.get_feature_target_pair(multi_target=False)
    lgb_model = lgb.LGBMRegressor(random_state=42, n_estimators=2000, max_depth=5,
                                  learning_rate=.01, colsample_bytree=.1)
    X_train = NumerFrame(X_train[feature_names])
    X_val = NumerFrame(X_val[feature_names])
    lgb_model.fit(X_train, y_train)

    X_val['prediction'] = lgb_model.predict(X_val)
    X_val['era'] = val_dataf['era']
    X_val['target'] = y_val
    X_val['random_pred'] = np.random.uniform(len(y_val))

    evaluator = NumeraiClassicEvaluator(fast_mode=True)
    results = evaluator.full_evaluation(X_val, example_col="random_pred",
                                        target_col='target', pred_cols=['prediction'])
    return results

As reference, We train a LightGBM model on the medium feature set with no added synthetic data.

In [26]:
results = train_and_evaluate(dataf, val_dataf, feature_names=feature_names)



Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

2022-05-05 12:47:03,481 INFO numexpr.utils: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-05-05 12:47:03,483 INFO numexpr.utils: NumExpr defaulting to 8 threads.


`FeatureSelectionPreProcessor` selects features and keep all other columns, like target, era and data type. We will use this to filter on the medium feature set. Then we randomly take 500000 rows to generate synthetic data on.

In [27]:
sample_dataf_dream = FeatureSelectionPreProcessor(feature_cols=feature_names)(dataf)
sample_dataf_dream = sample_dataf_dream.sample(500_000, random_state=42)

A simple call to the `DeepDreamGenerator` will generate a full new dataset from the input DataFrame. In this case, we generate 500000 new rows and add it to our full Numerai v3 dataset.

In [28]:
sample_dream_dataf = ddg.get_synthetic_batch(sample_dataf_dream)
dream_dataf = pd.concat([dataf, sample_dream_dataf])

Deepdreaming Synthetic Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Now let's train on the full dataset again with 500000 additional synthetic rows and compare results.

In [29]:
dream_results = train_and_evaluate(dream_dataf, val_dataf,
                                   feature_names=feature_names)



Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

In [30]:
results

Unnamed: 0,target,mean,std,sharpe,max_drawdown,apy,mmc_mean,mmc_std,mmc_sharpe,corr_with_example_preds
prediction,target,0.020226,0.032328,0.625658,-0.192148,155.268204,0.015526,0.024817,0.62565,-3.1303469999999996e-19


In [31]:
dream_results

Unnamed: 0,target,mean,std,sharpe,max_drawdown,apy,mmc_mean,mmc_std,mmc_sharpe,corr_with_example_preds
prediction,target,0.020592,0.030987,0.664546,-0.182301,160.191965,0.015807,0.023788,0.664535,-3.0722660000000002e-18


Note that with the added synthetic data the mean correlation improves, along with a higher Sharpe and lower max. drawdown. There seems to be a sweet spot when mixing in synthetic data. Add too little and no significant improvement occurs, but adding too much synthetic data can also hurt performance. You can play with this for yourself. Also try to adjust the `steps` and `step_size` parameters in `DeepDreamGenerator`.

## 3. UMAPFeatureGenerator

UMAP is a feature reduction technique that can be used to generate synthetic features. In other words, we create new representations of the existing features and add them to our dataset.

We will perform UMAP on the training and validation data combined. Note that the data created with `DeepDreamGenerator` is included in this dataset. Then, once again we train a model on it and evaluate results.

In [32]:
from numerblox.preprocessing import UMAPFeatureGenerator

`n_components` denotes the amount of additional features we are generating.

In [33]:
n_components = 3
umap_gen = UMAPFeatureGenerator(n_components=n_components, n_neighbors=9)

In [34]:
test_data = create_numerframe("../test_assets/mini_numerai_version_2_data.parquet")

In [35]:
test_data = umap_gen(test_data)

OMP: Info #271: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


The new features follow the naming convention `f"feature_umap_{i}"`. All new components are scaled between 0 and 1.

In [36]:
umap_features = [f"feature_umap_{i}" for i in range(n_components)]
test_data[umap_features].head(3)

Unnamed: 0_level_0,feature_umap_0,feature_umap_1,feature_umap_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n559bd06a8861222,0.887313,0.365509,1.0
n9d39dea58c9e3cf,1.0,0.779677,0.732083
nb64f06d3a9fc9f1,0.879256,0.073302,0.174605


Contrast this with the deep dream results.

After you're done all the downloaded files can be cleaned up with `.remove_base_directory()`.

In [37]:
# Clean up environment
dl.remove_base_directory()