In [13]:
# hide
from nbdev.showdoc import *

# Synthetic Data Generation with NumerBlox

This example notebook covers ways to generate synthetic data using `numerblox` components. Synthetic data can be a great way to improve performance simply by having more representative data. We will both cover ways to generate synthetic target variables and features.


**WARNING:** Fitting these models can take quite some time on full Numerai datasets. It is recommended to use these preprocessors stand-alone and only apply them within a `ModelPipeline` if you already have trained models to load and generate batches of synthetic data.

## 0. Download and load

In [14]:
from numerblox.download import NumeraiClassicDownloader
from numerblox.numerframe import create_numerframe, NumerFrame
from numerblox.preprocessing import SyntheticDataGenerator, BayesianGMMTargetProcessor

In [15]:
dl = NumeraiClassicDownloader(directory_path="synth_test")
dl.download_training_data(version=4)

2022-04-18 15:08:53,289 INFO numerapi.utils: target file already exists
2022-04-18 15:08:53,290 INFO numerapi.utils: download complete


2022-04-18 15:08:54,349 INFO numerapi.utils: target file already exists
2022-04-18 15:08:54,350 INFO numerapi.utils: download complete


In [16]:
dataf = create_numerframe("synth_test/train.parquet")

In [17]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_paul_v4_20,target_paul_v4_60,target_george_v4_20,target_george_v4_60,target_william_v4_20,target_william_v4_60,target_arthur_v4_20,target_arthur_v4_60,target_thomas_v4_20,target_thomas_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n003bba8a98662e4,1,train,1.0,0.5,1.0,1.0,0.0,0.0,1.0,1.0,...,0.5,0.25,0.25,0.0,0.333333,0.0,0.5,0.5,0.166667,0.0
n003bee128c2fcfc,1,train,0.5,1.0,0.25,0.75,0.0,0.75,0.5,0.75,...,0.75,1.0,1.0,1.0,0.666667,0.666667,0.833333,0.666667,0.833333,0.666667


In [18]:
# Sample for testing
test_columns = ['era', 'data_type', 'feature_honoured_observational_balaamite',
                'feature_polaroid_vadose_quinze', 'target',
                'target_nomi_v4_20', 'target_nomi_v4_60']
dataf = NumerFrame(dataf[test_columns].sample(10000))

## 1. Synthetic target (Bayesian GMM)

First we will tackle the problem of creating a synthetic target column to improve model performance. `BayesianGMMTargetProcessor` allows you to generate a new target variable based on a given target. The preprocessor sample the target from a [Bayesian Gaussian Mixture model](https://scikit-learn.org/stable/modules/generated/sklearn.mixture.BayesianGaussianMixture.html) which is fitted on coefficients from a [regularized linear model (Ridge regression)](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html).

This implementation is based on a [Github Gist by Michael Oliver (mdo)](https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93).

In [19]:
show_doc(BayesianGMMTargetProcessor)

<h2 id="BayesianGMMTargetProcessor" class="doc_header"><code>class</code> <code>BayesianGMMTargetProcessor</code><a href="https://github.com/crowdcent/numerblox/tree/master/numerblox/preprocessing.py#L232" class="source_link" style="float:right">[source]</a></h2>

> <code>BayesianGMMTargetProcessor</code>(**`target_col`**:`str`=*`'target'`*, **`n_components`**:`int`=*`6`*) :: [`BaseProcessor`](/numerbloxpreprocessing.html#BaseProcessor)

Generate synthetic (fake) target using a Bayesian Gaussian Mixture model. 

Based on Michael Oliver's GitHub Gist implementation: 

https://gist.github.com/the-moliver/dcdd2862dc2c78dda600f1b449071c93

:param target_col: Column from which to create fake target. 

:param n_components: Number of components for fitting Bayesian Gaussian Mixture Model.

In [20]:
bgmm = BayesianGMMTargetProcessor(target_col="target_nomi_v4_60")
fake_dataf = bgmm(dataf)

Generating fake target:   0%|          | 0/574 [00:00<?, ?it/s]

The new target will be suffixed by `_fake` to distinguish it from the original targets.

In [21]:
fake_dataf.get_target_data.head(2)

Unnamed: 0_level_0,target,target_nomi_v4_20,target_nomi_v4_60,target_nomi_v4_60_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
n458579a5705ffe2,0.5,0.5,0.75,0.75
nf30a7195098a8fe,0.0,0.0,0.0,0.5


Note that you can easily generate multiple fake targets in a loop.

In [22]:
for target_col in dataf.target_cols:
    bgmm = BayesianGMMTargetProcessor(target_col=target_col)
    dataf = bgmm(dataf)
dataf.get_target_data.head(2)

Generating fake target:   0%|          | 0/574 [00:00<?, ?it/s]

Generating fake target:   0%|          | 0/574 [00:00<?, ?it/s]

Generating fake target:   0%|          | 0/574 [00:00<?, ?it/s]

Unnamed: 0_level_0,target,target_nomi_v4_20,target_nomi_v4_60,target_nomi_v4_60_fake,target_fake,target_nomi_v4_20_fake
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
n458579a5705ffe2,0.5,0.5,0.75,0.5,0.25,0.5
nf30a7195098a8fe,0.0,0.0,0.0,0.5,0.75,0.5


## 2. Synthetic data (SDV)

In [23]:
show_doc(SyntheticDataGenerator)

<h2 id="SyntheticDataGenerator" class="doc_header"><code>class</code> <code>SyntheticDataGenerator</code><a href="https://github.com/crowdcent/numerblox/tree/master/numerblox/preprocessing.py#L180" class="source_link" style="float:right">[source]</a></h2>

> <code>SyntheticDataGenerator</code>(**`model_path`**:`str`, **`model_name`**=*`'CTGAN'`*, **`rows_per_era`**:`int`=*`5400`*, **`eras_to_add`**:`int`=*`1`*) :: [`BaseProcessor`](/numerbloxpreprocessing.html#BaseProcessor)

Generate synthetic eras. Uses SDV (sdv.dev) under the hood.

:param model_name: Exact class name of a model supported on sdv. 

:param model_path: Either: 

1. Path to trained model. 

2. Path to where you want to save the fitted model. 

If model_path does not point to a valid file, a new model will be initialized, fitted and saved.

In [24]:
# Clean up environment
# dl.remove_base_directory()