<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In [None]:
#| include: false

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
#| include: false
from nbdev.showdoc import *

## Overview: The NumerFrame

`NumerFrame` is a data structure that extends `pd.DataFrame` with functionality convenient for Numerai users. The main benefits include:
1. Automatically track features, targets, prediction and other columns + easily retrieve these data slices.
2. Add, export and import metadata. Furthermore, dynamically update or manipulate metadata within your Numerai data pipeline.
3. Other library functionality automatically recognizes era column (`era`, `friday_date` or `date`).
4. Integrations with other library components (i.e. `preprocessing`, `model`, `modelpipeline`, `postprocessing`, `evaluation` and `submission`) to create more solid inference pipelines and increase reliability.

Besides, all functionality of Pandas DataFrames is still available in the `NumerFrame`. You therefore don't have to create new pipelines to process your data when using `NumerFrame`.

We adopt the convention:
 1. All feature column names should start with `'feature'`.
 2. All target column names should start with `'target'`.
 3. All prediction column names should start with `'prediction'`.
 4. Data should contain an `'era'`, `'friday_date'` or `'date'` column, as is almost always the case with Numerai datasets.

Every column for which these conditions do not hold will be classified as an `'aux'` column.

In [1]:
#| echo: false
#| output: asis
show_doc(NumerFrame)

---

### NumerFrame

>      NumerFrame (*args, **kwargs)

Data structure which extends Pandas DataFrames and
allows for additional Numerai specific functionality.

`create_numerframe` automatically recognizes your data file format, loads it into a `NumerFrame`, allows for column selection before loading and optionally adds metadata.

Support file formats are `.csv`, `.parquet`, `.pkl`, `.pickle`, `.xsl`, `.xslx`, `.xlsm`, `.xlsb`, `.odf`, `.ods` and `.odt`. If the file format for your use case is missing, feel free to create a Github issue or submit a pull request. See `README.md` for more information on contributing.

In [2]:
#| echo: false
#| output: asis
show_doc(create_numerframe)

---

### create_numerframe

>      create_numerframe (file_path:str, metadata:dict=None, columns:list=None,
>                         *args, **kwargs)

Convenient function to initialize NumerFrame.
Support most used file formats for Pandas DataFrames 

(.csv, .parquet, .xls, .pkl, etc.).
For more details check https://pandas.pydata.org/docs/reference/io.html

:param file_path: Relative or absolute path to data file. 

:param metadata: Metadata to be stored in NumerFrame.meta. 

:param columns: Which columns to read (All by default). 

*args, **kwargs will be passed to Pandas loading function.

## NumerFrame Usage

A `NumerFrame` object can be initialized from memory just like you would with a Pandas DataFrame.
You then have the option to add metadata with `.add_metadata`. All metadata will be stored in the `meta` attribute.

### 1. Initialize from memory

In [None]:
test_features = [f"feature_{l}" for l in "ABCDEFGHIK"]
id_col = [uuid.uuid4().hex for _ in range(100)]

# Random DataFrame
dataf = pd.DataFrame(np.random.uniform(size=(100, 10)), columns=test_features)
dataf["id"] = id_col
dataf[["target", "target_1", "target_2"]] = np.random.normal(size=(100, 3))
dataf["date"] = range(100)

In [None]:
metadata = {
    "version": 42,
    "additional_info": "test_model",
    "multi_target": False,
    "tournament_type": "random",
}
memory_dataf = NumerFrame(dataf)
memory_dataf.add_metadata(metadata)
assert memory_dataf.meta.version == 42
assert memory_dataf.meta.tournament_type == "random"

Metadata stored in `.meta` and can be accessed as a dictionary or as attributes.

In [None]:
memory_dataf.meta

{'era_col': 'date',
 'era_col_verified': True,
 'version': 42,
 'additional_info': 'test_model',
 'multi_target': False,
 'tournament_type': 'random'}

In [None]:
memory_dataf.meta.version

42

In [None]:
memory_dataf.meta['version']

42

In [None]:
assert memory_dataf.meta.version == memory_dataf.meta['version']

### 2. Initialize from file path

You can also use the convenience function `create_numerframe` so `NumerFrame` can be easily initialized. Think of it as a dynamic `pd.read_csv`, `pd.read_parquet`, etc. where you can also directly pass metadata.

In [None]:
metadata = {
    "version": 2,
    "multi_target": False,
    "tournament_type": "classic",
    "era_col": "era"
}

num_dataf = create_numerframe("test_assets/mini_numerai_version_2_data.parquet",
                          metadata=metadata
                          )
assert num_dataf.meta.version == 2
assert num_dataf.meta.era_col == "era"
assert not num_dataf.meta.multi_target
num_dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,297,train,0.25,0.75,0.25,0.75,0.25,0.5,1.0,0.25,...,0.0,0.5,0.25,0.5,0.0,0.5,0.166667,0.5,0.333333,0.5
n9d39dea58c9e3cf,3,train,0.75,0.5,0.75,1.0,0.5,0.25,0.5,0.0,...,0.5,0.75,0.5,0.5,0.666667,0.666667,0.5,0.666667,0.5,0.666667


### 3. Example functionality

In [None]:
num_dataf.meta

{'era_col': 'era',
 'era_col_verified': True,
 'version': 2,
 'additional_info': 'test_model',
 'multi_target': False,
 'tournament_type': 'classic'}

`.get_feature_data` will retrieve all columns where the column name starts with `feature`.

In [None]:
num_dataf.get_feature_data.head(2)

Unnamed: 0_level_0,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,feature_unscheduled_malignant_shingling,feature_clawed_unwept_adaptability,...,feature_unpruned_pedagoguish_inkblot,feature_forworn_hask_haet,feature_drawable_exhortative_dispersant,feature_metabolic_minded_armorist,feature_investigatory_inerasable_circumvallation,feature_centroclinal_incentive_lancelet,feature_unemotional_quietistic_chirper,feature_behaviorist_microbiological_farina,feature_lofty_acceptable_challenge,feature_coactive_prefatorial_lucy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,0.25,0.75,0.25,0.75,0.25,0.5,1.0,0.25,0.25,0.75,...,0.75,0.0,1.0,0.0,0.0,0.25,0.0,0.0,1.0,0.25
n9d39dea58c9e3cf,0.75,0.5,0.75,1.0,0.5,0.25,0.5,0.0,1.0,0.25,...,1.0,1.0,0.25,0.5,0.0,0.25,0.75,1.0,0.75,1.0


`.get_target_data` retrieves all columns if the column name starts with `"target"`.

In [None]:
num_dataf.get_target_data.head(2)

Unnamed: 0_level_0,target,target_nomi_20,target_nomi_60,target_jerome_20,target_jerome_60,target_janet_20,target_janet_60,target_ben_20,target_ben_60,target_alan_20,...,target_paul_20,target_paul_60,target_george_20,target_george_60,target_william_20,target_william_60,target_arthur_20,target_arthur_60,target_thomas_20,target_thomas_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,0.25,0.25,0.5,0.0,0.5,0.5,0.5,0.25,0.5,0.5,...,0.0,0.5,0.25,0.5,0.0,0.5,0.166667,0.5,0.333333,0.5
n9d39dea58c9e3cf,0.5,0.5,0.75,0.5,0.75,0.5,0.5,0.5,0.5,0.5,...,0.5,0.75,0.5,0.5,0.666667,0.666667,0.5,0.666667,0.5,0.666667


`.get_single_target_data` only retrieves the column `"target"`.

In [None]:
num_dataf.get_single_target_data.head(2)

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
n559bd06a8861222,0.25
n9d39dea58c9e3cf,0.5


`.get_pattern_data` allows you to get columns based on a certain pattern. In this example we retrieve all 20-day targets.

In [None]:
num_dataf.get_pattern_data("_20").head(2)

Unnamed: 0_level_0,target_nomi_20,target_jerome_20,target_janet_20,target_ben_20,target_alan_20,target_paul_20,target_george_20,target_william_20,target_arthur_20,target_thomas_20
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
n559bd06a8861222,0.25,0.0,0.5,0.25,0.5,0.0,0.25,0.0,0.166667,0.333333
n9d39dea58c9e3cf,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.666667,0.5,0.5


`.get_era_batch` will return a `tf.Tensor` or `np.array` with feature data and target data for one or more eras. Convenient for creating neural network DataGenerators.

In [None]:
X_era, y_era = num_dataf.get_era_batch(['0003'], convert_to_tf=True, dtype=tf.float16)
X_era

2022-11-09 13:04:47.328935: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-09 13:04:47.339774: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-09 13:04:47.340556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-09 13:04:47.341920: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

<tf.Tensor: shape=(1, 1050), dtype=float16, numpy=array([[0.75, 0.5 , 0.75, ..., 1.  , 0.75, 1.  ]], dtype=float16)>

For people training autoencoders + MLP you can get a target that contains 3 elements: features, targets and targets. Just define `aemlp_batch=True`.
More info on this setup: [AutoEncoder and multitask MLP on new dataset forum post](https://forum.numer.ai/t/autoencoder-and-multitask-mlp-on-new-dataset-from-kaggle-jane-street/4338).

In [None]:
_, y_era_aemlp = num_dataf.get_era_batch(['0003'], convert_to_tf=True, aemlp_batch=True, dtype=tf.float16)
y_era_aemlp

[<tf.Tensor: shape=(1, 1050), dtype=float16, numpy=array([[0.75, 0.5 , 0.75, ..., 1.  , 0.75, 1.  ]], dtype=float16)>,
 <tf.Tensor: shape=(1, 21), dtype=float16, numpy=
 array([[0.5   , 0.5   , 0.75  , 0.5   , 0.75  , 0.5   , 0.5   , 0.5   ,
         0.5   , 0.5   , 0.5   , 0.5   , 0.75  , 0.5   , 0.5   , 0.6665,
         0.6665, 0.5   , 0.6665, 0.5   , 0.6665]], dtype=float16)>,
 <tf.Tensor: shape=(1, 21), dtype=float16, numpy=
 array([[0.5   , 0.5   , 0.75  , 0.5   , 0.75  , 0.5   , 0.5   , 0.5   ,
         0.5   , 0.5   , 0.5   , 0.5   , 0.75  , 0.5   , 0.5   , 0.6665,
         0.6665, 0.5   , 0.6665, 0.5   , 0.6665]], dtype=float16)>]

`.aux_cols` denotes all columns that are not features, targets or prediction columns.

In [None]:
num_dataf.aux_cols

['era', 'data_type']

In [None]:
num_dataf.get_aux_data.head(2)

Unnamed: 0_level_0,era,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n559bd06a8861222,297,train
n9d39dea58c9e3cf,3,train


In [None]:
num_dataf['prediction_1'] = np.random.uniform(size=len(num_dataf))
num_dataf['prediction_2'] = np.random.uniform(size=len(num_dataf))

To track new columns like prediction columns, make sure to initialize a new `NumerFrame`. Prediction columns can easily be retrieved with `.get_prediction_data` and `get_prediction_aux_data` if you want to also get columns like `era` and `data_type`. This can be handy for ensembling and submission use cases.

In [None]:
num_dataf = NumerFrame(num_dataf)

In [None]:
num_dataf.get_prediction_data.head(2)

Unnamed: 0_level_0,prediction_1,prediction_2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n559bd06a8861222,0.969691,0.712817
n9d39dea58c9e3cf,0.562595,0.364946


In [None]:
num_dataf.get_prediction_aux_data.head(2)

Unnamed: 0_level_0,prediction_1,prediction_2,era,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
n559bd06a8861222,0.969691,0.712817,297,train
n9d39dea58c9e3cf,0.562595,0.364946,3,train


Arbitrary `.json` metadata can be stored into the `NumerFrame`. All metadata can also be exported to a `.json` file.

In [None]:
num_dataf.export_json_metadata("config.json")

In [None]:
num_dataf.import_json_metadata("config.json")

In [None]:
assert num_dataf.meta.version == 2
assert not num_dataf.meta.multi_target

Because `NumerFrame` inherits from `pd.DataFrame` you still have all functionality of a normal DataFrame at your disposal, like copying.

In [None]:
dataf2 = num_dataf.copy()
assert dataf2.equals(num_dataf)

`NumerFrame` dynamically tracks which feature, target, aux and prediction columns there are when initialized. For example, here we add a new prediction column. Upon initialization the column will be contained in `prediction_cols`. Prediction columns are all column names that start with `prediction`.

In [None]:
num_dataf.loc[:, "prediction_test_1"] = np.random.uniform(size=len(num_dataf))
new_dataset = NumerFrame(num_dataf)
assert "prediction_test_1" in new_dataset.prediction_cols
assert new_dataset.meta.version == 2

Arbitrary columns van be retrieved with `.get_column_selection`. The input argument can be either a string or a list with column names.

In [None]:
selection1 = num_dataf.get_column_selection("era")
selection1.head(2)

Unnamed: 0_level_0,era
id,Unnamed: 1_level_1
n559bd06a8861222,297
n9d39dea58c9e3cf,3


In [None]:
selection2 = num_dataf.get_column_selection(["era", "prediction_test_1"])
selection2.head(2)

Unnamed: 0_level_0,era,prediction_test_1
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n559bd06a8861222,297,0.06453
n9d39dea58c9e3cf,3,0.128838


In [None]:
#| include: false
for sel in [selection1, selection2]:
    assert isinstance(sel, NumerFrame)

For convenience we can get a feature, target pair with one method. If `multi_target=True` all columns where the column name starts with `target` will be retrieved.

In [None]:
features, single_target = num_dataf.get_feature_target_pair(multi_target=False)
features.head(2)

Unnamed: 0_level_0,feature_dichasial_hammier_spawner,feature_rheumy_epistemic_prancer,feature_pert_performative_hormuz,feature_hillier_unpitied_theobromine,feature_perigean_bewitching_thruster,feature_renegade_undomestic_milord,feature_koranic_rude_corf,feature_demisable_expiring_millepede,feature_unscheduled_malignant_shingling,feature_clawed_unwept_adaptability,...,feature_unpruned_pedagoguish_inkblot,feature_forworn_hask_haet,feature_drawable_exhortative_dispersant,feature_metabolic_minded_armorist,feature_investigatory_inerasable_circumvallation,feature_centroclinal_incentive_lancelet,feature_unemotional_quietistic_chirper,feature_behaviorist_microbiological_farina,feature_lofty_acceptable_challenge,feature_coactive_prefatorial_lucy
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n559bd06a8861222,0.25,0.75,0.25,0.75,0.25,0.5,1.0,0.25,0.25,0.75,...,0.75,0.0,1.0,0.0,0.0,0.25,0.0,0.0,1.0,0.25
n9d39dea58c9e3cf,0.75,0.5,0.75,1.0,0.5,0.25,0.5,0.0,1.0,0.25,...,1.0,1.0,0.25,0.5,0.0,0.25,0.75,1.0,0.75,1.0


In [None]:
single_target.head(2)

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
n559bd06a8861222,0.25
n9d39dea58c9e3cf,0.5


-----------------------------------------------