<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In [None]:
#| include: false

<IPython.core.display.Javascript object>

## Overview

The postprocessing procedure is similar to preprocessing. Preprocessors manipulate and/or add `feature` columns, while postprocessors manipulate and/or add `prediction` columns.

Every postprocessor should inherit from `BasePostProcessor`. A postprocessor should take a `NumerFrame` as input and output a `NumerFrame`. One or more new prediction column(s) with prefix `prediction` are added or manipulated in a postprocessor.

In [None]:
#| include: false
from nbdev.showdoc import *

<IPython.core.display.Javascript object>

## 0. BasePostProcessor

Some characteristics are particular to Postprocessors, but not suitable to put in the `Processor` base class.
This functionality is implemented in `BasePostProcessor`.

In [1]:
#| echo: false
#| output: asis
show_doc(BasePostProcessor)

---

### BasePostProcessor

>      BasePostProcessor (final_col_name:str)

Base class for postprocessing objects.

Postprocessors manipulate or introduce new prediction columns in a NumerFrame.

## 1. Common postprocessing steps

We invite the Numerai community to develop new postprocessors so that everyone can benefit from new insights and research.
This section implements commonly used postprocessing for Numerai.

## 1.0. Tournament agnostic

Postprocessing that works for both Numerai Classic and Numerai Signals.

### 1.0.1. Standardization

Standardizing is an essential step in order to reliably combine Numerai predictions. It is a default postprocessor for `ModelPipeline`.

In [2]:
#| echo: false
#| output: asis
show_doc(Standardizer)

---

### Standardizer

>      Standardizer (cols:list=None)

Uniform standardization of prediction columns.
All values should only contain values in the range [0...1].

:param cols: All prediction columns that should be standardized. Use all prediction columns by default.

In [None]:
# Random DataFrame
test_features = [f"prediction_{l}" for l in "ABCDE"]
df = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df["target"] = np.random.normal(size=100)
df["era"] = [0, 1, 2, 3] * 25
test_dataf = NumerFrame(df)

<IPython.core.display.Javascript object>

In [None]:
std = Standardizer()
std.transform(test_dataf).get_prediction_data.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E
0,0.32,0.2,0.08,0.68,0.4
1,0.16,0.88,0.2,0.08,0.88


<IPython.core.display.Javascript object>

### 1.0.2. Ensembling

Multiple prediction results can be ensembled in multiple ways. We provide the most common use cases here.

#### 1.0.2.1. Simple Mean

In [3]:
#| echo: false
#| output: asis
show_doc(MeanEnsembler)

---

### MeanEnsembler

>      MeanEnsembler (final_col_name:str, cols:list=None,
>                     standardize:bool=False)

Take simple mean of multiple cols and store in new col.

:param final_col_name: Name of new averaged column.
final_col_name should start with "prediction". 

:param cols: Column names to average. 

:param standardize: Whether to standardize by era before averaging. Highly recommended as columns that are averaged may have different distributions.

#### 1.0.2.2. Donate's formula

This method for weighted averaging is mostly suitable if you have multiple models trained on a time series cross validation scheme. The first models will be trained on less data so we want to give them a lower weighting compared to the later models.

Source: [Yirun Zhang in his winning solution for the Jane Street 2021 Kaggle competition](https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp).
Based on a [paper by Donate et al.](https://doi.org/10.1016/j.neucom.2012.02.053)

In [4]:
#| echo: false
#| output: asis
show_doc(DonateWeightedEnsembler)

---

### DonateWeightedEnsembler

>      DonateWeightedEnsembler (final_col_name:str, cols:list=None)

Weighted average as per Donate et al.'s formula
Paper Link: https://doi.org/10.1016/j.neucom.2012.02.053
Code source: https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp

Weightings for 5 folds: [0.0625, 0.0625, 0.125, 0.25, 0.5]

:param cols: Prediction columns to ensemble.
Uses all prediction columns by default. 

:param final_col_name: New column name for ensembled values.

In [None]:
# Random DataFrame
#| include: false
test_features = [f"prediction_{l}" for l in "ABCDE"]
df = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df["target"] = np.random.normal(size=100)
df["era"] = range(100)
test_dataf = NumerFrame(df)

<IPython.core.display.Javascript object>

For 5 folds, the weightings are `[0.0625, 0.0625, 0.125, 0.25, 0.5]`.

In [None]:
w_5_fold = [0.0625, 0.0625, 0.125, 0.25, 0.5]
donate = DonateWeightedEnsembler(
    cols=test_dataf.prediction_cols, final_col_name="prediction"
)
ensembled = donate(test_dataf).get_prediction_data
assert ensembled["prediction"][0] == np.sum(
    [w * elem for w, elem in zip(w_5_fold, ensembled[test_features].iloc[0])]
)
ensembled.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E,prediction
0,0.111924,0.935528,0.853572,0.351036,0.158973,0.339408
1,0.846985,0.748842,0.83988,0.781556,0.354063,0.577145


<IPython.core.display.Javascript object>

#### 1.0.2.3. Geometric Mean

Take the mean of multiple prediction columns using the product of values.

**More info on Geometric mean:**
- [Wikipedia](https://en.wikipedia.org/wiki/Geometric_mean)
- [Investopedia](https://www.investopedia.com/terms/g/geometricmean.asp)

In [5]:
#| echo: false
#| output: asis
show_doc(GeometricMeanEnsembler)

---

### GeometricMeanEnsembler

>      GeometricMeanEnsembler (final_col_name:str, cols:list=None)

Calculate the weighted Geometric mean.

:param cols: Prediction columns to ensemble.
Uses all prediction columns by default. 

:param final_col_name: New column name for ensembled values.

In [None]:
geo_mean = GeometricMeanEnsembler(final_col_name="prediction_geo")
ensembled = geo_mean(test_dataf).get_prediction_data
ensembled.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E,prediction,prediction_geo
0,0.111924,0.935528,0.853572,0.351036,0.158973,0.339408,0.346401
1,0.846985,0.748842,0.83988,0.781556,0.354063,0.577145,0.681875


<IPython.core.display.Javascript object>

### 1.0.3. Neutralization and penalization

#### 1.0.3.1. Feature Neutralization

Classic feature neutralization (subtracting linear model from scores).

New column name for neutralized values will be `{pred_name}_neutralized_{PROPORTION}`. `pred_name` should start with `'prediction'`.

Optionally, you can run feature neutralization on the GPU using [cupy](https://docs.cupy.dev/en/stable/overview.html) by setting `cuda=True`. Make sure you have `cupy` installed with the correct CUDA Toolkit version. More information: [docs.cupy.dev/en/stable/install.html](https://docs.cupy.dev/en/stable/install.html)

[Detailed explanation of Feature Neutralization by Katsu1110](https://www.kaggle.com/code1110/janestreet-avoid-overfit-feature-neutralization)

In [6]:
#| echo: false
#| output: asis
show_doc(FeatureNeutralizer)

---

### FeatureNeutralizer

>      FeatureNeutralizer (feature_names:list=None, pred_name:str='prediction',
>                          proportion:float=0.5, suffix:str=None, cuda=False)

Classic feature neutralization by subtracting linear model.

:param feature_names: List of column names to neutralize against. Uses all feature columns by default. 

:param pred_name: Prediction column to neutralize. 

:param proportion: Number in range [0...1] indicating how much to neutralize. 

:param suffix: Optional suffix that is added to new column name. 

:param cuda: Do neutralization on the GPU 

Make sure you have CuPy installed when setting cuda to True. 

Installation docs: docs.cupy.dev/en/stable/install.html

In [None]:
test_dataf = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
test_dataf.loc[:, "prediction"] = np.random.uniform(size=len(test_dataf))

<IPython.core.display.Javascript object>

In [None]:
ft = FeatureNeutralizer(
    feature_names=test_dataf.feature_cols, pred_name="prediction", proportion=0.8
)
new_dataf = ft.transform(test_dataf)

<IPython.core.display.Javascript object>

In [None]:
assert "prediction_neutralized_0.8" in new_dataf.prediction_cols
assert 0.0 in new_dataf.get_prediction_data["prediction_neutralized_0.8"]
assert 1.0 in new_dataf.get_prediction_data["prediction_neutralized_0.8"]

<IPython.core.display.Javascript object>

Generated columns and data can be easily retrieved for the `NumerFrame`.

In [None]:
new_dataf.prediction_cols

['prediction', 'prediction_neutralized_0.8']

<IPython.core.display.Javascript object>

In [None]:
new_dataf.get_prediction_data.head(3)

Unnamed: 0,prediction,prediction_neutralized_0.8
0,0.516076,0.461802
1,0.356837,0.29497
2,0.283496,0.184947


<IPython.core.display.Javascript object>

In [None]:
#| include: false
# ft = FeatureNeutralizer(
#     feature_names=test_dataf.feature_cols, pred_name="prediction",
#     proportion=0.8, cuda=True
# )
# new_dataf_cuda = ft.transform(test_dataf)
# new_dataf_cuda.head(2)

<IPython.core.display.Javascript object>

#### 1.0.3.2. Feature Penalization

In [7]:
#| echo: false
#| output: asis
show_doc(FeaturePenalizer)

---

### FeaturePenalizer

>      FeaturePenalizer (max_exposure:float, feature_names:list=None,
>                        pred_name:str='prediction', suffix:str=None)

Feature penalization with TensorFlow.

Source (by jrb): https://github.com/jonrtaylor/twitch/blob/master/FE_Clipping_Script.ipynb

Source of first PyTorch implementation (by Michael Oliver / mdo): https://forum.numer.ai/t/model-diagnostics-feature-exposure/899/12

:param feature_names: List of column names to reduce feature exposure. Uses all feature columns by default. 

:param pred_name: Prediction column to neutralize. 

:param max_exposure: Number in range [0...1] indicating how much to reduce max feature exposure to.

In [None]:
#| include: false
test_dataf = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
test_dataf.loc[:, "prediction"] = np.random.uniform(size=len(test_dataf))
# ft = FeaturePenalizer(pred_name='prediction', max_exposure=0.8)
# new_dataset = ft.transform(test_dataset)

<IPython.core.display.Javascript object>

## 1.1. Numerai Classic

Postprocessing steps that are specific to Numerai Classic

In [None]:
# 1.1.
# No Numerai Classic specific postprocessors implemented yet.

<IPython.core.display.Javascript object>

## 1.2. Numerai Signals

Postprocessors that are specific to Numerai Signals.

In [None]:
# 1.2.
# No Numerai Signals specific postprocessors implemented yet.

<IPython.core.display.Javascript object>

## 2. Custom PostProcessors

As with preprocessors, there are an almost unlimited number of ways to postprocess data. We (once again) invite the Numerai community to develop Numerai Classic and Signals postprocessors.

A new Postprocessor should inherit from `BasePostProcessor` and implement a `transform` method. The `transform` method should take a `NumerFrame` as input and return a `NumerFrame` object as output. A template for this is given below.

We recommend adding `@typechecked` at the top of a new postprocessor to enforce types and provide useful debugging stacktraces.

To enable fancy logging output. Add the `@display_processor_info` decorator to the `transform` method.

In [8]:
#| echo: false
#| output: asis
show_doc(AwesomePostProcessor)

---

### AwesomePostProcessor

>      AwesomePostProcessor (final_col_name:str, *args, **kwargs)

TEMPLATE - Do some awesome postprocessing.

:param final_col_name: Column name to store manipulated or ensembled predictions in.

------------------------------------------------------