In [None]:
#| include: false
%load_ext autoreload
%autoreload 2

<IPython.core.display.Javascript object>

In [None]:
#| default_exp postprocessing

<IPython.core.display.Javascript object>

## Overview

The postprocessing procedure is similar to preprocessing. Preprocessors manipulate and/or add `feature` columns, while postprocessors manipulate and/or add `prediction` columns.

Every postprocessor should inherit from `BasePostProcessor`. A postprocessor should take a `NumerFrame` as input and output a `NumerFrame`. One or more new prediction column(s) with prefix `prediction` are added or manipulated in a postprocessor.

In [None]:
#| include: false
from nbdev.showdoc import *

<IPython.core.display.Javascript object>

In [None]:
#| export
import scipy
import numpy as np
import pandas as pd
import tensorflow as tf
import scipy.stats as sp
from tqdm.auto import tqdm
from typeguard import typechecked
from rich import print as rich_print
from scipy.stats.mstats import gmean
from sklearn.preprocessing import MinMaxScaler

from numerblox.numerframe import NumerFrame, create_numerframe
from numerblox.preprocessing import BaseProcessor, display_processor_info

<IPython.core.display.Javascript object>

## 0. BasePostProcessor

Some characteristics are particular to Postprocessors, but not suitable to put in the `Processor` base class.
This functionality is implemented in `BasePostProcessor`.

In [None]:
#| export
class BasePostProcessor(BaseProcessor):
    """
    Base class for postprocessing objects.

    Postprocessors manipulate or introduce new prediction columns in a NumerFrame.
    """
    def __init__(self, final_col_name: str):
        super().__init__()
        self.final_col_name = final_col_name
        if not final_col_name.startswith("prediction"):
            rich_print(f":warning: WARNING: final_col_name should start with 'prediction'. Column output will be: '{final_col_name}'. :warning:")

    def transform(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        ...

<IPython.core.display.Javascript object>

## 1. Common postprocessing steps

We invite the Numerai community to develop new postprocessors so that everyone can benefit from new insights and research.
This section implements commonly used postprocessing for Numerai.

## 1.0. Tournament agnostic

Postprocessing that works for both Numerai Classic and Numerai Signals.

### 1.0.1. Standardization

Standardizing is an essential step in order to reliably combine Numerai predictions. It is a default postprocessor for `ModelPipeline`.

In [None]:
#| export
@typechecked
class Standardizer(BasePostProcessor):
    """
    Uniform standardization of prediction columns.
    All values should only contain values in the range [0...1].

    :param cols: All prediction columns that should be standardized. Use all prediction columns by default.
    """

    def __init__(self, cols: list = None):
        super().__init__(final_col_name="prediction")
        self.cols = cols

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        cols = dataf.prediction_cols if not self.cols else self.cols
        dataf.loc[:, cols] = dataf.groupby(dataf.meta.era_col)[cols].rank(pct=True)
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [None]:
# Random DataFrame
test_features = [f"prediction_{l}" for l in "ABCDE"]
df = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df["target"] = np.random.normal(size=100)
df["era"] = [0, 1, 2, 3] * 25
test_dataf = NumerFrame(df)

<IPython.core.display.Javascript object>

In [None]:
std = Standardizer()
std.transform(test_dataf).get_prediction_data.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E
0,0.32,0.2,0.08,0.68,0.4
1,0.16,0.88,0.2,0.08,0.88


<IPython.core.display.Javascript object>

### 1.0.2. Ensembling

Multiple prediction results can be ensembled in multiple ways. We provide the most common use cases here.

#### 1.0.2.1. Simple Mean

In [None]:
#| export
@typechecked
class MeanEnsembler(BasePostProcessor):
    """
    Take simple mean of multiple cols and store in new col.

    :param final_col_name: Name of new averaged column.
    final_col_name should start with "prediction". \n
    :param cols: Column names to average. \n
    :param standardize: Whether to standardize by era before averaging. Highly recommended as columns that are averaged may have different distributions.
    """

    def __init__(
        self, final_col_name: str, cols: list = None, standardize: bool = False
    ):
        self.cols = cols
        self.standardize = standardize
        super().__init__(final_col_name=final_col_name)

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        cols = self.cols if self.cols else dataf.prediction_cols
        if self.standardize:
            to_average = dataf.groupby(dataf.meta.era_col)[cols].rank(pct=True)
        else:
            to_average = dataf[cols]
        dataf.loc[:, self.final_col_name] = to_average.mean(axis=1)
        rich_print(
            f":stew: Ensembled [blue]'{cols}'[blue] with simple mean and saved in [bold]'{self.final_col_name}'[bold] :stew:"
        )
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

#### 1.0.2.2. Donate's formula

This method for weighted averaging is mostly suitable if you have multiple models trained on a time series cross validation scheme. The first models will be trained on less data so we want to give them a lower weighting compared to the later models.

Source: [Yirun Zhang in his winning solution for the Jane Street 2021 Kaggle competition](https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp).
Based on a [paper by Donate et al.](https://doi.org/10.1016/j.neucom.2012.02.053)

In [None]:
#| export
@typechecked
class DonateWeightedEnsembler(BasePostProcessor):
    """
    Weighted average as per Donate et al.'s formula
    Paper Link: https://doi.org/10.1016/j.neucom.2012.02.053
    Code source: https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp

    Weightings for 5 folds: [0.0625, 0.0625, 0.125, 0.25, 0.5]

    :param cols: Prediction columns to ensemble.
    Uses all prediction columns by default. \n
    :param final_col_name: New column name for ensembled values.
    """
    def __init__(self, final_col_name: str, cols: list = None):
        super().__init__(final_col_name=final_col_name)
        self.cols = cols
        self.n_cols = len(cols)
        self.weights = self._get_weights()

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        cols = self.cols if self.cols else dataf.prediction_cols
        dataf.loc[:, self.final_col_name] = np.average(
            dataf.loc[:, cols], weights=self.weights, axis=1
        )
        rich_print(
            f":stew: Ensembled [blue]'{cols}'[/blue] with [bold]{self.__class__.__name__}[/bold] and saved in [bold]'{self.final_col_name}'[bold] :stew:"
        )
        return NumerFrame(dataf)

    def _get_weights(self) -> list:
        """Exponential weights."""
        weights = []
        for j in range(1, self.n_cols + 1):
            j = 2 if j == 1 else j
            weights.append(1 / (2 ** (self.n_cols + 1 - j)))
        return weights

<IPython.core.display.Javascript object>

In [None]:
# Random DataFrame
#| include: false
test_features = [f"prediction_{l}" for l in "ABCDE"]
df = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df["target"] = np.random.normal(size=100)
df["era"] = range(100)
test_dataf = NumerFrame(df)

<IPython.core.display.Javascript object>

For 5 folds, the weightings are `[0.0625, 0.0625, 0.125, 0.25, 0.5]`.

In [None]:
w_5_fold = [0.0625, 0.0625, 0.125, 0.25, 0.5]
donate = DonateWeightedEnsembler(
    cols=test_dataf.prediction_cols, final_col_name="prediction"
)
ensembled = donate(test_dataf).get_prediction_data
assert ensembled["prediction"][0] == np.sum(
    [w * elem for w, elem in zip(w_5_fold, ensembled[test_features].iloc[0])]
)
ensembled.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E,prediction
0,0.111924,0.935528,0.853572,0.351036,0.158973,0.339408
1,0.846985,0.748842,0.83988,0.781556,0.354063,0.577145


<IPython.core.display.Javascript object>

#### 1.0.2.3. Geometric Mean


Take the mean of multiple prediction columns using the product of values.

**More info on Geometric mean:**
- [Wikipedia](https://en.wikipedia.org/wiki/Geometric_mean)
- [Investopedia](https://www.investopedia.com/terms/g/geometricmean.asp)

In [None]:
#| export
@typechecked
class GeometricMeanEnsembler(BasePostProcessor):
    """
    Calculate the weighted Geometric mean.

    :param cols: Prediction columns to ensemble.
    Uses all prediction columns by default. \n
    :param final_col_name: New column name for ensembled values.
    """

    def __init__(self, final_col_name: str, cols: list = None):
        super().__init__(final_col_name=final_col_name)
        self.cols = cols

    @display_processor_info
    def transform(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        cols = self.cols if self.cols else dataf.prediction_cols
        new_col = dataf.loc[:, cols].apply(gmean, axis=1)
        dataf.loc[:, self.final_col_name] = new_col
        rich_print(
            f":stew: Ensembled [blue]'{cols}'[/blue] with [bold]{self.__class__.__name__}[/bold] and saved in [bold]'{self.final_col_name}'[bold] :stew:"
        )
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [None]:
geo_mean = GeometricMeanEnsembler(final_col_name="prediction_geo")
ensembled = geo_mean(test_dataf).get_prediction_data
ensembled.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E,prediction,prediction_geo
0,0.111924,0.935528,0.853572,0.351036,0.158973,0.339408,0.346401
1,0.846985,0.748842,0.83988,0.781556,0.354063,0.577145,0.681875


<IPython.core.display.Javascript object>

### 1.0.3. Neutralization and penalization

#### 1.0.3.1. Feature Neutralization

Classic feature neutralization (subtracting linear model from scores).

New column name for neutralized values will be `{pred_name}_neutralized_{PROPORTION}`. `pred_name` should start with `'prediction'`.

Optionally, you can run feature neutralization on the GPU using [cupy](https://docs.cupy.dev/en/stable/overview.html) by setting `cuda=True`. Make sure you have `cupy` installed with the correct CUDA Toolkit version. More information: [docs.cupy.dev/en/stable/install.html](https://docs.cupy.dev/en/stable/install.html)

[Detailed explanation of Feature Neutralization by Katsu1110](https://www.kaggle.com/code1110/janestreet-avoid-overfit-feature-neutralization)

In [None]:
#| export
@typechecked
class FeatureNeutralizer(BasePostProcessor):
    """
    Classic feature neutralization by subtracting linear model.

    :param feature_names: List of column names to neutralize against. Uses all feature columns by default. \n
    :param pred_name: Prediction column to neutralize. \n
    :param proportion: Number in range [0...1] indicating how much to neutralize. \n
    :param suffix: Optional suffix that is added to new column name. \n
    :param cuda: Do neutralization on the GPU \n
    Make sure you have CuPy installed when setting cuda to True. \n
    Installation docs: docs.cupy.dev/en/stable/install.html
    """
    def __init__(
        self,
        feature_names: list = None,
        pred_name: str = "prediction",
        proportion: float = 0.5,
        suffix: str = None,
        cuda = False,
    ):
        self.pred_name = pred_name
        self.proportion = proportion
        assert (
            0.0 <= proportion <= 1.0
        ), f"'proportion' should be a float in range [0...1]. Got '{proportion}'."
        self.new_col_name = (
            f"{self.pred_name}_neutralized_{self.proportion}_{suffix}"
            if suffix
            else f"{self.pred_name}_neutralized_{self.proportion}"
        )
        super().__init__(final_col_name=self.new_col_name)
        self.feature_names = feature_names
        self.cuda = cuda

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        feature_names = self.feature_names if self.feature_names else dataf.feature_cols
        neutralized_preds = dataf.groupby(dataf.meta.era_col).apply(
            lambda x: self.normalize_and_neutralize(x, [self.pred_name], feature_names)
        )
        dataf.loc[:, self.new_col_name] = MinMaxScaler().fit_transform(
            neutralized_preds
        )
        rich_print(
            f":robot: Neutralized [bold blue]'{self.pred_name}'[bold blue] with proportion [bold]'{self.proportion}'[/bold] :robot:"
        )
        rich_print(
            f"New neutralized column = [bold green]'{self.new_col_name}'[/bold green]."
        )
        return NumerFrame(dataf)

    def neutralize(self, dataf: pd.DataFrame, columns: list, by: list) -> pd.DataFrame:
        """ Neutralize on CPU. """
        scores = dataf[columns]
        exposures = dataf[by].values
        scores = scores - self.proportion * exposures.dot(
            np.linalg.pinv(exposures).dot(scores)
        )
        return scores / scores.std()

    def neutralize_cuda(self, dataf: pd.DataFrame, columns: list, by: list) -> np.ndarray:
        """ Neutralize on GPU. """
        try:
            import cupy
        except ImportError:
            raise ImportError("CuPy not installed. Set cuda=False or install CuPy. Installation docs: docs.cupy.dev/en/stable/install.html")
        scores = cupy.array(dataf[columns].values)
        exposures = cupy.array(dataf[by].values)
        scores = scores - self.proportion * exposures.dot(
            cupy.linalg.pinv(exposures).dot(scores)
        )
        return cupy.asnumpy(scores / scores.std())

    @staticmethod
    def normalize(dataf: pd.DataFrame) -> np.ndarray:
        normalized_ranks = (dataf.rank(method="first") - 0.5) / len(dataf)
        return sp.norm.ppf(normalized_ranks)

    def normalize_and_neutralize(
        self, dataf: pd.DataFrame, columns: list, by: list
    ) -> pd.DataFrame:
        dataf[columns] = self.normalize(dataf[columns])
        neutralization_func = self.neutralize if not self.cuda else self.neutralize_cuda
        dataf[columns] = neutralization_func(dataf, columns, by)
        return dataf[columns]

<IPython.core.display.Javascript object>

In [None]:
test_dataf = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
test_dataf.loc[:, "prediction"] = np.random.uniform(size=len(test_dataf))

<IPython.core.display.Javascript object>

In [None]:
ft = FeatureNeutralizer(
    feature_names=test_dataf.feature_cols, pred_name="prediction", proportion=0.8
)
new_dataf = ft.transform(test_dataf)

<IPython.core.display.Javascript object>

In [None]:
assert "prediction_neutralized_0.8" in new_dataf.prediction_cols
assert 0.0 in new_dataf.get_prediction_data["prediction_neutralized_0.8"]
assert 1.0 in new_dataf.get_prediction_data["prediction_neutralized_0.8"]

<IPython.core.display.Javascript object>

Generated columns and data can be easily retrieved for the `NumerFrame`.

In [None]:
new_dataf.prediction_cols

['prediction', 'prediction_neutralized_0.8']

<IPython.core.display.Javascript object>

In [None]:
new_dataf.get_prediction_data.head(3)

Unnamed: 0,prediction,prediction_neutralized_0.8
0,0.516076,0.461802
1,0.356837,0.29497
2,0.283496,0.184947


<IPython.core.display.Javascript object>

In [None]:
#| include: false
#| cuda_test
# ft = FeatureNeutralizer(
#     feature_names=test_dataf.feature_cols, pred_name="prediction",
#     proportion=0.8, cuda=True
# )
# new_dataf_cuda = ft.transform(test_dataf)
# new_dataf_cuda.head(2)

<IPython.core.display.Javascript object>

#### 1.0.3.2. Feature Penalization

In [None]:
#| export
@typechecked
class FeaturePenalizer(BasePostProcessor):
    """
    Feature penalization with TensorFlow.

    Source (by jrb): https://github.com/jonrtaylor/twitch/blob/master/FE_Clipping_Script.ipynb

    Source of first PyTorch implementation (by Michael Oliver / mdo): https://forum.numer.ai/t/model-diagnostics-feature-exposure/899/12

    :param feature_names: List of column names to reduce feature exposure. Uses all feature columns by default. \n
    :param pred_name: Prediction column to neutralize. \n
    :param max_exposure: Number in range [0...1] indicating how much to reduce max feature exposure to.
    """
    def __init__(
        self,
        max_exposure: float,
        feature_names: list = None,
        pred_name: str = "prediction",
        suffix: str = None,
    ):
        self.pred_name = pred_name
        self.max_exposure = max_exposure
        assert (
            0.0 <= max_exposure <= 1.0
        ), f"'max_exposure' should be a float in range [0...1]. Got '{max_exposure}'."
        self.new_col_name = (
            f"{self.pred_name}_penalized_{self.max_exposure}_{suffix}"
            if suffix
            else f"{self.pred_name}_penalized_{self.max_exposure}"
        )
        super().__init__(final_col_name=self.new_col_name)

        self.feature_names = feature_names

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        feature_names = (
            dataf.feature_cols if not self.feature_names else self.feature_names
        )
        penalized_data = self.reduce_all_exposures(
            dataf=dataf, column=self.pred_name, neutralizers=feature_names
        )
        dataf.loc[:, self.new_col_name] = penalized_data[self.pred_name]
        return NumerFrame(dataf)

    def reduce_all_exposures(
        self,
        dataf: NumerFrame,
        column: str = "prediction",
        neutralizers: list = None,
        normalize=True,
        gaussianize=True,
    ) -> pd.DataFrame:
        if neutralizers is None:
            neutralizers = [x for x in dataf.columns if x.startswith("feature")]
        neutralized = []

        for era in tqdm(dataf[dataf.meta.era_col].unique()):
            dataf_era = dataf[dataf[dataf.meta.era_col] == era]
            scores = dataf_era[[column]].values
            exposure_values = dataf_era[neutralizers].values

            if normalize:
                scores2 = []
                for x in scores.T:
                    x = (scipy.stats.rankdata(x, method="ordinal") - 0.5) / len(x)
                    if gaussianize:
                        x = scipy.stats.norm.ppf(x)
                    scores2.append(x)
                scores = np.array(scores2)[0]

            scores, weights = self._reduce_exposure(
                scores, exposure_values, len(neutralizers), None
            )

            scores /= tf.math.reduce_std(scores)
            scores -= tf.reduce_min(scores)
            scores /= tf.reduce_max(scores)
            neutralized.append(scores.numpy())

        predictions = pd.DataFrame(
            np.concatenate(neutralized), columns=[column], index=dataf.index
        )
        return predictions

    def _reduce_exposure(self, prediction, features, input_size=50, weights=None):
        model = tf.keras.models.Sequential(
            [
                tf.keras.layers.Input(input_size),
                tf.keras.experimental.LinearModel(use_bias=False),
            ]
        )
        feats = tf.convert_to_tensor(features - 0.5, dtype=tf.float32)
        pred = tf.convert_to_tensor(prediction, dtype=tf.float32)
        if weights is None:
            optimizer = tf.keras.optimizers.Adamax()
            start_exp = self.__exposures(feats, pred[:, None])
            target_exps = tf.clip_by_value(
                start_exp, -self.max_exposure, self.max_exposure
            )
            self._train_loop(model, optimizer, feats, pred, target_exps)
        else:
            model.set_weights(weights)
        return pred[:, None] - model(feats), model.get_weights()

    def _train_loop(self, model, optimizer, feats, pred, target_exps):
        for i in range(1000000):
            loss, grads = self.__train_loop_body(model, feats, pred, target_exps)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))
            if loss < 1e-7:
                break

    @tf.function(experimental_relax_shapes=True)
    def __train_loop_body(self, model, feats, pred, target_exps):
        with tf.GradientTape() as tape:
            exps = self.__exposures(feats, pred[:, None] - model(feats, training=True))
            loss = tf.reduce_sum(
                tf.nn.relu(tf.nn.relu(exps) - tf.nn.relu(target_exps))
                + tf.nn.relu(tf.nn.relu(-exps) - tf.nn.relu(-target_exps))
            )
        return loss, tape.gradient(loss, model.trainable_variables)

    @staticmethod
    @tf.function(experimental_relax_shapes=True, experimental_compile=True)
    def __exposures(x, y):
        x = x - tf.math.reduce_mean(x, axis=0)
        x = x / tf.norm(x, axis=0)
        y = y - tf.math.reduce_mean(y, axis=0)
        y = y / tf.norm(y, axis=0)
        return tf.matmul(x, y, transpose_a=True)

<IPython.core.display.Javascript object>

In [None]:
#| include: false
#| cuda_test
test_dataf = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
test_dataf.loc[:, "prediction"] = np.random.uniform(size=len(test_dataf))
# ft = FeaturePenalizer(pred_name='prediction', max_exposure=0.8)
# new_dataset = ft.transform(test_dataset)

<IPython.core.display.Javascript object>

## 1.1. Numerai Classic

Postprocessing steps that are specific to Numerai Classic

In [None]:
# 1.1.
# No Numerai Classic specific postprocessors implemented yet.

<IPython.core.display.Javascript object>

## 1.2. Numerai Signals

Postprocessors that are specific to Numerai Signals.

In [None]:
# 1.2.
# No Numerai Signals specific postprocessors implemented yet.

<IPython.core.display.Javascript object>

## 2. Custom PostProcessors

As with preprocessors, there are an almost unlimited number of ways to postprocess data. We (once again) invite the Numerai community to develop Numerai Classic and Signals postprocessors.

A new Postprocessor should inherit from `BasePostProcessor` and implement a `transform` method. The `transform` method should take a `NumerFrame` as input and return a `NumerFrame` object as output. A template for this is given below.

We recommend adding `@typechecked` at the top of a new postprocessor to enforce types and provide useful debugging stacktraces.

To enable fancy logging output. Add the `@display_processor_info` decorator to the `transform` method.

In [None]:
#| export
@typechecked
class AwesomePostProcessor(BasePostProcessor):
    """
    TEMPLATE - Do some awesome postprocessing.

    :param final_col_name: Column name to store manipulated or ensembled predictions in.
    """

    def __init__(self, final_col_name: str, *args, **kwargs):
        super().__init__(final_col_name=final_col_name)

    @display_processor_info
    def transform(self, dataf: NumerFrame, *args, **kwargs) -> NumerFrame:
        # Do processing
        ...
        # Add new column(s) for manipulated data
        dataf.loc[:, self.final_col_name] = ...
        ...
        # Parse all contents to the next pipeline step
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

------------------------------------------------------