In [36]:
%load_ext autoreload
%autoreload 2
%load_ext nb_black
%load_ext lab_black

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black
The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


<IPython.core.display.Javascript object>

In [37]:
# default_exp postprocessing

<IPython.core.display.Javascript object>

# Postprocessing

The postprocessing procedure is very similar to preprocessing.

The only difference between a postprocessing step and a preprocessing step is that preprocessing works on `feature` columns while postprocessing manipulates `prediction` columns.

We inherit from `BasePostProcessor` for postprocessing. The PostProcessor should take a `NumerFrame` or `DataFrame` as input and output a `NumerFrame` where one or more new prediction column(s) are added with prefix `prediction`.

In [38]:
# hide
from nbdev.showdoc import *

<IPython.core.display.Javascript object>

In [39]:
#export
import scipy
import numpy as np
import pandas as pd
import tensorflow as tf
from typing import Union
import scipy.stats as sp
from tqdm.auto import tqdm
from typeguard import typechecked
from rich import print as rich_print
from scipy.stats.mstats import gmean
from sklearn.preprocessing import MinMaxScaler

from numerai_blocks.numerframe import NumerFrame, create_numerframe
from numerai_blocks.preprocessing import BaseProcessor, display_processor_info

<IPython.core.display.Javascript object>

## 0. BasePostProcessor

Some characteristics are particular to PostProcessors, but not suitable to put in the `Processor` base class.
This functionality is implemented in `BasePostProcessor`.

In [40]:
#export
class BasePostProcessor(BaseProcessor):
    """
    Base class for postprocessing objects.
    Postprocessors manipulate or ensemble prediction column(s)
    and add them to a Dataset.
    """
    def __init__(self, final_col_name: str):
        super().__init__()
        self.final_col_name = final_col_name
        assert final_col_name.startswith("prediction"), f"final_col name should start with 'prediction'. Got {final_col_name}"

    def transform(self, dataset: Union[pd.DataFrame, NumerFrame], *args, **kwargs) -> NumerFrame:
        ...

<IPython.core.display.Javascript object>

## 1. Common postprocessing steps

### 1.1. Version agnostic

### 1.1.0. Standardization

Standardizing is an essential step in order to combine Numerai predictions and is a default postprocessor for `ModelPipeline`.

In [41]:
# export
@typechecked
class Standardizer(BaseProcessor):
    """
    Uniform standardization of prediction columns.
    All values should only contain values in the range [0...1].
    """
    def __init__(self, cols: list = None):
        super().__init__()
        self.cols = cols

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        cols = dataf.prediction_cols if not self.cols else self.cols
        for col in cols:
            assert dataf[col].between(0, 1).all(), f"All values should only contain values between 0 and 1. Does not hold for '{col}'"
        dataf.loc[:, cols] = dataf.groupby('era')[cols].rank(pct=True)
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [42]:
# Random DataFrame
test_features = [f"prediction_{l}" for l in "ABCDE"]
df = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df["target"] = np.random.normal(size=100)
df["era"] = [0, 1, 2, 3] * 25
test_dataset = NumerFrame(df)

<IPython.core.display.Javascript object>

In [43]:
std = Standardizer()
std.transform(test_dataset).get_prediction_data.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E
0,0.96,0.92,0.28,0.64,0.84
1,1.0,0.6,0.72,0.12,0.96


<IPython.core.display.Javascript object>

#### 1.1.1. Ensembling

Multiple prediction results can be ensembled in multiple ways, but we provide the most common use cases here.

##### Simple Mean

In [44]:
#export
@typechecked
class MeanEnsembler(BasePostProcessor):
    """ Take simple mean of multiple cols and store in new col. """
    def __init__(self, cols: list, final_col_name: str):
        super().__init__(final_col_name=final_col_name)
        self.cols = cols

    @display_processor_info
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        dataf.loc[:, self.final_col_name] = dataf.loc[:, self.cols].mean(axis=1)
        rich_print(f":stew: Ensembled [blue]'{self.cols}'[blue] with simple mean and saved in [bold]'{self.final_col_name}'[bold] :stew:")
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

##### Donate's formula

In [45]:
#export
@typechecked
class DonateWeightedEnsembler(BasePostProcessor):
    """
    Weighted average as per Donate et al.'s formula
    https://doi.org/10.1016/j.neucom.2012.02.053
    [0.0625, 0.0625, 0.125, 0.25, 0.5] for 5 fold
    Source: https://www.kaggle.com/gogo827jz/jane-street-supervised-autoencoder-mlp
    """
    def __init__(self, cols: list, final_col_name: str):
        super().__init__(final_col_name=final_col_name)
        self.cols = cols
        self.n_cols = len(cols)
        self.weights = self._get_weights()

    @display_processor_info
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        dataf.loc[:, self.final_col_name] = np.average(dataf.loc[:, self.cols],
                                                       weights=self.weights, axis=1)
        rich_print(f":stew: Ensembled [blue]'{self.cols}'[/blue] with [bold]{self.__class__.__name__}[/bold] and saved in [bold]'{self.final_col_name}'[bold] :stew:")
        return NumerFrame(dataf)

    def _get_weights(self) -> list:
        weights = []
        for j in range(1, self.n_cols+1):
            j = 2 if j == 1 else j
            weights.append(1 / (2**(self.n_cols + 1 - j)))
        return weights

<IPython.core.display.Javascript object>

In [46]:
# Random DataFrame
test_features = [f"prediction_{l}" for l in "ABCDE"]
df = pd.DataFrame(np.random.uniform(size=(100, 5)), columns=test_features)
df["target"] = np.random.normal(size=100)
df["era"] = range(100)
test_dataset = NumerFrame(df)

<IPython.core.display.Javascript object>

In [47]:
# [0.0625, 0.0625, 0.125, 0.25, 0.5] for 5 fold
w_5_fold = [0.0625, 0.0625, 0.125, 0.25, 0.5]
donate = DonateWeightedEnsembler(cols=test_dataset.prediction_cols, final_col_name='prediction')
ensembled = donate(test_dataset).get_prediction_data
assert ensembled['prediction'][0] == np.sum([w * elem for w, elem in zip(w_5_fold, ensembled[test_features].iloc[0])])
ensembled.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E,prediction
0,0.965094,0.400478,0.022906,0.560659,0.086063,0.271408
1,0.32681,0.271006,0.053055,0.04691,0.401494,0.25647


<IPython.core.display.Javascript object>

##### Geometric Mean


In [48]:
#export
@typechecked
class GeometricMeanEnsembler(BasePostProcessor):
    """
    Calculate the weighted Geometric mean using inverse correlation.
    """
    def __init__(self, cols: list, final_col_name: str):
        super().__init__(final_col_name=final_col_name)
        self.cols = cols
        self.n_cols = len(cols)

    @display_processor_info
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame], *args, **kwargs) -> NumerFrame:
        new_col = dataf.loc[:, self.cols].apply(gmean, axis=1)
        dataf.loc[:, self.final_col_name] = new_col
        rich_print(f":stew: Ensembled [blue]'{self.cols}'[/blue] with [bold]{self.__class__.__name__}[/bold] and saved in [bold]'{self.final_col_name}'[bold] :stew:")
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

In [49]:
geo_mean = GeometricMeanEnsembler(cols=test_dataset.prediction_cols, final_col_name='prediction')
ensembled = geo_mean(test_dataset).get_prediction_data
ensembled.head(2)

Unnamed: 0,prediction_A,prediction_B,prediction_C,prediction_D,prediction_E,prediction
0,0.965094,0.400478,0.022906,0.560659,0.086063,0.211896
1,0.32681,0.271006,0.053055,0.04691,0.401494,0.154663


<IPython.core.display.Javascript object>

#### 1.1.2. Feature Neutralization

Classic feature neutralization (subtracting linear model from scores)

In [50]:
#export
@typechecked
class FeatureNeutralizer(BasePostProcessor):
    """
    Classic feature neutralization
    Subtracting Linear model.
    :param feature_names: List of column names to neutralize against.
    :param pred_name: Prediction column to neutralize.
    :param era_col: Numerai era column
    :param proportion: Number in range [0...1] indication how much to neutralize.
    """
    def __init__(self,
                 feature_names: list = None,
                 pred_name: str = "prediction",
                 era_col: str = "era",
                 proportion: float = 0.5):
        self.pred_name = pred_name
        self.proportion = proportion
        assert 0. <= proportion <= 1., f"'proportion' should be a float in range [0...1]. Got '{proportion}'."
        self.new_col_name = f"{self.pred_name}_neutralized_{self.proportion}"
        super().__init__(final_col_name=self.new_col_name)

        self.feature_names = feature_names
        self.era_col = era_col

    @display_processor_info
    def transform(self, dataf: NumerFrame) -> NumerFrame:
        feature_names = self.feature_names if self.feature_names else dataf.feature_cols
        neutralized_preds = dataf.groupby(self.era_col)\
            .apply(lambda x: self.normalize_and_neutralize(x, [self.pred_name], feature_names))
        dataf.loc[:, self.new_col_name] = MinMaxScaler().fit_transform(neutralized_preds)
        rich_print(f":robot: Neutralized [bold blue]'{self.pred_name}'[bold blue] with proportion [bold]'{self.proportion}'[/bold] :robot:")
        rich_print(f"New neutralized column = [bold green]'{self.new_col_name}'[/bold green].")
        return NumerFrame(dataf)

    def neutralize(self, dataf: pd.DataFrame, columns: list, by: list) -> pd.DataFrame:
        scores = dataf[columns]
        exposures = dataf[by].values
        scores = scores - self.proportion * exposures.dot(np.linalg.pinv(exposures).dot(scores))
        return scores / scores.std()

    @staticmethod
    def normalize(dataf: pd.DataFrame) -> np.ndarray:
        normalized_ranks = (dataf.rank(method="first") - 0.5) / len(dataf)
        return sp.norm.ppf(normalized_ranks)

    def normalize_and_neutralize(self, dataf: pd.DataFrame, columns: list, by: list) -> pd.DataFrame:
        # Convert the scores to a normal distribution
        dataf[columns] = self.normalize(dataf[columns])
        dataf[columns] = self.neutralize(dataf, columns, by)
        return dataf[columns]

<IPython.core.display.Javascript object>

In [51]:
test_dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
test_dataset.loc[:, 'prediction'] = np.random.uniform(size=len(test_dataset))

<IPython.core.display.Javascript object>

In [52]:
ft = FeatureNeutralizer(feature_names=test_dataset.feature_cols, pred_name='prediction', proportion=0.8)
new_dataset = ft.transform(test_dataset);

<IPython.core.display.Javascript object>

In [53]:
assert "prediction_neutralized_0.8" in new_dataset.prediction_cols
assert 0. in new_dataset.get_prediction_data['prediction_neutralized_0.8']
assert 1. in new_dataset.get_prediction_data['prediction_neutralized_0.8']

<IPython.core.display.Javascript object>

In [54]:
new_dataset.prediction_cols

['prediction', 'prediction_neutralized_0.8']

<IPython.core.display.Javascript object>

In [55]:
new_dataset.get_prediction_data.head(3)

Unnamed: 0,prediction,prediction_neutralized_0.8
0,0.694602,0.538198
1,0.681106,0.461802
2,0.309822,0.29497


<IPython.core.display.Javascript object>

#### 1.1.3. Feature Penalization

In [56]:
#export
@typechecked
class FeaturePenalizer(BasePostProcessor):
    """ Feature penalization with Tensorflow. """
    def __init__(self, model_list: list, max_exposure: float,
                 risky_feature_names: list = None, pred_name: str = "prediction", era_col: str = 'era'):
        self.pred_name = pred_name
        self.max_exposure = max_exposure
        assert 0. <= max_exposure <= 1., f"'max_exposure' should be a float in range [0...1]. Got '{max_exposure}'."
        self.new_col_name = f"{self.pred_name}_penalized_{self.max_exposure}"
        super().__init__(final_col_name=self.new_col_name)

        self.model_list = model_list
        self.risky_feature_names = risky_feature_names
        self.era_col = era_col

    @display_processor_info
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame]) -> NumerFrame:
        risky_feature_names = dataf.feature_cols if not self.risky_feature_names else self.risky_feature_names
        for model_name in tqdm(self.model_list, desc="Feature Penalization"):
            penalized_data = self.reduce_all_exposures(
                            df=dataf,
                            column=self.pred_name,
                            neutralizers=risky_feature_names,
                        )
            new_pred_col = f"prediction_{self.pred_name}_{model_name}_FP_{self.max_exposure}"
            dataf.loc[:, new_pred_col] = penalized_data[self.pred_name]
        return NumerFrame(dataf)

    def reduce_all_exposures(self, df: pd.DataFrame,
                             column: str = "prediction",
                             neutralizers: list = None,
                             normalize=True,
                             gaussianize=True,
                             ):
        if neutralizers is None:
            neutralizers = [x for x in df.columns if x.startswith("feature")]
        neutralized = []

        for era in tqdm(df[self.era_col].unique()):
            df_era = df[df[self.era_col] == era]
            scores = df_era[[column]].values
            exposure_values = df_era[neutralizers].values

            if normalize:
                scores2 = []
                for x in scores.T:
                    x = (scipy.stats.rankdata(x, method='ordinal') - .5) / len(x)
                    if gaussianize:
                        x = scipy.stats.norm.ppf(x)
                    scores2.append(x)
                scores = np.array(scores2)[0]

            scores, weights = self._reduce_exposure(scores, exposure_values,
                                                    len(neutralizers), None)

            scores /= tf.math.reduce_std(scores)
            scores -= tf.reduce_min(scores)
            scores /= tf.reduce_max(scores)
            neutralized.append(scores.numpy())

        predictions = pd.DataFrame(np.concatenate(neutralized),
                                   columns=[column], index=df.index)
        return predictions

    def _reduce_exposure(self, prediction, features, input_size=50, weights=None):
        model = tf.keras.models.Sequential([
            tf.keras.layers.Input(input_size),
            tf.keras.experimental.LinearModel(use_bias=False),
        ])
        feats = tf.convert_to_tensor(features - 0.5, dtype=tf.float32)
        pred = tf.convert_to_tensor(prediction, dtype=tf.float32)
        if weights is None:
            optimizer = tf.keras.optimizers.Adamax()
            start_exp = self.__exposures(feats, pred[:, None])
            target_exps = tf.clip_by_value(start_exp, -self.max_exposure, self.max_exposure)
            self._train_loop(model, optimizer, feats, pred, target_exps)
        else:
            model.set_weights(weights)
        return pred[:,None] - model(feats), model.get_weights()


    def _train_loop(self, model, optimizer, feats, pred, target_exps):
        for i in range(1000000):
            loss, grads = self.__train_loop_body(model, feats, pred, target_exps)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))
            if loss < 1e-7:
                break

    @tf.function(experimental_relax_shapes=True)
    def __train_loop_body(self, model, feats, pred, target_exps):
        with tf.GradientTape() as tape:
            exps = self.__exposures(feats, pred[:, None] - model(feats, training=True))
            loss = tf.reduce_sum(tf.nn.relu(tf.nn.relu(exps) - tf.nn.relu(target_exps)) +
                                 tf.nn.relu(tf.nn.relu(-exps) - tf.nn.relu(-target_exps)))
        return loss, tape.gradient(loss, model.trainable_variables)

    @staticmethod
    @tf.function(experimental_relax_shapes=True, experimental_compile=True)
    def __exposures(x, y):
        x = x - tf.math.reduce_mean(x, axis=0)
        x = x / tf.norm(x, axis=0)
        y = y - tf.math.reduce_mean(y, axis=0)
        y = y / tf.norm(y, axis=0)
        return tf.matmul(x, y, transpose_a=True)

<IPython.core.display.Javascript object>

In [57]:
# TODO Test Feature penalizer
test_dataset = create_numerframe("test_assets/mini_numerai_version_1_data.csv")
test_dataset.loc[:, 'prediction'] = np.random.uniform(size=len(test_dataset))

<IPython.core.display.Javascript object>

In [58]:
ft = FeaturePenalizer(model_list=[], pred_name='prediction', max_exposure=0.8)
new_dataset = ft.transform(test_dataset);

Feature Penalization: 0it [00:00, ?it/s]

<IPython.core.display.Javascript object>

### 1.2. Version 1 specific

### 1.3. Version 2 specific

### 1.4. Signals specific

## 2. Custom PostProcessors

There are an almost unlimited number of ways to postprocess data. We invite the Numerai community to develop Numerai Classic and Signals preprocessors for `numerai-blocks`.

A new PostProcessor should inherit from `BasePostProcessor` and implement a `transform` method. The `transform` method should take a `NumerFrame` or `DataFrame` as input and return a `NumerFrame` object as output. A template is given below.

We recommend adding `@typechecked` at the top of a new PostProcessor class to enforce types and provide useful debugging stacktraces.

To enable fancy logging output. Add the `@display_processor_info` decorator to the `transform` method.

Note that arbitrary metadata can be added or changed in the `NumerFrame` class during a postprocessing step.



In [59]:
#export
@typechecked
class AwesomePostProcessor(BasePostProcessor):
    """
    - TEMPLATE -
    Do some awesome postprocessing.
    :param final_col_name: Column name to store manipulated or ensembled predictions in.
    """
    def __init__(self, final_col_name: str, *args, **kwargs):
        super().__init__(final_col_name=final_col_name)

    @display_processor_info
    def transform(self, dataf: Union[pd.DataFrame, NumerFrame], *args, **kwargs) -> NumerFrame:
        # Do processing
        ...
        # Add new column(s) for manipulated data (optional)
        dataf.loc[:, self.final_col_name] = ...
        ...
        # Parse all contents to the next pipeline step
        return NumerFrame(dataf)

<IPython.core.display.Javascript object>

------------------------------------------------------

In [60]:
# hide
# Run this cell to sync all changes with library
from nbdev.export import notebook2script

notebook2script()

Converted 01_download.ipynb.
Converted 02_numerframe.ipynb.
Converted 03_preprocessing.ipynb.
Converted 04_model.ipynb.
Converted 05_postprocessing.ipynb.
Converted 06_modelpipeline.ipynb.
Converted 07_evaluation.ipynb.
Converted 08_key.ipynb.
Converted 09_submission.ipynb.
Converted 10_staking.ipynb.
Converted index.ipynb.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>