In [1]:
# hide
from nbdev.showdoc import *

# Constructing Model Pipelines

In [2]:
from numerai_blocks.download import NumeraiClassicDownloader
from numerai_blocks.numerframe import create_numerframe
from numerai_blocks.model_pipeline import ModelPipeline, ModelPipelineCollection

## Why bother using ModelPipeline?

**ModelPipeline**

This framework allows you to easily define `ModelPipeline` objects. These are composed from Preprocessors, Models and Postprocessors. In order to make predictions, `ModelPipeline` takes a `NumerFrame` as input and outputs a `NumerFrame` with prediction columns added.

`ModelPipeline` ensures that all processing steps are performed in a correct order and gives you a more concise overview of your full pipeline. This will simplify your weekly inference setup and allows you to scale more comfortably to multiple models.

To increase overview, many components of a typical pipeline also perform data integrity checks and display which step was performed. These displays allow you to identify slow implementations or other bottlenecks.

**ModelPipelineCollection**

Multiple `ModelPipeline` objects can be combined into a `ModelPipelineCollection`. This is convenient if you are use the same starting dataset, but have multiple pipelines with different Preprocessors, Models and/or Postprocessors.

## 0. Download live data

In [3]:
# Download most recent live data
downloader = NumeraiClassicDownloader("pipeline_test")
downloader.download_live_data()

# Initialize NumerFrame from parquet file path
dataf = create_numerframe('pipeline_test/numerai_live_data.parquet')

2022-02-17 17:23:36,095 INFO numerapi.utils: target file already exists
2022-02-17 17:23:36,097 INFO numerapi.utils: download complete


------------------------------------------------------------------
## Example 1. Catboost model (.joblib) with 0.5 feature neutralization.

A very common use case is to predict from a single model on all features and perhaps do some feature neutralization. These can be set up with a few lines of code.

1. Use `SingleModel` which handles prediction logic for several formats (`.joblib`, `.cbm`, `.pickle`, `.pkl`, `.cbm`, `.lgb` and `.h5`.)

In [4]:
from numerai_blocks.model import SingleModel
from numerai_blocks.postprocessing import FeatureNeutralizer

In [5]:
joblib_model = SingleModel("../nbs/test_assets/joblib_v2_example_model.joblib",
                    model_name="joblib")
neutralizer = FeatureNeutralizer(pred_name="prediction_joblib",
                                 proportion=0.5)

In [6]:
pipeline = ModelPipeline(models=[joblib_model],
                         postprocessors=[neutralizer])

In [7]:
prediction_dataf = pipeline(dataf)

6d0ff6d0bf6f48119bae7211363b41c8 Preprocessing:: 0it [00:00, ?it/s]

6d0ff6d0bf6f48119bae7211363b41c8 Model prediction:   0%|          | 0/1 [00:00<?, ?it/s]

2022-02-17 17:23:37,153 INFO numexpr.utils: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-02-17 17:23:37,154 INFO numexpr.utils: NumExpr defaulting to 8 threads.


6d0ff6d0bf6f48119bae7211363b41c8 Postprocessing:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
prediction_dataf.get_prediction_data.head(2)

Unnamed: 0_level_0,prediction_joblib,prediction_joblib_neutralized_0.5
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n0001e4a82d5531c,0.480704,0.466076
n000ace6d1f6367e,0.834582,0.583167


--------------------------------------------------
## Example 2. Ensembling multiple models

In [9]:
from numerai_blocks.model import RandomModel
from numerai_blocks.postprocessing import MeanEnsembler

In [10]:
random_model = RandomModel()

In [11]:
pipeline = ModelPipeline(models=[joblib_model, random_model],
                         postprocessors=[MeanEnsembler(cols=['prediction_joblib',
                                                            'prediction_random'],
                                                       final_col_name="prediction_ensemble")]
                         )

In [12]:
multi_model_dataf = pipeline(dataf)

7779be934c4b4eeb977b6927f5704bc8 Preprocessing:: 0it [00:00, ?it/s]

7779be934c4b4eeb977b6927f5704bc8 Model prediction:   0%|          | 0/2 [00:00<?, ?it/s]

7779be934c4b4eeb977b6927f5704bc8 Postprocessing:   0%|          | 0/1 [00:00<?, ?it/s]

In [13]:
multi_model_dataf.get_prediction_data.head(3)

Unnamed: 0_level_0,prediction_joblib,prediction_random,prediction_ensemble
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
n0001e4a82d5531c,0.480704,0.238104,0.359404
n000ace6d1f6367e,0.834582,0.502623,0.668602
n000ae61e2b11e0a,0.459723,0.102098,0.28091


## Example 3. ModelPipelineCollection use case