# Mini-batching

In its purest form, online machine learning encompasses models which learn with one sample at a time. This is the design which is used in `creme`.

The main downside of single-instance processing is that it doesn't scale to big data. Indeed, processing one sample at a time means that we are able to use [vectorisation](https://www.wikiwand.com/en/Vectorization) and other computational tools that are taken for granted in batch learning. On top of this, processing a large dataset in `creme` essentially involves a Python `for` loop, which might be too slow for some usecases. However, this doesn't mean that `creme` is slow. In fact, for processing a single instance, `creme` is actually a couple of orders of magnitude faster than libraries such as scikit-learn, PyTorch, and Tensorflow. The reason why is because `creme` is designed from the ground up to process a single instance, whereas the majority of other libraries choose to care about batches of data. Both approaches offer different compromises, and the best choice depends on your usecase.

In order to propose the best of both worlds, `creme` offers some limited support for mini-batch learning. Some of `creme`'s estimators implement `*_many` methods on top of their `*_one` counterparts. For instance, `preprocessing.StandardScaler` has a `fit_many` method as well as a `transform_many` method, in addition to `fit_one` and `transform_one`. Each mini-batch method takes as input a `pandas.DataFrame`. Supervised estimators also take as input a `pandas.Series` of target values. We choose to use `pandas.DataFrames` over `numpy.ndarrays` because of the simple fact that the former allows us to name each feature. This in turn allows us to offer a uniform interface for both single instance and mini-batch learning.

As an example, we will build a simple pipeline that scales the data and trains a logistic regression. Indeed, the `compose.Pipeline` class can be applied to mini-batches, as long as each step is able to do so.

In [1]:
from creme import compose
from creme import linear_model
from creme import preprocessing

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

For this example, we will use `datasets.Higgs`.

In [2]:
from creme import datasets

dataset = datasets.Higgs()
dataset

Higgs dataset

              Task  Binary classification                                                       
 Number of samples  11,000,000                                                                  
Number of features  28                                                                          
            Sparse  False                                                                       
              Path  /Users/mhalford/creme_data/Higgs/HIGGS.csv.gz                               
               URL  https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
              Size  2.55 GB                                                                     
        Downloaded  True                                                                        

The easiest way to read the data in a mini-batch fashion is to use the `read_csv` from `pandas`.

In [8]:
import pandas as pd

names = [
    'target', 'lepton pT', 'lepton eta', 'lepton phi',
    'missing energy magnitude', 'missing energy phi',
    'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 'jet 1 b-tag',
    'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag',
    'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag',
    'jet 4 pt', 'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag',
    'm_jj', 'm_jjj', 'm_lv', 'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb'
]

for x in pd.read_csv(dataset.path, names=names, chunksize=8096, nrows=3e5):
    y = x.pop('target')
    y_pred = model.predict_proba_many(x)
    model.fit_many(x, y)

Unnamed: 0,lepton pT,lepton eta,lepton phi,missing energy magnitude,missing energy phi,jet 1 pt,jet 1 eta,jet 1 phi,jet 1 b-tag,jet 2 pt,...,jet 4 eta,jet 4 phi,jet 4 b-tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
299552,0.791697,-0.019534,1.739907,1.885742,-1.399785,0.94557,2.328995,0.167582,0.0,0.995519,...,-0.934888,0.18074,3.101961,2.647392,1.798047,1.000372,1.093855,0.872866,1.387072,1.298226
299553,2.281208,0.085655,-0.224405,1.40488,-0.47712,1.537719,-1.931964,1.244179,0.0,1.119282,...,0.749913,-0.447523,3.101961,0.906534,1.839424,0.979172,0.945455,0.570161,1.423344,1.280748
299554,0.434647,-1.028565,-0.504055,0.637968,0.035387,0.776188,-1.028874,0.500207,2.173076,0.858536,...,-0.33942,-0.99134,3.101961,0.847959,1.156364,0.97957,0.633138,0.957975,0.77951,0.715258
299555,0.317887,-1.534055,-0.597829,1.649802,1.057176,0.785165,0.528757,-0.313164,2.173076,0.766752,...,-1.28634,-1.097328,0.0,0.462453,0.901873,1.000208,0.650438,1.587968,1.151113,0.904281
299556,0.321181,0.438232,1.634483,1.267926,0.344081,1.056781,-1.080366,0.965328,1.086538,1.488557,...,-0.799971,-0.711109,0.0,0.470469,0.69418,0.987412,0.818274,1.000118,0.913945,0.829425


If you are familiar with scikit-learn, you might be aware that [some](https://scikit-learn.org/stable/modules/computing.html#incremental-learning) of their estimators have a `partial_fit` method, which is similar to creme's `fit_many` method. Here are some advantages that creme has over scikit-learn:

- We guarantee that creme's is just as fast, if not faster than scikit-learn. The differences are negligeable, but are slightly in favor of creme.
- We take as input dataframes, which allows us to name each feature. The benefit is that you can add/remove/permute features between batches and everything will keep working.
- Estimators that support mini-batches also support single instance learning. This means that you can enjoy the best of both worlds. For instance, you can train with mini-batches and use `predict_one` to make predictions.

Note that you can check which estimators can process mini-batches programmatically:

In [None]:
import importlib
import inspect

def can_mini_batch(obj):
    return hasattr(obj, 'fit_many')

for module in importlib.import_module('creme').__all__:
    for obj in inspect.getmembers(importlib.import_module(f'creme.{module}'), can_mini_batch):
        print(obj)

Because mini-batch learning isn't treated as a first-class citizen, some of the creme's functionalities require some work in order to play nicely with mini-batches. For instance, the objects from the `metrics` module have an `update` method that take as input a single pair `(y_true, y_pred)`. This might change in the future, depending on the demand.

We plan to promote more models to the mini-batch regime. However, we will only be doing so for the methods that benefit the most from it, as well as those that are most popular. Indeed, `creme`'s core philosophy will remain to cater to single instance learning.