# GeoAML - AutoML pipeline evaluation based on inference time

The Open-Earth-Monitor Cyberinfrastructure project has received funding from the European Union's Horizon Europe research and innovation programme under grant agreement No. 101059548.

One of the goals of the OEMC project is streamlining ML procudures against geospatial data. Geospatial application of ML often has the end goal of mapping predictions across a large spatial (or spatiotemporal) extent immediately upon fitting a model. This is often necessary even in the prototyping phase to assess problems with the modelling approach which might not be apparent from the training dataset. In these situations the area that needs to be mapped often contains order of magnitude more samples than the training dataset, and the time needed to produce (infer) a full map can be a significant hurdle to the usfulness of models.

To overcome this in an AutoML pipeline, we can directly include inference time as a metric to be tracked and optimized for. We demonstrate this with two prominent AutoML frameworks: [PyCaret](pycaret.gitbook.io/) and [FLAML](https://microsoft.github.io/FLAML/).

## Setting up

This notebook requires the following libraries installed in your environment:
  - `scikit-learn`
  - `xgboost`
  - `lightgbom`
  - `numpy`
  - `pandas`
  - `requests`
  - `pycaret`
  - `flaml[automl]`

You can either `pip install` or `conda install` these yourself, or if you are working from a constarined environment like Colab, uncomment the install lines in the following code block.

Additionally we provide a small module called `automl-utils` included in the notebook repository. This module contains some thin wrappers and helper to assist with integrating an inference time metric with AutoML (most notably when working with PyCaret, which doesn't normally pass the estimator to metric functions). You can install this package from git (by uncommenting the corresponding line in the following code cell), or from a local copy of the notebook repository (by running `python -m pip install .` from the `automl-utils` directory).

In [1]:
### uncomment the following to install dependencies
!python -m pip install scikit-learn xgboost lightgbm numpy pandas requests pycaret flaml[automl]

### uncomment the following to install the automl-utils module
!python -m pip install git+https://github.com/Open-Earth-Monitor/showcase#egg=automl-utils&subdirectory=automl-utils



First, we need to fetch some data to work with. The dataset presented here was prepared within the context of [OEMC Hackthon 2023](https://earthmonitor.org/events/hackathon2023/). We can download it from Zenodo and inspect it with Pandas.

In [2]:
import pandas as pd
import requests

ZENODO_URL = "https://zenodo.org/records/13874505"

DATA_FILE = "./train.csv"

resp = requests.get(f"{ZENODO_URL}/files/train.csv?download=1")

with open(DATA_FILE, "wb") as dst:
    dst.write(resp.content)

df_train = pd.read_csv(DATA_FILE)

df_train

Unnamed: 0,sample_id,station,month,fapar,modis_blue,modis_red,modis_nir,modis_mir,modis_evi,modis_ndvi,...,dtm_slope,dtm_aspect-cosine,dtm_aspect-sine,dtm_downlslope.curvature,dtm_upslope.curvature,dtm_elevation,dtm_cti,dtm_neg.openess,dtm_pos.openess,dtm_vbf
0,0,52,2,0.310634,235.0,545.0,1306.0,1414.0,1484.0,4108.0,...,10.0,-4753.0,-876.0,-13.0,13.0,351.0,-2414.0,153.0,155.0,14.0
1,1,14,9,0.699500,355.0,531.0,3348.0,786.0,5060.0,7156.0,...,11.0,-3071.0,945.0,-22.0,40.0,125.0,-412.0,152.0,155.0,10.0
2,2,52,3,0.353572,276.0,642.0,1496.0,1364.0,1614.0,4086.0,...,10.0,-4753.0,-876.0,-13.0,13.0,351.0,-2414.0,153.0,155.0,14.0
3,3,73,11,0.260067,519.0,1196.0,3256.0,1247.0,3112.0,4628.0,...,1.0,5685.0,788.0,0.0,0.0,584.0,3795.0,157.0,157.0,257.0
4,4,14,3,0.779333,327.0,528.0,3106.0,988.0,4572.0,7048.0,...,11.0,-3071.0,945.0,-22.0,40.0,125.0,-412.0,152.0,155.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3456,3456,7,10,0.027000,811.0,1169.0,1800.0,2672.0,1163.0,2045.0,...,1.0,-9449.0,-38.0,-1.0,3.0,1652.0,100.0,156.0,157.0,373.0
3457,3457,26,8,0.036196,563.0,1366.0,2776.0,2432.0,1999.0,3316.0,...,1.0,4335.0,900.0,-6.0,-1.0,1322.0,2011.0,157.0,156.0,540.0
3458,3458,56,8,0.969277,167.0,250.0,3366.0,525.0,5704.0,8624.0,...,1.0,-3085.0,416.0,-4.0,0.0,12.0,2786.0,157.0,157.0,653.0
3459,3459,23,3,0.536160,257.0,542.0,2104.0,887.0,2918.0,6168.0,...,1.0,9548.0,-270.0,-8.0,3.0,42.0,400.0,157.0,156.0,875.0


We'll separate some columns from the dataset for later and delete the ones we don't currently need to simplify things. After that we can begin building our AutoML pipeline, first with PyCaret.

In [3]:
groups = df_train.station.copy()
X = df_train[df_train.columns[4:]]

del df_train["sample_id"], df_train["month"], df_train["station"]

df_train

Unnamed: 0,fapar,modis_blue,modis_red,modis_nir,modis_mir,modis_evi,modis_ndvi,modis_lst_day_p05,modis_lst_day_p50,modis_lst_day_p95,...,dtm_slope,dtm_aspect-cosine,dtm_aspect-sine,dtm_downlslope.curvature,dtm_upslope.curvature,dtm_elevation,dtm_cti,dtm_neg.openess,dtm_pos.openess,dtm_vbf
0,0.310634,235.0,545.0,1306.0,1414.0,1484.0,4108.0,13656.0,14032.0,14384.0,...,10.0,-4753.0,-876.0,-13.0,13.0,351.0,-2414.0,153.0,155.0,14.0
1,0.699500,355.0,531.0,3348.0,786.0,5060.0,7156.0,14904.0,15200.0,15296.0,...,11.0,-3071.0,945.0,-22.0,40.0,125.0,-412.0,152.0,155.0,10.0
2,0.353572,276.0,642.0,1496.0,1364.0,1614.0,4086.0,14400.0,14480.0,14888.0,...,10.0,-4753.0,-876.0,-13.0,13.0,351.0,-2414.0,153.0,155.0,14.0
3,0.260067,519.0,1196.0,3256.0,1247.0,3112.0,4628.0,13264.0,13696.0,13976.0,...,1.0,5685.0,788.0,0.0,0.0,584.0,3795.0,157.0,157.0,257.0
4,0.779333,327.0,528.0,3106.0,988.0,4572.0,7048.0,15136.0,15200.0,15296.0,...,11.0,-3071.0,945.0,-22.0,40.0,125.0,-412.0,152.0,155.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3456,0.027000,811.0,1169.0,1800.0,2672.0,1163.0,2045.0,14456.0,14832.0,15272.0,...,1.0,-9449.0,-38.0,-1.0,3.0,1652.0,100.0,156.0,157.0,373.0
3457,0.036196,563.0,1366.0,2776.0,2432.0,1999.0,3316.0,15472.0,15792.0,16248.0,...,1.0,4335.0,900.0,-6.0,-1.0,1322.0,2011.0,157.0,156.0,540.0
3458,0.969277,167.0,250.0,3366.0,525.0,5704.0,8624.0,14680.0,14960.0,15056.0,...,1.0,-3085.0,416.0,-4.0,0.0,12.0,2786.0,157.0,157.0,653.0
3459,0.536160,257.0,542.0,2104.0,887.0,2918.0,6168.0,14504.0,14832.0,14952.0,...,1.0,9548.0,-270.0,-8.0,3.0,42.0,400.0,157.0,156.0,875.0


## PyCaret

PyCaret is a highly accessible AutoML framework designed for rapid experimentation and protyping. It handles many of the common preprocessing tasks internally (to a degree), like data cleaning and feature engineering, and makes it simple to build a workable prototype model relatively quickly, even on lower-end hardware.

We will be using PyCaret's OOP API which allows us to contain a full AutoML experiment in a single object. We will also inspect what kind of metrics the experiment tracks by default.

In [4]:
from pycaret.regression import RegressionExperiment

exp = RegressionExperiment()

# setting up an experiment also outputs a summary of the setup
exp.setup(
    df_train,
    target="fapar",
    fold_groups=groups,
    fold_strategy="groupkfold",
)

# inspect the metrics included by default
exp.get_metrics()

Unnamed: 0,Description,Value
0,Session id,7872
1,Target,fapar
2,Target type,Regression
3,Original data shape,"(3461, 33)"
4,Transformed data shape,"(3461, 33)"
5,Transformed train set shape,"(2422, 33)"
6,Transformed test set shape,"(1039, 33)"
7,Numeric features,32
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0_level_0,Name,Display Name,Score Function,Scorer,Target,Args,Greater is Better,Custom
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
mae,MAE,MAE,<function mean_absolute_error at 0x7f50ade44f40>,neg_mean_absolute_error,pred,{},False,False
mse,MSE,MSE,<function mean_squared_error at 0x7f50ade45300>,neg_mean_squared_error,pred,{},False,False
rmse,RMSE,RMSE,<function mean_squared_error at 0x7f50ade45300>,neg_root_mean_squared_error,pred,{'squared': False},False,False
r2,R2,R2,<function r2_score at 0x7f50ade45b20>,r2,pred,{},True,False
rmsle,RMSLE,RMSLE,<function RMSLEMetricContainer.__init__.<local...,"make_scorer(root_mean_squared_log_error, great...",pred,{},False,False
mape,MAPE,MAPE,<function MAPEMetricContainer.__init__.<locals...,"make_scorer(mean_absolute_percentage_error, gr...",pred,{},False,False


PyCaret allows us to easily add custom metrics to the experiment, but does not natively pass the estimator object to the metric function (which is necessary to measure inference time). We will circumvent this by using the `automl_utils` module, and add an instance of its `InferenceTimer` as a custom metric to the experiment.

In [5]:
import automl_utils

inference_time_metric_pycaret = automl_utils.InferenceTimer(
    X,  # specify the dataset for timed inference
    target_lib="pycaret",  # specify that the metric will be used with PyCaret
)

exp.add_metric(
    "inference_time",
    "inference time",
    inference_time_metric_pycaret,
    greater_is_better=False,
)

exp.get_metrics()

Unnamed: 0_level_0,Name,Display Name,Score Function,Scorer,Target,Args,Greater is Better,Custom
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
mae,MAE,MAE,<function mean_absolute_error at 0x7f50ade44f40>,neg_mean_absolute_error,pred,{},False,False
mse,MSE,MSE,<function mean_squared_error at 0x7f50ade45300>,neg_mean_squared_error,pred,{},False,False
rmse,RMSE,RMSE,<function mean_squared_error at 0x7f50ade45300>,neg_root_mean_squared_error,pred,{'squared': False},False,False
r2,R2,R2,<function r2_score at 0x7f50ade45b20>,r2,pred,{},True,False
rmsle,RMSLE,RMSLE,<function RMSLEMetricContainer.__init__.<local...,"make_scorer(root_mean_squared_log_error, great...",pred,{},False,False
mape,MAPE,MAPE,<function MAPEMetricContainer.__init__.<locals...,"make_scorer(mean_absolute_percentage_error, gr...",pred,{},False,False
inference_time,inference time,inference time,<automl_utils.InferenceTimer object at 0x7f50a...,"make_scorer(inference_time, greater_is_better=...",pred,{},False,True


To work with this setup, we also need to patch the estimator objects we will use for the experiment using the `automl_utils` module. This will work with any `scikit-learn` compatible model classes.

In [6]:
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

estimators = [
    automl_utils.patch_estimator(e)
    for e in (
        LGBMRegressor(),
        XGBRegressor(),
        RandomForestRegressor(),
    )
]

estimators

[LGBMRegressor_Patched4InferenceTimer(),
 XGBRegressor_Patched4InferenceTimer(base_score=None, booster=None,
                                     callbacks=None, colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, device=None,
                                     early_stopping_rounds=None,
                                     enable_categorical=False, eval_metric=None,
                                     feature_types=None, gamma=None,
                                     grow_policy=None, importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_bin=None,
                                     max_cat_threshold=None,
                                     max_cat_to_onehot=None, max_delta_step=None,
                                     max_depth=None, max_leaves=None,
                                     m

We can now run our AutoML experiment and get a ranking of our models based on multiple metrics, including inference time.

In [7]:
best = exp.compare_models(
    include=estimators,
    turbo=True,
)

best

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,inference time,TT (Sec)
0,Light Gradient Boosting Machine,0.0707,0.0158,0.1074,0.749,0.0746,0.5389,0.0255,15.408
1,Extreme Gradient Boosting,0.074,0.0181,0.1141,0.72,0.0785,0.4245,0.02,0.477
2,Random Forest Regressor,0.0731,0.018,0.1111,0.7094,0.0766,0.4457,0.1036,0.863


## FLAML

FLAML is modular and efficient AutoML framework with support for a broad range of ML tasks. It is one of the few frameworks in this space that passes the estimator to metric functions, allowing for direct inference time optimization without any helpers. However, we will again use the `automl_utils.InferenceTimer` helper for convenience.

Let's set up our AutoML experiment with FLAML.

In [8]:
from flaml import AutoML

inference_time_metric_flaml = automl_utils.InferenceTimer(
    X,
    target_lib="flaml",  # specify that the metric will be used with FLAML
)

settings = {
    "time_budget": 10,
    "metric": inference_time_metric_flaml,  # define metric as optimization target
    "estimator_list": [  # specify estimator types
        "lgbm",
        "rf",
        "xgboost",
    ],
    "task": "regression",
}

automl = AutoML(**settings)

automl

We can now use the `automl` object to optimize an ML pipeline purely for inference time.

In [9]:
y = df_train["fapar"]

automl.fit(X, y)

[flaml.automl.logger: 06-03 21:49:39] {1752} INFO - task = regression
[flaml.automl.logger: 06-03 21:49:39] {1763} INFO - Evaluation method: holdout
[flaml.automl.logger: 06-03 21:49:39] {1862} INFO - Minimizing error metric: customized metric
[flaml.automl.logger: 06-03 21:49:39] {1979} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost']
[flaml.automl.logger: 06-03 21:49:39] {2282} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 06-03 21:49:39] {2417} INFO - Estimated sufficient time budget=438s. Estimated necessary time budget=0s.
[flaml.automl.logger: 06-03 21:49:39] {2466} INFO -  at 0.1s,	estimator lgbm's best error=0.0028,	best estimator lgbm's best error=0.0028
[flaml.automl.logger: 06-03 21:49:39] {2282} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 06-03 21:49:40] {2466} INFO -  at 0.2s,	estimator lgbm's best error=0.0020,	best estimator lgbm's best error=0.0020
[flaml.automl.logger: 06-03 21:49:40] {2282} INFO - iteration 2, cur

While this works, optimizing an ML pipeline purely for inference time obviously isn't particularly useful. Instead, we can define a more complex metric that combines a traditional loss function with inference time.

In [10]:
import numpy as np

# if we omit the target_lib argument, we get a generic inference timer
# that takes only an estimator and outputs number of seconds elapsed
inference_timer = automl_utils.InferenceTimer(X)


def custom_metric_flaml(X_val, y_val, estimator, *args, **kwargs):
    y_pred = estimator.predict(X_val)
    mae = np.abs(y_pred - y_val).mean()
    seconds_elapsed = inference_timer(estimator)

    # combine MAE and inference time into a single metric to optimize for
    loss = mae * seconds_elapsed * 1000  # magnify by 1000 for more readable results

    # return a single loss value as the optimization target
    # and a dict of metrics to display during optimization
    return loss, {
        "MAE": mae,
        "inference_time": seconds_elapsed,
    }


automl.fit(X, y, metric=custom_metric_flaml)

[flaml.automl.logger: 06-03 21:49:50] {1752} INFO - task = regression
[flaml.automl.logger: 06-03 21:49:50] {1763} INFO - Evaluation method: holdout
[flaml.automl.logger: 06-03 21:49:50] {1862} INFO - Minimizing error metric: customized metric
[flaml.automl.logger: 06-03 21:49:50] {1979} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost']
[flaml.automl.logger: 06-03 21:49:50] {2282} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 06-03 21:49:50] {2417} INFO - Estimated sufficient time budget=651s. Estimated necessary time budget=1s.
[flaml.automl.logger: 06-03 21:49:50] {2466} INFO -  at 0.2s,	estimator lgbm's best error=0.8768,	best estimator lgbm's best error=0.8768
[flaml.automl.logger: 06-03 21:49:50] {2282} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 06-03 21:49:50] {2466} INFO -  at 0.2s,	estimator lgbm's best error=0.8768,	best estimator lgbm's best error=0.8768
[flaml.automl.logger: 06-03 21:49:50] {2282} INFO - iteration 2, cur