Copyright (c) Microsoft Corporation. All rights reserved. 

Licensed under the MIT License.

# Customize your AutoML with FLAML


## Introduction and Preparation

This notebook shows you several customization choices you may find useful in FLAML, including customization choices regarding:
- **optimization metric**
- **learner** and its **search space**
- **Resampling strategy**
- **Ensemble**

### FLAML installation
FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the `notebook` option:
```bash
pip install flaml[notebook]
```

In [None]:
%pip install flaml[notebook]==1.1.2

### Load data and preprocess

Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

In [1]:
from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='./')

load dataset from ./openml_ds1169.pkl
Dataset name: airlines
X_train.shape: (404537, 7), y_train.shape: (404537,);
X_test.shape: (134846, 7), y_test.shape: (134846,)


## 1. Optimization Metric

It's easy to customize the optimization metric. As an example, we demonstrate with a custom metric function which combines training loss and validation loss as the final loss to minimize.

In [2]:
def custom_metric(X_val, y_val, estimator, labels, X_train, y_train,
                  weight_val=None, weight_train=None, config=None,
                  groups_val=None, groups_train=None):
    from sklearn.metrics import log_loss
    import time
    start = time.time()
    y_pred = estimator.predict_proba(X_val)
    pred_time = (time.time() - start) / len(X_val)
    val_loss = log_loss(y_val, y_pred, labels=labels,
                         sample_weight=weight_val)
    y_pred = estimator.predict_proba(X_train)
    train_loss = log_loss(y_train, y_pred, labels=labels,
                          sample_weight=weight_train)
    alpha = 0.5
    return val_loss * (1 + alpha) - alpha * train_loss, {
        "val_loss": val_loss, "train_loss": train_loss, "pred_time": pred_time
    }
    # two elements are returned:
    # the first element is the metric to minimize as a float number,
    # the second element is a dictionary of the metrics to log

We can then pass this custom metric function to automl's `fit` method.

In [3]:
''' import AutoML class from flaml package '''
from flaml import AutoML
automl = AutoML()
settings = {
    "time_budget": 10,  # total running time in seconds
    "metric": custom_metric,  # pass the custom metric funtion here
    "task": 'classification',  # task type
    "log_file_name": 'airlines_experiment_custom_metric.log',  # flaml log file
}

automl.fit(X_train=X_train, y_train=y_train, **settings)

[flaml.automl.automl: 01-06 15:42:10] {2625} INFO - task = classification
[flaml.automl.automl: 01-06 15:42:10] {2627} INFO - Data split method: stratified
[flaml.automl.automl: 01-06 15:42:10] {2630} INFO - Evaluation method: holdout
[flaml.automl.automl: 01-06 15:42:10] {2757} INFO - Minimizing error metric: customized metric
[flaml.automl.automl: 01-06 15:42:10] {2902} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.automl: 01-06 15:42:10] {3203} INFO - iteration 0, current learner lgbm
[flaml.automl.automl: 01-06 15:42:10] {3340} INFO - Estimated sufficient time budget=25247s. Estimated necessary time budget=582s.
[flaml.automl.automl: 01-06 15:42:10] {3387} INFO -  at 0.6s,	estimator lgbm's best error=0.6647,	best estimator lgbm's best error=0.6647
[flaml.automl.automl: 01-06 15:42:10] {3203} INFO - iteration 1, current learner lgbm
[flaml.automl.automl: 01-06 15:42:10] {3387} INFO -  at 0.6s,	estimator lgbm

## 2. Learner and Search Space

Some experienced automl users may have a preferred model to tune or may already have a reasonably by-hand-tuned model before launching the automl experiment. They need to select optimal configurations for the customized model mixed with standard built-in learners. 

FLAML can easily incorporate customized/new learners (preferably with sklearn API) provided by users in a real-time manner, as demonstrated below.

### Example of Regularized Greedy Forest

[Regularized Greedy Forest](https://arxiv.org/abs/1109.0887) (RGF) is a machine learning method currently not included in FLAML. The RGF has many tuning parameters, the most critical of which are: `[max_leaf, n_iter, n_tree_search, opt_interval, min_samples_leaf]`. To run a customized/new learner, the user needs to provide the following information:
* an implementation of the customized/new learner
* a list of hyperparameter names and types
* rough ranges of hyperparameters (i.e., upper/lower bounds)
* choose initial value corresponding to low cost for cost-related hyperparameters (e.g., initial value for max_leaf and n_iter should be small)

In this example, the above information for RGF is wrapped in a python class called *MyRegularizedGreedyForest* that exposes the hyperparameters.

In [None]:
%pip install rgf-python

In [5]:
''' SKLearnEstimator is the super class for a sklearn learner '''
from flaml.model import SKLearnEstimator
from flaml import tune
from flaml.data import CLASSIFICATION


class MyRegularizedGreedyForest(SKLearnEstimator):
    def __init__(self, task='binary', **config):
        '''Constructor
        
        Args:
            task: A string of the task type, one of
                'binary', 'multiclass', 'regression'
            config: A dictionary containing the hyperparameter names
                and 'n_jobs' as keys. n_jobs is the number of parallel threads.
        '''

        super().__init__(task, **config)

        '''task=binary or multi for classification task'''
        if task in CLASSIFICATION:
            from rgf.sklearn import RGFClassifier

            self.estimator_class = RGFClassifier
        else:
            from rgf.sklearn import RGFRegressor
            
            self.estimator_class = RGFRegressor

    @classmethod
    def search_space(cls, data_size, task):
        '''[required method] search space

        Returns:
            A dictionary of the search space. 
            Each key is the name of a hyperparameter, and value is a dict with
                its domain (required) and low_cost_init_value, init_value,
                cat_hp_cost (if applicable).
                e.g.,
                {'domain': tune.randint(lower=1, upper=10), 'init_value': 1}.
        '''
        space = {        
            'max_leaf': {'domain': tune.lograndint(lower=4, upper=data_size[0]), 'init_value': 4, 'low_cost_init_value': 4},
            'n_iter': {'domain': tune.lograndint(lower=1, upper=data_size[0]), 'init_value': 1, 'low_cost_init_value': 1},
            'n_tree_search': {'domain': tune.lograndint(lower=1, upper=32768), 'init_value': 1, 'low_cost_init_value': 1},
            'opt_interval': {'domain': tune.lograndint(lower=1, upper=10000), 'init_value': 100},
            'learning_rate': {'domain': tune.loguniform(lower=0.01, upper=20.0)},
            'min_samples_leaf': {'domain': tune.lograndint(lower=1, upper=20), 'init_value': 20},
        }
        return space

    @classmethod
    def size(cls, config):
        '''[optional method] memory size of the estimator in bytes
        
        Args:
            config - the dict of the hyperparameter config

        Returns:
            A float of the memory size required by the estimator to train the
            given config
        '''
        max_leaves = int(round(config['max_leaf']))
        n_estimators = int(round(config['n_iter']))
        return (max_leaves * 3 + (max_leaves - 1) * 4 + 1.0) * n_estimators * 8

    @classmethod
    def cost_relative2lgbm(cls):
        '''[optional method] relative cost compared to lightgbm
        '''
        return 1.0

### Add Customized Learner and Run FLAML AutoML

After adding RGF into the list of learners, we run automl by tuning hyperpameters of RGF as well as the default learners. 

In [6]:
automl = AutoML()
automl.add_learner(learner_name='RGF', learner_class=MyRegularizedGreedyForest)

In [7]:
settings = {
    "time_budget": 10,  # total running time in seconds
    "metric": 'accuracy', 
    "estimator_list": ['RGF', 'lgbm', 'rf', 'xgboost'],  # list of ML learners
    "task": 'classification',  # task type    
    "log_file_name": 'airlines_experiment_custom_learner.log',  # flaml log file 
    "log_training_metric": True,  # whether to log training metric
}

automl.fit(X_train=X_train, y_train=y_train, **settings)

[flaml.automl.automl: 01-06 15:42:23] {2625} INFO - task = classification
[flaml.automl.automl: 01-06 15:42:23] {2627} INFO - Data split method: stratified
[flaml.automl.automl: 01-06 15:42:23] {2630} INFO - Evaluation method: holdout
[flaml.automl.automl: 01-06 15:42:23] {2757} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.automl: 01-06 15:42:23] {2902} INFO - List of ML learners in AutoML Run: ['RGF', 'lgbm', 'rf', 'xgboost']
[flaml.automl.automl: 01-06 15:42:23] {3203} INFO - iteration 0, current learner RGF
[flaml.automl.automl: 01-06 15:42:24] {3340} INFO - Estimated sufficient time budget=354648s. Estimated necessary time budget=355s.
[flaml.automl.automl: 01-06 15:42:24] {3387} INFO -  at 1.4s,	estimator RGF's best error=0.3840,	best estimator RGF's best error=0.3840
[flaml.automl.automl: 01-06 15:42:24] {3203} INFO - iteration 1, current learner RGF
[flaml.automl.automl: 01-06 15:42:25] {3387} INFO -  at 2.0s,	estimator RGF's best error=0.3840,	best estimator RGF's b

## 3. Resembling Strategy
Keyword arguments related to resampling strategy in FLAML
* `eval_method`
* `split_ratio`
* `n_splits`
* `split_type`
* `X_val`, and `y_val`

Please find a detailed documention on them in this [page](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML/#resampling-strategy). 

## 4. Ensemble
To use [stacked ensemble after the model search in FLAML](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#ensemble) , set `ensemble` to True or a dict.


In [8]:
from sklearn.linear_model import LogisticRegression
automl = AutoML()
settings = {
    "time_budget": 10,  # total running time in seconds
    "ensemble": {
        "final_estimator": LogisticRegression(),
        "passthrough": False,
    },
}

automl.fit(X_train=X_train, y_train=y_train, **settings)

[flaml.automl.automl: 01-06 15:42:35] {2625} INFO - task = classification
[flaml.automl.automl: 01-06 15:42:35] {2627} INFO - Data split method: stratified
[flaml.automl.automl: 01-06 15:42:35] {2630} INFO - Evaluation method: holdout
[flaml.automl.automl: 01-06 15:42:35] {2757} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl.automl: 01-06 15:42:35] {2902} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.automl: 01-06 15:42:35] {3203} INFO - iteration 0, current learner lgbm
[flaml.automl.automl: 01-06 15:42:35] {3340} INFO - Estimated sufficient time budget=12146s. Estimated necessary time budget=280s.
[flaml.automl.automl: 01-06 15:42:35] {3387} INFO -  at 0.4s,	estimator lgbm's best error=0.3580,	best estimator lgbm's best error=0.3580
[flaml.automl.automl: 01-06 15:42:35] {3203} INFO - iteration 1, current learner lgbm
[flaml.automl.automl: 01-06 15:42:35] {3387} INFO -  at 0.5s,	estimator lgbm's best 

## 5.Adding extra fit arguments
You can specify the different arguments needed by different estimators using the `fit_kwargs_by_estimator` argument.

In [None]:
%pip install flaml[catboost]==1.1.2

In [None]:
from flaml.automl.data import load_openml_dataset
from flaml import AutoML

X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir="./")

automl = AutoML()
automl_settings = {
    "task": "classification",
    "time_budget": 10,
    "estimator_list": ["catboost", "rf"],
    "fit_kwargs_by_estimator": {
        "catboost": {
            "verbose": True,  # setting the verbosity of catboost to True
        }
    },
}
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)

## Interested in Knowing More?
Find more about available customization choices about FLAML's `AuotML` module [here](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#customize-automlfit), and about FLAML's `Tune` module [here](https://microsoft.github.io/FLAML/docs/Use-Cases/Tune-User-Defined-Function#advanced-tuning-options). 