Copyright (c) Microsoft Corporation. All rights reserved. 

# Further Acceleration of AutoML with FLAML


## Introduction and preparation

In addition to the fast HPO methods, FLAML provides several other tricks you could use to further accelerate your AutoML task using several unique functionalities from FLAML, including
- Enabling constraints on training time or inference time
- Warm start
- Parallization

### FLAML installation
FLAML requires `Python>=3.7`. To run this notebook example, please install flaml with the `notebook`, `blendsearch`, `ray` options:
```bash
pip install flaml[notebook,blendsearch,ray]
```

In [None]:
%pip install flaml[notebook,blendsearch,ray]==1.1.2

### Load data and preprocess

Download [Airlines dataset](https://www.openml.org/d/1169) from OpenML. The task is to predict whether a given flight will be delayed, given the information of the scheduled departure.

In [1]:
from flaml.data import load_openml_dataset
X_train, X_test, y_train, y_test = load_openml_dataset(dataset_id=1169, data_dir='./')

load dataset from ./openml_ds1169.pkl
Dataset name: airlines
X_train.shape: (404537, 7), y_train.shape: (404537,);
X_test.shape: (134846, 7), y_test.shape: (134846,)


## 1. Enabling constraints during AutoML in FLAML
**Overview:** There are [4 types of constraints](https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#constraint) you can impose in FLAML, including

(1) Constraints on the AutoML process via `time_budget` and/or `max_iter`.

(2) Constraints on the constructor arguments of the estimators.
Some constraints on the estimator can be implemented via the custom learner. The following example adds a monotonicity constraint to XGBoost. This approach can be used to set any constraint that is an argument in the underlying estimator's constructor.

(3) Constraints on the models tried in AutoML.
Users can set constraints such as the maximal number of models to try, limit on training time and prediction time per model.
* `train_time_limit`: training time in seconds.
* `pred_time_limit`: prediction time per instance in seconds.

(4) Constraints on the metrics of the ML model tried in AutoML.

### 1.1 Imposing constraints on training time and prediction time per model 
You can set limits on training time and prediction time per model via the following keyword arguments:
* `train_time_limit`: training time in seconds.
* `pred_time_limit`: prediction time per instance in seconds.

In [2]:
from flaml import AutoML
fast_automl_1 = AutoML()
settings = {
    "time_budget": 10,  # total running time in seconds
    "task": 'classification',  # task type
    "train_time_limit": 1,
    "pred_time_limit": 0.1,
}
fast_automl_1.fit(X_train, y_train, **settings)

[flaml.automl.automl: 02-04 09:17:47] {2625} INFO - task = classification
[flaml.automl.automl: 02-04 09:17:47] {2627} INFO - Data split method: stratified
[flaml.automl.automl: 02-04 09:17:47] {2630} INFO - Evaluation method: holdout
[flaml.automl.automl: 02-04 09:17:48] {2757} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl.automl: 02-04 09:17:48] {2902} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'catboost', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.automl: 02-04 09:17:48] {3203} INFO - iteration 0, current learner lgbm
[flaml.automl.automl: 02-04 09:17:48] {3340} INFO - Estimated sufficient time budget=22614s. Estimated necessary time budget=555s.
[flaml.automl.automl: 02-04 09:17:48] {3387} INFO -  at 0.6s,	estimator lgbm's best error=0.3580,	best estimator lgbm's best error=0.3580
[flaml.automl.automl: 02-04 09:17:48] {3203} INFO - iteration 1, current learner lgbm
[flaml.automl.automl: 02-04 09:17:48] {3387} INFO -  at 0.6s,	estimator 

### 1.2 Impose constraints on one or more metrics of the ML model tried in AutoML.

Say, you have an AutoML task with the following constraints you want impose:
- Have a model train time limit of 1 second.
- Have a model predict time of 0.001 second per instance.
- We want to find models with both traning loss and validation loss smaller than 0.1.

Let's see how FLAML can help you do it.

In [22]:
def custom_metric(X_val, y_val, estimator, labels, X_train, y_train,
                  weight_val=None, weight_train=None, config=None,
                  groups_val=None, groups_train=None):
    from sklearn.metrics import log_loss
    import time
    start = time.time()
    y_pred = estimator.predict_proba(X_val)
    pred_time = (time.time() - start) / len(X_val)
    val_loss = log_loss(y_val, y_pred, labels=labels,
                         sample_weight=weight_val)
    y_pred = estimator.predict_proba(X_train)
    train_loss = log_loss(y_train, y_pred, labels=labels,
                          sample_weight=weight_train)
    alpha = 0.5
    return val_loss, {
        "val_loss": val_loss, "train_loss": train_loss, "pred_time": pred_time
    }
    # two elements are returned:
    # the first element is the metric to minimize as a float number,
    # the second element is a dictionary of the metrics to log

fast_automl_2 = AutoML()
metric_constraints = [("train_loss", "<=", 0.1), ("val_loss", "<=", 0.1)]
settings = {
    "time_budget": 10,  # total running time in seconds
    "metric": custom_metric,  # pass the custom metric funtion here
    "task": 'classification',  # task type
    "train_time_limit": 1,
    "pred_time_limit": 0.001,
    "metric_constraints": metric_constraints,
}
fast_automl_2.fit(X_train, y_train, **settings)

[flaml.automl.automl: 01-06 13:58:20] {2625} INFO - task = classification
[flaml.automl.automl: 01-06 13:58:20] {2627} INFO - Data split method: stratified
[flaml.automl.automl: 01-06 13:58:20] {2630} INFO - Evaluation method: holdout
[flaml.automl.automl: 01-06 13:58:20] {2757} INFO - Minimizing error metric: customized metric
[flaml.automl.automl: 01-06 13:58:20] {2902} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.automl: 01-06 13:58:20] {3203} INFO - iteration 0, current learner lgbm
[flaml.automl.automl: 01-06 13:58:20] {3340} INFO - Estimated sufficient time budget=22591s. Estimated necessary time budget=521s.
[flaml.automl.automl: 01-06 13:58:20] {3387} INFO -  at 1.0s,	estimator lgbm's best error=0.6640,	best estimator lgbm's best error=0.6640
[flaml.automl.automl: 01-06 13:58:20] {3203} INFO - iteration 1, current learner lgbm
[flaml.automl.automl: 01-06 13:58:20] {3387} INFO -  at 1.1s,	estimator lgbm

## 2. Warmstart

Leveraging results from previous automl experiments is a good way to warm start a new experiment.

We can warm start your AutoML job by providing starting points of hyperparameter configurstions for each estimator. For example, if you have run your AutoML job for one hour, after checking the results, you would like to run it for another two hours, then you can use the best configurations found for each estimator as the starting points for the new run.


In [23]:
warmstarted_automl = AutoML()
settings = {
    "time_budget": 10,  # total running time in seconds
    "starting_points": fast_automl_2.best_config_per_estimator,
}
warmstarted_automl.fit(X_train, y_train, **settings)

[flaml.automl.automl: 01-06 13:58:32] {2625} INFO - task = classification
[flaml.automl.automl: 01-06 13:58:32] {2627} INFO - Data split method: stratified
[flaml.automl.automl: 01-06 13:58:32] {2630} INFO - Evaluation method: holdout
[flaml.automl.automl: 01-06 13:58:32] {2757} INFO - Minimizing error metric: 1-roc_auc
[flaml.automl.automl: 01-06 13:58:32] {2902} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.automl: 01-06 13:58:32] {3203} INFO - iteration 0, current learner lgbm
[flaml.automl.automl: 01-06 13:58:32] {3340} INFO - Estimated sufficient time budget=19213s. Estimated necessary time budget=443s.
[flaml.automl.automl: 01-06 13:58:32] {3387} INFO -  at 0.9s,	estimator lgbm's best error=0.3282,	best estimator lgbm's best error=0.3282
[flaml.automl.automl: 01-06 13:58:32] {3203} INFO - iteration 1, current learner lgbm
[flaml.automl.automl: 01-06 13:58:32] {3387} INFO -  at 1.0s,	estimator lgbm's best 

## 3. Fast AutoML with Parallelization 

One potentially effective way to make your AutoML finish fast in terms of wall-clock time is parallelization.

When you have parallel resources, you can either spend them in training and keep the model search sequential, or perform parallel search.

(1) To do parallel tuning with ray, install the `ray` and `blendsearch` options 
```bash
pip install flaml[blendsearch,ray]
```

In [26]:
import ray
from flaml import AutoML
ray.shutdown()
ray.init(num_cpus=4)
automl = AutoML()
settings = {
    "time_budget": 30,
    "n_jobs": 2,
    "n_concurrent_trials": 2,
}
automl.fit(X_train, y_train, **settings)

2023-01-06 13:59:41,314	INFO tune.py:747 -- Total run time: 32.40 seconds (31.81 seconds for the tuning loop).
[flaml.automl.automl: 01-06 13:59:41] {3503} INFO - selected model: None
[flaml.automl.automl: 01-06 14:00:46] {3647} INFO - retrain xgb_limitdepth for 64.6s
[flaml.automl.automl: 01-06 14:00:46] {3654} INFO - retrained model: XGBClassifier(base_score=0.5, booster='gbtree', callbacks=[],
              colsample_bylevel=0.745214496109651, colsample_bynode=1,
              colsample_bytree=0.7487835192665638, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.05512783571344645,
              max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=7, max_leaves=0,
              min_child_weight=2.3004805741571643, missing=nan,
              monoton

(2)  To do parallel tuning with spark, install the `spark` and `blendsearch` options, and set `use_spark` to True. Find a more detailed example in this notebook:
[Parallel tuning with Spark](https://colab.research.google.com/github/microsoft/FLAML/blob/tutorial-aaai23/notebook/integrate_spark.ipynb)
