We use the Iris dataset to illustrate how different AutoML frameworks work, by doing model selection on the training set and then evaluate on test set. The error metric we are using is balanced error rate, which is the average of false positive rate and false negative rate, and then take the average of those averages across classes.

In [15]:
import sys
import pandas as pd
import os
import time
import numpy as np
import multiprocessing as mp

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [16]:
automl_path = 'oboe/automl/'
sys.path.append(automl_path)
from auto_learner import AutoLearner
import util

# disable warnings
import warnings
warnings.filterwarnings('ignore')

Prepare the dataset: we would use either iris or Airbnb.

In [17]:
dataset = "airbnb" # airbnb or iris
airbnb_dataset_size = 1000 # number of points to keep in subsampling

if dataset == "airbnb":
    df_airbnb = pd.read_csv("airbnb.csv", index_col=None, header=0)
    df_airbnb.drop(df_airbnb[df_airbnb.price == np.nan].index, inplace=True)
    features_real = [
      "host_listings_count",
      "host_total_listings_count",
      "accommodates",
      "bathrooms",
      "bedrooms",
      "guests_included",
      "extra_people",
      "minimum_nights",
      "maximum_nights",
      "availability_30",
      "availability_60",
      "availability_90",
      "availability_365",
      "number_of_reviews",
      "review_scores_rating",
      "review_scores_accuracy",
      "review_scores_cleanliness",
      "review_scores_checkin",
      "review_scores_communication",
      "review_scores_location",
      "price"
    ]

    label = ["review_scores_value"]
    x = df_airbnb[features_real].values
    y = df_airbnb[label].values.flatten()
    
    np.random.seed(0)
    idx_to_keep = np.random.choice(np.arange(y.shape[0]), size=airbnb_dataset_size, replace=False)
    x = x[idx_to_keep]
    y = y[idx_to_keep]
    

elif dataset == "iris":
    data = load_iris()
    x = np.array(data['data'])
    y = np.array(data['target'])
    
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0)

# Part I: auto-sklearn

In [18]:
import autosklearn.classification
from autosklearn.metrics import balanced_accuracy

A wrapper class for the auto-sklearn learner.

In [19]:
def AutoSklearn(total_runtime, train_features, train_labels):
    clf = autosklearn.classification.AutoSklearnClassifier(
            time_left_for_this_task=total_runtime,
            include_preprocessors=["no_preprocessing"],
            include_estimators = ["adaboost","gaussian_nb", "extra_trees", "gradient_boosting", 
                                 "liblinear_svc", "libsvm_svc","random_forest",
                                 "k_nearest_neighbors","decision_tree"],
    )
        
    clf.fit(train_features, train_labels, metric=balanced_accuracy)    
    return clf

Run auto-sklearn for 30 seconds.

In [20]:
runtime = 30
clf = AutoSklearn(runtime, x_train, y_train)

Time limit for a single run is higher than total time limit. Capping the limit for a single run to the total time given to SMAC (29.593113)
1
['/tmp/autosklearn_tmp_64801_2482/.auto-sklearn/ensembles/1.0000000000.ensemble', '/tmp/autosklearn_tmp_64801_2482/.auto-sklearn/ensembles/1.0000000001.ensemble', '/tmp/autosklearn_tmp_64801_2482/.auto-sklearn/ensembles/1.0000000002.ensemble', '/tmp/autosklearn_tmp_64801_2482/.auto-sklearn/ensembles/1.0000000003.ensemble']


In [21]:
y_pred_autosklearn = clf.predict(x_test)

Show which models the learner has picked.

In [22]:
clf.show_models()

"[(0.660000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'categorical_encoding:__choice__': 'one_hot_encoding', 'classifier:__choice__': 'random_forest', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'no_preprocessing', 'rescaling:__choice__': 'standardize', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'True', 'classifier:random_forest:bootstrap': 'True', 'classifier:random_forest:criterion': 'gini', 'classifier:random_forest:max_depth': 'None', 'classifier:random_forest:max_features': 0.5, 'classifier:random_forest:max_leaf_nodes': 'None', 'classifier:random_forest:min_impurity_decrease': 0.0, 'classifier:random_forest:min_samples_leaf': 1, 'classifier:random_forest:min_samples_split': 2, 'classifier:random_forest:min_weight_fraction_leaf': 0.0, 'classifier:random_forest:n_estimators': 100, 'categorical_encoding:one_hot_encoding:minimum_fraction': 0.01},\ndataset_properties={\n  'task': 2,\n  'sparse': False,\n  'multilabel': False,\n  'mul

Show the error on test dataset.

In [23]:
util.error(y_test, y_pred_autosklearn, 'classification')

0.36055756843800324

# Part II: TPOT

TPOT is an AutoML tool that optimizes machine learning pipelines by genetic programming.

In [24]:
from tpot import TPOTClassifier

Run TPOT for 30 seconds.

In [25]:
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2, max_time_mins=.5)
tpot.fit(x_train, y_train)

HBox(children=(IntProgress(value=0, description='Optimization Progress', max=20, style=ProgressStyle(descripti…


0.5264128666666666 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.9000000000000001, min_samples_leaf=1, min_samples_split=17, n_estimators=100)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=1000000,
               max_eval_time_mins=5, max_time_mins=0.5, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=20,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

In [26]:
y_pred_tpot = tpot.predict(x_test)

Show the error on test dataset.

In [27]:
util.error(y_test, y_pred_tpot, 'classification')

0.3794949037476211

# Part III: Oboe

## Oboe Example 1: build an ensemble of models

In [28]:
#experimental settings
VERBOSE = False #whether to print out information indicating current fitting progress
N_CORES = 1 #number of cores
RUNTIME_BUDGET = 30

In [29]:
#optional: limit the types of algorithms
s = ['AB', 'ExtraTrees', 'GNB', 'KNN', 'RF', 'DT']

In [30]:
#autolearner arguments
autolearner_kwargs = {
    'p_type': 'classification',
    'runtime_limit': RUNTIME_BUDGET,
    'verbose': VERBOSE,
    'selection_method': 'min_variance',
    'algorithms': s,
    'stacking_alg': 'greedy',
    'n_cores': N_CORES,
    'build_ensemble': True,
}

In [31]:
#intialize the autolearner class
m = AutoLearner(**autolearner_kwargs)

In [33]:
# fit autolearner on training set and record runtime
start = time.time()
m.fit(x_train, y_train)
elapsed_time = time.time() - start

In [68]:
# use the fitted autolearner for prediction on test set
y_predicted = m.predict(x_test)
print("prediction error: {}".format(util.error(y_test, y_predicted, 'classification')))
print("elapsed time: {}".format(elapsed_time))
print("individual accuracies of selected models: {}".format(m.get_model_accuracy(y_test)))

prediction error: 0.26755324989020646
elapsed time: 20.5676531791687
individual accuracies of selected models: [0.2690905888596106, 0.4200885668276973, 0.4208333333333333, 0.2690905888596106]


In [69]:
# get names of the selected machine learning models
m.get_models()

{'ensemble method': 'greedy selection',
 'base learners': {'GNB': [{}, {}],
  'ExtraTrees': [{'min_samples_split': 0.1, 'criterion': 'gini'},
   {'min_samples_split': 64, 'criterion': 'gini'}]}}

## Oboe Example 2: just select a collection of promising models without building an ensemble afterwards

In [23]:
#experimental settings
VERBOSE = False #whether to print out information indicating current fitting progress
N_CORES = 1 #number of cores
RUNTIME_BUDGET = 30

In [24]:
#optional: limit the types of algorithms
s = ['AB', 'ExtraTrees', 'GNB', 'KNN', 'RF', 'DT']

In [25]:
#autolearner arguments
autolearner_kwargs = {
    'p_type': 'classification',
    'runtime_limit': RUNTIME_BUDGET,
    'verbose': VERBOSE,
    'selection_method': 'min_variance',
    'algorithms': s,
    'stacking_alg': 'greedy',
    'n_cores': N_CORES,
    'build_ensemble': False,
}

In [26]:
#intialize the autolearner class
m = AutoLearner(**autolearner_kwargs)

In [27]:
# fit autolearner on training set and record runtime
start = time.time()
m.fit(x_train, y_train)
elapsed_time = time.time() - start

In [28]:
# use the fitted autolearner for prediction on test set
y_predicted = m.predict(x_test)
 
print("elapsed time: {}".format(elapsed_time))
print("accuracies of selected models: {}".format(m.get_model_accuracy(y_test)))

elapsed time: 11.177161693572998
accuracies of selected models: [0.3487545289855073, 0.3487545289855073, 0.3577999194847021, 0.280049590103938, 0.35440821256038646, 0.2487973027375201, 0.26432028619528614, 0.25504844642072905, 0.26432028619528614, 0.34665106682769725, 0.34665106682769725, 0.34665106682769725, 0.34665106682769725, 0.41362218196457323, 0.2690905888596106, 0.40555303945249593, 0.41857387278582936, 0.5, 0.41021286231884063, 0.2563190784658176, 0.40940633691992384, 0.41021286231884063, 0.2563190784658176, 0.3210897471087688, 0.3773461517347387, 0.3773461517347387, 0.3773461517347387, 0.3773461517347387, 0.27385014090177134, 0.1856392274191187, 0.27680015737080954, 0.27520632045088567, 0.2823746980676329, 0.26924933208900603, 0.2413370571658615, 0.4131514419557898, 0.3722901570048309, 0.37120571658615137, 0.28960849436392916, 0.386694847020934, 0.41977153784218996, 0.3922327898550725, 0.3927083333333333, 0.2823746980676329, 0.26924933208900603, 0.27385014090177134, 0.1856392

Note that we do not have a single accuracy value here if we do not build an ensemble, instead, we just have a collection of fitted models with individual accuracies reported.

The following shows which models we have picked.

In [29]:
m.get_models()

{'DT': [{'min_samples_split': 128},
  {'min_samples_split': 128},
  {'min_samples_split': 64},
  {'min_samples_split': 32},
  {'min_samples_split': 256},
  {'min_samples_split': 16},
  {'min_samples_split': 8},
  {'min_samples_split': 4},
  {'min_samples_split': 0.01},
  {'min_samples_split': 0.001},
  {'min_samples_split': 1e-05},
  {'min_samples_split': 0.0001},
  {'min_samples_split': 2},
  {'min_samples_split': 512},
  {'min_samples_split': 1024}],
 'RF': [{'min_samples_split': 128, 'criterion': 'gini'},
  {'min_samples_split': 2, 'criterion': 'gini'},
  {'min_samples_split': 2, 'criterion': 'entropy'},
  {'min_samples_split': 4, 'criterion': 'gini'},
  {'min_samples_split': 4, 'criterion': 'entropy'},
  {'min_samples_split': 8, 'criterion': 'gini'},
  {'min_samples_split': 8, 'criterion': 'entropy'},
  {'min_samples_split': 16, 'criterion': 'gini'},
  {'min_samples_split': 16, 'criterion': 'entropy'},
  {'min_samples_split': 32, 'criterion': 'gini'},
  {'min_samples_split': 32, 'c