# <font color='#B31B1B'> AutoML </font>

In this section we'll dive into the world of automated machine learning! The high-level goal of autoML is to, believe it or not, automate the process of selecting machine learning models, hyperparemeters and even pre-processing pipelines. A large chunk of this notebook is adapted from the excellent examples from the OBOE repository.

In [1]:
from oboe import AutoLearner, error  # This may take around 15 seconds at first run.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import time

import warnings
warnings.filterwarnings('ignore')

### <font color='#B31B1B'> Wisconsin Breast Cancer Data </font>
We are going to use the [Wisconsin Breast Cancer Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). Is a dataset that contains measurements taken on Breast Cancer cell images. The goal of the dataset is to predict whether a cancer is benign or malignant.

The images look like this one:

![title](media/breast_cancer.png)

In [2]:
data = load_breast_cancer()
x = np.array(data['data'])
y = np.array(data['target'])
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [3]:
print(data['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

### <font color='#B31B1B'> Baseline Approach </font>
We've learned some fancy machine learning approaches, as a baseline let's try to fit a decision tree to predict whether each tumor is benign or malignant. We'll start by just using the defaults for scikit-learn's decision tree implementation.

In [4]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(x_train, y_train)

RandomForestClassifier()

In [5]:
#Compute our test set accuracy
(clf.predict(x_test) == y_test).mean()

0.956140350877193

Let's see if we can beat this baseline with some tools from AutoML.

### <font color='#B31B1B'> Automated Hyperparameter Tuning </font>
The simplest form of automated machine learning is to try to automatically pick model hyperparameters. In general, there are two popular approaches: grid search, and randomized grid search. Check out this <a href='https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf'> paper </a> for an analysis of the differences. Luckily scikit-learn has implementations of both!

![title](media/grid_search.png)

In [6]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

Let's check out what parameters we can tinker with in our random forest model.

In [7]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We are going to define the hyperparameter grid search space. For grid search we need to define every value we want to test.

In [8]:
# The main paramters people tune for logistic regression are: penalty, C and intercept
search_parameters_space = {
    "n_estimators": [1, 10, 100 , 300],
    "bootstrap": [True, False], # range from 0.1 - 2, with 10 values in between 
    "ccp_alpha": [0.0, 0.01, 0.1, 1, 10], # fit intercept T or F
}

We create the grid search by passing the estimator, the grid search parameter dictionary, and the scoring function we want to optimize.

In [9]:
grid = GridSearchCV(estimator=clf, 
                    param_grid=search_parameters_space,
                    scoring="accuracy",
                    n_jobs=-1)

In [10]:
grid

GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'ccp_alpha': [0.0, 0.01, 0.1, 1, 10],
                         'n_estimators': [1, 10, 100, 300]},
             scoring='accuracy')

`GridSearchCV` works as an estimator, it has a `fit` method to perform the search.

**Note**: If you see the next step takes too long, it probably means your laptop is not powerful enough to run this search. Try to restart your notebook (you might have to kill it and restart) and load the data with a smaller sample (try 5000 samples for example, if that doesnt work try 2000).

In [11]:
%%time
grid.fit(x_train, y_train)

CPU times: user 773 ms, sys: 105 ms, total: 878 ms
Wall time: 8.75 s


GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'ccp_alpha': [0.0, 0.01, 0.1, 1, 10],
                         'n_estimators': [1, 10, 100, 300]},
             scoring='accuracy')

Now we can see the performance of the best hyperparameter combination found by the grid search.

In [12]:
print(grid.best_score_)

0.9692307692307693


We can see the best params:

In [13]:
grid.best_params_

{'bootstrap': False, 'ccp_alpha': 0.0, 'n_estimators': 300}

And we can use the best trained estimator:

In [14]:
grid.best_estimator_ # this gets the best model all packaged for us.

RandomForestClassifier(bootstrap=False, n_estimators=300)

After fitting, Gridsearch returns the results for each one of the combinations as the attribute `cv_results_`
Which we can see as a dataframe for convenience

In [15]:
pd.DataFrame(grid.cv_results_).sort_values(by="rank_test_score").head() 
# so we can see the test score for every possible option of our grid search

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_bootstrap,param_ccp_alpha,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
23,0.705604,0.011363,0.057265,0.003251,False,0.0,300,"{'bootstrap': False, 'ccp_alpha': 0.0, 'n_esti...",0.989011,0.967033,0.956044,0.967033,0.967033,0.969231,0.010767,1
1,0.024883,0.000886,0.002305,0.000184,True,0.0,10,"{'bootstrap': True, 'ccp_alpha': 0.0, 'n_estim...",0.967033,0.978022,0.956044,0.978022,0.956044,0.967033,0.009829,2
22,0.218859,0.006316,0.01533,0.001111,False,0.0,100,"{'bootstrap': False, 'ccp_alpha': 0.0, 'n_esti...",0.967033,0.967033,0.956044,0.967033,0.967033,0.964835,0.004396,3
2,0.218427,0.001245,0.015191,0.00095,True,0.0,100,"{'bootstrap': True, 'ccp_alpha': 0.0, 'n_estim...",0.978022,0.945055,0.956044,0.967033,0.967033,0.962637,0.011207,4
3,0.665528,0.0117,0.047676,0.006044,True,0.0,300,"{'bootstrap': True, 'ccp_alpha': 0.0, 'n_estim...",0.978022,0.945055,0.956044,0.967033,0.967033,0.962637,0.011207,4


For randomized grid search we have to define a distribution of parameters to test instead of just every possible values. 

In [16]:
from scipy.stats import randint as sp_randint
from scipy.stats import uniform 

param_dist_random = {
    "bootstrap": [True, False],
    "ccp_alpha": uniform(loc=0.1, scale=2),
    "n_estimators": sp_randint(1, 1000),
}

In [17]:
random_search = RandomizedSearchCV(
    estimator=clf, 
    param_distributions=param_dist_random,
   scoring="accuracy", n_jobs=-1, 
    n_iter=50)

In [18]:
%%time
random_search.fit(x_train, y_train)

CPU times: user 1.54 s, sys: 58.6 ms, total: 1.6 s
Wall time: 51.4 s


RandomizedSearchCV(estimator=RandomForestClassifier(), n_iter=50, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'ccp_alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x163d8a130>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x163da16a0>},
                   scoring='accuracy')

However this is a cautionary tale that you have to pick your distributions well...

In [19]:
(random_search.predict(x_test) == y_test).mean()

0.9385964912280702

In [20]:
random_search.best_estimator_

RandomForestClassifier(bootstrap=False, ccp_alpha=0.27856553766808234,
                       n_estimators=710)

## <font color='#B31B1B'> OBOE </font>
Hyperparameter tuning can only get you so far, and unfortunately manually testing every model with every hyperparameter can be time consuming and a drain on computational resources. Instead, we'll turn to OBOE an AutoML tool based on generalized low rank models. OBOE will help pick both hyperparameters AND the model itself. For more information check out the paper <a href='https://arxiv.org/abs/1808.03233'> here</a>. You can install OBOE using pip (check out the <a href= 'https://github.com/udellgroup/oboe'> github page </a> for more details).

![title](media/oboe.png)

In [21]:
from oboe import AutoLearner, error  # This may take around 15 seconds at first run.

In [22]:
method = 'Oboe'
problem_type = 'classification'

m = AutoLearner(p_type='classification', runtime_limit=30, method=method, verbose=False)
m.fit(x_train, y_train)



{'ranks': [8, 9, 9],
 'runtime_limits': [1, 2, 4],
 'validation_loss': [0.5,
  0.020000000000000018,
  0.060000000000000026,
  0.05136363636363636],
 'filled_new_row': [array([[ 4.84731972e-01,  5.09012629e-01, -3.58292541e-02,
          -2.19050322e-01, -2.47586852e-01,  4.96896229e-01,
           4.97452990e-01, -1.04984631e-01, -2.86319762e-01,
          -3.10185933e-01,  6.49313311e-02,  6.83359657e-02,
           7.98323521e-02,  7.37782016e-02,  7.99442820e-02,
           1.17227625e-01,  9.22195189e-02,  1.05909439e-01,
           4.92541813e-01,  5.07353721e-01,  6.53291678e-02,
           6.50020833e-02,  6.49313311e-02,  6.49313311e-02,
          -2.25282176e-02, -6.27848089e-04, -2.14766790e-02,
          -3.30322039e-03, -2.64395814e-02, -8.01301427e-04,
          -4.36383859e-02, -7.25822025e-03, -4.82271852e-02,
          -1.11749801e-02, -8.59159526e-02, -3.59560815e-02,
          -1.04333248e-01, -5.36676983e-02, -3.80741335e-02,
           7.91809133e-03,  2.86574878e-

In [23]:
#Let's check out our test set accuracy
(m.predict(x_test) == y_test).mean()

0.9473684210526315

Wow! Already got a sizeable bump in accuracy from using AutoML tools.

In [24]:
# get names of the selected machine learning models
m.get_models()

{'ensemble method': 'select at most 5 pipelines with smallest cv error',
 'base learners': {'GBT': [{'learning_rate': 0.05,
    'max_depth': 3,
    'max_features': 'log2'},
   {'learning_rate': 0.1, 'max_depth': 3, 'max_features': 'log2'}],
  'RF': [{'min_samples_split': 64, 'criterion': 'gini'}],
  'Logit': [{'C': 2, 'solver': 'liblinear', 'penalty': 'l2'}],
  'ExtraTrees': [{'min_samples_split': 0.1, 'criterion': 'gini'}]}}

So far, we haven't played around with all the different settings OBOE gives us. Let's check out what we can change:

In [25]:
AutoLearner?

We can start by setting some experiment limits (this is important because generally our quality of solution is going to improve with more computation time).

In [26]:
#experimental settings
VERBOSE = False #whether to print out information indicating current fitting progress
N_CORES = 1 #number of cores
RUNTIME_BUDGET = 30

We can even limit the types of models we want to consider (this is great for applications where we might need an interpretable model!).

In [27]:
#optional: limit the types of algorithms
s = ['AB', 'ExtraTrees', 'GNB', 'KNN', 'RF', 'DT']

We can also decide whether or not we want to build an ensemble solution (for this demo. we'll do an ensemble)

In [28]:
#autolearner arguments
autolearner_kwargs = {
    'p_type': 'classification',
    'method': method,
    'runtime_limit': RUNTIME_BUDGET,
    'verbose': VERBOSE,
    'selection_method': 'ED',
    'algorithms': s,
    'stacking_alg': 'greedy',
    'n_cores': N_CORES,
    'build_ensemble': True,
}

In [29]:
m = AutoLearner(**autolearner_kwargs)

start = time.time()
m.fit(x_train, y_train)
print('It took ',time.time() - start,' seconds to train our AutoML model!')

It took  27.175869941711426  seconds to train our AutoML model!


In [30]:
#Let's check out our test set accuracy
(m.predict(x_test) == y_test).mean()

0.956140350877193

Our accuracy took a bit of a hit, but remember we've restricted the kinds of models we're considering!

In [31]:
# get names of the selected machine learning models
m.get_models()

{'ensemble method': 'select at most 5 pipelines with smallest cv error',
 'base learners': {'AB': [{'n_estimators': 100, 'learning_rate': 1.5},
   {'n_estimators': 50, 'learning_rate': 1},
   {'n_estimators': 100, 'learning_rate': 1}],
  'ExtraTrees': [{'min_samples_split': 0.01, 'criterion': 'gini'},
   {'min_samples_split': 16, 'criterion': 'gini'}]}}

## <font color='#B31B1B'> Tensor OBOE </font>
So far we've only see OBOE pick a model and its hyperparameters, but there's an extension to the model called <a href='https://people.ece.cornell.edu/cy/_papers/tensor_oboe.pdf'> TensorOBOE </a> that can pick the pre-processing pipeline too. Programatically the code is nearly identical, we just need to change our method.

In [32]:
method = 'tensoroboe'  # Now'TensorOboe'

In [33]:
# We can again just use it straight out of the box without checking out the hyperparameters
m = AutoLearner(p_type='classification', runtime_limit=50, method=method, verbose=True)
m.fit(x_train, y_train, categorical=None) # TensorOboe accepts the list of feature types

rank for EM-Tucker imputation: (20, 4, 2, 2, 8, 20)
shape of the error tensor: (551, 4, 2, 2, 8, 183)
Loading latent factors from storage ...
Loading saved runtime predictors ...

Shape of training dataset: 455 data points, 30 features
Splitting training set into training and validation ..
Predicting pipeline running time ..
runtime limit of initial round: 32.0 seconds
fitting and kfold_fit_validating the best-on-average pipeline
Pipeline fitting completed.
Fitted an ensemble with size 1
having a capped running time of 32 seconds
Fitted an ensemble with size 1
Fitted an ensemble with size 1
Fitted an ensemble with size 1
Fitted an ensemble with size 1
Fitted an ensemble with size 1
Doubling process started ...
Fitting with ranks=(20, 4, 2, 2, 8, 18), t=32.0

Single round runtime target: 32.0
Fitting AutoLearner with maximum runtime 32.0 seconds
Selecting an initial set of models to evaluate ...
greedy_initialization
[0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5]
Sampling 8 entries of new row...
hav

{'ranks': [(20, 4, 2, 2, 8, 18)],
 'runtime_limits': [32.0],
 'validation_loss': [0.5, 0.0029411764705882305, 0.0],
 'filled_new_row': [array([[nan, nan, nan, ..., nan, nan, nan]])],
 'predicted_new_row': [array([[-0.01471546, -0.00767647,  0.0423596 , ..., -0.01764392,
          -0.01993165, -0.01932104]])],
 'actual_runtimes': [10.237134218215942],
 'sampled_indices': [{8541,
   11711,
   12255,
   12256,
   12257,
   12258,
   12259,
   12260,
   12271,
   19479}],
 'models': [<oboe.ensemble.Ensemble at 0x163d3f6d0>]}

In [34]:
#Note that our output now includes pre-processing!
m.get_models()

{'ensemble method': 'select at most 5 pipelines with smallest cv error',
 'base learners': [{'imputer': {'algorithm': 'SimpleImputer',
    'hyperparameters': {'strategy': 'median'}},
   'encoder': {'algorithm': 'OneHotEncoder',
    'hyperparameters': {'handle_unknown': 'ignore', 'sparse': 0}},
   'standardizer': {'algorithm': 'StandardScaler', 'hyperparameters': {}},
   'dim_reducer': {'algorithm': 'SelectKBest', 'hyperparameters': {'k': 22}},
   'estimator': {'algorithm': 'ExtraTrees',
    'hyperparameters': {'min_samples_split': 1e-05, 'criterion': 'entropy'}}},
  {'imputer': {'algorithm': 'SimpleImputer',
    'hyperparameters': {'strategy': 'most_frequent'}},
   'encoder': {'algorithm': None},
   'standardizer': {'algorithm': None},
   'dim_reducer': {'algorithm': 'PCA',
    'hyperparameters': {'n_components': 15}},
   'estimator': {'algorithm': 'ExtraTrees',
    'hyperparameters': {'min_samples_split': 1e-05, 'criterion': 'entropy'}}},
  {'imputer': {'algorithm': 'SimpleImputer',
 