# OnePanel AutoML 0.1a

AutoML is a framework that allows building automated machine learning pipelines easily and declaratively, running them locally (current implementation) or on the cluster (TBD).

The framework can be easily extened with new features.
Currently AutoML is integrated with popular opensource machine learning libraries like Scikit-learn and Hyperopt

In [2]:
import sys
# AutoML uses Python's logging module
import logging

# Various sklearn models and metrics
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
from xgboost.sklearn import XGBClassifier

# AutoML Clasees
from automl.pipeline import LocalExecutor, Pipeline, PipelineStep, PipelineData
from automl.data.dataset import Dataset
from automl.model import ModelSpace, CV, Validate, ChooseBest
from automl.hyperparam.templates import (random_forest_hp_space, 
                                         knn_hp_space, svc_kernel_hp_space, 
                                         grad_boosting_hp_space, 
                                         xgboost_hp_space)
from automl.feature.generators import FormulaFeatureGenerator
from automl.feature.selector import FeatureSelector
from automl.hyperparam.hyperopt import Hyperopt
from automl.combinators import RandomChoice

logging.basicConfig(level=logging.INFO)
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# ch.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]

# Core concepts
Key concepts in AutoML are 
* `Pipeline` - a machine learning pipeline. It executes various steps inside the pipeline passing each step output as an input to the next step
* `PipelineStep` - all `Pipeline`s consist of steps. AutoML provide lots of different steps out of the box
* `Executor` - executes a pipeline. Currently AutoML provides `LocalExecutor` which runs pipeline locally. Future versions will have `DistributedExecutor` built-in 

AutoML can easily be extended by implementing own `PipelineStep`s. Next, we will use various built-in `PipelineStep`s to create an automated classification pipeline.

In [3]:
# Let's create a dataset first
x, y = make_classification(
      n_samples=1000,
      n_features=40,
      n_informative=2,
      n_redundant=10,
      flip_y=0.05)

# We will use AutoML Dataset class to wrap our data 
# into structure that can be understanded by AutoML
data = Dataset(x, y)

# Next, we define our ModelSpace. ModelSpace is initialized by a list of tuples.
# First element of each tuple should be an sklearn-like estimator with fit method
# The second one is model parameter dictionary. Here we do not define parameters 
# explicitly, but use hyperparameter templates from AutoML. Those templates can be
# used later by Hyperopt step to find best model parameters automatically
model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (KNeighborsClassifier, knn_hp_space(lambda key: key)),
      (XGBClassifier, xgboost_hp_space())
  ]


# Create executor, initialize it with our classification dataset 
# and set total number of epochs to 2 (the pipeline will be run two times in a row).
# We can load any pipeline into executor using << operator like below:
context, pipeline_data = LocalExecutor(data, epochs=2) << \
    (Pipeline() # Here we define the pipeline. Steps can be added to pipeline using >> operator
     # First we define our ModelSpace. We wrap it with PipelineStep class 
     # and set initializer=True so that ModelSpace step will be run only at the first epoch
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     # But we are not obliged to wrap all steps with PipelineStep.
     # This will be done automatically if we do not need to set any special parameters 
     # We use FormulaFeatureGenerator to create arithmetic combinations of features from the dataset
     >> FormulaFeatureGenerator(['+', '-', '*']) 
     # Next we use Hyperopt to find the best combination of hyperparameters for each model
     # We use test set validation with ROC AUC metric as a score function.
     # CV could be used instead of Validate to perform cross-validation
     >> Hyperopt(Validate(test_size=0.1, metrics=roc_auc_score), max_evals=5) 
     # Then we choose the best performing model we found
     >> ChooseBest(1)
     # And select 10 best features
     >> FeatureSelector(10))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 40, new feature number - 50
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x117422be0>, 'max_features': <hyperopt.pyll.base.Apply object at 0x117422f98>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x11742a2e8>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x11742a6a0>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x11742a7f0>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x11742a908>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_

hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.190084
hyperopt.tpe - INFO - tpe_transform took 0.001345 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.077880
hyperopt.tpe - INFO - tpe_transform took 0.001224 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.077880
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
Hyperopt - INFO - {'max_depth': <hyperopt.pyll.base.Apply object at 0x11742ae10>, 'learning_rate': <hyperopt.pyll.base.Apply object at 0x11742af98>, 'n_estimators': <hyperopt.pyll.base.Apply object at 0x11743a1d0>, 'gamma': <hyperopt.pyll.base.Apply object at 0x11743a358>, 'min_child_weight': <hyperopt.pyll.base.Apply object at 0x11743a4e0>, 'max_delta_step': 0, 'subsample': <hyperopt.pyll.base.Apply object at 0x11743a630>, 'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x11743a780>, 'colsample_bylevel': <hyperopt.pyll.base.Apply object at 0x11743a8d0>, 'reg_alpha': <hyperopt.pyll.bas

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=26, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=70, n_jobs=1,
            oob_score=False, random_state=3, verbose=False,
            warm_start=False) 0.9315535929345644
(1000, 10)





# Extending AutoML

First, let's look at how `PipelineStep`s can be created by creating a simple hello world pipeline.

In [4]:
# Let's create a simple pipeline
pipeline = Pipeline() >> PipelineStep('hello_step', lambda inp, context: print("Hello!"))

# And execute it locally
LocalExecutor() << pipeline

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/1 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'hello_step'
100%|██████████| 1/1 [00:00<00:00, 845.80it/s]

Hello!





(<automl.pipeline.PipelineContext at 0x11741b550>, None)

As you can see steps can be added to a pipeline using `>>` operator. A pipeline may contain any number of steps. Any `PipelineStep` is constructed by passing a step name and a `callable` which will be executed when `Pipeline` is run by an `Executor`. It's important to mention that all `Pipeline`s are lazy and all steps inside will be executed only when `Pipeline` is loaded into `Executor.`

`PipelineStep` syntax is pretty verbose, but it can be simplified. You can pass any `callable` to a pipeline and it will be wrapped into `PipelineStep` automatically. Step function should have two arguments: `input` and `context`. `input` must be loaded through executor parameters, `context` contains global variables, available for each step. If `PipelineStep` returns any value, it should wrap it into `PipelineData` class. `input` passed to an `Executor` is wrapped to `PipelineData` automatically

In [5]:
# We create two steps that add 1 and 2 to input data
plus_one = PipelineStep('plus_one', lambda inp, context: inp.dataset + 1)
plus_two = PipelineStep('plus_two', lambda inp, context: inp.dataset + 2)

LocalExecutor(0) << \
    (Pipeline()
     # We use RandomChoice combinator to choose randomly between two steps while executing the pipeline
     >> RandomChoice([plus_one, plus_two]))

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/1 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'RandomChoice'
100%|██████████| 1/1 [00:00<00:00, 1019.27it/s]


(<automl.pipeline.PipelineContext at 0x117422780>, 1)

It is recommended to create complex callables for `PipelineStep`s as classes:

In [6]:
class ComplexStep:
    def __init__(self):
        print("Initializing ComplexStep")
        
    def __call__(self, inp, context):
        print(inp)
        return inp
    
LocalExecutor() << (Pipeline() >> ComplexStep())

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/1 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'ComplexStep'
100%|██████████| 1/1 [00:00<00:00, 969.11it/s]

Initializing ComplexStep
<automl.pipeline.PipelineData object at 0x1173adeb8>





(<automl.pipeline.PipelineContext at 0x117422828>,
 <automl.pipeline.PipelineData at 0x1173adeb8>)