# Onepanel AutoML 0.1a - Kaggle Dataset Example

Here we use AutoML to solve a classification task on [Video Game Sales](https://www.kaggle.com/gregorut/videogamesales) dataset from Kaggle. First, let's download the data.

In [38]:
import pandas as pd

data = pd.read_csv('./data/train.csv', parse_dates=[2])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [87]:
import sys
import numpy as np
# AutoML uses Python's logging module
import logging

# Various sklearn models and metrics
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, accuracy_score
from xgboost.sklearn import XGBClassifier

# AutoML Clasees
from automl.pipeline import LocalExecutor, Pipeline, PipelineStep, PipelineData
from automl.data.dataset import Dataset
from automl.model import ModelSpace, CV, Validate, ChooseBest
from automl.hyperparam.templates import (random_forest_hp_space, 
                                         knn_hp_space, svc_hp_space, 
                                         grad_boosting_hp_space, 
                                         xgboost_hp_space)
from automl.feature.generators import FormulaFeatureGenerator
from automl.feature.selector import FeatureSelector
from automl.hyperparam.hyperopt import Hyperopt
from automl.combinators import RandomChoice

logging.basicConfig(level=logging.INFO)
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# ch.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]

# Preprocessing


In [80]:
from sklearn.preprocessing import LabelEncoder

def preprocess_data(df):
    """Preprocess data and create AutoML Dataset"""
    encoder = LabelEncoder()
    result = df.copy()
    result.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
    result['Sex'] = encoder.fit_transform(result['Sex'])
    result['Embarked'] = encoder.fit_transform(result['Embarked'].astype(str))
    result['Age'].fillna(result['Age'].median(), inplace=True)
    result['Age'] = pd.cut(result['Age'], 10, labels=range(0,10)).astype(int)
    result['Pclass'] = result['Pclass'].astype(int)
    
    result['FamilySize'] = result['SibSp'] + result['Parch'] + 1
    result['IsAlone'] = 0
    result.loc[result['FamilySize'] == 1, 'IsAlone'] = 1
    
    result['Fare'] = pd.cut(result['Fare'], 10, labels=range(0,10)).astype(int)
    
    
    return Dataset(result.drop(['Survived'], axis=1),
                   result['Survived'])

dataset = preprocess_data(data)
dataset.data.dtypes

Pclass        int64
Sex           int64
Age           int64
SibSp         int64
Parch         int64
Fare          int64
Embarked      int64
FamilySize    int64
IsAlone       int64
dtype: object

In [105]:
from sklearn.model_selection import cross_val_score
rf = XGBClassifier()
np.mean(cross_val_score(rf, dataset.data, dataset.target, cv=5))

0.80590466584887432

In [83]:
# Next, we define our ModelSpace. ModelSpace is initialized by a list of tuples.
# First element of each tuple should be an sklearn-like estimator with fit method
# The second one is model parameter dictionary. Here we do not define parameters 
# explicitly, but use hyperparameter templates from AutoML. Those templates can be
# used later by Hyperopt step to find best model parameters automatically
model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (KNeighborsClassifier, knn_hp_space(lambda key: key)),
      (XGBClassifier, xgboost_hp_space())
  ]


# Create executor, initialize it with our classification dataset 
# and set total number of epochs to 2 (the pipeline will be run two times in a row).
# We can load any pipeline into executor using << operator like below:
context, pipeline_data = LocalExecutor(dataset, epochs=3) << \
    (Pipeline() # Here we define the pipeline. Steps can be added to pipeline using >> operator
     # First we define our ModelSpace. We wrap it with PipelineStep class 
     # and set initializer=True so that ModelSpace step will be run only at the first epoch
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     # But we are not obliged to wrap all steps with PipelineStep.
     # This will be done automatically if we do not need to set any special parameters 
     # We use FormulaFeatureGenerator to create arithmetic combinations of features from the dataset
     #>> FormulaFeatureGenerator(['+', '-', '*']) 
     # Next we use Hyperopt to find the best combination of hyperparameters for each model
     # We use test set validation with ROC AUC metric as a score function.
     # CV could be used instead of Validate to perform cross-validation
     >> Hyperopt(Validate(test_size=0.1, metrics=accuracy_score), max_evals=60) 
     # Then we choose the best performing model we found
     >> ChooseBest(1)
     # And select 10 best features
     #>> FeatureSelector(10))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 9, new feature number - 19
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004544 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.004900 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.233333
hyperopt.tpe - INFO - tpe_transform took 0.004704 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.004924 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.188889
hyperopt.tpe - INFO - tpe_transform took 0.005194 seconds
hype

hyperopt.tpe - INFO - tpe_transform took 0.003255 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.003261 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.003328 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.003435 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.003125 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.003126 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.003137 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_transform took 0.003359 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.222222
hyperopt.tpe - INFO - tpe_

hyperopt.tpe - INFO - tpe_transform took 0.005548 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.188889
hyperopt.tpe - INFO - tpe_transform took 0.005675 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005486 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.006190 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005635 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005444 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005519 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005472 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.177

hyperopt.tpe - INFO - tpe_transform took 0.004721 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.005326 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.188889
hyperopt.tpe - INFO - tpe_transform took 0.004799 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.188889
hyperopt.tpe - INFO - tpe_transform took 0.004829 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.188889
hyperopt.tpe - INFO - tpe_transform took 0.004757 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004979 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004818 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.005081 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_

hyperopt.tpe - INFO - tpe_transform took 0.003213 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003099 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003149 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003208 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.003145 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.003167 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.003581 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.003219 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.166

hyperopt.tpe - INFO - tpe_transform took 0.006296 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005443 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005530 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005658 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005411 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005553 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005414 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005599 seconds
hyperopt.tpe - INFO - TPE using 15/15 trials with best lo

hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004988 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004838 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004910 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.005137 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004719 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004805 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.005017 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.166667
hyperopt.tpe - INFO - tpe_transform took 0.004760 second

hyperopt.tpe - INFO - tpe_transform took 0.003150 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003124 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003185 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003155 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003197 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003446 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003219 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.200000
hyperopt.tpe - INFO - tpe_transform took 0.003133 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss

hyperopt.tpe - INFO - tpe_transform took 0.005700 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005553 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005918 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005647 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005785 seconds
hyperopt.tpe - INFO - TPE using 15/15 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005445 seconds
hyperopt.tpe - INFO - TPE using 16/16 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.006004 seconds
hyperopt.tpe - INFO - TPE using 17/17 trials with best loss 0.177778
hyperopt.tpe - INFO - tpe_transform took 0.005759 seconds
hyperopt.tpe - INFO - TPE using 18/18 trials with bes

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features=0.635895756849388,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=2,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=115, n_jobs=1, oob_score=False, random_state=3,
            verbose=False, warm_start=False) 0.7777777777777778
(891, 9)





# Extending AutoML

First, let's look at how `PipelineStep`s can be created by creating a simple hello world pipeline.

In [29]:
# Let's create a simple pipeline
pipeline = Pipeline() >> PipelineStep('hello_step', lambda inp, context: print("Hello!"))

# And execute it locally
LocalExecutor() << pipeline

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/1 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'hello_step'
100%|██████████| 1/1 [00:00<00:00, 1305.82it/s]

Hello!





(<automl.pipeline.PipelineContext at 0x7fbd91124390>, None)

As you can see steps can be added to a pipeline using `>>` operator. A pipeline may contain any number of steps. Any `PipelineStep` is constructed by passing a step name and a `callable` which will be executed when `Pipeline` is run by an `Executor`. It's important to mention that all `Pipeline`s are lazy and all steps inside will be executed only when `Pipeline` is loaded into `Executor.`

`PipelineStep` syntax is pretty verbose, but it can be simplified. You can pass any `callable` to a pipeline and it will be wrapped into `PipelineStep` automatically. Step function should have two arguments: `input` and `context`. `input` must be loaded through executor parameters, `context` contains global variables, available for each step. If `PipelineStep` returns any value, it should wrap it into `PipelineData` class. `input` passed to an `Executor` is wrapped to `PipelineData` automatically

In [10]:
# We create two steps that add 1 and 2 to input data
plus_one = PipelineStep('plus_one', lambda inp, context: inp.dataset + 1)
plus_two = PipelineStep('plus_two', lambda inp, context: inp.dataset + 2)

LocalExecutor(0) << \
    (Pipeline()
     # We use RandomChoice combinator to choose randomly between two steps while executing the pipeline
     >> RandomChoice([plus_one, plus_two]))

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/1 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'RandomChoice'
100%|██████████| 1/1 [00:00<00:00, 896.22it/s]


(<automl.pipeline.PipelineContext at 0x7f277b4310b8>, 1)

It is recommended to create complex callables for `PipelineStep`s as classes:

In [12]:
class ComplexStep:
    def __init__(self):
        print("Initializing ComplexStep")
        
    def __call__(self, inp, context):
        print(inp)
        return inp
    
LocalExecutor() << (Pipeline() >> ComplexStep())

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/1 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'ComplexStep'
100%|██████████| 1/1 [00:00<00:00, 1423.73it/s]

Initializing ComplexStep
<automl.pipeline.PipelineData object at 0x7f277afe0748>





(<automl.pipeline.PipelineContext at 0x7f277bf5ca20>,
 <automl.pipeline.PipelineData at 0x7f277afe0748>)