# Onepanel AutoML 0.1.5 - Kaggle Dataset Example

Here we use AutoML to solve a classification task on a classic [Titanic](https://www.kaggle.com/c/titanic) dataset from Kaggle. First, let's load the data.

In [1]:
import pandas as pd

data = pd.read_csv('./data/train.csv', parse_dates=[2])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
import sys
import numpy as np
# AutoML uses Python's logging module
import logging

# Various sklearn models and metrics
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, make_scorer
from xgboost.sklearn import XGBClassifier

# AutoML Clasees
from automl.pipeline import LocalExecutor, Pipeline, PipelineStep, PipelineData
from automl.data.dataset import Dataset
from automl.model import ModelSpace, CV, Validate, ChooseBest
from automl.hyperparam.templates import (random_forest_hp_space, 
                                         knn_hp_space, svc_kernel_hp_space, 
                                         grad_boosting_hp_space, 
                                         xgboost_hp_space)
from automl.feature.generators import FormulaFeatureGenerator, PolynomialGenerator
from automl.feature.selector import FeatureSelector
from automl.hyperparam.optimization import Hyperopt
from automl.combinators import RandomChoice

logging.basicConfig(level=logging.INFO)
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# ch.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]

# Preprocessing


No matter how automated our process is, data still may need some preprocessing. Also, doing good old feature engenering can help by a lot. We skip exploratory data analysis and feature engeneering stages for brevity. If you are interested, we suggest looking up some examples at contest's [kernels](https://www.kaggle.com/c/titanic/kernels).

In [3]:
from sklearn.preprocessing import LabelEncoder

def preprocess_data(df):
    """Preprocess data and create AutoML Dataset"""
    encoder = LabelEncoder()
    result = df.copy()
    
    # drop columns we won't be using
    result.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
    
    # transform Sex column into numeric categories
    result['Sex'] = encoder.fit_transform(result['Sex'])
    
    # do the same with Embarked column
    result['Embarked'] = encoder.fit_transform(result['Embarked'].astype(str))
    
    # Replace missing Ages with median value and "pack"
    # Age into 10 equal-sized bins. For example, all 
    # ages from 0-10 will be packed into bin 0.
    result['Age'].fillna(result['Age'].median(), inplace=True)
    result['Age'] = pd.cut(result['Age'], 10, labels=range(0,10)).astype(int)
    
    # Pack Fare into 10 bins
    result['Fare'] = pd.cut(result['Fare'], 10, labels=range(0,10)).astype(int)
    
    # transform Pclass type to int
    result['Pclass'] = result['Pclass'].astype(int)
    
    # add some useful predictive features that may came 
    # up to mind during data analysis
    result['FamilySize'] = result['SibSp'] + result['Parch'] + 1
    result['IsAlone'] = 0
    result.loc[result['FamilySize'] == 1, 'IsAlone'] = 1
    
    return Dataset(result.drop(['Survived'], axis=1),
                   result['Survived'])

dataset = preprocess_data(data)
print(f"Features: {dataset.columns}")

Features: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'IsAlone']


Now let's fit simple XGBoost model with default parameters and see how it scores on the dataset

In [4]:
from sklearn.model_selection import cross_val_score
rf = XGBClassifier()
np.mean(cross_val_score(rf, dataset.data, dataset.target))

0.81144781144781142

Ok, 81% accuracy with the defaults. Let's go on to AutoML Pipelines and see if we can improve the results

In [5]:
# Next, we define our ModelSpace. ModelSpace is initialized by a list of tuples.
# First element of each tuple should be an sklearn-like estimator with fit method
# The second one is model parameter dictionary. Here we do not define parameters 
# explicitly, but use hyperparameter templates from AutoML. Those templates can be
# used later by Hyperopt step to find best model parameters automatically
model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (KNeighborsClassifier, knn_hp_space(lambda key: key)),
      (XGBClassifier, xgboost_hp_space())
  ]


# Create executor, initialize it with our classification dataset 
# and set total number of epochs to 2 (the pipeline will be run two times in a row).
# We can load any pipeline into executor using << operator like below:
context, pipeline_data = LocalExecutor(dataset, epochs=2) << \
    (Pipeline() # Here we define the pipeline. Steps can be added to pipeline using >> operator
     # First we define our ModelSpace. We wrap it with PipelineStep class 
     # and set initializer=True so that ModelSpace step will be run only at the first epoch
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     # But we are not obliged to wrap all steps with PipelineStep.
     # This will be done automatically if we do not need to set any special parameters 
     # We use PolynomialGenerator to create polynomial combinations of the features from the dataset
     >> PolynomialGenerator()
     # Next we use Hyperopt to find the best combination of hyperparameters for each model
     # We use test set validation with accuracy metric as a score function.
     # CV could be used instead of Validate to perform cross-validation
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=20)
     # Then we choose the best performing model we found
     >> ChooseBest(1)
     # And select 10 best features
     >> FeatureSelector(20))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Framework version: v0.1.5
LocalExecutor - INFO - Starting AutoML Epoch #1
LocalExecutor - INFO - Dataset columns: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'IsAlone']
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'SklearnFeatureGenerator'
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x1147959e8>, 'max_features': <hyperopt.pyll.base.Apply object at 0x114795da0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x1147c40b8>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x1147c41d0>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x1147c43c8>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x1147c44e0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'skle


  █████╗ ██╗   ██╗████████╗ ██████╗ ███╗   ███╗██╗
 ██╔══██╗██║   ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
 ███████║██║   ██║   ██║   ██║   ██║██╔████╔██║██║
 ██╔══██║██║   ██║   ██║   ██║   ██║██║╚██╔╝██║██║
 ██║  ██║╚██████╔╝   ██║   ╚██████╔╝██║ ╚═╝ ██║███████╗
 ╚═╝  ╚═╝ ╚═════╝    ╚═╝    ╚═════╝ ╚═╝     ╚═╝╚══════╝



hyperopt.tpe - INFO - tpe_transform took 0.004156 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.181818
hyperopt.tpe - INFO - tpe_transform took 0.004237 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.004591 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.004515 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.004059 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.004067 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.004074 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.004502 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.176207


hyperopt.tpe - INFO - tpe_transform took 0.005452 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005175 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.004998 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.004948 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.007446 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005935 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005151 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005070 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.1

hyperopt.tpe - INFO - tpe_transform took 0.003750 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.003971 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.003685 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.003576 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.003932 seconds
hyperopt.tpe - INFO - TPE using 15/15 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.003899 seconds
hyperopt.tpe - INFO - TPE using 16/16 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.004088 seconds
hyperopt.tpe - INFO - TPE using 17/17 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.004361 seconds
hyperopt.tpe - INFO - TPE using 18/18 trials with bes

hyperopt.tpe - INFO - tpe_transform took 0.004577 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.004660 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.004599 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.005671 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.004783 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.203143
hyperopt.tpe - INFO - tpe_transform took 0.004695 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.203143
hyperopt.tpe - INFO - tpe_transform took 0.005785 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.203143
hyperopt.tpe - INFO - tpe_transform took 0.006638 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.1

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=2, max_features=0.1300828845731834,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=319, n_jobs=1, oob_score=False, random_state=0,
            verbose=False, warm_start=False) 0.8260381593714928
(891, 20)


Ok, we have reached better accuracy compared to default XGBoost plain dataset. However, first pipeline we launched had a good job of generating various features for our dataset, but it was not really created for searching the best model. Now let's create a pipeline which will search for the best model on a fixed dataset.

Please note that increasing `max_evals` parameter for `Hyperopt` can lead to finding better model parameters, but we use modest values here for demonstation purposes.

In [6]:
context, pipeline_data = LocalExecutor(pipeline_data.dataset, epochs=1) << \
    (Pipeline()
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=50)
     >> ChooseBest(1))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Framework version: v0.1.5
LocalExecutor - INFO - Starting AutoML Epoch #1
LocalExecutor - INFO - Dataset columns: ['Age']
  0%|          | 0/3 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x1147959e8>, 'max_features': <hyperopt.pyll.base.Apply object at 0x114795da0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x1147c40b8>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x1147c41d0>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x1147c43c8>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x1147c44e0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004771 seconds
hyperopt.tpe - INFO - TPE using 0 trial


  █████╗ ██╗   ██╗████████╗ ██████╗ ███╗   ███╗██╗
 ██╔══██╗██║   ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
 ███████║██║   ██║   ██║   ██║   ██║██╔████╔██║██║
 ██╔══██║██║   ██║   ██║   ██║   ██║██║╚██╔╝██║██║
 ██║  ██║╚██████╔╝   ██║   ╚██████╔╝██║ ╚═╝ ██║███████╗
 ╚═╝  ╚═╝ ╚═════╝    ╚═╝    ╚═════╝ ╚═╝     ╚═╝╚══════╝



hyperopt.tpe - INFO - tpe_transform took 0.004314 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.203143
hyperopt.tpe - INFO - tpe_transform took 0.003817 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.203143
hyperopt.tpe - INFO - tpe_transform took 0.004985 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.190797
hyperopt.tpe - INFO - tpe_transform took 0.003829 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.190797
hyperopt.tpe - INFO - tpe_transform took 0.005366 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.190797
hyperopt.tpe - INFO - tpe_transform took 0.004957 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.190797
hyperopt.tpe - INFO - tpe_transform took 0.003651 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.190797
hyperopt.tpe - INFO - tpe_transform took 0.005338 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.190797


Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.neighbors.classification.KNeighborsClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.001306 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.002850 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.314254
hyperopt.tpe - INFO - tpe_transform took 0.002563 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.196409
hyperopt.tpe - INFO - tpe_transform took 0.002489 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.196409
hyperopt.tpe - INFO - tpe_transform took 0.003026 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.196409
hyperopt.tpe - INFO - tpe_transform took 0.002179 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.196409
hyperopt.tpe - INFO - tpe_transform took 0.002440 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.196409
hyperopt.tpe - INFO - tpe_tra

hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.006257 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.194164
hyperopt.tpe - INFO - tpe_transform took 0.004806 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.194164
hyperopt.tpe - INFO - tpe_transform took 0.004320 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.193042
hyperopt.tpe - INFO - tpe_transform took 0.004426 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.193042
hyperopt.tpe - INFO - tpe_transform took 0.006391 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.193042
hyperopt.tpe - INFO - tpe_transform took 0.004402 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.193042
hyperopt.tpe - INFO - tpe_transform took 0.004800 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.193042
hyperopt.tpe - INFO - tpe_transform took 0.004697 second

hyperopt.tpe - INFO - tpe_transform took 0.004748 seconds
hyperopt.tpe - INFO - TPE using 48/48 trials with best loss 0.184063
hyperopt.tpe - INFO - tpe_transform took 0.004590 seconds
hyperopt.tpe - INFO - TPE using 49/49 trials with best loss 0.184063
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
 67%|██████▋   | 2/3 [03:47<01:53, 113.69s/it]LocalExecutor - INFO - Running step 'ChooseBest'
ChooseBest - INFO - Final model scores:
ChooseBest - INFO - RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.095468560708176,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=20,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=False, warm_start=False) - 0.8215488215488215
ChooseBest - INFO - KNeighborsClassifier(algorithm='

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.095468560708176,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=20,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=0,
            verbose=False, warm_start=False) 0.8215488215488215
(891, 20)


We've got a nice improvement. That's enough to get into the top 1% in the Kaggle Titatic demo competition

## Voting feature selection

AutoML also allows you to select feature that perform well for the most models in the model space. Features that have geather importance in multiple models will have geather weight.


First $N$ features are selected from the following well-ordered set:

$\mathbb{F} = softmax(\mathbb{I}) \circ \mathbb{S}$,

where 
* $\mathbb{I} \in \mathbb{R}$ represents a model-specific feature score set
* $\mathbb{S} \in \mathbb{R}$ is a set of model scores according to some scoring function $s(x, m): \mathbb{M} \rightarrow \mathbb{R} $ ($x$ is a dataset, $m$ is a model, $\mathbb{M}$ is a model space)

In [7]:
from automl.hyperparam.optimization import Hyperopt
from sklearn.linear_model import LogisticRegression
from automl.feature.selector import VotingFeatureSelector
from automl.feature.generators import FormulaFeatureGenerator

model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (LogisticRegression, {}),
      (XGBClassifier, xgboost_hp_space())
  ]

dataset = preprocess_data(data)

context, pipeline_data = LocalExecutor(dataset, epochs=20) << \
    (Pipeline()
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     >> FormulaFeatureGenerator()
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=1)
     >> ChooseBest(4, by_largest_score=False)
     >> VotingFeatureSelector(feature_to_select=5, reverse_score=True))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

print(f"Selected features:")
for col in pipeline_data.dataset.columns:
    print(f"{col}")

LocalExecutor - INFO - Framework version: v0.1.5
LocalExecutor - INFO - Starting AutoML Epoch #1
LocalExecutor - INFO - Dataset columns: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'IsAlone']
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 9, new feature number - 10
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x116b91b00>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x116b915c0>, 'ver


  █████╗ ██╗   ██╗████████╗ ██████╗ ███╗   ███╗██╗
 ██╔══██╗██║   ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
 ███████║██║   ██║   ██║   ██║   ██║██╔████╔██║██║
 ██╔══██║██║   ██║   ██║   ██║   ██║██║╚██╔╝██║██║
 ██║  ██║╚██████╔╝   ██║   ╚██████╔╝██║ ╚═╝ ██║███████╗
 ╚═╝  ╚═╝ ╚═════╝    ╚═╝    ╚═════╝ ╚═╝     ╚═╝╚══════╝



Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
Hyperopt - INFO - {}
Hyperopt - INFO - {'max_depth': <hyperopt.pyll.base.Apply object at 0x116b914a8>, 'learning_rate': <hyperopt.pyll.base.Apply object at 0x116b91ba8>, 'n_estimators': <hyperopt.pyll.base.Apply object at 0x116b91da0>, 'gamma': <hyperopt.pyll.base.Apply object at 0x116b91f28>, 'min_child_weight': <hyperopt.pyll.base.Apply object at 0x116b9e278>, 'max_delta_step': 0, 'subsample': <hyperopt.pyll.base.Apply object at 0x116b9eac8>, 'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x116b9ec50>, 'colsample_bylevel': <hyperopt.pyll.base.Apply object at 0x116b9ec18>, 'reg_alpha': <hyperopt.pyll.base.Apply object at 0x116b9e828>, 'reg_lambda': <hyperopt.pyll.base.Apply object at 0x116b9efd0>, 'scale_pos_weight': 1, 'base_score': 0.5, 'seed': <hyperopt.pyll.base.Apply object at 0x116b9ea58>}
Hyperopt - INFO - Running hyperparameter optimization for <class 'xgboost.sklearn.XGBClassifier'>
hy

Hyperopt - INFO - Running hyperparameter optimization for <class 'xgboost.sklearn.XGBClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004605 seconds
hyperopt.tpe - INFO - TPE using 0 trials
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
 60%|██████    | 3/5 [00:01<00:01,  1.70it/s]LocalExecutor - INFO - Running step 'ChooseBest'
ChooseBest - INFO - Final model scores:
ChooseBest - INFO - <class 'sklearn.linear_model.logistic.LogisticRegression'> - 0
ChooseBest - INFO - XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.9850874752008347,
       colsample_bytree=0.6564900316641631, gamma=0.10876807548294232,
       learning_rate=0.0004325663616031183, max_delta_step=0, max_depth=10,
       min_child_weight=66, missing=None, n_estimators=6000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.11458809561923787, reg_lambda=1.2379096181902154,
       scale_pos_weight=1, seed=4, si

 60%|██████    | 3/5 [00:01<00:01,  1.60it/s]LocalExecutor - INFO - Running step 'ChooseBest'
ChooseBest - INFO - Final model scores:
ChooseBest - INFO - <class 'sklearn.linear_model.logistic.LogisticRegression'> - 0
ChooseBest - INFO - XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.8132079355507038,
       colsample_bytree=0.7229112540223576, gamma=0.0011881315904272904,
       learning_rate=0.000546077539864241, max_delta_step=0, max_depth=4,
       min_child_weight=33, missing=None, n_estimators=5000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.03551121080733793, reg_lambda=1.4999454203143583,
       scale_pos_weight=1, seed=2, silent=True,
       subsample=0.788327214362534) - 0.7867564534231201
ChooseBest - INFO - RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.43088429080032586,
            max_leaf_nodes=None, min_impurity_decreas

ChooseBest - INFO - RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features=0.27398126117665655,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=13, n_jobs=1, oob_score=False, random_state=1,
            verbose=False, warm_start=False) - 0.8002244668911335
LocalExecutor - INFO - Running step 'VotingFeatureSelector'
100%|██████████| 5/5 [00:01<00:00,  3.39it/s]
LocalExecutor - INFO - Starting AutoML Epoch #8
LocalExecutor - INFO - Dataset columns: ['Pclass', 'Sex', '(Pclass_mul_Sex)', '(Pclass_mul_(Pclass_mul_Sex))', '((Pclass_mul_Sex)_add_FamilySize)']
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'

LocalExecutor - INFO - Running step 'VotingFeatureSelector'
100%|██████████| 5/5 [00:07<00:00,  1.56s/it]
LocalExecutor - INFO - Starting AutoML Epoch #10
LocalExecutor - INFO - Dataset columns: ['Sex', '(Pclass_mul_Sex)', '(Pclass_mul_(Pclass_mul_Sex))', '((Pclass_mul_Sex)_add_FamilySize)', '(Sex_add_Pclass)']
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 5, new feature number - 6
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'b

100%|██████████| 5/5 [00:02<00:00,  2.23it/s]
LocalExecutor - INFO - Starting AutoML Epoch #12
LocalExecutor - INFO - Dataset columns: ['Sex', '(Pclass_mul_Sex)', '(Pclass_mul_(Pclass_mul_Sex))', '((Pclass_mul_Sex)_add_FamilySize)', '(Sex_add_Pclass)', '((Sex_add_Pclass)_add_((Pclass_mul_Sex)_add_FamilySize))']
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 6, new feature number - 7
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'b

  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 5, new feature number - 6
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x116b91b00>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x116b915c0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>

  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 5, new feature number - 6
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x116b91b00>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x116b915c0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>

  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 5, new feature number - 6
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x116b91b00>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x116b915c0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>

  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 5, new feature number - 6
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x116a12b38>, 'max_features': <hyperopt.pyll.base.Apply object at 0x116a12ef0>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x116b91780>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x116b91128>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x116b91b00>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x116b915c0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>

<class 'sklearn.linear_model.logistic.LogisticRegression'> 0
XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.5010372927131221,
       colsample_bytree=0.6363781436333136, gamma=1.4143689967206658,
       learning_rate=0.0006961769815068519, max_delta_step=0, max_depth=1,
       min_child_weight=51, missing=None, n_estimators=4800, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=7.514229992362662e-06, reg_lambda=2.644368927737067,
       scale_pos_weight=1, seed=3, silent=True,
       subsample=0.9947229032889922) 0.7463524130190797
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.16941997001422793,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=13,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=154, n_jobs=1, oob_score=False, random_state=3




5 features were selected from an initial set on 9.

# Reproducable preprocessing
After you're done with AutoML model search it may be useful to reproduce resulting feature generation process.

In [9]:
from automl.feature.generators import Preprocessing

original_dataset = preprocess_data(data)

# Let's recreate all features useful features found in AutoML Pipeline
preprocessing = Preprocessing()
final_data = preprocessing.reproduce(pipeline_data.dataset, original_dataset)
final_data

array([[  2.,   6.,   6.,  36.,  38.],
       [  3.,   6.,   1.,   6.,   9.],
       [  4.,   8.,   3.,  24.,  28.],
       ..., 
       [  7.,  14.,   3.,  42.,  49.],
       [ -3.,  -3.,   5., -15., -18.],
       [ -1.,   1.,   7.,   7.,   6.]], dtype=float32)