# Onepanel AutoML 0.1a - Kaggle Dataset Example

Here we use AutoML to solve a classification task on a classic [Titanic](https://www.kaggle.com/c/titanic) dataset from Kaggle. First, let's download the data.

In [1]:
import pandas as pd

data = pd.read_csv('./data/train.csv', parse_dates=[2])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
import sys
import numpy as np
# AutoML uses Python's logging module
import logging

# Various sklearn models and metrics
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, make_scorer
from xgboost.sklearn import XGBClassifier

# AutoML Clasees
from automl.pipeline import LocalExecutor, Pipeline, PipelineStep, PipelineData
from automl.data.dataset import Dataset
from automl.model import ModelSpace, CV, Validate, ChooseBest
from automl.hyperparam.templates import (random_forest_hp_space, 
                                         knn_hp_space, svc_kernel_hp_space, 
                                         grad_boosting_hp_space, 
                                         xgboost_hp_space)
from automl.feature.generators import FormulaFeatureGenerator, PolynomialGenerator
from automl.feature.selector import FeatureSelector
from automl.hyperparam.hyperopt import Hyperopt
from automl.combinators import RandomChoice

logging.basicConfig(level=logging.INFO)
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# ch.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]



# Preprocessing


No matter how automated our process is, data still may need some preprocessing. Also, doing good old feature engenering can help by a lot. We skip exploratory data analysis and feature engeneering stages for brevity. If you are interested, we suggest looking up some examples at contest's [kernels](https://www.kaggle.com/c/titanic/kernels).

In [3]:
from sklearn.preprocessing import LabelEncoder

def preprocess_data(df):
    """Preprocess data and create AutoML Dataset"""
    encoder = LabelEncoder()
    result = df.copy()
    
    # drop columns we won't be using
    result.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
    
    # transform Sex column into numeric categories
    result['Sex'] = encoder.fit_transform(result['Sex'])
    
    # do the same with Embarked column
    result['Embarked'] = encoder.fit_transform(result['Embarked'].astype(str))
    
    # Replace missing Ages with median value and "pack"
    # Age into 10 equal-sized bins. For example, all 
    # ages from 0-10 will be packed into bin 0.
    result['Age'].fillna(result['Age'].median(), inplace=True)
    result['Age'] = pd.cut(result['Age'], 10, labels=range(0,10)).astype(int)
    
    # Pack Fare into 10 bins
    result['Fare'] = pd.cut(result['Fare'], 10, labels=range(0,10)).astype(int)
    
    # transform Pclass type to int
    result['Pclass'] = result['Pclass'].astype(int)
    
    # add some useful predictive features that may came 
    # up to mind during data analysis
    result['FamilySize'] = result['SibSp'] + result['Parch'] + 1
    result['IsAlone'] = 0
    result.loc[result['FamilySize'] == 1, 'IsAlone'] = 1
    
    return Dataset(result.drop(['Survived'], axis=1),
                   result['Survived'])

dataset = preprocess_data(data)
print(f"Features: {dataset.columns}")

Features: ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'FamilySize', 'IsAlone']


Now let's fit simple XGBoost model with default parameters and see how it scores on the dataset

In [4]:
from sklearn.model_selection import cross_val_score
rf = XGBClassifier()
np.mean(cross_val_score(rf, dataset.data, dataset.target))

0.81144781144781142

Ok, 81% accuracy with the defaults. Let's go on to AutoML Pipelines and see if we can improve the results

In [7]:
# Next, we define our ModelSpace. ModelSpace is initialized by a list of tuples.
# First element of each tuple should be an sklearn-like estimator with fit method
# The second one is model parameter dictionary. Here we do not define parameters 
# explicitly, but use hyperparameter templates from AutoML. Those templates can be
# used later by Hyperopt step to find best model parameters automatically
model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (KNeighborsClassifier, knn_hp_space(lambda key: key)),
      (XGBClassifier, xgboost_hp_space())
  ]


# Create executor, initialize it with our classification dataset 
# and set total number of epochs to 2 (the pipeline will be run two times in a row).
# We can load any pipeline into executor using << operator like below:
context, pipeline_data = LocalExecutor(dataset, epochs=2) << \
    (Pipeline() # Here we define the pipeline. Steps can be added to pipeline using >> operator
     # First we define our ModelSpace. We wrap it with PipelineStep class 
     # and set initializer=True so that ModelSpace step will be run only at the first epoch
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     # But we are not obliged to wrap all steps with PipelineStep.
     # This will be done automatically if we do not need to set any special parameters 
     # We use PolynomialGenerator to create polynomial combinations of the features from the dataset
     >> PolynomialGenerator()
     # Next we use Hyperopt to find the best combination of hyperparameters for each model
     # We use test set validation with accuracy metric as a score function.
     # CV could be used instead of Validate to perform cross-validation
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=20)
     # Then we choose the best performing model we found
     >> ChooseBest(1)
     # And select 10 best features
     >> FeatureSelector(20))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'SklearnFeatureGenerator'
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x10ed188d0>, 'max_features': <hyperopt.pyll.base.Apply object at 0x112000c88>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x112000dd8>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x112000550>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x112000438>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x1120002b0>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004746 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transfo

hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005660 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.004721 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.004724 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.004514 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005353 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005894 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.178451
hyperopt.tpe - INFO - tpe_transform took 0.005794 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.176207
hyperopt.tpe - INFO - tpe_transform took 0.00440

hyperopt.tpe - INFO - tpe_transform took 0.004364 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.210999
hyperopt.tpe - INFO - tpe_transform took 0.002823 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.210999
hyperopt.tpe - INFO - tpe_transform took 0.005069 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.210999
hyperopt.tpe - INFO - tpe_transform took 0.002273 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.209877
hyperopt.tpe - INFO - tpe_transform took 0.003099 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.209877
hyperopt.tpe - INFO - tpe_transform took 0.002829 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.209877
hyperopt.tpe - INFO - tpe_transform took 0.002238 seconds
hyperopt.tpe - INFO - TPE using 15/15 trials with best loss 0.209877
hyperopt.tpe - INFO - tpe_transform took 0.003078 seconds
hyperopt.tpe - INFO - TPE using 16/16 trials with best 

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=11, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=2935, n_jobs=1,
            oob_score=False, random_state=0, verbose=False,
            warm_start=False) 0.8181818181818182
(891, 20)





Ok, we have reached better accuracy compared to default XGBoost plain dataset. However, first pipeline we launched had a good job of generating various features for our dataset, but it was not really created for searching the best model. Now let's create a pipeline which will search for the best model on a fixed dataset.

Please note that increasing `max_evals` parameter for `Hyperopt` can lead to finding better model parameters, but we use modest values here for demonstation purposes.

In [6]:
context, pipeline_data = LocalExecutor(pipeline_data.dataset, epochs=1) << \
    (Pipeline()
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=50)
     >> ChooseBest(1))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/3 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x110afd710>, 'max_features': <hyperopt.pyll.base.Apply object at 0x110b129e8>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x110b12278>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x110b12630>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x110b126a0>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x110ae5f60>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004085 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.004677 seconds
hyperopt.tpe - INFO - TPE using 1/1 t

hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002789 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002775 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002217 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002209 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002209 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.003297 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002305 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.002389 se

hyperopt.tpe - INFO - tpe_transform took 0.005166 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.206510
hyperopt.tpe - INFO - tpe_transform took 0.005136 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.206510
hyperopt.tpe - INFO - tpe_transform took 0.004515 seconds
hyperopt.tpe - INFO - TPE using 15/15 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.004597 seconds
hyperopt.tpe - INFO - TPE using 16/16 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.006371 seconds
hyperopt.tpe - INFO - TPE using 17/17 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.006206 seconds
hyperopt.tpe - INFO - TPE using 18/18 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.005871 seconds
hyperopt.tpe - INFO - TPE using 19/19 trials with best loss 0.204265
hyperopt.tpe - INFO - tpe_transform took 0.005204 seconds
hyperopt.tpe - INFO - TPE using 20/20 trials with bes

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=3, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=77, n_jobs=1,
            oob_score=False, random_state=1, verbose=False,
            warm_start=False) 0.8047138047138048
(891, 20)





We've got a nice improvement. That's enough to get into the top 1% in the Kaggle Titatic demo competition

## Voting feature selection

AutoML also allows you to select feature that perform well for the most models in the model space. Features that have geather importance in multiple models will have geather weight.


First $N$ features are selected from the following well-ordered set:

$\mathbb{F} = softmax(\mathbb{I}) \circ \mathbb{S}$,

where 
* $\mathbb{I} \in \mathbb{R}$ represents a model-specific feature score set
* $\mathbb{S} \in \mathbb{R}$ is a set of model scores according to some scoring function $s(x, m): \mathbb{M} \rightarrow \mathbb{R} $ ($x$ is a dataset, $m$ is a model, $\mathbb{M}$ is a model space)

In [10]:
from automl.hyperparam.hyperopt import Hyperopt
from sklearn.linear_model import LogisticRegression
from automl.feature.selector import VotingFeatureSelector
from automl.feature.generators import FormulaFeatureGenerator

model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (LogisticRegression, {}),
      (XGBClassifier, xgboost_hp_space())
  ]

dataset = preprocess_data(data)

context, pipeline_data = LocalExecutor(dataset, epochs=20) << \
    (Pipeline()
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     >> FormulaFeatureGenerator()
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=1)
     >> ChooseBest(4, by_largest_score=False)
     >> VotingFeatureSelector(feature_to_select=5, reverse_score=True))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

print(f"Selected features:")
for col in pipeline_data.dataset.columns:
    print(f"{col}")

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 9, new feature number - 10
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x11380c550>, 'max_features': <hyperopt.pyll.base.Apply object at 0x11380cc50>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x11380cf60>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x1158120b8>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x115812208>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x115812320>, 'verbose': False, 'criterion': 'gini'}
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_t

Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004786 seconds
hyperopt.tpe - INFO - TPE using 0 trials
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
Hyperopt - INFO - {}
Hyperopt - INFO - {'max_depth': <hyperopt.pyll.base.Apply object at 0x115812978>, 'learning_rate': <hyperopt.pyll.base.Apply object at 0x115812438>, 'n_estimators': <hyperopt.pyll.base.Apply object at 0x115812eb8>, 'gamma': <hyperopt.pyll.base.Apply object at 0x115812748>, 'min_child_weight': <hyperopt.pyll.base.Apply object at 0x1158125f8>, 'max_delta_step': 0, 'subsample': <hyperopt.pyll.base.Apply object at 0x115812b00>, 'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x115812e80>, 'colsample_bylevel': <hyperopt.pyll.base.Apply object at 0x1157e5160>, 'reg_alpha': <hyperopt.pyll.base.Apply object at 0x1157e52e8>, 'reg_lambda': <hyperopt.pyll.base.Apply object at 

Hyperopt - INFO - Running hyperparameter optimization for <class 'xgboost.sklearn.XGBClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004019 seconds
hyperopt.tpe - INFO - TPE using 0 trials
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
 60%|██████    | 3/5 [00:04<00:03,  1.65s/it]LocalExecutor - INFO - Running step 'ChooseBest'
ChooseBest - INFO - Final model scores:
ChooseBest - INFO - <class 'sklearn.linear_model.logistic.LogisticRegression'> - 0
ChooseBest - INFO - RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.4580048724069996,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=1557, n_jobs=1, oob_score=False, random_state=2,
            verbose=False, warm_start=False) - 0.8013468013468014
ChooseBest - INFO - XGBCl

ChooseBest - INFO - XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.6083115341159874,
       colsample_bytree=0.9958321234616289, gamma=0.008418514642201332,
       learning_rate=0.015280550283312879, max_delta_step=0, max_depth=8,
       min_child_weight=6, missing=None, n_estimators=400, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.00018721999851741222, reg_lambda=1.1821199127464457,
       scale_pos_weight=1, seed=4, silent=True,
       subsample=0.8433763925951976) - 0.8226711560044894
LocalExecutor - INFO - Running step 'VotingFeatureSelector'
100%|██████████| 5/5 [00:06<00:00,  1.23s/it]
LocalExecutor - INFO - Starting AutoML Epoch #8
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new f

LocalExecutor - INFO - Running step 'VotingFeatureSelector'
100%|██████████| 5/5 [00:03<00:00,  1.58it/s]
LocalExecutor - INFO - Starting AutoML Epoch #10
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
PipelineStep - INFO - Initializer step model space was already run, skipping
LocalExecutor - INFO - Running step 'FormulaFeatureGenerator'
FormulaFeatureGenerator - INFO - Generated new features. Old feature number - 5, new feature number - 6
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - {'n_estimators': <hyperopt.pyll.base.Apply object at 0x11380c550>, 'max_features': <hyperopt.pyll.base.Apply object at 0x11380cc50>, 'max_depth': <hyperopt.pyll.base.Apply object at 0x11380cf60>, 'min_samples_split': 2, 'min_samples_leaf': <hyperopt.pyll.base.Apply object at 0x1158120b8>, 'bootstrap': <hyperopt.pyll.base.Apply object at 0x115812208>, 'oob_score': False, 'n_jobs': 1, 'random_state': <hyperopt.pyll.base.Apply object at 0x115812320>

Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004273 seconds
hyperopt.tpe - INFO - TPE using 0 trials
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
Hyperopt - INFO - {}
Hyperopt - INFO - {'max_depth': <hyperopt.pyll.base.Apply object at 0x115812978>, 'learning_rate': <hyperopt.pyll.base.Apply object at 0x115812438>, 'n_estimators': <hyperopt.pyll.base.Apply object at 0x115812eb8>, 'gamma': <hyperopt.pyll.base.Apply object at 0x115812748>, 'min_child_weight': <hyperopt.pyll.base.Apply object at 0x1158125f8>, 'max_delta_step': 0, 'subsample': <hyperopt.pyll.base.Apply object at 0x115812b00>, 'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x115812e80>, 'colsample_bylevel': <hyperopt.pyll.base.Apply object at 0x1157e5160>, 'reg_alpha': <hyperopt.pyll.base.Apply object at 0x1157e52e8>, 'reg_lambda': <hyperopt.pyll.base.Apply object at 

Hyperopt - INFO - {'max_depth': <hyperopt.pyll.base.Apply object at 0x115812978>, 'learning_rate': <hyperopt.pyll.base.Apply object at 0x115812438>, 'n_estimators': <hyperopt.pyll.base.Apply object at 0x115812eb8>, 'gamma': <hyperopt.pyll.base.Apply object at 0x115812748>, 'min_child_weight': <hyperopt.pyll.base.Apply object at 0x1158125f8>, 'max_delta_step': 0, 'subsample': <hyperopt.pyll.base.Apply object at 0x115812b00>, 'colsample_bytree': <hyperopt.pyll.base.Apply object at 0x115812e80>, 'colsample_bylevel': <hyperopt.pyll.base.Apply object at 0x1157e5160>, 'reg_alpha': <hyperopt.pyll.base.Apply object at 0x1157e52e8>, 'reg_lambda': <hyperopt.pyll.base.Apply object at 0x1157e5470>, 'scale_pos_weight': 1, 'base_score': 0.5, 'seed': <hyperopt.pyll.base.Apply object at 0x1157e5518>}
Hyperopt - INFO - Running hyperparameter optimization for <class 'xgboost.sklearn.XGBClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004559 seconds
hyperopt.tpe - INFO - TPE using 0 trials
Hyperop

Hyperopt - INFO - Running hyperparameter optimization for <class 'xgboost.sklearn.XGBClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.005034 seconds
hyperopt.tpe - INFO - TPE using 0 trials
Hyperopt - INFO - Reversing best score bask to original form as reverse_score=True
 60%|██████    | 3/5 [00:01<00:01,  1.68it/s]LocalExecutor - INFO - Running step 'ChooseBest'
ChooseBest - INFO - Final model scores:
ChooseBest - INFO - <class 'sklearn.linear_model.logistic.LogisticRegression'> - 0
ChooseBest - INFO - RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=4, max_features=0.1915193703167708,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=35,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=246, n_jobs=1, oob_score=False, random_state=0,
            verbose=False, warm_start=False) - 0.7934904601571269
ChooseBest - INFO - XGBClassi

ChooseBest - INFO - Final model scores:
ChooseBest - INFO - <class 'sklearn.linear_model.logistic.LogisticRegression'> - 0
ChooseBest - INFO - RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.32984866125793166,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=7,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=447, n_jobs=1, oob_score=False, random_state=2,
            verbose=False, warm_start=False) - 0.7867564534231201
ChooseBest - INFO - XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.6411443264170507,
       colsample_bytree=0.714607443333956, gamma=0.0010914663424632753,
       learning_rate=0.0012251494596873583, max_delta_step=0, max_depth=5,
       min_child_weight=23, missing=None, n_estimators=4000, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
   

ChooseBest - INFO - XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.9409202069976053,
       colsample_bytree=0.9178224275088251, gamma=0.0008887275353809392,
       learning_rate=0.033426701248356304, max_delta_step=0, max_depth=6,
       min_child_weight=4, missing=None, n_estimators=4600, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.008681998079279135, reg_lambda=2.202762869564714,
       scale_pos_weight=1, seed=2, silent=True,
       subsample=0.7584289020642088) - 0.7946127946127945
LocalExecutor - INFO - Running step 'VotingFeatureSelector'
100%|██████████| 5/5 [00:03<00:00,  1.40it/s]

<class 'sklearn.linear_model.logistic.LogisticRegression'> 0
RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=242, n_jobs=1,
            oob_score=False, random_state=4, verbose=False,
            warm_start=False) 0.7890011223344556
XGBClassifier(base_score=0.5, booster='gbtree',
       colsample_bylevel=0.9409202069976053,
       colsample_bytree=0.9178224275088251, gamma=0.0008887275353809392,
       learning_rate=0.033426701248356304, max_delta_step=0, max_depth=6,
       min_child_weight=4, missing=None, n_estimators=4600, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0.008681998079279135, reg_lambda=2.202762869564714,
       scale_pos_weight=1, seed=2, silent=True,
   




5 features were selected from an initial set on 9.

# Reproducable preprocessing
After you're done with AutoML model search it may be useful to reproduce resulting feature generation process.

In [13]:
from automl.feature.generators import Preprocessing

original_dataset = preprocess_data(data)

# Let's recreate all features useful features found in AutoML Pipeline
preprocessing = Preprocessing()
final_data = preprocessing.reproduce(pipeline_data.dataset, original_dataset)
final_data

array([[  4.        ,   2.        ,   1.33333337,   2.66666675,   8.        ],
       [  0.        ,   0.        ,   0.        ,   4.        ,   0.        ],
       [  0.        ,   0.        ,   0.        ,   1.        ,   0.        ],
       ..., 
       [  0.        ,   0.        ,   0.        ,   1.        ,   0.        ],
       [  6.        ,   3.        ,   9.        ,   6.        ,  18.        ],
       [  6.        ,   3.        ,   3.        ,   4.        ,  18.        ]], dtype=float32)