# Onepanel AutoML 0.1a - Kaggle Dataset Example

Here we use AutoML to solve a classification task on a classic [Titanic](https://www.kaggle.com/c/titanic) dataset from Kaggle. First, let's download the data.

In [1]:
import pandas as pd

data = pd.read_csv('./data/train.csv', parse_dates=[2])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
import sys
import numpy as np
# AutoML uses Python's logging module
import logging

# Various sklearn models and metrics
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, make_scorer
from xgboost.sklearn import XGBClassifier

# AutoML Clasees
from automl.pipeline import LocalExecutor, Pipeline, PipelineStep, PipelineData
from automl.data.dataset import Dataset
from automl.model import ModelSpace, CV, Validate, ChooseBest
from automl.hyperparam.templates import (random_forest_hp_space, 
                                         knn_hp_space, svc_hp_space, 
                                         grad_boosting_hp_space, 
                                         xgboost_hp_space)
from automl.feature.generators import FormulaFeatureGenerator, PolynomialGenerator
from automl.feature.selector import FeatureSelector
from automl.hyperparam.hyperopt import Hyperopt
from automl.combinators import RandomChoice

logging.basicConfig(level=logging.INFO)
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# ch.setLevel(logging.DEBUG)

# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)

# Set STDERR handler as the only handler 
logger.handlers = [handler]

# Preprocessing


No matter how automated our process is, data still may need some preprocessing. Also, doing good old feature engenering can help by a lot. We skip exploratory data analysis and feature engeneering stages for brevity. If you are interested, we suggest looking up some examples at contest's [kernels](https://www.kaggle.com/c/titanic/kernels).

In [28]:
from sklearn.preprocessing import LabelEncoder

def preprocess_data(df):
    """Preprocess data and create AutoML Dataset"""
    encoder = LabelEncoder()
    result = df.copy()
    
    # drop columns we won't be using
    result.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
    
    # transform Sex column into numeric categories
    result['Sex'] = encoder.fit_transform(result['Sex'])
    
    # do the same with Embarked column
    result['Embarked'] = encoder.fit_transform(result['Embarked'].astype(str))
    
    # Replace missing Ages with median value and "pack"
    # Age into 10 equal-sized bins. For example, all 
    # ages from 0-10 will be packed into bin 0.
    result['Age'].fillna(result['Age'].median(), inplace=True)
    result['Age'] = pd.cut(result['Age'], 10, labels=range(0,10)).astype(int)
    
    # Pack Fare into 10 bins
    result['Fare'] = pd.cut(result['Fare'], 10, labels=range(0,10)).astype(int)
    
    # transform Pclass type to int
    result['Pclass'] = result['Pclass'].astype(int)
    
    # add some useful predictive features that may came 
    # up to mind during data analysis
    result['FamilySize'] = result['SibSp'] + result['Parch'] + 1
    result['IsAlone'] = 0
    result.loc[result['FamilySize'] == 1, 'IsAlone'] = 1
    
    return Dataset(result.drop(['Survived'], axis=1),
                   result['Survived'])

dataset = preprocess_data(data)
dataset.data.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,IsAlone
0,3,1,2,1,0,0,2,2,0
1,1,0,4,1,0,1,0,2,0
2,3,0,3,0,0,0,2,1,1
3,1,0,4,1,0,1,2,2,0
4,3,1,4,0,0,0,2,1,1


Now let's fit simple XGBoost model with default parameters and see how it scores on the dataset

In [4]:
from sklearn.model_selection import cross_val_score
rf = XGBClassifier()
np.mean(cross_val_score(rf, dataset.data, dataset.target))

0.81940713961728329

Ok, 81% accuracy with the defaults. Let's go on to AutoML Pipelines and see if we can improve the results

In [24]:
# Next, we define our ModelSpace. ModelSpace is initialized by a list of tuples.
# First element of each tuple should be an sklearn-like estimator with fit method
# The second one is model parameter dictionary. Here we do not define parameters 
# explicitly, but use hyperparameter templates from AutoML. Those templates can be
# used later by Hyperopt step to find best model parameters automatically
model_list = [
      (RandomForestClassifier, random_forest_hp_space()),
      (KNeighborsClassifier, knn_hp_space(lambda key: key)),
      (XGBClassifier, xgboost_hp_space())
  ]


# Create executor, initialize it with our classification dataset 
# and set total number of epochs to 2 (the pipeline will be run two times in a row).
# We can load any pipeline into executor using << operator like below:
context, pipeline_data = LocalExecutor(dataset, epochs=2) << \
    (Pipeline() # Here we define the pipeline. Steps can be added to pipeline using >> operator
     # First we define our ModelSpace. We wrap it with PipelineStep class 
     # and set initializer=True so that ModelSpace step will be run only at the first epoch
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     # But we are not obliged to wrap all steps with PipelineStep.
     # This will be done automatically if we do not need to set any special parameters 
     # We use PolynomialGenerator to create polynomial combinations of the features from the dataset
     >> PolynomialGenerator()
     # Next we use Hyperopt to find the best combination of hyperparameters for each model
     # We use test set validation with accuracy metric as a score function.
     # CV could be used instead of Validate to perform cross-validation
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=20)
     # Then we choose the best performing model we found
     >> ChooseBest(1)
     # And select 10 best features
     >> FeatureSelector(20))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/5 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'SklearnFeatureGenerator'
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.005084 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.005467 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.199776
hyperopt.tpe - INFO - tpe_transform took 0.008861 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.177329
hyperopt.tpe - INFO - tpe_transform took 0.007315 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.177329
hyperopt.tpe - INFO - tpe_transform took 0.006723 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.177329
hyperopt.tpe - INFO - tpe_transform took 0

hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.191919
hyperopt.tpe - INFO - tpe_transform took 0.006910 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.191919
hyperopt.tpe - INFO - tpe_transform took 0.006010 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.191919
hyperopt.tpe - INFO - tpe_transform took 0.006188 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.191919
hyperopt.tpe - INFO - tpe_transform took 0.006788 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.184063
hyperopt.tpe - INFO - tpe_transform took 0.006775 seconds
hyperopt.tpe - INFO - TPE using 13/13 trials with best loss 0.184063
hyperopt.tpe - INFO - tpe_transform took 0.006480 seconds
hyperopt.tpe - INFO - TPE using 14/14 trials with best loss 0.184063
hyperopt.tpe - INFO - tpe_transform took 0.008462 seconds
hyperopt.tpe - INFO - TPE using 15/15 trials with best loss 0.184063
hyperopt.tpe - INFO - tpe_transform took 0.007

hyperopt.tpe - INFO - tpe_transform took 0.004017 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials with best loss 0.216611
hyperopt.tpe - INFO - tpe_transform took 0.003805 seconds
hyperopt.tpe - INFO - TPE using 6/6 trials with best loss 0.216611
hyperopt.tpe - INFO - tpe_transform took 0.004434 seconds
hyperopt.tpe - INFO - TPE using 7/7 trials with best loss 0.216611
hyperopt.tpe - INFO - tpe_transform took 0.004521 seconds
hyperopt.tpe - INFO - TPE using 8/8 trials with best loss 0.198653
hyperopt.tpe - INFO - tpe_transform took 0.004610 seconds
hyperopt.tpe - INFO - TPE using 9/9 trials with best loss 0.198653
hyperopt.tpe - INFO - tpe_transform took 0.004199 seconds
hyperopt.tpe - INFO - TPE using 10/10 trials with best loss 0.198653
hyperopt.tpe - INFO - tpe_transform took 0.003879 seconds
hyperopt.tpe - INFO - TPE using 11/11 trials with best loss 0.198653
hyperopt.tpe - INFO - tpe_transform took 0.003779 seconds
hyperopt.tpe - INFO - TPE using 12/12 trials with best loss 0.1

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=3, max_features=0.40674571689587147,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=2,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=37, n_jobs=1, oob_score=False, random_state=2,
            verbose=False, warm_start=False) 0.8249158249158249
(891, 19)


Ok, we have reached better accuracy compared to default XGBoost plain dataset. However, first pipeline we launched had a good job of generating various features for our dataset, but it was not really created for searching the best model. Now let's create a pipeline which will search for the best model on a fixed dataset.

Please note that increasing `max_evals` parameter for `Hyperopt` can lead to finding better model parameters, but we use modest values here for demonstation purposes.

In [27]:
context, pipeline_data = LocalExecutor(pipeline_data.dataset, epochs=1) << \
    (Pipeline()
     >> PipelineStep('model space', ModelSpace(model_list), initializer=True)
     >> Hyperopt(CV(scoring=make_scorer(accuracy_score)), max_evals=50)
     >> ChooseBest(1))

for result in pipeline_data.return_val:
    print(result.model, result.score)
print(pipeline_data.dataset.data.shape)

LocalExecutor - INFO - Starting AutoML Epoch #1
  0%|          | 0/3 [00:00<?, ?it/s]LocalExecutor - INFO - Running step 'model space'
LocalExecutor - INFO - Running step 'Hyperopt'
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.ensemble.forest.RandomForestClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004551 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.005586 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.007511 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.205387
hyperopt.tpe - INFO - tpe_transform took 0.006692 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.199776
hyperopt.tpe - INFO - tpe_transform took 0.006507 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with best loss 0.182941
hyperopt.tpe - INFO - tpe_transform took 0.005235 seconds
hyperopt.tpe - INFO - TPE using 5/5 trials wit

Hyperopt - INFO - 0.7901234567901234
Hyperopt - INFO - 0.7901234567901234
Hyperopt - INFO - 0.7890011223344557
Hyperopt - INFO - 0.7890011223344556
Hyperopt - INFO - 0.7890011223344556
Hyperopt - INFO - 0.7878787878787877
Hyperopt - INFO - 0.7867564534231201
Hyperopt - INFO - 0.7789001122334455
Hyperopt - INFO - Running hyperparameter optimization for <class 'sklearn.neighbors.classification.KNeighborsClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.004069 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.004954 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.290685
hyperopt.tpe - INFO - tpe_transform took 0.006655 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.199776
hyperopt.tpe - INFO - tpe_transform took 0.004256 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with best loss 0.199776
hyperopt.tpe - INFO - tpe_transform took 0.004565 seconds
hyperopt.tpe - INFO - TPE using 4/4 trials with be

Hyperopt - INFO - 0.7867564534231201
Hyperopt - INFO - 0.7845117845117846
Hyperopt - INFO - 0.7845117845117845
Hyperopt - INFO - 0.7789001122334455
Hyperopt - INFO - 0.7093153759820426
Hyperopt - INFO - 0.7093153759820426
Hyperopt - INFO - 0.7093153759820426
Hyperopt - INFO - 0.7070707070707071
Hyperopt - INFO - 0.7070707070707071
Hyperopt - INFO - 0.7070707070707071
Hyperopt - INFO - 0.7070707070707071
Hyperopt - INFO - 0.7070707070707071
Hyperopt - INFO - Running hyperparameter optimization for <class 'xgboost.sklearn.XGBClassifier'>
hyperopt.tpe - INFO - tpe_transform took 0.005085 seconds
hyperopt.tpe - INFO - TPE using 0 trials
hyperopt.tpe - INFO - tpe_transform took 0.006369 seconds
hyperopt.tpe - INFO - TPE using 1/1 trials with best loss 0.223345
hyperopt.tpe - INFO - tpe_transform took 0.007089 seconds
hyperopt.tpe - INFO - TPE using 2/2 trials with best loss 0.175084
hyperopt.tpe - INFO - tpe_transform took 0.005959 seconds
hyperopt.tpe - INFO - TPE using 3/3 trials with bes

Hyperopt - INFO - 0.7934904601571269
Hyperopt - INFO - 0.7934904601571269
Hyperopt - INFO - 0.7912457912457912
Hyperopt - INFO - 0.7901234567901234
Hyperopt - INFO - 0.7901234567901234
Hyperopt - INFO - 0.7890011223344556
Hyperopt - INFO - 0.787878787878788
Hyperopt - INFO - 0.77665544332211
Hyperopt - INFO - 0.7755331088664422
Hyperopt - INFO - 0.7755331088664422
Hyperopt - INFO - 0.6161616161616161
Hyperopt - INFO - 0.6161616161616161
Hyperopt - INFO - 0.6161616161616161
Hyperopt - INFO - 0.6161616161616161
Hyperopt - INFO - 0.6161616161616161
 67%|██████▋   | 2/3 [01:26<00:43, 43.36s/it]LocalExecutor - INFO - Running step 'ChooseBest'
ChooseBest - INFO - Final model scores:
ChooseBest - INFO - RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features=0.11839601117622323,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weigh

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features=0.11839601117622323,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=13, n_jobs=1, oob_score=False, random_state=4,
            verbose=False, warm_start=False) 0.830527497194164
(891, 19)





We've got a nice improvement. That's enough to get into the top 1% in the Kaggle Titatic demo competition