# Auto-sklearn

#### Author's description:

auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator

#### Useful links:

[install link](https://automl.github.io/auto-sklearn/master/installation.html),
[git](https://github.com/automl/auto-sklearn),
[manual](https://automl.github.io/auto-sklearn/master/manual.html),
[parallel instances](https://automl.github.io/auto-sklearn/master/examples/example_parallel_manual_spawning.html),
[parallel runs on one machine](https://automl.github.io/auto-sklearn/master/examples/example_parallel_n_jobs.html),
[cross validation](https://automl.github.io/auto-sklearn/master/examples/example_crossvalidation.html),
[feature types](https://automl.github.io/auto-sklearn/master/examples/example_feature_types.html)

## Install and import

Note that we use the subprocess function instead of the jupyter **!** method of running bash commands. Domino can run these notebooks as [jobs](https://support.dominodatalab.com/hc/en-us/articles/360023696651-Jobs) (batch or scheduled) which turns your ipython notebook into an executable script file! All you have to do is ensure the code can be executed in a .py file.

In [1]:
import subprocess

completed = subprocess.run(['sudo', 'pip', 'install', 'auto-sklearn'], \
                           stdout=subprocess.PIPE,)
print(completed.stdout.decode('utf-8'))

Collecting auto-sklearn
  Downloading auto-sklearn-0.6.0.tar.gz (3.9 MB)
Collecting scikit-learn<0.22,>=0.21.0
  Downloading scikit_learn-0.21.3-cp36-cp36m-manylinux1_x86_64.whl (6.7 MB)
Collecting lockfile
  Downloading lockfile-0.12.2-py2.py3-none-any.whl (13 kB)
Collecting joblib
  Downloading joblib-0.14.1-py2.py3-none-any.whl (294 kB)
Collecting liac-arff
  Downloading liac-arff-2.4.0.tar.gz (15 kB)
Collecting ConfigSpace<0.5,>=0.4.0
  Downloading ConfigSpace-0.4.12.tar.gz (966 kB)
Collecting pynisher>=0.4.2
  Downloading pynisher-0.5.0.tar.gz (5.0 kB)
Collecting pyrfr<0.9,>=0.7
  Downloading pyrfr-0.8.0.tar.gz (293 kB)
Collecting smac==0.8
  Downloading smac-0.8.0.tar.gz (94 kB)
Collecting sphinx_rtd_theme
  Downloading sphinx_rtd_theme-0.4.3-py2.py3-none-any.whl (6.4 MB)
Building wheels for collected packages: auto-sklearn, liac-arff, ConfigSpace, pynisher, pyrfr, smac
  Building wheel for auto-sklearn (setup.py): started
  Building wheel for auto-sklearn (setup.py): finished wi

In [2]:
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
import pandas as pd
import numpy as np

In [3]:
#the original notebook was created on 0.5.2
autosklearn.__version__

'0.6.0'

In [4]:
#there can be a lot of warnings in auto-sklearn
#especially if you overwrite existing files
#turning off for demo purposes

import warnings
warnings.filterwarnings("ignore")

## Take a look at the classification function

auto-sklearn is mostly a wrapper around scikit-learn. It was not the intention of the authors to allow user control over details such as the modeling algorithm and typical hyper-parameter choices. Control is several layers deep in the [SMAC](https://automl.github.io/SMAC3/stable/index.html) space and scenario settings. The user can control the time is takes to build the ensemble, the resampling strategy and the parallelization of the work across CPUs on the machine. These will be demonstrated below.

In [5]:
?autosklearn.classification.AutoSklearnClassifier

[0;31mInit signature:[0m
[0mautosklearn[0m[0;34m.[0m[0mclassification[0m[0;34m.[0m[0mAutoSklearnClassifier[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtime_left_for_this_task[0m[0;34m=[0m[0;36m3600[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mper_run_time_limit[0m[0;34m=[0m[0;36m360[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minitial_configurations_via_metalearning[0m[0;34m=[0m[0;36m25[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mensemble_size[0m[0;34m:[0m[0mint[0m[0;34m=[0m[0;36m50[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mensemble_nbest[0m[0;34m=[0m[0;36m50[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mensemble_memory_limit[0m[0;34m=[0m[0;36m1024[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mseed[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mml_memory_limit[0m[0;34m=[0m[0;36m3072[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minclude_estimators[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m

## Heart Disease

#### Load the heart disease dataset

Note that in this cell we are calling **sklearn.model_selection.train_test_split()** twice and creating two sets of heart disease (hd) data for model fitting and testing. One is for the hd data without one hot encoding (ohe) and the other has the ohe columns. 

auto-sklearn accepts a list of categorical features and has several methods for treating categorical data. In this notebook we try both approaches - building ohe columns ourselves and letting auto-sklearn do its thing.

In [6]:
'''
/mnt/data/raw/heart.csv

attribute documentation:
      age: age in years
      sex: sex (1 = male; 0 = female)
      cp: chest pain type
        -- Value 1: typical angina
        -- Value 2: atypical angina
        -- Value 3: non-anginal pain
        -- Value 4: asymptomatic
     trestbps: resting blood pressure (in mm Hg on admission to the 
        hospital)
     chol: serum cholestoral in mg/dl
     fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
     restecg: resting electrocardiographic results
        -- Value 0: normal
        -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST 
                    elevation or depression of > 0.05 mV)
        -- Value 2: showing probable or definite left ventricular hypertrophy
                    by Estes' criteria
     thalach: maximum heart rate achieved
     exang: exercise induced angina (1 = yes; 0 = no)
     oldpeak = ST depression induced by exercise relative to rest
     slope: the slope of the peak exercise ST segment
        -- Value 1: upsloping
        -- Value 2: flat
        -- Value 3: downsloping
     ca: number of major vessels (0-3) colored by flourosopy
     thal: 
         3 = normal; 
         6 = fixed defect; 
         7 = reversable defect
     target: diagnosis of heart disease (angiographic disease status)
        -- Value 0: < 50% diameter narrowing
        -- Value 1: > 50% diameter narrowing
 '''

#load and clean the data----------------------

#column names
names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang', \
         'oldpeak','slope','ca','thal','target']

#load data from Domino project directory
hd_data = pd.read_csv("../data/raw/heart.csv", header=None, names=names)

#in case some data comes in as string
#convert to numeric and coerce errors to NaN
for col in hd_data.columns:  # Iterate over chosen columns
    hd_data[col] = pd.to_numeric(hd_data[col], errors='coerce')
    
#drop nulls
hd_data.dropna(inplace=True)

#non-ohe data---------------------------------
   
#load the X and y set as a numpy array
X_hd = hd_data.drop('target', axis=1).values
y_hd = hd_data['target'].values

#build the train and test sets
X_hd_train, X_hd_test, y_hd_train, y_hd_test = \
    sklearn.model_selection.train_test_split(X_hd, y_hd, random_state=1)

#now do ohe-----------------------------------

#function to do one hot encoding for categorical columns
def create_dummies(data, cols, drop1st=True):
    for c in cols:
        dummies_df = pd.get_dummies(data[c], prefix=c, drop_first=drop1st)  
        data=pd.concat([data, dummies_df], axis=1)
        data = data.drop([c], axis=1)
    return data

cat_cols = ['cp', 'restecg', 'slope', 'ca', 'thal']
hd_data = create_dummies(hd_data, cat_cols)
    
#load the X and y set as a numpy array
X_hd_ohe = hd_data.drop('target', axis=1).values
y_hd_ohe = hd_data['target'].values

#build the train and test sets
X_hd_ohe_train, X_hd_ohe_test, y_hd_ohe_train, y_hd_ohe_test = \
    sklearn.model_selection.train_test_split(X_hd_ohe, y_hd_ohe, \
                                             random_state=1)

#### Function to delete the output and temp directories of auto-sklearn

You'll need to clear the previous folders to avoid overwrite errors. Alternatively, you can create new output directories.

In [7]:
def cleanup(directories_, delete_):
    for d in directories_:
        if delete_:
            print('deleting', d)
            completed = subprocess.run(
                ['rm', '-rf', d],
                stdout=subprocess.PIPE,
            )
            print(completed.stdout.decode('utf-8'))

#### Build a model on ohe data with holdout

In [8]:
%%time

#set and clear the output directories
directories = ['../results/tmp_hd_holdout', \
               '../results/out_hd_holdout']
cleanup(directories, True)

#build the auto-sklearn routine
automl_hd_ohe = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
    tmp_folder=directories[0],
    output_folder=directories[1],
    disable_evaluator_output=False,
    resampling_strategy='holdout',
    resampling_strategy_arguments={'train_size': 0.67}
)

#call it
automl_hd_ohe.fit(X_hd_ohe_train, y_hd_ohe_train, \
                  dataset_name='heart_disease')

#save the predicitons
predictions_hd_ohe = automl_hd_ohe.predict(X_hd_ohe_test)

deleting ../results/tmp_hd_holdout

deleting ../results/out_hd_holdout

CPU times: user 10.4 s, sys: 412 ms, total: 10.8 s
Wall time: 57.6 s


#### Fitting with autosklearn

A common mistake is to call **fit_ensemble()** after already running **fit()**. **fit()** both optimizes the machine learning models and builds an ensemble out of them. To disable ensembling when running **fit()** (with parallel instances for example) set ensemble_size to 0. Then **fit_ensemble()** would be needed once all models have been built.

To save fitted models, use typical [pickle procedures](https://scikit-learn.org/stable/modules/model_persistence.html#persistence-example).

#### Metrics

Accuracy, sprint stats, and model details are available. 

Later we will run auto-sklearn in parallel. Note the number of models built here and compare it to the number built with parallelization turned on. 

The model details give you insight into what auto-sklearn is doing under the hood. You can see the modeling algorithm used and all the parameter settings. 

In [9]:
print('Accuracy:')
print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
                                     predictions_hd_ohe))
print(' ')
print('-----------------------------------------')
print(' ')
print('Sprint Stats:')
print(automl_hd_ohe.sprint_statistics())
print(' ')
print('-----------------------------------------')
print(' ')
print('Model Details:')
print(automl_hd_ohe.show_models())

Accuracy:
0.75
 
-----------------------------------------
 
Sprint Stats:
auto-sklearn results:
  Dataset name: heart_disease
  Metric: accuracy
  Best validation score: 0.880000
  Number of target algorithm runs: 24
  Number of successful target algorithm runs: 24
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

 
-----------------------------------------
 
Model Details:
[(0.100000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'categorical_encoding:__choice__': 'no_encoding', 'classifier:__choice__': 'extra_trees', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'fast_ica', 'rescaling:__choice__': 'robust_scaler', 'classifier:extra_trees:bootstrap': 'True', 'classifier:extra_trees:criterion': 'gini', 'classifier:extra_trees:max_depth': 'None', 'classifier:extra_trees:max_features': 0.9708954776493797, 'classifier:extra_trees:max_lea

#### Do the same thing (build a model on ohe data with holdout) but this time with parallelization turned on

In [10]:
%%time

#set and clear the output directories
directories_parallel = ['../results/tmp_hd_holdout_parallel', \
                        '../results/out_hd_holdout_parallel']
cleanup(directories_parallel, True)

#build the auto-sklearn routine
automl_hd_ohe_p = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
    tmp_folder=directories_parallel[0],
    output_folder=directories_parallel[1],
    disable_evaluator_output=False,
    resampling_strategy='holdout',
    resampling_strategy_arguments={'train_size': 0.67},
    
    #turn on parallelization
    n_jobs=4,
    seed=5,
    
    delete_output_folder_after_terminate=False,
    delete_tmp_folder_after_terminate=False,
)

#call it
automl_hd_ohe_p.fit(X_hd_ohe_train, y_hd_ohe_train, \
                    dataset_name='heart_disease')

#save the predicitons
predictions_hd_ohe_p = automl_hd_ohe_p.predict(X_hd_ohe_test)

deleting ../results/tmp_hd_holdout_parallel

deleting ../results/out_hd_holdout_parallel

CPU times: user 3.57 s, sys: 128 ms, total: 3.7 s
Wall time: 57.9 s


In [11]:
print('Accuracy:')
print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
                                     predictions_hd_ohe_p))
print(' ')
print('-----------------------------------------')
print(' ')
print('Sprint Stats:')
print(automl_hd_ohe_p.sprint_statistics())

Accuracy:
0.75
 
-----------------------------------------
 
Sprint Stats:
auto-sklearn results:
  Dataset name: heart_disease
  Metric: accuracy
  Best validation score: 0.866667
  Number of target algorithm runs: 55
  Number of successful target algorithm runs: 51
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 4
  Number of target algorithms that exceeded the memory limit: 0



In [12]:
print('Model Details:')
print(automl_hd_ohe_p.show_models())

Model Details:
[(0.320000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'categorical_encoding:__choice__': 'one_hot_encoding', 'classifier:__choice__': 'multinomial_nb', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'liblinear_svc_preprocessor', 'rescaling:__choice__': 'minmax', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'False', 'classifier:multinomial_nb:alpha': 92.44256225709728, 'classifier:multinomial_nb:fit_prior': 'False', 'preprocessor:liblinear_svc_preprocessor:C': 11433.184251681738, 'preprocessor:liblinear_svc_preprocessor:dual': 'False', 'preprocessor:liblinear_svc_preprocessor:fit_intercept': 'True', 'preprocessor:liblinear_svc_preprocessor:intercept_scaling': 1, 'preprocessor:liblinear_svc_preprocessor:loss': 'squared_hinge', 'preprocessor:liblinear_svc_preprocessor:multi_class': 'ovr', 'preprocessor:liblinear_svc_preprocessor:penalty': 'l1', 'preprocessor:liblinear_svc_preprocessor:tol': 0.00014616338472666772},
dataset_

#### Try with feat_type option instead of ohe (still using parallel and holdout)

In [13]:
%%time

#set and clear the output directories
directories_parallel_ft = ['../results/tmp_hd_holdout_parallel_ft', \
                           '../results/out_hd_holdout_parallel_ft']
cleanup(directories_parallel_ft, True)

#build the auto-sklearn routine
automl_hd_ft_p = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
    tmp_folder=directories_parallel_ft[0],
    output_folder=directories_parallel_ft[1],
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy='holdout',
    resampling_strategy_arguments={'train_size': 0.67},
    n_jobs=4,
    seed=5,
    delete_output_folder_after_terminate=False,
    delete_tmp_folder_after_terminate=False,
)

feat_type = ['Numerical','Numerical','Categorical','Numerical',\
             'Numerical', 'Numerical','Categorical', 'Numerical',\
             'Numerical','Numerical', 'Categorical','Numerical',\
             'Categorical']

#call it
automl_hd_ft_p.fit(X_hd_train, y_hd_train, \
                   dataset_name='heart_disease', feat_type=feat_type)

#save the predicitons
predictions_hd_ft_p = automl_hd_ft_p.predict(X_hd_test)

deleting ../results/tmp_hd_holdout_parallel_ft

deleting ../results/out_hd_holdout_parallel_ft

CPU times: user 3.72 s, sys: 352 ms, total: 4.07 s
Wall time: 57.3 s


In [14]:
print('Accuracy:')
print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
                                     predictions_hd_ft_p))
print(' ')
print('-----------------------------------------')
print(' ')
print('Sprint Stats:')
print(automl_hd_ft_p.sprint_statistics())

Accuracy:
0.7894736842105263
 
-----------------------------------------
 
Sprint Stats:
auto-sklearn results:
  Dataset name: heart_disease
  Metric: accuracy
  Best validation score: 0.866667
  Number of target algorithm runs: 54
  Number of successful target algorithm runs: 50
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 4
  Number of target algorithms that exceeded the memory limit: 0



In [15]:
print('Model Details:')
print(automl_hd_ft_p.show_models())

Model Details:
[(0.100000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'categorical_encoding:__choice__': 'no_encoding', 'classifier:__choice__': 'passive_aggressive', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'random_trees_embedding', 'rescaling:__choice__': 'none', 'classifier:passive_aggressive:C': 3.562059440549482e-05, 'classifier:passive_aggressive:average': 'False', 'classifier:passive_aggressive:fit_intercept': 'True', 'classifier:passive_aggressive:loss': 'squared_hinge', 'classifier:passive_aggressive:tol': 0.0004234555532193723, 'preprocessor:random_trees_embedding:bootstrap': 'False', 'preprocessor:random_trees_embedding:max_depth': 2, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding:min_samples_leaf': 9, 'preprocessor:random_trees_embedding:min_samples_split': 18, 'preprocessor:random_trees_embedding:min_weight_fraction_leaf': 1.0, 'preprocessor:random_trees_embedding:n_estimators': 84},
data

#### Try with CV instead of Holdout (using ohe and parallel)

CV requires an extra step to fit our ensemble on all the data. During **fit()**, models are fit on individual cross-validation folds. To use all available data, we call **refit()** which trains all models in the final ensemble on the whole dataset. Also, when using CV, **fit()** changes the data in place, but refit needs the original data. So we use the **copy()** function. In practice, you might want to reload the data.

In [25]:
# %%time

# #set and clear the output directories
# directories_parallel_cv = ['../results/tmp_hd_holdout_parallel_cv', \
#                            '../results/out_hd_holdout_parallel_cv']
# cleanup(directories_parallel_cv, True)

# #build the auto-sklearn routine
# automl_hd_cv_p = autosklearn.classification.AutoSklearnClassifier(
#     time_left_for_this_task=60,
#     per_run_time_limit=30,
#     tmp_folder=directories_parallel_cv[0],
#     output_folder=directories_parallel_cv[1],
#     disable_evaluator_output=False,
#     # 'holdout' with 'train_size'=0.67 is the default argument setting
#     # for AutoSklearnClassifier. It is explicitly specified in this example
#     # for demonstrational purpose.
#     resampling_strategy='cv',
#     resampling_strategy_arguments={'folds': 5},
#     n_jobs=4,
#     seed=5,
#     delete_output_folder_after_terminate=False,
#     delete_tmp_folder_after_terminate=False,
# )

# #call it
# automl_hd_cv_p.fit(X_hd_ohe_train.copy(), y_hd_ohe_train.copy(), \
#                    dataset_name='heart_disease')
# automl_hd_cv_p.refit(X_hd_ohe_train.copy(), y_hd_ohe_train.copy())

# #save the predicitons
# predictions_hd_cv_p = automl_hd_cv_p.predict(X_hd_ohe_test)

In [26]:
# print('Accuracy:')
# print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
#                                      predictions_hd_cv_p))
# print(' ')
# print('-----------------------------------------')
# print(' ')
# print('Sprint Stats:')
# print(automl_hd_cv_p.sprint_statistics())

In [27]:
# print('Model Details:')
# print(automl_hd_cv_p.show_models())

## Breast Cancer

#### Load the breast cancer data

In [19]:
from sklearn.datasets import load_breast_cancer

'''
Attribute Information:

1) ID number 
2) Diagnosis (M = malignant, B = benign) 
3-32) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 
b) texture (standard deviation of gray-scale values) 
c) perimeter 
d) area 
e) smoothness (local variation in radius lengths) 
f) compactness (perimeter^2 / area - 1.0) 
g) concavity (severity of concave portions of the contour) 
h) concave points (number of concave portions of the contour) 
i) symmetry 
j) fractal dimension ("coastline approximation" - 1)
'''

#load from sklearn
X_bc, y_bc = sklearn.datasets.load_breast_cancer(return_X_y=True)

#build the train and test sets
X_bc_train, X_bc_test, y_bc_train, y_bc_test = \
    sklearn.model_selection.train_test_split(X_bc, y_bc, random_state=1)

#### Build a model using holdout and parallelization

In [20]:
%%time

#set and clear the output directorie
directories_bc = ['../results/tmp_bc', '../results/out_bc']
cleanup(directories_bc, True)

#build the auto-sklearn routine
automl_bc = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=30,
    tmp_folder=directories_bc[0],
    output_folder=directories_bc[1],
    disable_evaluator_output=False,
    # 'holdout' with 'train_size'=0.67 is the default argument setting
    # for AutoSklearnClassifier. It is explicitly specified in this example
    # for demonstrational purpose.
    resampling_strategy='holdout',
    resampling_strategy_arguments={'train_size': 0.67},
    n_jobs=4,
    seed=5,
    delete_output_folder_after_terminate=False,
    delete_tmp_folder_after_terminate=False,
)

#call it
automl_bc.fit(X_bc_train, y_bc_train, dataset_name='breast_cancer')

#save the predicitons
predictions_bc = automl_bc.predict(X_bc_test)

deleting ../results/tmp_bc

deleting ../results/out_bc

CPU times: user 3.27 s, sys: 388 ms, total: 3.66 s
Wall time: 56.7 s


In [21]:
print('Accuracy:')
print(sklearn.metrics.accuracy_score(y_bc_test, \
                                     predictions_bc))
print(' ')
print('-----------------------------------------')
print(' ')
print('Sprint Stats:')
print(automl_bc.sprint_statistics())

Accuracy:
0.958041958041958
 
-----------------------------------------
 
Sprint Stats:
auto-sklearn results:
  Dataset name: breast_cancer
  Metric: accuracy
  Best validation score: 0.992908
  Number of target algorithm runs: 43
  Number of successful target algorithm runs: 38
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 4
  Number of target algorithms that exceeded the memory limit: 1



In [22]:
print('Model Details:')
print(automl_bc.show_models())

Model Details:
[(0.120000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'categorical_encoding:__choice__': 'no_encoding', 'classifier:__choice__': 'extra_trees', 'imputation:strategy': 'median', 'preprocessor:__choice__': 'random_trees_embedding', 'rescaling:__choice__': 'robust_scaler', 'classifier:extra_trees:bootstrap': 'True', 'classifier:extra_trees:criterion': 'entropy', 'classifier:extra_trees:max_depth': 'None', 'classifier:extra_trees:max_features': 0.6093972073864385, 'classifier:extra_trees:max_leaf_nodes': 'None', 'classifier:extra_trees:min_impurity_decrease': 0.0, 'classifier:extra_trees:min_samples_leaf': 5, 'classifier:extra_trees:min_samples_split': 8, 'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'classifier:extra_trees:n_estimators': 100, 'preprocessor:random_trees_embedding:bootstrap': 'True', 'preprocessor:random_trees_embedding:max_depth': 9, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding

## Model Run and Accuracy Stats

All in one place for easier comparison.

In [30]:
print("-----------Heart Disease---------------")
print(' ')
print(' ')

print("Model stats HD Holdout:")
print(automl_hd_ohe.sprint_statistics())
print(' ')
print("Accuracy score HD Holdout:")
print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
                                     predictions_hd_ohe))

print(' ')
print('-----------------------------------------')
print(' ')

print("Model stats HD Holdout Parallel:")
print(automl_hd_ohe_p.sprint_statistics())
print("Accuracy score HD Holdout Parallel:")
print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
                                     predictions_hd_ohe_p))

print(' ')
print('-----------------------------------------')
print(' ')

#holdout parallel feat_type
print("Model stats HD Holdout Feature Type Parallel:")
print(automl_hd_ft_p.sprint_statistics())
print(' ')
print("Accuracy score HD Holdout Feature Type Parallel:")
print(sklearn.metrics.accuracy_score(y_hd_test, \
                                     predictions_hd_ft_p))

# print(' ')
# print('-----------------------------------------')
# print(' ')

# #cross validation parallel
# print("Model stats HD CV Parllel:")
# print(automl_hd_cv_p.sprint_statistics())
# print(' ')
# print("Accuracy score HD CV Parallel:")
# print(sklearn.metrics.accuracy_score(y_hd_ohe_test, \
#                                      predictions_hd_cv_p))

print(' ')
print(' ')

print("-----------Breast Cancer---------------")
print(' ')
print(' ')

print("Model stats BC Holdout Parallel:")
print(automl_bc.sprint_statistics())
print(' ')
print("Accuracy score BC Holdout Parallel:")
print(sklearn.metrics.accuracy_score(y_bc_test, \
                                     predictions_bc))

-----------Heart Disease---------------
 
 
Model stats HD Holdout:
auto-sklearn results:
  Dataset name: heart_disease
  Metric: accuracy
  Best validation score: 0.880000
  Number of target algorithm runs: 24
  Number of successful target algorithm runs: 24
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 0
  Number of target algorithms that exceeded the memory limit: 0

 
Accuracy score HD Holdout:
0.75
 
-----------------------------------------
 
Model stats HD Holdout Parallel:
auto-sklearn results:
  Dataset name: heart_disease
  Metric: accuracy
  Best validation score: 0.866667
  Number of target algorithm runs: 55
  Number of successful target algorithm runs: 51
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 4
  Number of target algorithms that exceeded the memory limit: 0

Accuracy score HD Holdout Parallel:
0.75
 
-----------------------------------------
 
Mode

## Save to Domino Stats File

To keep things simple, we pick one of the hd models. Saving stats to this file [allows Domino to track and trend them in the Experiment Manager](https://support.dominodatalab.com/hc/en-us/articles/204348169-Diagnostic-statistics-with-dominostats-json) when this notebook is run as a batch or scheduled job.

In [24]:
hd_acc = sklearn.metrics.accuracy_score(y_hd_ohe_test, \
                                        predictions_hd_ohe_p)
bc_acc = sklearn.metrics.accuracy_score(y_bc_test, \
                                        predictions_bc)

import json
with open('../dominostats.json', 'w') as f:
    f.write(json.dumps( {"HD_ACC": hd_acc, "BC_ACC": bc_acc}))