# OPTaaS Scikit-learn Pipelines

Using the OPTaaS Python Client, you can optimize any scikit-learn pipeline. For each step or estimator in the pipeline, OPTaaS just needs to know what parameters to optimize and what constraints will apply to them.

Your pipeline can even include **optional** steps (such as feature selection), and **choice** steps (such as choosing between a set of classifiers).

We have provided pre-defined parameters and constraints for some of the most widely used estimators, such as Random Forest and XGBoost. The example below demonstrates how to use them. See also our [tutorial on defining your own custom optimizable estimators](sklearn_custom.ipynb).

## Load your dataset

We will run a classification pipeline using the German Credit Data available [here](https://newonlinecourses.science.psu.edu/stat857/node/215/). The data contains 1000 rows, with 20 feature columns and 1 target column which includes 2 classes.

In [1]:
import pandas as pd

data = pd.read_csv('german_credit.csv')
features = data[data.columns.drop(['Creditability'])]
target = data['Creditability']

## Specify your estimators

Our pipeline will include:

- An optional feature selection step using PCA

- A choice of classifier from: Random Forest, Extra Trees and Linear SVC

In [2]:
from mindfoundry.optaas.client.sklearn_pipelines.estimators.pca import PCA
from mindfoundry.optaas.client.sklearn_pipelines.estimators.ensemble import RandomForestClassifier, ExtraTreesClassifier
from mindfoundry.optaas.client.sklearn_pipelines.estimators.svc import LinearSVC
from mindfoundry.optaas.client.sklearn_pipelines.mixin import choice, optional_step

estimators = [
    ('feature_selection', optional_step(PCA())),
    ('classification', choice(
        RandomForestClassifier(),
        ExtraTreesClassifier(),
        LinearSVC()
    ))
]

## Connect to the OPTaaS server using your API Key

We now create a client, and connect to the web service that will perform our optimization. You will need to input your personal API key. Make sure you keep your key private and don't commit it to your version control system. 

In [3]:
from mindfoundry.optaas.client.client import OPTaaSClient

client = OPTaaSClient('https://optaas.mindfoundry.ai', '<Your OPTaaS API key>')

## Create your Sklearn Task

We don't need to worry about specifying all the parameters and constraints - they are generated based on the estimators and any additional parameters we specify. In this case we specify the `feature_count` and we set `gpu_enabled` to False.

In [4]:
task = client.create_sklearn_task(
    title='My Sklearn Task', 
    estimators=estimators,
    feature_count=len(features.columns)
)

display(task.parameters)
display(task.constraints)

[{'id': 'feature_selection',
  'items': [{'default': 'auto',
    'enum': ['arpack', 'auto', 'full', 'randomized'],
    'id': 'feature_selection__svd_solver',
    'name': 'svd_solver',
    'type': 'categorical'},
   {'choices': [{'id': 'feature_selection__n_components_int',
      'maximum': 20,
      'minimum': 1,
      'name': 'n_components_int',
      'type': 'integer'},
     {'id': 'feature_selection__n_components_float',
      'maximum': 0.9999999999999999,
      'minimum': 5e-324,
      'name': 'n_components_float',
      'type': 'number'},
     {'id': 'feature_selection__n_components_mle',
      'name': 'n_components_mle',
      'type': 'constant',
      'value': 'mle'}],
    'id': 'feature_selection__n_components',
    'includeInDefault': False,
    'name': 'n_components',
    'optional': True,
    'type': 'choice'},
   {'default': False,
    'id': 'feature_selection__whiten',
    'name': 'whiten',
    'type': 'boolean'},
   {'default': 0.0,
    'id': 'feature_selection__tol',
  

["if #feature_selection__svd_solver == 'arpack' then ( #feature_selection__n_components_int < 20 ) && #feature_selection__tol is_present",
 "if ( #feature_selection__svd_solver == 'auto' ) || ( #feature_selection__svd_solver == 'randomized' ) then #feature_selection__n_components is_absent || ( #feature_selection__n_components == #feature_selection__n_components_int )",
 'if #classification__0__bootstrap == false then #classification__0__oob_score != true',
 'if #classification__1__bootstrap == false then #classification__1__oob_score != true',
 "if #classification__2__dual == true then #classification__2__penalty == 'l2'",
 "if #classification__2__dual == false then #classification__2__loss == 'squared_hinge'",
 "if #classification__2__multi_class == 'ovr' then ( #classification__2__dual is_present && #classification__2__penalty is_present ) && #classification__2__loss is_present",
 "if #classification__2__multi_class == 'crammer_singer' then ( #classification__2__dual is_absent && #c

## Generate the first pipeline

We get the first configuration from OPTaaS and use it to generate a pipeline:

In [5]:
configuration = task.generate_configurations()[0]
pipeline = task.make_pipeline(configuration)

display(pipeline)

Pipeline(memory=None,
     steps=[('feature_selection', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_l...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

## Run your pipeline and calculate the result

We define a function to run our pipeline and calculate the mean score:

In [6]:
from sklearn.model_selection import cross_val_score

def get_result(pipeline):
    scores = cross_val_score(pipeline, features, target, scoring='f1_micro')
    mean_score = scores.mean()
    return mean_score

mean_score = get_result(pipeline)

display(f'Mean score: {mean_score:.3f}')

'Mean score: 0.684'

## Record the result in OPTaaS and generate the next pipeline

In [7]:
configuration = task.record_result(configuration, score=mean_score)
pipeline = task.make_pipeline(configuration)

## Repeat as necessary

In [8]:
number_of_iterations = 12

for _ in range(number_of_iterations):
    display("OPTaaS recommendation:", pipeline)
    mean_score = get_result(pipeline)
    display(f'Mean score: {mean_score:.3f}')
    configuration = task.record_result(configuration, score=mean_score)
    pipeline = task.make_pipeline(configuration)

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='gini',
           max_depth=88, max_features=0.8347809045688854,
           max_leaf_nodes=None, min_impurity_decrease=0.7945820764661936,
           min_impurity_split=None, min_samples_leaf=14,
           min_samples_split=8,
           min_weight_fraction_leaf=0.0900039391054861, n_estimators=14,
           n_jobs=1, oob_score=True, random_state=None, verbose=0,
           warm_start=False))])

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=0.6971715530810846,
            max_leaf_nodes=None, min_impurity_decrease=0.11162448326459773,
            min_impurity_split=None, min_samples_leaf=0.07487...            n_jobs=1, oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=None, max_features='sqrt',
            max_leaf_nodes=6580, min_impurity_decrease=0.24478696548136247,
            min_impurity_split=None, min_samples_leaf=16,
            min_samples_split=8,
            min_weight_fraction_leaf=0.3843510059011472, n_estimators=167,
            n_jobs=1, oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=15, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.9728838363652157,
            min_impurity_split=None, min_samples_leaf=0.24365195679239393...            n_jobs=1, oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=942,
           min_impurity_decrease=0.8632421089796949,
           min_impurity_split=None, min_samples_leaf=15,
           min_samples_split=0.20737671197302388,
           min_weight_fraction_leaf=0.13565324953115132, n_estimators=457,
           n_jobs=1, oob_score=True, random_state=None, verbose=0,
           warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='entropy',
           max_depth=90, max_features=0.003646868480661869,
           max_leaf_nodes=95, min_impurity_decrease=0.1978619738503251,
           min_impurity_split=None, min_samples_leaf=3,
         ...6,
           n_jobs=1, oob_score=True, random_state=None, verbose=0,
           warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('feature_selection', PCA(copy=True, iterated_power=13, n_components=None, random_state=None,
  svd_solver='full', tol=0.3607130914658494, whiten=True)), ('classification', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='gini',
           max_depth=27, max_features='sqrt', ...0,
           n_jobs=1, oob_score=True, random_state=None, verbose=0,
           warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=19, max_features=0.2980724139146793,
            max_leaf_nodes=None, min_impurity_decrease=0.6131634029682336,
            min_impurity_split=None, min_samples_leaf=0.45927...            n_jobs=1, oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('feature_selection', PCA(copy=True, iterated_power=7, n_components=None, random_state=None,
  svd_solver='full', tol=0.7032527640397433, whiten=False)), ('classification', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='entropy',
           max_depth=None, max_features=0.1...9,
           n_jobs=1, oob_score=True, random_state=None, verbose=0,
           warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('feature_selection', PCA(copy=True, iterated_power=55, n_components='mle', random_state=None,
  svd_solver='full', tol=0.1699494461779577, whiten=True)), ('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='a...
            n_jobs=1, oob_score=True, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.700'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=0.047590580336390964, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=24, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.722'

'OPTaaS recommendation:'

Pipeline(memory=None,
     steps=[('feature_selection', PCA(copy=True, iterated_power=54, n_components=None, random_state=None,
  svd_solver='auto', tol=None, whiten=False)), ('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='sqrt', max_l...            n_jobs=1, oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

'Mean score: 0.700'

## Complete your task

In [9]:
task.complete()

best_result, best_configuration = task.get_best_result_and_configuration()
best_pipeline = task.make_pipeline(best_configuration)

display(best_result, best_pipeline)

{ 'configuration': '9298d7c8-f6a1-4559-8d09-5fe35c2dd080',
  'id': 2373,
  'score': 0.7219884555213896,
  'user_defined_data': None}

Pipeline(memory=None,
     steps=[('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=0.047590580336390964, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=24, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])