# OPTaaS Scikit-learn Pipelines

### <span style="color:red">Note:</span> To run this notebook, you need an API Key. You can get one <a href="mailto:charles.brecque@mindfoundry.ai">here</a>.

Using the OPTaaS Python Client, you can optimize any scikit-learn pipeline. For each step or estimator in the pipeline, OPTaaS just needs to know what parameters to optimize and what constraints will apply to them.

Your pipeline can even include **optional** steps (such as feature selection), **choice** steps (such as choosing between a set of classifiers) and **nested** pipelines.

We have provided pre-defined parameters and constraints for some of the most widely used estimators, such as Random Forest and XGBoost. The example below demonstrates how to use them. See also our [tutorial on defining your own custom optimizable estimators](Custom Scikit-learn Estimators.ipynb).

## Load your dataset

We will run a classification pipeline using the German Credit Data available [here](https://newonlinecourses.science.psu.edu/stat857/node/215/). The data contains 1000 rows, with 20 feature columns and 1 target column which includes 2 classes.

In [1]:
import pandas as pd

data = pd.read_csv('../data/german_credit.csv')
features = data[data.columns.drop(['Creditability'])]
target = data['Creditability']

## Create your OptimizablePipeline

Our pipeline will include:

- An optional feature selection step using PCA

- A choice of classifier from: Random Forest, Extra Trees and Gradient Boost

In [2]:
from mindfoundry.optaas.client.sklearn_pipelines.estimators.pca import PCA
from mindfoundry.optaas.client.sklearn_pipelines.estimators.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from mindfoundry.optaas.client.sklearn_pipelines.mixin import OptimizablePipeline, choice, optional_step

optimizable_pipeline = OptimizablePipeline([
    ('feature_selection', optional_step(PCA())),
    ('classification', choice(
        RandomForestClassifier(),
        ExtraTreesClassifier(),
        GradientBoostingClassifier()
    ))
])

## Connect to the OPTaaS server using your API Key

We now create a client, and connect to the web service that will perform our optimization. You will need to input your personal API key. Make sure you keep your key private and don't commit it to your version control system. 

In [3]:
from mindfoundry.optaas.client.client import OPTaaSClient

client = OPTaaSClient('https://optaas.mindfoundry.ai', '<Your OPTaaS API key>')

## Create your Sklearn Task

We don't need to worry about specifying all the parameters and constraints - they are generated based on our OptimizablePipeline. Sometimes we will need to provide additional kwargs, e.g. `feature_count` which is required by PCA.

If we do need to optimize any additional parameters that are outside of our pipeline, we can include them in `additional_parameters` and `additional_constraints`.

In [4]:
from mindfoundry.optaas.client.parameter import IntParameter
from mindfoundry.optaas.client.constraint import Constraint

my_extra_param = IntParameter('extra', id='extra', minimum=0, maximum=10)
my_extra_constraint = Constraint(my_extra_param != 7)

task = client.create_sklearn_task(
    title='My Sklearn Task', 
    pipeline=optimizable_pipeline,
    feature_count=len(features.columns),
    additional_parameters=[my_extra_param],
    additional_constraints=[my_extra_constraint],
    target_score=1.0  # optional: this is the best possible score
)

display(task.parameters)
display(task.constraints)

[{'id': 'pipeline',
  'name': 'pipeline',
  'type': 'group',
  'items': [{'id': 'pipeline__feature_selection',
    'name': 'feature_selection',
    'type': 'group',
    'optional': True,
    'items': [{'id': 'pipeline__feature_selection__n_components',
      'name': 'n_components',
      'type': 'integer',
      'minimum': 1,
      'maximum': 20},
     {'id': 'pipeline__feature_selection__whiten',
      'name': 'whiten',
      'type': 'boolean',
      'default': False}]},
   {'id': 'classification',
    'name': 'classification',
    'type': 'choice',
    'choices': [{'id': 'pipeline__classification__0',
      'name': '0',
      'type': 'group',
      'items': [{'id': 'pipeline__classification__0__max_features',
        'name': 'max_features',
        'type': 'categorical',
        'default': 'auto',
        'enum': ['auto', 'sqrt', 'log2']},
       {'id': 'pipeline__classification__0__min_samples_split',
        'name': 'min_samples_split',
        'type': 'integer',
        'default':

['#extra != 7']

## Define your scoring function

We define a function to run our pipeline and calculate the mean score and variance:

In [5]:
from sklearn.model_selection import cross_val_score

def scoring_function(pipeline):
    scores = cross_val_score(pipeline, features, target, scoring='f1_micro')
    return scores.mean(), scores.var()

## Run your task

We run the task for 20 iterations and review the results:

In [6]:
best_result, best_pipeline = task.run(scoring_function, 20)

display('Best Score:', best_result.score)
display('Best Pipeline:', best_pipeline)

Running task "My Sklearn Task" for 20 iterations
(or until target score 1.0 is reached)

Iteration: 0    Score: 0.692992393591196    Variance: 0.0008240181111248598
Configuration: {'pipeline': {'feature_selection': {'n_components': 10, 'whiten': False}, 'classification': {'2': {'max_features': 'auto', 'min_samples_split': 2, 'min_samples_leaf': 1, 'criterion': 'friedman_mse', 'min_weight_fraction_leaf': 0.0, 'min_impurity_decrease': 0.0, 'learning_rate': 0.1, 'n_estimators': 100, 'subsample': 1.0}}}, 'additional': {'extra': 5}}

Iteration: 1    Score: 0.7079984175792559    Variance: 0.001638023934747848
Configuration: {'pipeline': {'feature_selection': {'n_components': 10, 'whiten': False}, 'classification': {'1': {'max_features': 'auto', 'min_samples_split': 2, 'min_samples_leaf': 1, 'criterion': 'gini', 'min_weight_fraction_leaf': 0.0, 'min_impurity_decrease': 0.0, 'bootstrap': False, 'n_estimators': 10, 'max_leaf_nodes': 5005, 'max_depth': 50}}}, 'additional': {'extra': 5}}

Iterati

Iteration: 18    Score: 0.7259984535433638    Variance: 0.0002552035255792119
Configuration: {'pipeline': {'classification': {'2': {'max_features': 'log2', 'min_samples_split': 20, 'min_samples_leaf': 12, 'criterion': 'friedman_mse', 'min_weight_fraction_leaf': 0.22770891172170005, 'min_impurity_decrease': 0.7623844285402049, 'learning_rate': 0.04607140948395827, 'n_estimators': 391, 'subsample': 0.5073692775798349}}}, 'additional': {'extra': 4}}

Iteration: 19    Score: 0.7189854525183866    Variance: 0.0003222472595104568
Configuration: {'pipeline': {'classification': {'2': {'max_features': 'log2', 'min_samples_split': 3, 'min_samples_leaf': 13, 'criterion': 'friedman_mse', 'min_weight_fraction_leaf': 0.1900392486697446, 'min_impurity_decrease': 0.7648193759348436, 'learning_rate': 0.011896091807658965, 'n_estimators': 371, 'subsample': 0.8016405875262133, 'max_depth': 91}}}, 'additional': {'extra': 4}}

Task Completed



'Best Score:'

0.7389904874934815

'Best Pipeline:'

Pipeline(memory=None,
     steps=[('classification', GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.04607140948395827, loss='deviance',
              max_depth=81, max_features='log2', max_leaf_nodes=None,
              min_impurity_decrease=0.7623844285402049,
              min_...auto', random_state=None,
              subsample=0.3557249351363936, verbose=0, warm_start=False))])