# OPTaaS Scikit-learn Pipelines

### <span style="color:red">Note:</span> To run this notebook, you need an API Key. If you don't have one, <a href="mailto:optaas@mindfoundry.ai">email us to start your free trial</a>.

Using the OPTaaS Python Client, you can optimize any scikit-learn pipeline. For each step or estimator in the pipeline, OPTaaS just needs to know what parameters to optimize and what constraints will apply to them.

Your pipeline can even include **optional** steps (such as feature selection), **choice** steps (such as choosing between a set of classifiers) and **nested** pipelines.

We have provided pre-defined parameters and constraints for some of the most widely used estimators, such as Random Forest and XGBoost. The example below demonstrates how to use them. See also our [tutorial on defining your own custom optimizable estimators](sklearn_custom.ipynb).

## Load your dataset

We will run a classification pipeline using the German Credit Data available [here](https://newonlinecourses.science.psu.edu/stat857/node/215/). The data contains 1000 rows, with 20 feature columns and 1 target column which includes 2 classes.

In [1]:
import pandas as pd

data = pd.read_csv('../data/german_credit.csv')
features = data[data.columns.drop(['Creditability'])]
target = data['Creditability']

## Create your OptimizablePipeline

Our pipeline will include:

- An optional feature selection step using PCA

- A choice of classifier from: Random Forest, Extra Trees and Linear SVC

In [2]:
from mindfoundry.optaas.client.sklearn_pipelines.estimators.pca import PCA
from mindfoundry.optaas.client.sklearn_pipelines.estimators.ensemble import RandomForestClassifier, ExtraTreesClassifier
from mindfoundry.optaas.client.sklearn_pipelines.estimators.svc import LinearSVC
from mindfoundry.optaas.client.sklearn_pipelines.mixin import OptimizablePipeline, choice, optional_step

optimizable_pipeline = OptimizablePipeline([
    ('feature_selection', optional_step(PCA())),
    ('classification', choice(
        RandomForestClassifier(),
        ExtraTreesClassifier(),
        LinearSVC()
    ))
])

## Connect to the OPTaaS server using your API Key

We now create a client, and connect to the web service that will perform our optimization. You will need to input your personal API key. Make sure you keep your key private and don't commit it to your version control system. 

In [3]:
from mindfoundry.optaas.client.client import OPTaaSClient

client = OPTaaSClient('https://optaas.mindfoundry.ai', '<Your OPTaaS API key>')

## Create your Sklearn Task

We don't need to worry about specifying all the parameters and constraints - they are generated based on our OptimizablePipeline. Sometimes we will need to provide additional kwargs, e.g. `feature_count` which is required by PCA.

If we do need to optimize any additional parameters that are outside of our pipeline, we can include them in `additional_parameters` and `additional_constraints`.

In [4]:
from mindfoundry.optaas.client.parameter import IntParameter
from mindfoundry.optaas.client.constraint import Constraint

my_extra_param = IntParameter('extra', id='extra', minimum=0, maximum=10)
my_extra_constraint = Constraint(my_extra_param != 7)

task = client.create_sklearn_task(
    title='My Sklearn Task', 
    pipeline=optimizable_pipeline,
    feature_count=len(features.columns),
    additional_parameters=[my_extra_param],
    additional_constraints=[my_extra_constraint],
)

display(task.parameters)
display(task.constraints)

[{'id': 'pipeline',
  'items': [{'id': 'pipeline__feature_selection',
    'items': [{'id': 'pipeline__feature_selection__n_components',
      'maximum': 20,
      'minimum': 1,
      'name': 'n_components',
      'type': 'integer'},
     {'default': False,
      'id': 'pipeline__feature_selection__whiten',
      'name': 'whiten',
      'type': 'boolean'}],
    'name': 'feature_selection',
    'optional': True,
    'type': 'group'},
   {'choices': [{'id': 'pipeline__classification__0',
      'items': [{'default': 'auto',
        'enum': ['auto', 'sqrt', 'log2'],
        'id': 'pipeline__classification__0__max_features',
        'name': 'max_features',
        'type': 'categorical'},
       {'default': 2,
        'distribution': 'Uniform',
        'id': 'pipeline__classification__0__min_samples_split',
        'maximum': 20,
        'minimum': 2,
        'name': 'min_samples_split',
        'type': 'integer'},
       {'default': 1,
        'id': 'pipeline__classification__0__min_samples_

['#extra != 7']

## Define your scoring function

We define a function to run our pipeline and calculate the mean score:

In [5]:
from sklearn.model_selection import cross_val_score

def scoring_function(pipeline):
    scores = cross_val_score(pipeline, features, target, scoring='f1_micro')
    mean_score = scores.mean()
    return mean_score

## Run your task

We run the task for 20 iterations and review the results:

In [6]:
best_result, best_pipeline = task.run(scoring_function, 20)

display('Best Score:', best_result.score)
display('Best Pipeline:', best_pipeline)

Running task "My Sklearn Task" for 20 iterations
(no score threshold set)

Iteration: 0    Score: 0.7239634844425265
Configuration: {'pipeline': {'classification': {'0': {'max_features': 'auto', 'min_samples_split': 2, 'min_samples_leaf': 1, 'criterion': 'gini', 'min_weight_fraction_leaf': 0.0, 'min_impurity_decrease': 0.0, 'bootstrap': True, 'n_estimators': 10, 'max_leaf_nodes': 5005}}}, 'additional': {'extra': 5}}

Iteration: 1    Score: 0.7259894625164085
Configuration: {'pipeline': {'classification': {'2': {'C': 1.3988132093705725, 'tol': 0.13663947006138843}}}, 'additional': {'extra': 4}}

Iteration: 2    Score: 0.699999400598203
Configuration: {'pipeline': {'classification': {'1': {'max_features': 'sqrt', 'min_samples_split': 3, 'min_samples_leaf': 1, 'criterion': 'gini', 'min_weight_fraction_leaf': 0.3774311709623272, 'min_impurity_decrease': 0.9865344170385548, 'bootstrap': True, 'n_estimators': 487, 'max_leaf_nodes': 4423, 'max_depth': 74}}}, 'additional': {'extra': 2}}

Itera

'Best Score:'

0.7509755264246283

'Best Pipeline:'

Pipeline(memory=None,
     steps=[('feature_selection', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
  svd_solver='auto', tol=0.0, whiten=True)), ('classification', LinearSVC(C=1.9210798580837665, class_weight=None, dual=True,
     fit_intercept=True, intercept_scaling=1, loss='squared_hinge',
     max_iter=1000, multi_class='ovr', penalty='l2', random_state=None,
     tol=0.5035867339618288, verbose=0))])