# OPTaaS Scikit-learn Pipelines

You can use OPTaaS to optimize any scikit-learn pipeline. The OPTaaS Python client will generate parameters and constraints for you based on the estimators that you want to include in your pipeline.

## Load your dataset

Let's load a classic data set, "iris", which comes packaged in scikit-learn. The dataset contains three types of flowers, and 50 measurements of four characteristics of these flowers (sepal length, sepal width, petal length and petal width). So we have 150 rows (50 per flower type), 4 input columns (the four features), and one target column (the name of the flower).

In [1]:
from sklearn import datasets

iris = datasets.load_iris()

## Create your estimators

We will use PCA to reduce the input dimensions, and then discover the type of flower with a Random Forest Classifier.

You can also use ExtraTreesClassifier, LinearSVC and VotingClassifier.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

estimators =[('pca', PCA()), ('rf', RandomForestClassifier())]

## Connect to the OPTaaS server using your API Key

We now create a client, and connect to the web service that will perform our optimization. You will need to input your personal API key. Make sure you keep your key private and don't commit it to your version control system. 

In [3]:
from mindfoundry.optaas.client.sklearn_pipelines.client import OPTaaSSklearnClient

client = OPTaaSSklearnClient('<URL of your OPTaaS server>', '<Your OPTaaS API key>')

## Create your Task
We create the task by providing only our estimators. The parameters and constraints will be generated automatically.

In [4]:
task = client.create_sklearn_task(
    title='My Pipeline Optimization', 
    estimators=estimators,
    feature_count=len(iris.feature_names)
)

display(task.parameters)

[{'default': 'auto',
  'enum': ['arpack', 'auto', 'full', 'randomized'],
  'id': 'pca__svd_solver',
  'name': 'pca__svd_solver',
  'type': 'categorical'},
 {'choices': [{'id': 'pca__n_components_int',
    'maximum': 4,
    'minimum': 1,
    'name': 'pca__n_components_int',
    'type': 'integer'},
   {'id': 'pca__n_components_float',
    'maximum': 0.9999999999999999,
    'minimum': 5e-324,
    'name': 'pca__n_components_float',
    'type': 'number'},
   {'id': 'pca__n_components_mle',
    'name': 'pca__n_components_mle',
    'type': 'constant',
    'value': 'mle'}],
  'id': 'pca__n_components',
  'includeInDefault': False,
  'name': 'pca__n_components',
  'optional': True,
  'type': 'choice'},
 {'default': False,
  'id': 'pca__whiten',
  'name': 'pca__whiten',
  'type': 'boolean'},
 {'default': 0.0,
  'id': 'pca__tol',
  'maximum': 1,
  'minimum': 0,
  'name': 'pca__tol',
  'optional': True,
  'type': 'number'},
 {'id': 'pca__iterated_power',
  'includeInDefault': False,
  'maximum': 9

## Generate the first configuration

Here OPTaaS gives us the first configuration (set of parameter values) for our problem. 

In [5]:
configuration = task.generate_configuration()

{ 'id': '0d56a685-4159-4ece-b67c-545d746216fb',
  'type': 'default',
  'values': { 'pca__svd_solver': 'auto',
              'pca__tol': 0.0,
              'pca__whiten': False,
              'rf__bootstrap': True,
              'rf__criterion': 'gini',
              'rf__max_features': {'rf__max_features_float': 0.5},
              'rf__min_impurity_decrease': 0.0,
              'rf__min_samples_leaf': 1,
              'rf__min_samples_split': 2,
              'rf__min_weight_fraction_leaf': 0.0,
              'rf__n_estimators': 10,
              'rf__oob_score': False}}

## Use the configuration to create and run a pipeline

Our task can parse the configuration and use it to generate a Pipeline, which we can then run on our "iris" dataset. 

We wrapped this into a function so that we can call it multiple times.

In [6]:
from sklearn.model_selection import cross_val_score

def get_result(configuration):
    pipeline = task.make_pipeline(configuration)
    scores = cross_val_score(pipeline, iris.data, iris.target, scoring='f1_micro')
    mean_score = scores.mean()
    return list(scores), mean_score

scores, mean_score = get_result(configuration)

[0.9607843137254902, 0.9019607843137255, 0.9583333333333334]

0.9403594771241831

## Record the result in OPTaaS and generate the next configuration to try

The score we obtained must now be passed back to OPTaaS, which will use it to generate and return a new configuration.

You can store additional user-defined data along with your result, but that's entirely optional and for your convenience only. For example, you can store the individual scores for each category so you can easily reference them later.

In [7]:
configuration = task.record_result(configuration, score=mean_score, user_defined_data=scores)

{ 'id': '0c4c1603-b450-45a2-a965-6432bb97c1b0',
  'type': 'exploration',
  'values': { 'pca__iterated_power': 20,
              'pca__n_components': { 'pca__n_components_float': 0.5204439958233601},
              'pca__svd_solver': 'full',
              'pca__whiten': False,
              'rf__bootstrap': True,
              'rf__criterion': 'entropy',
              'rf__max_features': {'rf__max_features_float': 0.866710457590781},
              'rf__max_leaf_nodes': 370,
              'rf__min_impurity_decrease': 0.927463600189984,
              'rf__min_samples_leaf': 10,
              'rf__min_samples_split': 19,
              'rf__min_weight_fraction_leaf': 0.15123365104252168,
              'rf__n_estimators': 83,
              'rf__oob_score': True}}

## Repeat as necessary

Repeat the process until you're happy with the result or you've run out of time

In [None]:
for _ in range(number_of_iterations):
    scores, mean_score = get_result(configuration)
    configuration = task.record_result(configuration, score=mean_score, user_defined_data=scores)

## Complete your task

Tell OPTaaS that you're satisfied and you won't run any more experiments, then go ahead and get your final result out. Job done!

In [9]:
task.complete()

best_result, best_configuration = task.get_best_result_and_configuration()

{ 'configuration': '0d56a685-4159-4ece-b67c-545d746216fb',
  'id': 101,
  'score': 0.9403594771241831,
  'user_defined_data': [ 0.9607843137254902,
                         0.9019607843137255,
                         0.9583333333333334]}

{ 'id': '0d56a685-4159-4ece-b67c-545d746216fb',
  'type': 'default',
  'values': { 'pca__svd_solver': 'auto',
              'pca__tol': 0.0,
              'pca__whiten': False,
              'rf__bootstrap': True,
              'rf__criterion': 'gini',
              'rf__max_features': {'rf__max_features_float': 0.5},
              'rf__min_impurity_decrease': 0.0,
              'rf__min_samples_leaf': 1,
              'rf__min_samples_split': 2,
              'rf__min_weight_fraction_leaf': 0.0,
              'rf__n_estimators': 10,
              'rf__oob_score': False}}