# OPTaaS Tutorial

OPTaaS is an optimization service developed by Mind Foundry. It can be used to optimize any function you wish to use; all it needs to know is the parameter space over which the function operates.

The procedure is simple: you create a task, which is the definition of the parameters of the function, and send it to OPTaaS. The service will then suggest a configuration (set of values for your parameters) to evaluate. You evaluate the configuration locally and send the score you obtained back to OPTaaS. You then repeat the process until you're happy that the score is good, or that there hasn't been enough progress in a while. 

Here we demonstrate how to use OPTaaS to optimize a scikit-learn classification pipeline.

## Load your dataset

Let's load a classic data set, "iris", which comes packaged in scikit-learn. The dataset contains three types of flowers, and 50 measurements of four characteristics of these flowers (sepal length, sepal width, petal length and petal width). So we have 150 rows (50 per flower type), 4 input columns (the four features), and one target column (the name of the flower).

In [1]:
from sklearn import datasets

iris = datasets.load_iris()

## Create your pipeline

We can then assemble a simple pipeline: we reduce the input dimensions with PCA, and then discover the type of flower with a Random Forest Classifier. 

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([('pca', PCA()), ('rf', RandomForestClassifier())])

Pipeline(memory=None,
     steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('rf', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
          ...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

## Define the parameters to be optimized

Let's now define the parameters for our task. A list of available parameter types can be found [here](autogen/mindfoundry.optaas.client.html#module-mindfoundry.optaas.client.parameter).

We can use *constraints* to make sure that the parameter values are always valid, e.g. that `max_features` in our classifier will never be larger than `n_components` in PCA. (see also: [Constraints](constraints.ipynb))

scikit-learn pipelines follow a naming convention, so we abide to that convention here. This allows us to plug them in directly into `Pipeline.set_params()`.

Note that you don't have to define every single parameter of your pipelines, just the ones you would like to optimize. If you don't know what parameters will lead to better results, give them all to OPTaaS and let it do the hard work for you. However, the more parameters you have, the harder it is to optimize over them, so you will need more iterations to achieve comparable results. 

In [3]:
from mindfoundry.optaas.client.parameter import IntParameter, BoolParameter, CategoricalParameter, Distribution
from mindfoundry.optaas.client.constraint import Constraint

feature_count = len(iris.feature_names)
n_components = IntParameter('pca__n_components', minimum=1, maximum=feature_count, default=feature_count)
max_features = IntParameter('rf__max_features', minimum=1, maximum=feature_count, optional=True)
constraint = Constraint(when=max_features.is_present(), then=max_features < n_components) 

parameters = [
    # PCA parameters
    n_components,
    BoolParameter('pca__whiten', default=False),
    
    # Random Forest parameters
    max_features,
    IntParameter('rf__n_estimators', minimum=10, maximum=1000, default=10, distribution=Distribution.LOGUNIFORM),
    CategoricalParameter('rf__criterion', values=['gini', 'entropy'], default='gini'),
    IntParameter('rf__max_depth', minimum=1, maximum=100, optional=True, default=1),
    IntParameter('rf__min_samples_split', minimum=2, maximum=20),
    IntParameter('rf__min_samples_leaf', minimum=1, maximum=20),
]

[{'id': '2381133733056', 'name': 'pca__n_components', 'type': 'integer', 'minimum': 1, 'maximum': 4},
 {'id': '2381133733112', 'name': 'pca__whiten', 'type': 'boolean', 'default': False},
 {'id': '2381133726832', 'name': 'rf__n_estimators', 'type': 'integer', 'default': 10, 'minimum': 10, 'maximum': 1000, 'distribution': 'LogUniform'},
 {'id': '2381133728288', 'name': 'rf__criterion', 'type': 'categorical', 'default': 'gini', 'enum': ['gini', 'entropy']},
 {'id': '2381133728344', 'name': 'rf__max_features', 'type': 'number', 'optional': True, 'minimum': 0.0, 'maximum': 1.0},
 {'id': '2381133728400', 'name': 'rf__max_depth', 'type': 'integer', 'optional': True, 'default': 1, 'minimum': 1, 'maximum': 100},
 {'id': '2381133728456', 'name': 'rf__min_samples_split', 'type': 'integer', 'minimum': 2, 'maximum': 20},
 {'id': '2381133728512', 'name': 'rf__min_samples_leaf', 'type': 'integer', 'minimum': 1, 'maximum': 20}]

## Connect to the OPTaaS server using your API Key

We now create a client, and connect to the web service that will perform our optimization. You will need to input your personal API key. Make sure you keep your key private and don't commit it to your version control system. 

In [4]:
from mindfoundry.optaas.client.client import OPTaaSClient

client = OPTaaSClient('<URL of your OPTaaS server>', '<Your OPTaaS API key>')

## Create your Task

In [5]:
task = client.create_task(
    title='My Pipeline Optimization', 
    parameters=parameters,
    constraints=[constraint]
)

## Generate the first configuration

Here OPTaaS gives us the first configuration (set of parameter values) for our problem. 

In [6]:
configuration = task.generate_configuration()

{ 'id': '327e3b4c-87ea-4790-b268-f4d185df87ee',
  'type': 'default',
  'values': { 'pca__n_components': 3,
              'pca__whiten': False,
              'rf__criterion': 'gini',
              'rf__max_depth': 1,
              'rf__max_features': 2,
              'rf__min_samples_leaf': 10,
              'rf__min_samples_split': 11,
              'rf__n_estimators': 10}}

## Use the configuration to set pipeline parameters and calculate the result

We can plug this configuration into our pipeline, run it and score it. We wrapped this into a function so that we can call it multiple times.

In [7]:
from sklearn.model_selection import cross_val_score

def get_result(configuration):
    pipeline.set_params(**configuration.values)
    scores = cross_val_score(pipeline, iris.data, iris.target, scoring='f1_micro')
    mean_score = scores.mean()
    return list(scores), mean_score

scores, mean_score = get_result(configuration)

([0.7058823529411765, 0.7647058823529412, 0.6666666666666666],
 0.7124183006535948)

## Record the result in OPTaaS and generate the next configuration to try

The score we obtained must now be passed back to OPTaaS, which will use it to generate and return a new configuration.

You can store additional user-defined data along with your result, but that's entirely optional and for your convenience only. For example, you can store the individual scores for each category so you can easily reference them later.

In [8]:
configuration = task.record_result(configuration, score=mean_score, user_defined_data=scores)

{ 'id': '1f8287b1-82b4-49da-8888-c0e614b0ce8a',
  'type': 'exploration',
  'values': { 'pca__n_components': 1,
              'pca__whiten': True,
              'rf__criterion': 'gini',
              'rf__max_depth': 69,
              'rf__min_samples_leaf': 13,
              'rf__min_samples_split': 10,
              'rf__n_estimators': 46}}

## Repeat as necessary

Repeat the process until you're happy with the result or you've run out of time. How many times should we run it? There is no single answer to this question, but you can look for a certain amount of improvement at each iteration, and stop when you haven't improved in a while. Again this is up to you, and depending on how long you are prepared to wait and perhaps even on how much it costs to run each evaluation on your hardware, especially if you're using a cloud computing service. The more the better (remembering that harder problems, i.e. problems with more parameters, will need more iterations to find a decent solution).

In [9]:
for _ in range(number_of_iterations):
    scores, mean_score = get_result(configuration)
    configuration = task.record_result(configuration, score=mean_score, user_defined_data=scores)

## Complete your task

Tell OPTaaS that you're satisfied and you won't run any more experiments, then go ahead and get your final result out. Job done!

In [10]:
task.complete()

best_result, best_configuration = task.get_best_result_and_configuration()

Best result: 0.9403594771241831  [0.9411764705882353, 0.9215686274509803, 0.9583333333333334]
Best configuration: { 'id': 'a0caee4e-1ffa-4793-8dff-a59642009936',
  'type': 'exploration',
  'values': { 'pca__n_components': 2,
              'pca__whiten': False,
              'rf__criterion': 'entropy',
              'rf__min_samples_leaf': 2,
              'rf__min_samples_split': 13,
              'rf__n_estimators': 987}}
