# OPTaaS Scikit-learn Custom Optimizable Estimators

Using the OPTaaS Python Client, you can optimize any scikit-learn pipeline. For each step or estimator in the pipeline, OPTaaS just needs to know what parameters to optimize and what constraints will apply to them.

We have provided pre-defined parameters and constraints for some of the most widely used estimators, such as Random Forest and XGBoost, but you can easily optimize your own custom estimator using our `OptimizableBaseEstimator` mixin. Here's an example:

## Creating an Optimizable PCA

We will take scikit-learn's PCA class and make it optimizable by OPTaaS. First we create a class that extends both the base PCA class from scikit-learn, and also our OptimizableBaseEstimator mixin. You'll notice there is an abstract method that we will need to implement:

In [1]:
from sklearn.decomposition import PCA as BasePCA

from mindfoundry.optaas.client.sklearn_pipelines.mixin import OptimizableBaseEstimator, ParametersAndConstraints
from mindfoundry.optaas.client.sklearn_pipelines.parameter_maker import SklearnParameterMaker

class OptimizablePCA(BasePCA, OptimizableBaseEstimator):
    def make_parameters_and_constraints(self, sk: SklearnParameterMaker, **kwargs) -> ParametersAndConstraints:
        pass

## Define some Parameters

Using the [scikit-learn docs](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) as a guide, we create some OPTaaS parameters for optimization. These will be returned from `make_parameters_and_constraints` (we'll leave the constraints as an empty list for now).

You'll notice the first argument to the method is a `SklearnParameterMaker`. We will use this to create our parameters, i.e. we call `sk.CategoricalParameter` instead of just `CategoricalParameter`.

This will ensure that each parameter is automatically assigned the correct default value, which is based on the estimator's `get_params` method. Therefore, if you specify parameter values in the estimator's constructor, it will override the default.

In [2]:
class OptimizablePCA(BasePCA, OptimizableBaseEstimator):
    def make_parameters_and_constraints(self, sk: SklearnParameterMaker, **kwargs) -> ParametersAndConstraints:
        svd_solver = sk.CategoricalParameter("svd_solver", values=['arpack', 'auto', 'full', 'randomized'])
        whiten = sk.BoolParameter('whiten')
        tol = sk.FloatParameter('tol', minimum=0, maximum=1, optional=True)
        
        return [svd_solver, whiten, tol], []

# This estimator will use 'full' as the default value for svd_solver
OptimizablePCA(svd_solver='full')

OptimizablePCA(copy=True, iterated_power='auto', n_components=None,
        random_state=None, svd_solver='full', tol=0.0, whiten=False)

## Multi-type Parameters

Some parameters might be multi-type, e.g. `n_components` can be an integer, a float, or the constant 'mle'. We model this using a `ChoiceParameter`, and our SklearnParameterMaker will handle setting the default value correctly. We will do the same for `iterated_power`.

In order to set the maximum int value for `n_components`, we also need to know how many features are in our dataset. We will therefore expect a `feature_count` value to be provided when the task is created (in the `OPTaaSClient.create_sklearn_task` method), and this will be made available to us here in `kwargs`.

The range of floating point values for `n_components` needs to be >0 and <1. Since OPTaaS ranges are inclusive of the bounds, we use numpy to generate the smallest value above 0 and the largest value below 1.

In [3]:
import numpy

class OptimizablePCA(BasePCA, OptimizableBaseEstimator):
    def make_parameters_and_constraints(self, sk: SklearnParameterMaker, **kwargs) -> ParametersAndConstraints:
        feature_count = self._get_required_kwarg(kwargs, 'feature_count')

        mle = sk.ConstantParameter('n_components_mle', value='mle')
        n_components_int = sk.IntParameter('n_components_int', minimum=1, maximum=feature_count)
        n_components_float = sk.FloatParameter('n_components_float', minimum=numpy.nextafter(0.0, 1),
                                               maximum=numpy.nextafter(1.0, 0))
        n_components = sk.ChoiceParameter('n_components', optional=True,
                                          choices=[n_components_int, n_components_float, mle])

        iterated_power_auto = sk.ConstantParameter('iterated_power_auto', value='auto')
        iterated_power_int = sk.IntParameter('iterated_power_int', minimum=0, maximum=99)
        iterated_power = sk.ChoiceParameter('iterated_power', choices=[iterated_power_auto, iterated_power_int])

        svd_solver = sk.CategoricalParameter("svd_solver", values=['arpack', 'auto', 'full', 'randomized'])
        whiten = sk.BoolParameter('whiten')
        tol = sk.FloatParameter('tol', minimum=0, maximum=1, optional=True)
        
        return [svd_solver, whiten, tol, n_components, iterated_power], []

## Constraints

Finally, we implement some constraints that will prevent OPTaaS from generating invalid configurations. Here we specify how the `svd_solver` value affects other parameters:

In [4]:
from mindfoundry.optaas.client.constraint import Constraint

class OptimizablePCA(BasePCA, OptimizableBaseEstimator):
    def make_parameters_and_constraints(self, sk: SklearnParameterMaker, **kwargs) -> ParametersAndConstraints:
        feature_count = self._get_required_kwarg(kwargs, 'feature_count')

        mle = sk.ConstantParameter('n_components_mle', value='mle')
        n_components_int = sk.IntParameter('n_components_int', minimum=1, maximum=feature_count)
        n_components_float = sk.FloatParameter('n_components_float', minimum=numpy.nextafter(0.0, 1),
                                               maximum=numpy.nextafter(1.0, 0))
        n_components = sk.ChoiceParameter('n_components', optional=True,
                                          choices=[n_components_int, n_components_float, mle])

        iterated_power_auto = sk.ConstantParameter('iterated_power_auto', value='auto')
        iterated_power_int = sk.IntParameter('iterated_power_int', minimum=0, maximum=99)
        iterated_power = sk.ChoiceParameter('iterated_power', choices=[iterated_power_auto, iterated_power_int])

        svd_solver = sk.CategoricalParameter("svd_solver", values=['arpack', 'auto', 'full', 'randomized'])
        whiten = sk.BoolParameter('whiten')
        tol = sk.FloatParameter('tol', minimum=0, maximum=1, optional=True)
        
        return [svd_solver, whiten, tol, n_components, iterated_power], [
            Constraint(when=svd_solver == 'arpack', then=(n_components_int < feature_count) & tol.is_present()),
            Constraint(when=(svd_solver == 'auto') | (svd_solver == 'randomized'),
                       then=n_components.is_absent() | (n_components == n_components_int))
        ]

## Creating our Task

We now create a task using our new estimator. As you can see, all the parameters and constraints have been generated as expected, and the defaults have been set.

In [5]:
from mindfoundry.optaas.client.client import OPTaaSClient

client = OPTaaSClient('https://optaas.mindfoundry.ai', '<Your OPTaaS API key>')

task = client.create_sklearn_task(
    title='My Task with OptimizablePCA', 
    estimators=[('pca', OptimizablePCA(n_components='mle', svd_solver='full'))],
    feature_count=20
)

display(task.parameters, task.constraints)

[{'id': 'pca',
  'items': [{'default': 'full',
    'enum': ['arpack', 'auto', 'full', 'randomized'],
    'id': 'pca__svd_solver',
    'name': 'svd_solver',
    'type': 'categorical'},
   {'default': False,
    'id': 'pca__whiten',
    'name': 'whiten',
    'type': 'boolean'},
   {'default': 0.0,
    'id': 'pca__tol',
    'maximum': 1,
    'minimum': 0,
    'name': 'tol',
    'optional': True,
    'type': 'number'},
   {'choices': [{'id': 'pca__n_components_int',
      'maximum': 20,
      'minimum': 1,
      'name': 'n_components_int',
      'type': 'integer'},
     {'id': 'pca__n_components_float',
      'maximum': 0.9999999999999999,
      'minimum': 5e-324,
      'name': 'n_components_float',
      'type': 'number'},
     {'id': 'pca__n_components_mle',
      'name': 'n_components_mle',
      'type': 'constant',
      'value': 'mle'}],
    'default': '#pca__n_components_mle',
    'id': 'pca__n_components',
    'name': 'n_components',
    'optional': True,
    'type': 'choice'},
   {

["if #pca__svd_solver == 'arpack' then ( #pca__n_components_int < 20 ) && #pca__tol is_present",
 "if ( #pca__svd_solver == 'auto' ) || ( #pca__svd_solver == 'randomized' ) then #pca__n_components is_absent || ( #pca__n_components == #pca__n_components_int )"]

## Optional Estimators

Any estimator can be an optional step in a pipeline by simply calling `optional_step(estimator)` as demonstrated [here](sklearn.ipynb).

However, if you want your estimator to **always** be optional, you can simply use the `OptionalStepMixin` instead of `OptimizableBaseEstimator`:

In [6]:
from mindfoundry.optaas.client.sklearn_pipelines.mixin import OptionalStepMixin

class OptionalPCA(BasePCA, OptionalStepMixin):
    pass