### KDD 2022 Hands-on Tutorial on "Gradual AutoML using Lale"

# 10. Schemas and Their Uses

Lale operators come with JSON Schemas that describe the shape of their inputs, outputs, and hyperparameters.

The [New Operators](08_newops.ipynb) notebook contains an introduction to writing these schemas.
In this notebook, we explore how Lale uses the schemas.
In particular, Lale uses the schemas to generate early, informative error messages, generate documentation, and generate search spaces for AutoML.

This notebook has the following sections:

- [10.1 Validation / Error Messages](#10.1-Validation-/-Error-Messages)
- [10.2 Automatically Generating Documentation](#10.2-Automatically-Generating-Documentation)
- [10.3 AutoML: Generating Search Spaces](#10.3-AutoML:-Generating-Search-Spaces)

## 10.1 Validation / Error Messages

Schemas are used to validate both the pipeline shape and the hyperparameters for each operator.

### 10.1.1 Data Schemas

Each operator comes with schemas that describe the operator's allowed input and output types.  In addition to simple static JSON schemas, Lale operators can also include a `transform_schema` method that can dynamically compute the output schema based on the actual shape of the input schema.
Given these schemas, Lale can check that a pipeline is well-formed:  the output of each operator must be compatible with the input schema of the operator it is piped to.
This is done using a _subschema_ entailment check.
If one of the required entailments fails, an error can be raised, making it easy to find the problem.  More information about this check can be found in [the bibliograpy below](#Bibliography).

As an example:

In [1]:
from lale.lib.sklearn import PCA
from lale.lib.sklearn import TfidfVectorizer
from lale.settings import set_disable_data_schema_validation, set_disable_hyperparams_schema_validation
import lale.datasets.openml
import pandas as pd
import numpy as np

# enable schema validation explicitly for the notebook, in case it was disabled for performance
set_disable_data_schema_validation(False)
set_disable_hyperparams_schema_validation(False)

# Load sample data
(train_X, train_y), (test_X, test_y) = lale.datasets.openml.fetch(
    "credit-g", "classification", preprocess=True
)

# create a pipeline that tries to extract TF-IDF vector information from strings,
# but erroneously applies it to the result of principal component analysis (which returns numbers)

pipeline = PCA() >> TfidfVectorizer()
try:
    pipeline.fit(np.array(train_X), np.array(train_y))
except ValueError as e:
    print(e)

TfidfVectorizer.fit() invalid X, the schema of the actual data is not a subschema of the expected schema of the argument.
actual_schema = {
    "description": "Features; the outer array is over samples.",
    "type": "array",
    "items": {"type": "array", "items": {"type": "number"}},
}
expected_schema = {
    "description": "Features; the outer array is over samples.",
    "anyOf": [
        {"type": "array", "items": {"type": "string"}},
        {
            "type": "array",
            "items": {
                "type": "array",
                "minItems": 1,
                "maxItems": 1,
                "items": {"type": "string"},
            },
        },
    ],
}


### 10.1.2 Hyperparameters

Schemas are also used to describe the hyperparameters for an individual operator, as well as their allowed values.
These schemas are typically more complex than input/output schemas, and are heavily used in Lale.
The input schema can also include data dependencies (the schema can refer to the shape of the actual data at runtime).  
When an operator is configured, the specified hyperparameter values are checked against the schema.
Note that Lale hyperparameter schemas often include side constraints, enabling the verification to validate that the entire hyperparameter set is valid, not just the individual values.
Additionally, if the configuration is invalid, Lale uses the schema to automatically generate suggested fixes (alternative, similar, configurations that are valid).
For example:

In [2]:
from lale.lib.sklearn import RandomForestClassifier
import jsonschema

try:
    rfc = RandomForestClassifier(bootstrap=False, oob_score=True)
except jsonschema.ValidationError as e:
    print(e)

Invalid configuration for RandomForestClassifier(bootstrap=False, oob_score=True) due to constraint out of bag estimation only available if bootstrap=True.
Some possible fixes include:
- set bootstrap=True
- set oob_score=False
Schema of failing constraint: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.random_forest_classifier.html#constraint-2
Invalid value: {'bootstrap': False, 'oob_score': True, 'n_estimators': 100, 'criterion': 'gini', 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_weight_fraction_leaf': 0.0, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'n_jobs': None, 'random_state': None, 'verbose': 0, 'warm_start': False, 'class_weight': None, 'ccp_alpha': 0.0, 'max_samples': None}


## 10.2 Automatically Generating Documentation

In addition to using schemas to quickly find problems, schemas are used to automatically generate documentation suitable for sphinx (and for hosting, for example, on [Read the Docs](https://lale.readthedocs.io/)).
After an operator is created (by calling `make_operator` and/or `customize_schema`), Lale practice is to call
`lale.docstrings.set_docstrings` on the variable the resulting operator is stored in.

For example:

```python
AwesomeOp = lale.operators.make_operator(
    _AwesomeOpImpl, _combined_schemas_for_awesome_op
)
lale.docstrings.set_docstrings(AwesomeOp)
```

This function, when run by Sphinx (during documentation generation), changes the name (`AwesomeOp` in this example) to instead refer to a dynamically created class with documentation for the `AwesomeOp` operator.
This class is created by a compilation process that analyzes the schemas and creates methods as appropriate.
For example, the compilation process creates a fake `__init__` method with arguments corresponding to all of the hyperparameters described in the schema, with appropriate descriptions/types.
Similarly, if a `fit` method is available, the fit input schema is translated into appropriate documentation for the dynamically created `fit` method, and so on for the other methods.
Note that these methods have stubs for their implementations; they exist simply to enable Sphinx to generate appropriate documentation for them.
When not running under Sphinx, the call to `lale.docstrings.set_docstrings` has no effect.

As an example, the automatically generated documentation for [PCA](https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.pca.html) includes: 

<img src="readthedocs_pca.png" width="600" />

Note how the constructor arguments for the generated documented PCA class reflect the schema's hyperparameters, along with defaults if specified.  The description of the hyperparameters is compiled from the schema.  In the example shown, `n_components` is seen to allow four different types of values, each of which is clearly explained.  It also cross-references to the side-constraints that mention `n_components`, which are included further down in the documentation.

Similarly, `fit` and `predict` methods are documented, along with compiled information about the shape of their input and output, and any fit parameters (additional arguments that the operator supports passing to the `fit` method).
The `description` strings in the schemas are leveraged throughout to provide useful descriptions to the user.

Automatically generating information in this fashion ensures that the documentation and the code stay synchronized, and that we are validating exactly what is specified in the documentation.

## 10.3 AutoML: Generating Search Spaces

The most sophisticated use Lale makes of schemas is to automatically generate search spaces from the hyperparameter schemas of the operators in a pipeline.
This enables AutoML to automatically tune the (unconfigured) hyperparameters of the operators in a pipeline.
It also works in conjunction with selecting from different choices encoded in the pipeline using the Lale choice combinator (`|`).
The planned pipeline is encoded into an optimizer search space via a compilation process.

Each operator schema is turned into a simplified representation that is easier to communicate to search space optimizers.
Explicitly configured hyperparameter values are taken into account.
This simplified form for each operator is combined into a bigger search space specification that encodes the shape of the pipeline (and the choices it may embed).
This representation is then compiled into the form required by the specified backend (such as hyperopt, smac, or grid search).  For example, for scikit-learn's [GridSearchCV](https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.grid_search_cv.html#lale.lib.lale.grid_search_cv.GridSearchCV), a list of grids is produced (with a discretized representation of the encoded search space).

Each time the specified AutoML optimizer samples a point in the search space, Lale runs a reverse compilation to decode that point into a trainable pipeline.
It fits that pipeline to obtain a trained pipeline, and evaluates that to obtain metrics, which certain optimizers (such as [Hyperopt](https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.hyperopt.html#lale.lib.lale.hyperopt.Hyperopt)) take into consideration when sampling the next point in the search space. 

<img src="workflow_enc_dec.png" width="360" />

## Bibliography

The following paper has more information about subschema entailment checks:

```bibtex
@InProceedings{habib_et_al_2021,
  title = "Finding Data Compatibility Bugs with {JSON} Subschema Checking",
  author = "Habib, Andrew and Shinnar, Avraham and Hirzel, Martin and Pradel, Michael",
  booktitle = "International Symposium on Software Testing and Analysis (ISSTA)",
  year = 2021,
  pages = "620--632",
  url = "https://doi.org/10.1145/3460319.3464796" }
```

The following paper has more information about how Lale generated search spaces:

```bibtex
@InProceedings{baudart_et_al_2021,
  title = "Pipeline Combinators for Gradual {AutoML}",
  author = "Baudart, Guillaume and Hirzel, Martin and Kate, Kiran and Ram, Parikshit and Shinnar, Avraham and Tsay, Jason",
  booktitle = "Advances in Neural Information Processing Systems (NeurIPS)",
  year = 2021,
  url = "https://proceedings.neurips.cc/paper/2021/file/a3b36cb25e2e0b93b5f334ffb4e4064e-Paper.pdf" }
```