# Schemas and their uses

Lale operators come with JSON Schemas that describe the shape of their inputs, their outputs, and their hyper-parameters.

The [New Operators](08_newops.ipynb) notebook contains an introduction to writing these schemas.
In this notebook, we explore how lale uses these schemas.
In particular, lale uses the schemas to generate early, informative error messages, generate documentation, and generate search spaces for AutoAI.

## Validation / Error Messages

Schemas are used to validate both the pipeline shape and the hyperparameters for each operator.

### Pipeline shape

Each operator comes with schemas that describe the operator's allowed input and output types.  In addition to simple static JSON schemas, Lale operators can also include a `transform_schema` method which can dynamically compute the output schema based on the actual shape of the input schema.
Given these schemas, Lale can check that a pipeline is well-formed:  the output of each operator must be compatible with the input schema of the operator it is piped to.
This is done using a _subschema_ entailment check.
If one of the required entailments fails, an error can be raised, making it easy to find the problem.  More information about this check can be found in [the bibliograpy below](#Bibliography).

As an example:

In [1]:
from lale.lib.sklearn import PCA
from lale.lib.sklearn import TfidfVectorizer
from lale.settings import set_disable_data_schema_validation, set_disable_hyperparams_schema_validation
import lale.datasets.openml
import pandas as pd
import numpy as np

# enable schema validation explicitly for the notebook, in case it was disabled for performance
set_disable_data_schema_validation(False)
set_disable_hyperparams_schema_validation(False)

# Load sample data
(train_X, train_y), (test_X, test_y) = lale.datasets.openml.fetch(
    'credit-g', 'classification', preprocess=True)

# create a pipeline that tries to extract TF-IDF vector information from strings,
# but erroneously applies it to the result of principal component analysis (which returns numbers)

pipeline = PCA() >> TfidfVectorizer()
pipeline.fit(np.array(train_X), np.array(train_y))

ValueError: TfidfVectorizer.fit() invalid X, the schema of the actual data is not a subschema of the expected schema of the argument.
actual_schema = {
    "description": "Features; the outer array is over samples.",
    "type": "array",
    "items": {"type": "array", "items": {"type": "number"}},
}
expected_schema = {
    "description": "Features; the outer array is over samples.",
    "anyOf": [
        {"type": "array", "items": {"type": "string"}},
        {
            "type": "array",
            "items": {
                "type": "array",
                "minItems": 1,
                "maxItems": 1,
                "items": {"type": "string"},
            },
        },
    ],
}

### Hyperparameters

Schemas are also used to describe the hyperparameters for an individual operator, as well as their allowed values.
These schemas are typically more complex than input/output schemas, and are heavily used in lale.
The input schema can also include data dependencies (the schema can refer to the shape of the actual shape of the data at runtime).  

These schemas are used to validate that an operator is initialized with proper values.
When an operator is configured, the specified paramater value are checked against the schema.
Note that lale hyperparameter schemas typically include side-constraints, enabling the verification to validate that the entire parameter set is valid, not just the individual values.

Additionally, if the configuration is invalid, we can use the schema to automatically generate suggested fixes: alternative, similar, configurations that are valid.

For example:

In [None]:
from lale.lib.sklearn import RandomForestClassifier
rfc = RandomForestClassifier(bootstrap=False, oob_score=True)

## Automatically generating documentation

In addition to using schemas to quickly find problems, schemas are used to automatically generate documentation suitable for sphinx (and for hosting, for example, on [Read The Docs](https://lale.readthedocs.io/))

After an operator is created, by calling `make_operator` (and/or `customize_schema`), lale practice is to call
`lale.docstrings.set_docstrings` on the variable the resulting operator is stored in.

For example:
```
AwesomeOp = lale.operators.make_operator(
    _AwesomeOpImpl, _combined_schemas_for_awesome_op))


lale.docstrings.set_docstrings(AwesomeOp)

```

This function, when run by sphinx (during documentation generation), changes the name (`AwesomeOp` in this example), to instead refer to a dynamically created class with documentation for the `AwesomeOp` operator.
This class is created by a compilation process, where the schemas are analyzed, and methods are created as appropriate.  For example, a fake `__init__` method is created, with arguments corresponding to all of the parameters described in the hyperparameter schema, with appropriate descriptions/types.

Similarly, if a `fit` method is available, the fit input schema is translated into appropriate documentation for the dynamically created `fit` method, and so on for the other operators.

Note that these methods have stubs for their implementations; they exist simply to enable sphinx to generate appropriate documentation for them.  When not running under sphinx, the call to `lale.docstrings.set_docstrings` has no effect.

As an example, the automatically generated documentation for [PCA](https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.pca.html) includes: 

![Read The Docs PCA snapshot](readthedocs_pca.png)

Note how the constructor arguments for the generated documented PCA class reflect the schema's hyperparameters, along with defaults if specified.  The description of the parameters is compiled from the schema.  In the example shown, `n_components` is seen to allow four different types of values, each of which is clearly explained.  It also cross-references to the multi-parameter constraints which mention `n_components`, and which are included further down in the documentation.

Similarly, fit and predict methods are documented, along with compiled information about the shape of their input and output, and any fit parameters (additional arguments that the operator supports passing to the `fit` method).
The `description` strings in the schemas are leveraged throughout to provide useful descriptions to the user.

Automatically generating information in this fashion ensures that the documentation and the code stay synchronized, and that we are validating exactly what is specified in the documentation.

## AutoAI: generating search spaces

The most sophisticated use Lale makes of schemas is to automatically generate search spaces from the hyperparameter schemas of the operators in a pipeline.  This enables `AutoAI`, automatically optimizing the (unconfigured) hyperparameters of the operators in a pipeline.  It also supports selecting from different choices encoded in the pipeline using the Lale choice combinator (`|`).

This is done via a complex compilation process.  Each operator schema is turned into a simplified representation, that is easier to communicate to search space optimizers.  Explicitly configured parameter values are taken into account.

This simplified form for each operator is combined into a bigger search space specification that encodes the shape of the pipeline (and the choices it may embed).

This representation is then compiled into the form required by the specified backend (such as hyperopt, smac, or grid search).  For example, for scikit-learn's GridSearchCV, a list of grids is produced (with a discretized representation of the encoded search space).  After the specified hyper-parameter optimizer is run, a trained pipeline with hyper-parameters set (and algorithmic choices made) is returned.  More details are available in our papers.


# Bibliography

The following paper has more information about subschema entailment checks:
```bibtex
@inproceedings{10.1145/3460319.3464796,
author = {Habib, Andrew and Shinnar, Avraham and Hirzel, Martin and Pradel, Michael},
title = {Finding Data Compatibility Bugs with JSON Subschema Checking},
year = {2021},
isbn = {9781450384599},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3460319.3464796},
doi = {10.1145/3460319.3464796},
booktitle = {Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis},
pages = {620–632},
numpages = {13},
keywords = {data compatibility bugs, subschema checking, JSON schema},
location = {Virtual, Denmark},
series = {ISSTA 2021}
}
```

The following paper has more information about how Lale generated search spaces:

```bibtex
@InProceedings{baudart_et_al_2021,
  title = "Pipeline Combinators for Gradual {AutoML}",
  author = "Baudart, Guillaume and Hirzel, Martin and Kate, Kiran and Ram, Parikshit and Shinnar, Avraham and Tsay, Jason",
  booktitle = "Advances in Neural Information Processing Systems (NeurIPS)",
  year = 2021,
  url = "https://proceedings.neurips.cc/paper/2021/file/a3b36cb25e2e0b93b5f334ffb4e4064e-Paper.pdf" }
```