## Dataset error example in middle of pipeline

Start by loading the California Housing dataset, which is a
two-dimensional array of numbers.  One of the columns, longitude,
contains negative numbers.

In [1]:
import pandas as pd
import lale.datasets as ds
(train_X, train_y), (test_X, test_y) = ds.california_housing_df()
schema_X = ds.data_schemas.to_schema(train_X)
schema_y = ds.data_schemas.to_schema(train_y)
pd.concat([train_X.head(3), train_y.head(3)], axis=1)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03,1.03
1,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16,3.821
2,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48,1.726


### Scikit-learn version

Train an RFE (recursive feature elimination). This internally trains
its argument, a random forest of 10 trees. Training the forest takes a
few seconds. If we used more trees, training the forest would take
longer.

In [2]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor as Forest
trainable_rfe = RFE(estimator=Forest(n_estimators=10))

In [3]:
%%time
trained_rfe = trainable_rfe.fit(train_X, train_y)

CPU times: user 4.44 s, sys: 172 ms, total: 4.61 s
Wall time: 4.68 s


The resulting features still include the longitude with its
negative numbers.

In [4]:
support = trained_rfe.get_support()
columns = [col for i, col in enumerate(train_X.columns) if support[i]]
train_X2 = pd.DataFrame(data=trained_rfe.transform(train_X), columns=columns)
pd.concat([train_X2.head(3), train_y.head(3)], axis=1)

Unnamed: 0,MedInc,AveOccup,Latitude,Longitude,target
0,3.2596,3.691814,32.71,-117.03,1.03
1,3.8125,1.738095,33.77,-118.16,3.821
2,4.1563,2.723214,34.66,-120.48,1.726


Compose a pipeline, trainable, with two steps, RFE and NMF, where the
output of RFE is piped into the input of NMF. NMF is is non-negative
matrix factorization and requires a non-negative matrix as its input.
When we try to fit the pipeline, scikit-learn first spends some time
to fit the upstream RFE. After that, it attempts to fit the NMF on the
output from RFE. The output from RFE contains negative numbers, and
therefore, NMF throws an exception.

In [5]:
import sys
import sklearn.pipeline
from sklearn.decomposition import NMF

In [6]:
trainable = sklearn.pipeline.make_pipeline(
    RFE(estimator=Forest(n_estimators=100)), NMF())

In [7]:
%%time
try:
    trainable.fit(train_X, train_y)
except ValueError as e:
    message = str(e)
print(message, file=sys.stderr)

CPU times: user 42.5 s, sys: 969 ms, total: 43.4 s
Wall time: 43.9 s


Negative values in data passed to NMF (input X)


In [8]:
assert message == 'Negative values in data passed to NMF (input X)'

### Lale version

Lale supports JSON schema validation of pipelines. We compose the same
trainable pipeline of RFE and NMF as before. But rather than trying to
fit it, we call validate_schema. This checks the schemas at each step
of the pipeline, and detects that the output from RFE is not a
subschema of the input to NMF.

In [9]:
import lale.operators
import lale.helpers
from lale.lib.sklearn import RFE, NMF

In [10]:
trainable = lale.operators.make_pipeline(
    RFE(estimator=Forest(n_estimators=100)), NMF())

In [11]:
%%time
try:
    trainable.validate_schema(schema_X, schema_y)
except lale.helpers.SubschemaError as e:
    message = str(e)
print(message, file=sys.stderr)

CPU times: user 62.5 ms, sys: 0 ns, total: 62.5 ms
Wall time: 71.7 ms


Expected to_schema(data) to be a subschema of NMF.input_schema_fit().
to_schema(data) = {
    'type': 'object',
    'additionalProperties': false,
    'required': ['X', 'y'],
    'properties': {
        'X': {
            '$schema': 'http://json-schema.org/draft-04/schema#',
            'type': 'array',
            'items': {
                'type': 'array',
                'items': {
                    'type': 'number'}}},
        'y': {
            '$schema': 'http://json-schema.org/draft-04/schema#',
            'type': 'array',
            'minItems': 16512,
            'maxItems': 16512,
            'items': {
                'description': 'target',
                'type': 'number'}}},
}
NMF.input_schema_fit() = {
    '$schema': 'http://json-schema.org/draft-04/schema#',
    'type': 'object',
    'required': ['X'],
    'additionalProperties': false,
    'properties': {
        'X': {
            'type': 'array',
            'items': {
                'type': 'array',
           

In [12]:
assert message.find('be a subschema of NMF.input_schema_fit()') != -1

## Hyperparameter error example in middle of pipeline

First, we load the drugs.com dataset.

In [13]:
from lale.datasets.uci.uci_datasets import fetch_drugscom
from sklearn.model_selection import train_test_split
import sys
import warnings
warnings.filterwarnings("ignore")
train_X, train_y, test_X, test_y = fetch_drugscom()
print(f'shapes: train_X {train_X.shape}, train_y {train_y.shape}')

shapes: train_X (161297, 5), train_y (161297,)


### Scikit-learn version

This example gives a baseline, it uses only scikit-learn, not Lale.
First, import a few things from scikit-learn.

In [14]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer as Tfidf
from sklearn.linear_model import LogisticRegression as LR

Second, instantiate a trainable pipeline that applies Tfidf on the
`'review'` column of the input data, followed by LogisticRegression.
Since there is no training happening, this is very fast. However,
there is a mistake in this code: LR does not support `solver='adam'`.
Unfortunately, scikit-learn does not report the error at this point.

In [15]:
%%time
trainable = make_pipeline(
    ColumnTransformer([
        ('txt', Tfidf(max_features=1000), 'review')]),
    LR(solver='adam'))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 310 µs


Third, try to train that pipeline. The Tfidf gets trained first,
because training LR requires the data as transformed by the trained
Tfidf. In this example, training Tfidf is slow. And because of the
mistake with `solver='adam'` in the previous cell, training LR fails.

In [16]:
%%time
try:
    trainable.fit(train_X)
except ValueError as e:
    message = str(e)
print(message, file=sys.stderr)
assert message.startswith('Logistic Regression supports only solvers in')

CPU times: user 13.1 s, sys: 1.53 s, total: 14.6 s
Wall time: 15.2 s


Logistic Regression supports only solvers in ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'], got adam.


### Lale version

First, import the Project operator from Lale, and wrap the imported
scikit-learn operators (LR and Tfidf) to augment them with JSON
schemas.

In [17]:
from lale.lib.lale import Project
import lale.helpers
lale.helpers.wrap_imported_operators()

Second, try to instantiate a pipeline as before. In particular, the
code has the same mistake as before, passing `solver='adam'` to
LR. But unlike in pure scikit-learn, here, the mistake gets caught
earlier, using JSON Schema validation when the operators are
instantiated. For this example, that saves a lot of time, since
there is no need to train Tfidf to catch the error.

In [18]:
%%time
from jsonschema import ValidationError
try:
    trainable = (Project(columns=['review'])
              >> Tfidf(max_features=1000)
              >> LR(solver='adam'))
except ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.startswith("Invalid configuration for LR(solver='adam')")

CPU times: user 31.2 ms, sys: 15.6 ms, total: 46.9 ms
Wall time: 54.7 ms


Invalid configuration for LR(solver='adam') due to invalid value solver=adam.
Schema of argument solver: {
    'description': 'Algorithm for optimization problem.',
    'enum': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'default': 'liblinear',
}
Value: adam


## Interactive documentation

Data scientists can interactively query the JSON Schemas of individual
operators. For example, they can find out all the hyperparameters
along with their defaults.

In [19]:
LR.hyperparam_defaults()

{'solver': 'liblinear',
 'penalty': 'l2',
 'dual': False,
 'C': 1.0,
 'tol': 0.0001,
 'fit_intercept': True,
 'intercept_scaling': 1.0,
 'class_weight': None,
 'random_state': None,
 'max_iter': 100,
 'multi_class': 'ovr',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None}

An example of a categorical hyperparameter, which JSON Schema
represents using an enum.

In [20]:
LR.hyperparam_schema('solver')

{'description': 'Algorithm for optimization problem.',
 'enum': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
 'default': 'liblinear'}

An example of a continuous hyperparameter, which JSON Schema
represents as a number. Th `'minimum'` specifies what is valid (will
not raise an exception), whereas the `'minimumForOptimizer'` and
`'maximumForOptimizer'` specify what is relevant (a range that makes
sense for automated search tools).

In [21]:
LR.hyperparam_schema('C')

{'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',
 'type': 'number',
 'distribution': 'loguniform',
 'minimum': 0.0,
 'exclusiveMinimum': True,
 'default': 1.0,
 'minimumForOptimizer': 0.03125,
 'maximumForOptimizer': 32768}

## More hyperparameter error examples

Since the schema of the `C` hyperparameter of `LR` specifies an
exclusive minimum of zero, passing zero is not valid. Lale internally
calls an off-the-shelf JSON Schema validator when an operator gets
configured with concrete hyperparameter values.

In [22]:
try:
    LR(C=0.0)
except ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.startswith('Invalid configuration for LR(C=0.0)')

Invalid configuration for LR(C=0.0) due to invalid value C=0.0.
Schema of argument C: {
    'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',
    'type': 'number',
    'distribution': 'loguniform',
    'minimum': 0.0,
    'exclusiveMinimum': true,
    'default': 1.0,
    'minimumForOptimizer': 0.03125,
    'maximumForOptimizer': 32768,
}
Value: 0.0


Besides per-hyperparameter types, there are also conditional
inter-hyperparameter constraints. These are checked using the
same call to an off-the-shelf JSON Schema validator.

In [23]:
try:
    LR(LR.solver.sag, LR.penalty.l1)
except ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.find('support only l2 penalties') != -1

Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 penalties.
Schema of constraint 1: {
    'description': 'The newton-cg, sag, and lbfgs solvers support only l2 penalties.',
    'anyOf': [{
        'type': 'object',
        'properties': {
            'solver': {
                'not': {
                    'enum': ['newton-cg', 'sag', 'lbfgs']}}}}, {
        'type': 'object',
        'properties': {
            'penalty': {
                'enum': ['l2']}}}],
}
Value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'ovr', 'verbose': 0, 'warm_start': False, 'n_jobs': None}


There are even constraints that affect three different hyperparameters.

In [24]:
try:
    LR(LR.penalty.l2, LR.solver.sag, dual=True)
except ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.find('dual formulation is only implemented for') != -1

Invalid configuration for LR(penalty='l2', solver='sag', dual=True) due to constraint the dual formulation is only implemented for l2 penalty with the liblinear solver.
Schema of constraint 2: {
    'description': 'The dual formulation is only implemented for l2 penalty with the liblinear solver.',
    'anyOf': [{
        'type': 'object',
        'properties': {
            'dual': {
                'enum': [false]}}}, {
        'type': 'object',
        'properties': {
            'penalty': {
                'enum': ['l2']},
            'solver': {
                'enum': ['liblinear']}}}],
}
Value: {'penalty': 'l2', 'solver': 'sag', 'dual': True, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'ovr', 'verbose': 0, 'warm_start': False, 'n_jobs': None}


## Dataset error example for individual operator

Lale uses JSON Schema validation not only for hyperparameters but also
for data. The dataset `train_X` is multimodal: some columns contain
text strings whereas others contain numbers.

In [25]:
ds.data_schemas.to_schema(train_X)

{'$schema': 'http://json-schema.org/draft-04/schema#',
 'type': 'array',
 'items': {'type': 'array',
  'minItems': 5,
  'maxItems': 5,
  'items': [{'description': 'drugName', 'type': 'string'},
   {'description': 'condition',
    'anyOf': [{'type': 'string'}, {'enum': [nan]}]},
   {'description': 'review', 'type': 'string'},
   {'description': 'date', 'type': 'string'},
   {'description': 'usefulCount', 'type': 'integer', 'minimum': 0}]},
 'minItems': 161297,
 'maxItems': 161297}

Since `train_X` contains strings but `LR` expects only numbers, the
call to `fit` reports a type error.

In [26]:
trainable_lr = LR()
try:
    trainable_lr.fit(train_X, train_y)
except ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.startswith('Failed validating input_schema_fit for LR')

Failed validating input_schema_fit for LR due to 'Valsartan' is not of type 'number'

Failed validating 'type' in schema['properties']['X']['items']['items']:
    {'type': 'number'}

On instance['X'][0][0]:
    'Valsartan'


Load a pure numerical dataset instead.

In [27]:
from lale.datasets import load_iris_df
(train_X, train_y), (test_X, test_y) = load_iris_df()
train_X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.0,3.4,1.6,0.4
1,6.3,3.3,4.7,1.6
2,5.1,3.4,1.5,0.2
3,4.8,3.0,1.4,0.1
4,6.7,3.1,4.7,1.5


Training LR with the Iris dataset works fine.

In [28]:
trained_lr = trainable_lr.fit(train_X, train_y)

## Lifecycle error example

Lale encourages separating the lifecycle states, here represented
by `trainable_lr` vs. `trained_lr`. The `predict` method should
only be called on a trained model.

In [29]:
predicted = trained_lr.predict(test_X)
print(f'test_y    {[*test_y]}')
print(f'predicted {[*predicted]}')

test_y    [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]


On the other hand, the `predict` method should not be called on a trainable model.

In [30]:
import warnings
warnings.filterwarnings("error", category=DeprecationWarning)
try:
    predicted = trainable_lr.predict(test_X)
except DeprecationWarning as w:
    message = str(w)
print(message, file=sys.stderr)
assert message.startswith('The `predict` method is deprecated on a trainable')
print(f'test_y    {[*test_y]}')
print(f'predicted {[*predicted]}')

The `predict` method is deprecated on a trainable operator, because the learned coefficients could be accidentally overwritten by retraining. Call `predict` on the trained operator returned by `fit` instead.

test_y    [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]



