In [1]:
from lale.datasets.uci.uci_datasets import fetch_drugscom
from sklearn.model_selection import train_test_split
import sys
import warnings
warnings.filterwarnings("ignore")
train_X, train_y, test_X, test_y = fetch_drugscom()
print(f'shapes: train_X {train_X.shape}, train_y {train_y.shape}')

shapes: train_X (161297, 5), train_y (161297,)


### Scikit-learn error example

This example gives a baseline, it uses only scikit-learn, not Lale.
First, import a few things from scikit-learn.

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import TfidfVectorizer as Tfidf
from sklearn.linear_model import LogisticRegression as LR

Second, instantiate a trainable pipeline that applies Tfidf on the
`'review'` column of the input data, followed by LogisticRegression.
Since there is no training happening, this is very fast. However,
there is a mistake in this code: LR does not support `solver='adam'`.
Unfortunately, scikit-learn does not report the error at this point.

In [3]:
%%time
trainable = make_pipeline(
    ColumnTransformer([
        ('txt', Tfidf(max_features=1000), 'review')]),
    LR(solver='adam'))

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 259 µs


Third, try to train that pipeline. The Tfidf gets trained first,
because training LR requires the data as transformed by the trained
Tfidf. In this example, training Tfidf is slow. And because of the
mistake with `solver='adam'` in the previous cell, training LR fails.

In [4]:
%%time
try:
    trainable.fit(train_X)
except ValueError as e:
    print(e, file=sys.stderr)

CPU times: user 12.2 s, sys: 1.2 s, total: 13.4 s
Wall time: 13.4 s


Logistic Regression supports only solvers in ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'], got adam.


### The same example in Lale

First, import the Project operator from Lale, and wrap the imported
scikit-learn operators (LR and Tfidf) to augment them with JSON
schemas.

In [5]:
from lale.lib.lale import Project
import lale.helpers
lale.helpers.wrap_imported_operators()

Second, try to instantiate a pipeline as before. In particular, the
code has the same mistake as before, passing `solver='adam'` to
LR. But unlike in pure scikit-learn, here, the mistake gets caught
earlier, using JSON Schema validation when the operators are
instantiated. For this example, that saves a lot of time, since
there is no need to train Tfidf to catch the error.

In [6]:
%%time
from jsonschema import ValidationError
try:
    trainable = (Project(columns=['review'])
              >> Tfidf(max_features=1000)
              >> LR(solver='adam'))
except ValidationError as e:
    print(e.message, file=sys.stderr)

CPU times: user 46.9 ms, sys: 0 ns, total: 46.9 ms
Wall time: 41.2 ms


Invalid configuration for LR(solver='adam') due to invalid value solver=adam.
Schema of argument solver: {
    'description': 'Algorithm for optimization problem.',
    'enum': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'default': 'liblinear',
}
Value: adam


### Interactive documentation

Data scientists can interactively query the JSON Schemas of individual
operators. For example, they can find out all the hyperparameters
along with their defaults.

In [7]:
LR.hyperparam_defaults()

{'solver': 'liblinear',
 'penalty': 'l2',
 'dual': False,
 'C': 1.0,
 'tol': 0.0001,
 'fit_intercept': True,
 'intercept_scaling': 1.0,
 'class_weight': None,
 'random_state': None,
 'max_iter': 100,
 'multi_class': 'ovr',
 'verbose': 0,
 'warm_start': False,
 'n_jobs': None}

An example of a categorical hyperparameter, which JSON Schema
represents using an enum.

In [8]:
LR.hyperparam_schema('solver')

{'description': 'Algorithm for optimization problem.',
 'enum': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
 'default': 'liblinear'}

An example of a continuous hyperparameter, which JSON Schema
represents as a number. Th `'minimum'` specifies what is valid (will
not raise an exception), whereas the `'minimumForOptimizer'` and
`'maximumForOptimizer'` specify what is relevant (a range that makes
sense for automated search tools).

In [9]:
LR.hyperparam_schema('C')

{'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',
 'type': 'number',
 'distribution': 'loguniform',
 'minimum': 0.0,
 'exclusiveMinimum': True,
 'default': 1.0,
 'minimumForOptimizer': 0.03125,
 'maximumForOptimizer': 32768}

### More hyperparameter error examples

Since the schema of the `C` hyperparameter of `LR` specifies an
exclusive minimum of zero, passing zero is not valid. Lale internally
calls an off-the-shelf JSON Schema validator when an operator gets
configured with concrete hyperparameter values.

In [10]:
try:
    LR(C=0.0)
except ValidationError as e:
    print(e.message, file=sys.stderr)

Invalid configuration for LR(C=0.0) due to invalid value C=0.0.
Schema of argument C: {
    'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',
    'type': 'number',
    'distribution': 'loguniform',
    'minimum': 0.0,
    'exclusiveMinimum': true,
    'default': 1.0,
    'minimumForOptimizer': 0.03125,
    'maximumForOptimizer': 32768,
}
Value: 0.0


Besides per-hyperparameter types, there are also conditional
inter-hyperparameter constraints. These are checked using the
same call to an off-the-shelf JSON Schema validator.

In [11]:
try:
    LR(LR.solver.sag, LR.penalty.l1)
except ValidationError as e:
    print(e.message, file=sys.stderr)

Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 penalties.
Schema of constraint 1: {
    'description': 'The newton-cg, sag, and lbfgs solvers support only l2 penalties.',
    'anyOf': [{
        'type': 'object',
        'properties': {
            'solver': {
                'not': {
                    'enum': ['newton-cg', 'sag', 'lbfgs']}}}}, {
        'type': 'object',
        'properties': {
            'penalty': {
                'enum': ['l2']}}}],
}
Value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'ovr', 'verbose': 0, 'warm_start': False, 'n_jobs': None}


There are even constraints that affect three different hyperparameters.

In [12]:
try:
    LR(LR.penalty.l2, LR.solver.sag, dual=True)
except ValidationError as e:
    print(e.message, file=sys.stderr)

Invalid configuration for LR(penalty='l2', solver='sag', dual=True) due to constraint the dual formulation is only implemented for l2 penalty with the liblinear solver.
Schema of constraint 2: {
    'description': 'The dual formulation is only implemented for l2 penalty with the liblinear solver.',
    'anyOf': [{
        'type': 'object',
        'properties': {
            'dual': {
                'enum': [false]}}}, {
        'type': 'object',
        'properties': {
            'penalty': {
                'enum': ['l2']},
            'solver': {
                'enum': ['liblinear']}}}],
}
Value: {'penalty': 'l2', 'solver': 'sag', 'dual': True, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'ovr', 'verbose': 0, 'warm_start': False, 'n_jobs': None}


### Dataset schema error examples

Lale uses JSON Schema validation not only for hyperparameters but also
for data. The dataset `train_X` is multimodal: some columns contain
text strings whereas others contain numbers.

In [13]:
from lale.datasets import data_schemas
data_schemas.to_schema(train_X)

{'$schema': 'http://json-schema.org/draft-04/schema#',
 'type': 'array',
 'items': {'type': 'array',
  'minItems': 5,
  'maxItems': 5,
  'items': [{'description': 'drugName', 'type': 'string'},
   {'description': 'condition',
    'anyOf': [{'type': 'string'}, {'enum': [nan]}]},
   {'description': 'review', 'type': 'string'},
   {'description': 'date', 'type': 'string'},
   {'description': 'usefulCount', 'type': 'integer', 'minimum': 0}]},
 'minItems': 161297,
 'maxItems': 161297}

Since `train_X` contains strings but `LR` expects only numbers, the
call to `fit` reports a type error.

In [14]:
trainable_lr = LR()
try:
    trainable_lr.fit(train_X, train_y)
except ValidationError as e:
    print(e.message, file=sys.stderr)

Failed validating input_schema_fit for LR due to 'Valsartan' is not of type 'number'

Failed validating 'type' in schema['properties']['X']['items']['items']:
    {'type': 'number'}

On instance['X'][0][0]:
    'Valsartan'


Load a pure numerical dataset instead.

In [15]:
from lale.datasets import load_iris_df
(train_X, train_y), (test_X, test_y) = load_iris_df()
train_X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.0,3.4,1.6,0.4
1,6.3,3.3,4.7,1.6
2,5.1,3.4,1.5,0.2
3,4.8,3.0,1.4,0.1
4,6.7,3.1,4.7,1.5


Training LR with the Iris dataset works fine.

In [16]:
trained_lr = trainable_lr.fit(train_X, train_y)

### Lifecycle error example

Lale encourages separating the lifecycle states, here represented
by `trainable_lr` vs. `trained_lr`. The `predict` method should
only be called on a trained model.

In [17]:
predicted = trained_lr.predict(test_X)
print(f'test_y    {[*test_y]}')
print(f'predicted {[*predicted]}')

test_y    [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]


On the other hand, the `predict` method should not be called on a trainable model.

In [18]:
import warnings
warnings.filterwarnings("error", category=DeprecationWarning)
try:
    predicted = trainable_lr.predict(test_X)
except DeprecationWarning as w:
    print(str(w), file=sys.stderr)
print(f'test_y    {[*test_y]}')
print(f'predicted {[*predicted]}')

test_y    [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]


The `predict` method is deprecated on a trainable operator, because the learned coefficients could be accidentally overwritten by retraining. Call `predict` on the trained operator returned by `fit` instead.
