# Workflows using Wrappers and Delta States

Using the functions in `autora.workflow`, we can build decorators for theorists or experimentalists which
 will keep them compatible with future developments in the architecture and naming of `State` objects.


In [None]:
from collections import ChainMap
import dataclasses
from typing import Optional

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

from autora.variable import VariableCollection, Variable

For this example, we'll use a polynomial of degree 3 as our "ground truth" function. We're also using pandas
DataFrames and Series as our data interchange format.

In [None]:
coefs = [432, -144, -3, 1] # from https://www.maa.org/sites/default/files/0025570x28304.di021116.02p0130a.pdf
def ground_truth(x: pd.Series) -> pd.Series:
    y = pd.Series(coefs[0] + coefs[1] * x + coefs[2] * x**2 + coefs[3] * x**3, name="y")
    return y

rng = np.random.default_rng(1)
def noisy_observation(x: pd.Series) -> pd.Series:
    y = ground_truth(x) + rng.normal(0, 1000, len(x))
    return y

We define a two part AER pipeline consisting of an experiment runner and a theorist (we use the seed conditions
always).

The key part here is that both experiment runner and theorist are functions which:
- operate on the `State`, and
- return a modified object of the **same type** `State.

We define the state as a simple dataclass with fields representing the variables,
parameters, experimental data, (possibly) conditions, and (possibly) a model.

This state has no "history"; it represents a snapshot of the data at one time. Other exemplar state objects are
available in the subpackage `autora.workflow.state` and include some with in-built histories.

In [None]:
@dataclasses.dataclass
class CustomState:
    variables: VariableCollection
    params: dict
    experimental_data: pd.DataFrame
    conditions: pd.Series = None
    model: Optional[BaseEstimator] = None

s = CustomState(
    variables=VariableCollection(independent_variables=[Variable("x")], dependent_variables=[Variable("y")]),
    params={},
    conditions=pd.Series(data=np.linspace(-15,15,101), name="x"),
    experimental_data = pd.DataFrame(columns=["x","y"]),
)
s

CustomState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), params={}, experimental_data=Empty DataFrame
Columns: [x, y]
Index: [], conditions=0     -15.0
1     -14.7
2     -14.4
3     -14.1
4     -13.8
       ... 
96     13.8
97     14.1
98     14.4
99     14.7
100    15.0
Name: x, Length: 101, dtype: float64, model=None)

Given this state, we define a two part AER pipeline consisting of an experiment runner and a theorist. We'll just
reuse the initial seed `conditions` in this example.

First we define and test the experiment runner.

The key part here is that both the experiment runner and the theorist are functions which operate on the `State` and
reutrn a `GeneralDelta` object.

In [None]:
from autora.workflow.state.delta import GeneralDelta, wrap_to_use_state

@wrap_to_use_state
def experiment_runner(conditions) -> GeneralDelta:
    x = conditions
    y = noisy_observation(x)
    experimental_data = pd.DataFrame.merge(x, y, left_index=True, right_index=True)
    return GeneralDelta(kind="extend", experimental_data=experimental_data)

experiment_runner(s)

OrderedDict([('conditions', <Parameter "conditions">)])


When we run the experiment runner, we can see the updated state object which is returned – it has new experimental data.

In [None]:
experiment_runner(s)

CustomState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), params={}, experimental_data=        x            y
0   -10.0  1434.444796
1    -9.8   488.295916
2    -9.6  1322.337241
3    -9.4  1908.779605
4    -9.2  1107.121583
..    ...          ...
96    9.2  -199.926131
97    9.4   192.309452
98    9.6 -1407.268729
99    9.8  1502.302238
100  10.0  1712.073367

[101 rows x 2 columns], conditions=0     -15.0
1     -14.7
2     -14.4
3     -14.1
4     -13.8
       ... 
96     13.8
97     14.1
98     14.4
99     14.7
100    15.0
Name: x, Length: 101, dtype: float64, model=None)


CustomState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), params={}, experimental_data=        x            y
0   -10.0  1434.444796
1    -9.8   488.295916
2    -9.6  1322.337241
3    -9.4  1908.779605
4    -9.2  1107.121583
..    ...          ...
96    9.2  -199.926131
97    9.4   192.309452
98    9.6 -1407.268729
99    9.8  1502.302238
100  10.0  1712.073367

[101 rows x 2 columns], conditions=0     -15.0
1     -14.7
2     -14.4
3     -14.1
4     -13.8
       ... 
96     13.8
97     14.1
98     14.4
99     14.7
100    15.0
Name: x, Length: 101, dtype: float64, model=None)

Now we define a theorist, which does a linear regression on the polynomial of degree 5. We define a regressor and a
method to return its feature names and coefficients, and then the theorist to handle it.

In [None]:
# Completely standard scikit-learn pipeline regressor
regressor = make_pipeline(PolynomialFeatures(degree=5), LinearRegression())

def get_equation(r):
    t = r.named_steps['polynomialfeatures'].get_feature_names_out()
    c = r.named_steps['linearregression'].coef_
    return pd.DataFrame({"t": t, "coefficient": c.reshape(t.shape)})

def theorist(state: State, params: Optional[dict] = None) -> State:
    if params is None:
        params = {}
    params_ = ChainMap(params, state.params)
    ivs = [v.name for v in state.variables.independent_variables]
    dvs = [v.name for v in state.variables.dependent_variables]
    X, y = state.experimental_data[ivs], state.experimental_data[dvs]
    model = regressor.fit(X, y, **params_)
    new_state = dataclasses.replace(state, model=model)
    return new_state

NameError: name 'State' is not defined

Now we run the theorist on the result of the experiment_runner (by chaining the two functions).

In [None]:
t = theorist(experiment_runner(s))

The fitted coefficients are:

In [None]:
print(get_equation(t.model))

Now we can define the simplest pipeline which runs the experiment runner and theorist in sequence and returns the
updated state:

In [None]:
def pipeline(state: State) -> State:
    s_ = state
    t_ = experiment_runner(s_)
    u_ = theorist(t_)
    return u_

Running this pipeline is the same as running the individual steps – just pass the state object.

In [None]:
rng = np.random.default_rng(1)  # reset the RNG for reproducibility
u = pipeline(s)

To show what's happening, we'll show the data, best fit model and ground truth:

In [None]:
def show_best_fit(state):
    state.experimental_data.plot.scatter("x", "y", s=1, alpha=0.5, c="gray")
    plt.plot(state.conditions, state.model.predict(pd.DataFrame(u.conditions)), label="best fit")
    plt.plot(state.conditions, ground_truth(u.conditions), label="ground truth")
    plt.legend()
    print(get_equation(t.model))

show_best_fit(u)

We can use this pipeline to make a trivial cycle, where we keep on gathering data until we reach 1000 datapoints. Any
 condition defined on the state object could be used here, though.

In [None]:
v = s
while len(v.experimental_data) < 1_000:  # any condition on the state can be used here.
    v = pipeline(v)
show_best_fit(v)

We can redefine the pipeline as a generator, which can be operated on using iteration tools:

In [None]:
def cycle(state: State) -> State:
    s_ = state
    while True:
        s_ = experiment_runner(s_)
        s_ = theorist(s_)
        yield s_

cycle_generator = cycle(s)

for i in range(1000):
    t = next(cycle_generator)
show_best_fit(t)

You can also define a cycle (or a sequence of steps) which yield the intermediate results.

In [None]:
v0 = s
def cycle(state: State) -> State:
    s_ = state
    while True:
        print("#-- running experiment_runner --#\n")
        s_ = experiment_runner(s_)
        yield s_
        print("#-- running theorist --#\n")
        s_ = theorist(s_)
        yield s_

cycle_generator = cycle(v0)

At the outset, we have no model and an emtpy experimental_data dataframe.

In [None]:
print(f"{v0.model=}, \n{v0.experimental_data.shape=}")

In the first `next`, we only run the "experiment_runner"

In [None]:
v1 = next(cycle_generator)
print(f"{v1.model=}, \n{v1.experimental_data.shape=}")

In the next step, we run the theorist on that data, but we don't add any new data:

In [None]:
v2 = next(cycle_generator)
print(f"{v2.model=}, \n{v2.experimental_data.shape=}")

In the next step, we run the experiment runner again and gather more observations:

In [None]:
v3 = next(cycle_generator)
print(f"{v3.model=}, \n{v3.experimental_data.shape=}")
