# The `State` mechanism

A `State` is an object representing data from an experiment, like the conditions, observed experiment data and models. 
In the AutoRA framework, experimentalists, experiment runners and theorists are functions which 
- operate on `States` and 
- return `States`.

The `autora.state` submodule provides classes and functions to help build these functions. 

## Basic Aim: $f(S) = S^\prime$

The AutoRA State mechanism is an implementation of the functional programming paradigm. It distinguishes between:
- Data – stored as an immutable `State`
- Procedures – functions which act on `State` objects to add new data and return a new `State`.

Procedures generate data. Some common procedures which appear in AutoRA experiments, and the data they produce are:

| Procedure         | Data            |
|-------------------|-----------------|
| Experimentalist   | Conditions      |
| Experiment Runner | Experiment Data |
| Theorist          | Model           |

The data produced by each procedure $f$ can be seen as additions to the existing data. Each procedure $f$:
- Takes in existing Data $S$
- Adds new data $\Delta S$
- Returns an updated state of the Data $S^\prime$  

$$
\begin{aligned}
f(S) &= S + \Delta S \\
     &= S^\prime
\end{aligned}
$$

AutoRA includes:
- Classes to represent the Data $S$ – the `State` object (and the derived `StandardState` – a pre-defined version 
with the common fields needed for cyclical experiments)  
- Functions to make it easier to write procedures of the form $f(S) = S^\prime$

In [None]:
import numpy as np
import pandas as pd
import autora.state
from autora.variable import VariableCollection, Variable

## `State` objects
TODO: write this part

In [None]:
s_0 = autora.state.StandardState(
    variables=VariableCollection(
        independent_variables=[Variable("x", value_range=(-10, 10))],
        dependent_variables=[Variable("y")]
    ),
    conditions=pd.DataFrame({"x":[]}),
    experiment_data=pd.DataFrame({"x":[], "y":[]}),
    models=[]
)

## `Variable` and `VariableCollection`
TODO: move this to a different file

## Making a function of the form $f(S) = S^\prime$

There are several equivalent ways to make a function of the form $f(S) = S^\prime$. These are (from 
simplest but most restrictive, to most complex but with the greatest flexibility):
- Use the `autora.state.on_state` decorator
- Modify `generate_conditions` to accept a `StandardState` and update this with a `Delta`

There are also special cases, like the `autora.state.estimator_on_state` wrapper for `scikit-learn` estimators.  

Say you have a function to generate new experimental conditions, given some variables.

In [None]:
def generate_conditions(variables, num_samples=5, random_state=42):
    rng = np.random.default_rng(random_state)               # Initialize a random number generator
    conditions = pd.DataFrame()                             # Create a DataFrame to hold the results  
    for iv in variables.independent_variables:              # Loop through the independent variables
        c = rng.uniform(*iv.value_range, size=num_samples)  #  - Generate a uniform sample from the range
        conditions[iv.name] = c                             #  - Save the new values to the DataFrame
    return conditions

There are several equivalent ways to make this into a function of the form $f(S) = S^\prime$. These are (from 
simplest but most restrictive, to most complex but with the greatest flexibility):
- Decorate it with `autora.state.on_state`
- Modify `generate_conditions` to accept a `StandardState` and update this with a `Delta`

### Use the `autora.state.on_state` decorator

`autora.state.on_state` is a wrapper for functions which changes their arguments. 

The most concise way to use it is as a decorator on the function where it is defined. You can specify how the 
returned values should be mapped to fields on the `State` using the `@autora.state.on_state(output=...)` argument.

In [None]:
@autora.state.on_state(output=["conditions"])
def generate_conditions(variables, num_samples=5, random_state=42):
    rng = np.random.default_rng(random_state)               # Initialize a random number generator
    conditions = pd.DataFrame()                             # Create a DataFrame to hold the results  
    for iv in variables.independent_variables:              # Loop through the independent variables
        c = rng.uniform(*iv.value_range, size=num_samples)  #  - Generate a uniform sample from the range
        conditions[iv.name] = c                             #  - Save the new values to the DataFrame
    return conditions

# Example
generate_conditions(s_0)

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-10, 10), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=          x
0  5.479121
1 -1.222431
2  7.171958
3  3.947361
4 -8.116453, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

Fully equivalently, you can modify `generate_conditions` to return a dictionary of values with the appropriate field 
names from `State`: 

In [None]:
@autora.state.on_state
def generate_conditions(variables, num_samples=5, random_state=42):
    rng = np.random.default_rng(random_state)               # Initialize a random number generator
    conditions = pd.DataFrame()                             # Create a DataFrame to hold the results  
    for iv in variables.independent_variables:              # Loop through the independent variables
        c = rng.uniform(*iv.value_range, size=num_samples)  #  - Generate a uniform sample from the range
        conditions[iv.name] = c                             #  - Save the new values to the DataFrame
    return {"conditions": conditions}                       # Return a dictionary with the appropriate name

# Example
generate_conditions(s_0)

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-10, 10), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=          x
0  5.479121
1 -1.222431
2  7.171958
3  3.947361
4 -8.116453, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

#### Deep dive: `autora.state_on_state`
The decorator notation is equivalent to the following:

In [None]:
def generate_conditions_inner(variables, num_samples=5, random_state=42):
    rng = np.random.default_rng(random_state)               # Initialize a random number generator
    result = pd.DataFrame()                             # Create a DataFrame to hold the results  
    for iv in variables.independent_variables:              # Loop through the independent variables
        c = rng.uniform(*iv.value_range, size=num_samples)  #  - Generate a uniform sample from the range
        result[iv.name] = c                             #  - Save the new values to the DataFrame
    return result

generate_conditions = autora.state.on_state(generate_conditions_inner, output=["conditions"])

# Example
generate_conditions(s_0, random_state=180)

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-10, 10), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=          x
0  1.521127
1  3.362120
2  1.065391
3 -5.844244
4 -6.444732, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

During the `generate_conditions(s_0, random_state=180)` call, `autora.state.on_state` does the following:
- Inspects the signature of `generate_conditions_inner` to see which variables are required – in this case:
    - `variables`, 
    - `num_samples` and 
    - `random_state`.
- Looks for fields with those names on `s_0`:
    - Finds a field called `variables`.
- Calls `generate_conditions_inner` with those fields as arguments, plus any arguments specified in the 
`generate_conditions` call (here just `random_state`)
- Converts the returned value `result` into `Delta(conditions=result)` using the name specified in `output=["conditions"]`
- Returns `s_0 + Delta(conditions=result)`

### Modify `generate_conditions` to accept a `StandardState` and update this with a `Delta`

Fully equivalently to using the `autora.state.on_state` wrapper, you can construct a function which takes and returns 
`State` objects. 

In [None]:
def generate_conditions(state: autora.state.StandardState, num_samples=5, random_state=42):
    rng = np.random.default_rng(random_state)               # Initialize a random number generator
    conditions = pd.DataFrame()                             # Create a DataFrame to hold the results  
    for iv in state.variables.independent_variables:        # Loop through the independent variables
        c = rng.uniform(*iv.value_range, size=num_samples)  #  - Generate a uniform sample from the range
        conditions[iv.name] = c                             #  - Save the new values to the DataFrame
    delta = autora.state.Delta(conditions=conditions)       # Construct a new Delta representing the updated data
    new_state = state + delta                               # Construct a new state, "adding" the Delta
    return new_state

# Example
generate_conditions(s_0)

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-10, 10), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=          x
0  5.479121
1 -1.222431
2  7.171958
3  3.947361
4 -8.116453, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

### Special case: `autora.state.estimator_on_state` for `scikit-learn` estimators

The "theorist" component in an AutoRA cycle is often a `scikit-learn` compatible estimator which implements a curve 
fitting function like a linear, logistic or symbolic regression. `scikit-learn` estimators are classes, and they have
 a specific wrapper: `autora.state.estimator_on_state`, used as follows:

In [None]:
from sklearn.linear_model import LinearRegression


estimator = LinearRegression(fit_intercept=True)       # Initialize the regressor with all its parameters
theorist = autora.state.estimator_on_state(estimator)  # Wrap the estimator


# Example
variables = s_0.variables          # Reuse the variables from before 
xs = np.linspace(-10, 10, 101)     # Make an array of x-values 
noise = np.random.default_rng(179).normal(0., 0.5, xs.shape)  # Gaussian noise
ys = (3.5 * xs + 2. + noise)       # Calculate y = 3.5 x + 2 + noise  

s_1 = autora.state.StandardState(  # Initialize the State with those data
    variables=variables,
    experiment_data=pd.DataFrame({"x":xs, "y":ys}),
)
s_1_prime = theorist(s_1)         # Run the theorist
print(f"Returned models: "
      f"{s_1_prime.models}")      
print(f"Last model's coefficients: "
      f"y = {s_1_prime.models[-1].coef_[0]} x + {s_1_prime.models[-1].intercept_}")

Returned models: [LinearRegression()]
Last model's coefficients: y = [3.49729147] x + [1.99930059]


During the `theorist(s_1)` call, `autora.state.estimator_on_state` does the following:
- Gets the names of the independent and dependent variables from the `s_1.variables`
- Gathers the values of those variables from `s_1.experiment_data`
- Passes those values to the `LinearRegression().fit(x, y)` method
- Constructs `Delta(models=[LinearRegression()])` with the fitted regressor
- Returns `s_1 + Delta(models=[LinearRegression()])`

## Example
Sebastian wishes to run an experiment. He knows:
- which variables he wants to investigate: 
    - $x$, the independent variable, is a number in the range $-10$ to $10$,
    - $y$, the dependent variable, is a number with an unknown range.

and will use this knowledge to **initialize a `State` object**.

He planned procedures for:
- making a list of conditions to observe, 
- running the experiment, given the list of conditions,
- generating a model to describe the data

and he will write each of these down as a **function**.

### Initialize the `State` object
Sebastian writes down the current State of his knowledge about the problem in a `State` object.

However, he doesn't yet know which conditions to look at – those will be generated by his procedures. 
Nor does he have any experiment data. So he initializes DataFrames to hold those results, but 
leaves both empty. Likewise, he doesn't have any models right now, so he creates an empty list for those.

In [None]:
import pandas as pd
from autora.state import StandardState
from autora.variable import VariableCollection, Variable

s_0 = StandardState(
    variables=VariableCollection(
        independent_variables=[Variable("x", value_range=(-10, 10))],
        dependent_variables=[Variable("y")]
    ),
    conditions=pd.DataFrame({"x":[]}),
    experiment_data=pd.DataFrame({"x":[], "y":[]}),
    models=[]
)

### Write "experimentalist" procedure for generating conditions

Sebastian writes down the procedure for making a list of conditions to observe. He writes this as a function 
which acts on the things he knows from the state, and returns a dataframe with the new conditions. 

In [None]:
import numpy as np


def generate_conditions(variables, num_samples=5, random_state=42):
    rng = np.random.default_rng(random_state)               # Initialize a random number generator
    conditions = pd.DataFrame()                             # Create a DataFrame to hold the results  
    for iv in variables.independent_variables:              # Loop through the independent variables
        c = rng.uniform(*iv.value_range, size=num_samples)  #  - Generate a uniform sample from the range
        conditions[iv.name] = c                             #  - Save the new values to the DataFrame
    return conditions

# Example
generate_conditions(s_0.variables)

Unnamed: 0,x
0,5.479121
1,-1.222431
2,7.171958
3,3.947361
4,-8.116453


Finally, he "wraps" the `generate_conditions` function using a utility from the `autora.state` submodule, to make his
 finished experimentalist. The purpose of the wrapper is to turn the basic function he wrote into one which accepts 
 a `State` object as input and returns a `State` object.

In [None]:
from autora.state import on_state


experimentalist = on_state(  # Utility which adds the `State` functionality to a function
    generate_conditions,     # Pass in the basic `generate_conditions` function
    output=["conditions"]    # Say that the value returned from `generate_conditions` should be 
                             # used as `conditions` on the State
)

# Example
experimentalist(s_0)

StandardState(variables=VariableCollection(independent_variables=[Variable(name='x', value_range=(-10, 10), allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], dependent_variables=[Variable(name='y', value_range=None, allowed_values=None, units='', type=<ValueType.REAL: 'real'>, variable_label='', rescale=1, is_covariate=False)], covariates=[]), conditions=          x
0  5.479121
1 -1.222431
2  7.171958
3  3.947361
4 -8.116453, experiment_data=Empty DataFrame
Columns: [x, y]
Index: [], models=[])

### Write "experiment runner" procedure for gathering observations 

In [None]:
from autora.state import on_state

experimentalist = on_state(function=random_pool, output=["conditions"])
s_1 = experimentalist(s_0, random_state=42)
s_1


## Theoretical Overview

The fundamental idea is this:
- We define a "state" object $S$ which can be modified with a "delta" (a new result) $\Delta S$.
- A new state at some point $i+1$ is $$S_{i+1} = S_i + \Delta S_{i+1}$$
- The cycle state after $n$ steps is thus $$S_n = S_{0} +  \sum^{n}_{i=1} \Delta S_{i}$$

To represent $S$ and $\Delta S$ in code, you can use `autora.state.State` and `autora.state.Delta`
respectively. To operate on these, we define functions.

- Each operation in an AER cycle (theorist, experimentalist, experiment_runner, etc.) is implemented as a
function with $n$ arguments $s_j$ which are members of $S$ and $m$ others $a_k$ which are not.
  $$ f(s_0, ..., s_n, a_0, ..., a_m) \rightarrow \Delta S_{i+1}$$
- There is a wrapper function $w$ (`autora.state.wrap_to_use_state`) which changes the signature of $f$ to
require $S$ and aggregates the resulting $\Delta S_{i+1}$
  $$w\left[f(s_0, ..., s_n, a_0, ..., a_m) \rightarrow \Delta
S_{i+1}\right] \rightarrow \left[ f^\prime(S_i, a_0, ..., a_m) \rightarrow S_{i} + \Delta
S_{i+1} = S_{i+1}\right]$$

- Assuming that the other arguments $a_k$ are provided by partial evaluation of the $f^\prime$, the full AER cycle can
then be represented as:
  $$S_n = f_n^\prime(...f_2^\prime(f_1^\prime(S_0)))$$

There are additional helper functions to wrap common experimentalists, experiment runners and theorists so that we
can define a full AER cycle using python notation as shown in the following example.

## Example

First initialize the State. In this case, we use the pre-defined `StandardState` which implements the standard AER
naming convention.
There are two variables `x` with a range [-10, 10] and `y` with an unspecified range.

In [None]:
from autora.state import StandardState
from autora.variable import VariableCollection, Variable

s_0 = StandardState(
    variables=VariableCollection(
        independent_variables=[Variable("x", value_range=(-10, 10))],
        dependent_variables=[Variable("y")]
    )
)

Specify the experimentalist. Use a standard function `random_pool`.
This gets 5 independent random samples (by default, configurable using an argument)
from the value_range of the independent variables, and returns them in a DataFrame.
To make this work as a function on the State objects, we wrap it in the `on_state` function.

In [None]:
from autora.experimentalist.random_ import random_pool
from autora.state import on_state

experimentalist = on_state(function=random_pool, output=["conditions"])
s_1 = experimentalist(s_0, random_state=42)
s_1

Specify the experiment runner. This calculates a linear function, adds noise, assigns the value to the `y` column
 in a new DataFrame.

In [None]:
from autora.state import on_state
import numpy as np
import pandas as pd


@on_state(output=["experiment_data"])
def experiment_runner(conditions: pd.DataFrame, c=[2, 4], random_state = None):
    rng = np.random.default_rng(random_state)
    x = conditions["x"]
    noise = rng.normal(0, 1, len(x))
    y = c[0] + (c[1] * x) + noise
    observations = conditions.assign(y = y)
    return observations

# Which does the following:
experiment_runner(s_1, random_state=43)

A completely analogous definition, using the separate `@inputs_from_state` and `@outputs_to_delta(...)` decorators
rather than the combined `@on_state(...)` decorator would be:

In [None]:
from autora.state import inputs_from_state, outputs_to_delta


@inputs_from_state
@outputs_to_delta("experiment_data")
def experiment_runner_alt_1(conditions: pd.DataFrame, c=[2, 4], random_state=None):
    x = conditions["x"]
    rng = np.random.default_rng(random_state)
    noise = rng.normal(0, 1, len(x))
    y = c[0] + (c[1] * x) + noise
    xy = conditions.assign(y = y)
    return xy

# Which does the following:
experiment_runner_alt_1(s_1, random_state=42)

Or alternatively:

In [None]:
def experiment_runner_alt_2_core(conditions: pd.DataFrame, c=[2, 4], random_state=None):
    x = conditions["x"]
    rng = np.random.default_rng(random_state)
    noise = rng.normal(0, 1, len(x))
    y = c[0] + (c[1] * x) + noise
    xy = conditions.assign(y = y)
    return xy

experiment_runner_alt_2 = on_state(experiment_runner_alt_2_core, output=["experiment_data"])
experiment_runner_alt_2(s_1)

Specify a theorist, using a standard LinearRegression from scikit-learn.

In [None]:
from sklearn.linear_model import LinearRegression
from autora.state import estimator_on_state

theorist = estimator_on_state(LinearRegression(fit_intercept=True))

Now we can run the theorist on the output from the experiment_runner,
which itself uses the output from the experimentalist.

In [None]:
theorist(experiment_runner(experimentalist(s_0)))

If we like, we can run the experimentalist, experiment_runner and theorist ten times.

In [None]:
s_ = s_0
for i in range(10):
    s_ = experimentalist(s_, random_state=180+i)
    s_ = experiment_runner(s_, random_state=2*180+i)
    s_ = theorist(s_)

The experiment_data has 50 entries (10 cycles and 5 samples per cycle):

In [None]:
s_.experiment_data

The fitted coefficients are close to the original intercept = 2, gradient = 4

In [None]:
print(s_.model.intercept_, s_.model.coef_)