In [1]:
import pathlib
import numpy as np
import pandas as pd
import os

from functools import partial
from sklearn.linear_model import LogisticRegression
from autora.variable import DV, IV, ValueType, VariableCollection
from autora.experimentalist.pipeline import PoolPipeline
from autora.experimentalist.pool import gridsearch_pool
from autora.experimentalist.filter import weber_filter
from autora.experimentalist.sampler import random_sampler, uncertainty_sampler

# Introduction
This notebook demonstrates the use of the `PoolPipeline` class to create Experimentalists. Experimentalists consist of two main components:
1. Condition Generation - Creating combinations of independent variables to test
2. Experimental Design - Ensuring conditions meet design constraints.

The `PoolPipeline` class allows us to define a series of functions to generate and process a pool of conditions that conform to an experimental design.


## Implementation

The `PoolPipeline` class consists of two types of inputs:
1. Pool - A pool of conditions or a function to generate it.
2. Pipes - An arbitrary number of filter functions to apply to the pool.
    * Examples of pipes may be samplers, conditional filters, and sequencers.

<blockquote>

```Python
# Initialize the Pipeline
pipeline = PoolPipeline(Pool, *Pipes)

# Run the pipeline
conditions = pipline.run()
```
</blockquote>




## Example 1: Exhaustive Pool with Random Sampler
The examples in this notebook will create a Weber line-lengths experiment. The Weber experiment tests human detection of differences between the lengths of two lines. The first example will sample a pool with simple random sampling. We will first define the independent and dependent variables (IVs and DVs, respectively).


In [4]:
# Specifying  Dependent and Independent Variables
# Specify independent variables
iv1 = IV(
    name="S1",
    allowed_values=np.linspace(0, 5, 5),
    units="intensity",
    variable_label="Stimulus 1 Intensity",
)

iv2 = IV(
    name="S2",
    allowed_values=np.linspace(0, 5, 5),
    units="intensity",
    variable_label="Stimulus 2 Intensity",
)

# The experimentalist pipeline doesn't actually use DVs, they are just specified here for
# example.
dv1 = DV(
    name="difference_detected",
    value_range=(0, 1),
    units="probability",
    variable_label="P(difference detected)",
    type=ValueType.PROBABILITY,
)

# Variable collection with ivs and dvs
metadata = VariableCollection(
    independent_variables=[iv1, iv2],
    dependent_variables=[dv1],
)

Next we set up the `PoolPipeline` with three functions:
1. `gridsearch_pool` - Generates an exhaustive pool of condition combinations using the Cartesian product of discrete IV values.
   - The discrete IV values are specified with the `allowed_values` attribute when defining the IVs.
2. `weber_filer` - Filter that selects the experimental design constraint where IV1 <= IV2.
3. `random_sampler` - Samples the pool of conditions

Functions that require keyword inputs are initialized using the `partial` function before passing into `PoolPipeline`.

In [5]:
## Set up pipeline functions with the partial function
# Pool Function
pooler_callable = partial(gridsearch_pool, ivs=metadata.independent_variables)
# Random Sampler
sampler = partial(random_sampler, n=10)

# Initialize the pipeline
pipeline_random_samp = PoolPipeline(
    pooler_callable,
    weber_filter, # Filter that selects conditions with IV1 <= IV2
    sampler,
)

The pipleine can be run by calling the `run` method.

The pipeline is run twice below to illustrate that random sampling is performed. Rerunning the cell will produce different results.


In [13]:
# Run the Pipeline
results1 = pipeline_random_samp.run()
results2 = pipeline_random_samp.run()
print('Sampled Conditions:')
print(f' Run 1: {results1}\n',
      f'Run 2: {results2}')

Sampled Conditions:
 Run 1: [(0.0, 0.0), (0.0, 1.25), (0.0, 2.5), (0.0, 3.75), (1.25, 5.0), (1.25, 2.5), (1.25, 1.25), (3.75, 5.0), (2.5, 2.5), (5.0, 5.0)]
 Run 2: [(2.5, 5.0), (0.0, 0.0), (1.25, 5.0), (0.0, 2.5), (0.0, 3.75), (1.25, 1.25), (2.5, 3.75), (0.0, 1.25), (3.75, 5.0), (0.0, 5.0)]


An alternative method of passing an instantiated pool iterator is demonstrated below. Note the difference where `gridsearch_pool` is not initialized using the `partial` function but instantiated before initializing the `PoolPipeline`. `gridsearch_pool` returns an iterator of the exhaustive pool. This will result in unexpected behavior when the pipeline is run multiple times.

In [15]:
## Set up pipeline functions with the partial function
# Pool Function
pooler_iterator = gridsearch_pool(metadata.independent_variables)

# Initialize the pipeline
pipeline_random_samp2 = PoolPipeline(
    pooler_iterator,
    weber_filter, # Filter that selects conditions with IV1 <= IV2
    sampler, # Sampler defined in the first implementation example
)
# Run the Pipeline
results1 = pipeline_random_samp2.run()
results2 = pipeline_random_samp2.run()
print('Sampled Conditions:')
print(f' Run 1: {results1}\n',
      f'Run 2: {results2}')


Sampled Conditions:
 Run 1: [(2.5, 3.75), (3.75, 3.75), (1.25, 1.25), (1.25, 3.75), (0.0, 5.0), (3.75, 5.0), (0.0, 3.75), (5.0, 5.0), (1.25, 2.5), (1.25, 5.0)]
 Run 2: []


Running the pipeline multiple times results in an empty list. This is because the iterator is exhausted after first run and no longer yields results. If the pipeline needs to be run multiple times, initializing the functions as a callable using the `partial` function is recommended because the iterator will be initialized at the start of each run.


## Example 2: Exhaustive Pool with Uncertainty Sampler
The next example will sample a pool with uncertainty sampling. Uncertainty sampling requires a model that returns probabilities of class predictions. We will use synthetic data from a Weber experiment to train a simple logistic regression model.


In [20]:
# Load the data
datafile_path = pathlib.Path(os.path.abspath("")).parent.parent.joinpath("example/sklearn/darts/weber_data.csv")
data = pd.read_csv(datafile_path)
X = data[["S1", "S2"]]
y = data["difference_detected"] # Probability that a difference is detected
y_classified = np.where(y >= .5, 1, 0)

# Train logistic regression model
logireg_model = LogisticRegression()
logireg_model.fit(X, y_classified)

Implementation is the same as Example 1 however we change the sampling function. The `uncertainty_sampler` is passed the logistic regression model and additional keywords specify to sample 10 conditions with using the 'least_confident' measure method. The uncertainty sampler uses the package [alipy QueryInstanceUncertainty](http://parnec.nuaa.edu.cn/_upload/tpl/02/db/731/template731/pages/huangsj/alipy/page_reference/api_classes/api_query_strategy.query_labels.QueryInstanceUncertainty.html) class to select for samples with the greatest uncertainty.

In [21]:
# Set up pipeline functions with the partial function
# Pool Function
pooler_callable = partial(gridsearch_pool, ivs=metadata.independent_variables)
# Uncertainty Sampler
sampler = partial(uncertainty_sampler, model=logireg_model, n=10, measure='least_confident')

# Initialize pipeline
pipeline_uncertainty_samp = PoolPipeline(
    pooler_callable,
    weber_filter, # Filter that selects conditions with IV1 <= IV2
    sampler,
)

In [25]:
# Run the Pipeline
results1 = pipeline_uncertainty_samp.run()
print('Sampled Conditions:')
print(results1)

Sampled Conditions:
[[1.25 5.  ]
 [3.75 5.  ]
 [2.5  2.5 ]
 [0.   1.25]
 [1.25 3.75]
 [2.5  3.75]
 [1.25 1.25]
 [0.   0.  ]
 [1.25 2.5 ]
 [2.5  5.  ]]




# Writing custom functions
