![logo](https://github.com/CitrineInformatics/community-tools/blob/master/templates/fig/citrine_banner_2.png?raw=true)
<link href="https://fonts.googleapis.com/css?family=Barlow+Semi+Condensed|Lusitana" rel="stylesheet">


# Sequential Learning Workshop
*Authors: Edward Kim, Enze Chen, Nils Persson*

In this notebook, we will cover how to perform **sequential learning** (SL) using the [Citrination API](http://citrineinformatics.github.io/python-citrination-client/). [Sequential learning](https://citrine.io/platform/sequential-learning/) is the key workflow which allows machine learning algorithms and in-lab experiments to iteratively inform each other.

To replace the need for an actual laboratory or simulation, this demo uses an existing dataset from the Open Citrination platform, with measurements of *steel fatigue strength across 437 experiments spanning 23 processing and formulation variables*. 

To simulate this experiment, we will redact the output measurement (Fatigue Strength) from all but 25 random experiments from the bottom quartile of performance. Each new experiment will be selected from the list of 412 other "unmeasured" points using the Citrination platform's design algorithm, with the goal of *maximizing Fatigue Strength*.

## Table of contents<a name="toc"></a>
1. [Setup](#1)
1. [Get Training Data](#2)
1. [Initial Measurements](#3)
1. [Run Sequential Learning](#4)
    1. [Design](#4.1)
    1. [Measure (and re-train)](#4.2)
    1. [Repeat](#4.3)
1. [Conclusion](#5)

## 1. Setup<a name="1"></a>
---
[Back to TOC](#toc)

This notebook uses some convenience functions to wrap several API endpoints. These are contained in the file `sequential_learning_wrappers_class.py` and imported below. Review the docstrings and code in that file to learn more.

In [None]:
# IPython magic settings
%matplotlib inline
%load_ext autoreload
%autoreload 2

# Third-party packages
from steel_fatigue_wrapper_class import *   # Helper functions to wrap several API endpoints together

### Initialize the CitrinationClient

Initializing a `CitrinationClient` requires two arguments, `api_key` and `site`.

If the following cell runs successfully, you will see `Client created successfully!`

In [None]:
# Initialize the CitrinationClient with your API key and deployment
site = "https://citrination.com"
client = CitrinationClient(api_key=os.environ.get('CITRINATION_API_KEY'),
                           site=site)
verify_client(client)

## 2. Get Training Data<a name="2"></a>
---
[Back to TOC](#toc)

Since we don't have access to an actual experiment or simulation, we will use data from an existing public dataset on steel fatigue.

In [None]:
pd.set_option('display.max_columns', 500)
orig_dataset_id = 150670    
df_steel = get_steel_dataset(client, orig_dataset_id)
ordered_cols = df_steel.columns.to_list()
print("{} entries spanning {} dimensions.".format(df_steel.shape[0], df_steel.shape[1]-5))
df_steel.sample(4)

### Plot histogram of Fatigue Strength values

In [None]:
plt.rcParams.update({'figure.figsize':(8, 7), 'font.size':14, 'lines.markersize':8})
df_steel['Fatigue Strength'].hist(bins=20)
plt.xlabel('Fatigue Strength (MPa)')
plt.ylabel('Number of Entries')
plt.show()

### Analyze simple statistics

In [None]:
df_steel['Fatigue Strength'].describe()

### Generate Training Set

We will select 25 points from the bottom 25% of the dataset (in terms of Fatigue Strength) to simulate an initial experimental design space. We will have access to the Fatigue Strength of these 25 training points, but it will be redacted from the remaining 412. Thus, our initial model will be constructed on these below-average candidates.

To "measure" a new candidate, we simply look up its Fatigue Strength from the original dataset. This process (the splitting and the "measurement") uses the functionality of our `SearchClient` under the hood.

In [None]:
# Set a cutoff value for Fatigue Strength
target_col = 'Fatigue Strength'
target_max = np.percentile(df_steel['Fatigue Strength'], 25) # 50th percentile of fatigue strength

# Split and redact original dataset
all_pifs = split_dataset(client,
                         orig_dataset_id,
                         target_col,
                         target_max,
                         num_train=25)

## 3. Initial Measurements<a name="3"></a>
---
[Back to TOC](#toc)

We'll now write our initial training data to a JSON file and upload it to Citrination using our `client`. This involves creating a new Dataset, then defining a DataView to run predict and design services.

**Start from here if you want to start-over an SL run**.

In [None]:
random_string = str(uuid4())[:6]
meas_dataset_name = "SL_demo_dataset_{}".format(random_string)

# Write to file
if not os.path.exists('temp'):
    os.makedirs('temp')
dataset_file = os.path.join("temp", meas_dataset_name+".json")
with open(dataset_file, "w") as f:
    f.write(pif.dumps(all_pifs, indent=4))

# Upload to Citrination
dataset_id = upload_data_and_get_id(client,
                                    meas_dataset_name,
                                    dataset_file,
                                    create_new_version=True)

print("Dataset created: {}/datasets/{}".format(site, dataset_id))
print('The name is "{}."'.format(meas_dataset_name))

### Create a DataView

We now create a DataView to model our initial training data and run design services. We will select the `chemical formula` and the processing variables as inputs, and set the `Fatigue Strength` as an output.

In [None]:
search_template_client = client.data_views.search_template_client
avail_cols = search_template_client.get_available_columns(dataset_id)

excluded_cols = \
    [col for col in avail_cols if ('Area Proportion' in col
                                   or 'Reduction Ratio' in col
                                   or 'composition' in col
                                   or 'Fatigue Strength' in col
                                   or 'Sample Number' in col)]

input_cols = [col for col in avail_cols if col not in excluded_cols]

print('Inputs:\n{}'.format(input_cols))

In [None]:
# Make a data view on Citrination, return/print the ID

view_name = "SL_demo_view_{}".format(random_string)

view_id = build_view_and_get_id(client,
                                dataset_id,
                                view_name,
                                view_desc='DataView for SL demo for Fatigue Strength.',
                                input_keys=input_cols,
                                output_keys=['Property Fatigue Strength',
                                             'Property Sample Number'],
                                model_type='default')

print("Data view created: {}/data_views/{}".format(site, view_id))
print('The name is "{}."'.format(view_name))

While model training proceeds, we can explore the DataView we just created.

## 4. Run Sequential Learning<a name="4"></a>
---
[Back to TOC](#toc)

We're now ready to run sequential learning (SL). SL consists of three main phases:

- **design**: generate new candidates to test in the lab (or *in silico*)
- **measure**: test those new candidates and add the results to your dataset
- **retrain**: re-train the machine learning model using the new measurements
- **repeat**

That's really all there is to it! We will manage the entire sequential learning process through an object called an `SL_run`, which has methods (`.design()` and `.measure()`) to run each of these steps. Let's instantiate one of those right now.

In [None]:
meas_cols = ordered_cols+['iter']
SLR = SL_run(client=client,
    view_id=str(view_id),
    dataset_id=str(dataset_id),
    orig_dataset_id=orig_dataset_id,
    all_dataset_cols=meas_cols,
    target=["Property Fatigue Strength", "Max"],
    score_type="MLI",
    sampler='This view')

SLR.measurements[meas_cols].head(3)

### 4.1 Design<a name="4.1"></a>

In [None]:
design_effort = 10 # An integer between 1 and 30
SLR.design(design_effort=design_effort)
cand_cols = input_cols + ['Property Fatigue Strength', 
                          'Uncertainty in Property Fatigue Strength', 
                          'citrine_score']
SLR.candidates[cand_cols]

### 4.2 Measure (and re-train)<a name="4.2"></a>

In [None]:
SLR.measure()
SLR.measurements[meas_cols].tail(3)

### 4.3 Repeat<a name="4.3"></a>
---
[Back to TOC](#toc)

From here on out, we can repeat this process as much as we want (and ultimately run until in converges to a specified tolerance). This basically consists of `.design()` and `.measure()` cycles. To wrap things up, we'll run a loop of a few more iterations.

In [None]:
repeat_iters = 6
for i in range(repeat_iters):
    SLR.design(design_effort=design_effort)
    SLR.measure()
    SLR.plot_sl_results();

In [None]:
# Inspect the new data in a DataFrame
SLR.measurements[meas_cols].tail(repeat_iters+3)

## Conclusion<a name="5"></a>
---
[Back to ToC](#toc)

After running this demo, you should have a sense for the steps involved in a sequential learning cycle, namely that it consists of **design** and **measure** phases. Design is run on the Citrination platform by training a model to fit your existing data, and returns candidates to measure. Measure is the phase where you (the experimentalist or computationalist) go and run your experiment!

A few key takeaways from this demo:
* Building a model on Citrination is as easy as defining inputs, outputs, and latent variables.
* Design runs return candidates based on predicted output *and* uncertainty.
* Well-calibrated prediction uncertainties are vital to this process.