In [1]:
from citrination_client import CitrinationClient
from os import environ

from citrination_client import PifSystemQuery, PifSystemReturningQuery
from citrination_client import FieldQuery, ValueQuery, NameQuery
from citrination_client import PropertyQuery,DataQuery, DatasetQuery, ChemicalFieldQuery, Filter
client = CitrinationClient(environ['CITRINATION_API_KEY'], 'https://citrination.com')

# Experimental design with Citrination

In this notebook, we demonstrate how to use Citrination to select experiments (or simulations, or literature searches) to maximize the impact on an optimization problem.

## Case study: thermoelectrics

Given a list of thermoelectric candidates, how many experiments do we need to find the best one?

Here, _best_ will be with respect to the thermoelecric figure of merit, ZT:
    $$ ZT = \frac{\sigma S^2 T}{\lambda}, $$
where $\sigma$ is the electrical conductivity, $S$ is the Seebeck coefficient, $T$ is the temperature, and $\lambda$ is the thermal conductivity.  Higher is better.

# Getting the data

We begin by querying for the thermoelectric dataset, which is on Citrination [here](https://citrination.com/datasets/150888).  We want the formula and zT

In [2]:
dataset_id = 150888

system_query = PifSystemQuery(
    chemical_formula=ChemicalFieldQuery(
        extract_as="formula"
    ),
    properties=PropertyQuery(
        name=FieldQuery(
            filter=[Filter(equal="ZT")]
        ),
        value=FieldQuery(
            extract_as="ZT"
        )
    )
)
thermoelectric_query = PifSystemReturningQuery(
                        random_seed=0,
                        query=DataQuery(
                            dataset=DatasetQuery(
                                id=[Filter(equal='150888')]
                            ),
                        system=system_query))

Let's run it and see what we get:

In [3]:
search_result = client.search.pif_search(thermoelectric_query)
print("We found {} records".format(search_result.total_num_hits))
print([x.extracted for x in search_result.hits[0:2]])

We found 165 records
[{'formula': 'La0.98Sr0.02CoO3', 'ZT': '0.026234732'}, {'formula': 'Zr0.4Hf0.4Ti0.2NiSn', 'ZT': '0.030958179'}]


This dataset has all the ZT values already, so we want to drop most of them before trying to design an experiment:

In [4]:
from random import shuffle, seed
seed(1)
full_data = [x.extracted.copy() for x in search_result.hits]
shuffle(full_data)
known_subset = full_data[:20]
unknown_subset = full_data[20:]
for x in unknown_subset: 
    del x['ZT']

Our goal is to pick the best material to measure next from `unknown_subset`.

# Training a model on known data

We train a model using the csv -> dataset -> data_view workflow described in the [modeling tutorial](https://github.com/CitrineInformatics/learn-citrination/blob/master/MLonCitrination.ipynb).

## Create a csv

The csv needs headers that conform to our [CSV template](http://help.citrination.com/knowledgebase/articles/1188136-citrine-template-csv-csv).

In [5]:
def write_csv(name, rows):
    with open(name, "w") as f:
        f.write("FORMULA, PROPERTY: ZT \n")
        for row in rows:
            f.write("{formula:s}, {ZT:s}\n".format(**row))
write_csv('known_thermoelectrics.csv', known_subset)

The rest of the model building process is on the website:
 1. Go to the [Add Datasets](https://citrination.com/add_data) page and upload `known_thermoelectrics.csv` using the `Citrine: Template CSV` ingester from the drop down menu.
 1. Go to the [data views page](https://citrination.com/data_views) and click "Create new data view"
 1. Search for the property name "ZT" and select the dataset you created before.  Advance with the "NEXT >" button in the top right corner
 1. Follow the guide to create a data view that has `formula` as an input and `ZT` as an output

# Apply the model to unknown data

First, we'll use the trained model to make predictions via the API.  Change the `view_id` below to point to your view.

In [9]:
view_id = "3904"

inputs = [{"formula": "TiO2"}]
prediction = client.models.predict(view_id, inputs, method="scalar")

print(inputs[0]['formula'])
print("We predict the ZT of {} (@ 300K) to be {} +/- {}".format(inputs[0]['formula'], prediction[0].get_value('Property ZT'), prediction[0].get_value('Property ZT')))


TiO2
We predict the ZT of TiO2 (@ 300K) to be <citrination_client.models.predicted_value.PredictedValue object at 0x10883c390> +/- <citrination_client.models.predicted_value.PredictedValue object at 0x10883c390>


## Maximum likelihood of improvement

This tutorial is about _experimental design_, so we need to pick a criterion for experimental selection.

There are many, but a straight forward and powerful one is "maximum likelihood of improvement", which is easy to compute if we assume the output distribution is normal.

In [10]:
from scipy.special import erf
from math import sqrt

def probability_improvement(mean, sigma, baseline):
    return float(0.5 * (1.0 + erf((mean - baseline) / (sigma * sqrt(2.0)))))

What is the baseline?  The largest value in the known data:

In [11]:
baseline_ZT = max(float(x['ZT']) for x in known_subset)
print("The highest ZT value in the known subset is {}".format(baseline_ZT))

The highest ZT value in the known subset is 0.424000225


Now let's screen the unknown materials for likelihood of improvement:

In [36]:
predictions = client.models.predict(view_id, unknown_subset)
for p in predictions:
    p_value = p.get_value('Property ZT')
    p.add_value('LI', probability_improvement(float(p_value.value), float(p_value.loss), baseline_ZT))

Pandas can help us look at the result.  The top values for "LI" are the ones we should try next.

In [46]:
import pandas as pd

top_predictions = []
for p in predictions:
    li, form, zt = p.get_value('LI'), p.get_value('formula'), p.get_value('Property ZT')
    if li > 1e-03:
        top_predictions.append((form.value, [zt.value, zt.loss], li))
        
df = pd.DataFrame(top_predictions, columns=['formula', 'Property ZT', 'LI'])
print(df)
df['Property ZT'] = df['Property ZT'].map(lambda x: "{:5.2f} +/- {:5.2f}".format(*x))
df['LI'] = df['LI'].map(lambda x: "{:5.3f}".format(x))

df.sort_values('LI', axis=0, ascending=False)[0:5]


            formula                                 Property ZT        LI
0             WO2.9  [0.15556628550000007, 0.11014407093559876]  0.007402
1         CeFe4Sb12  [0.20955596754687503, 0.10286526296804725]  0.018548
2           WO2.722  [0.15561337937500003, 0.10955761375265435]  0.007148
3     In0.25Co4Sb12    [0.29919832456250023, 0.125702231635684]  0.160395
4       LaFe3CoSb12          [0.195377252, 0.09730643457811816]  0.009399
5      In0.3Co4Sb12    [0.29919832456250023, 0.125702231635684]  0.160395
6      In0.1Co4Sb12      [0.2129342765625, 0.07349687579251576]  0.002041
7  CeFe3.5Co0.5Sb12  [0.20955596754687503, 0.10286526296804725]  0.018548
8       CeFe3CoSb12  [0.20955596754687503, 0.10286526296804725]  0.018548
9      In0.2Co4Sb12    [0.29919832456250023, 0.125702231635684]  0.160395


Unnamed: 0,formula,Property ZT,LI
3,In0.25Co4Sb12,0.30 +/- 0.13,0.16
5,In0.3Co4Sb12,0.30 +/- 0.13,0.16
9,In0.2Co4Sb12,0.30 +/- 0.13,0.16
1,CeFe4Sb12,0.21 +/- 0.10,0.019
7,CeFe3.5Co0.5Sb12,0.21 +/- 0.10,0.019


These materials have a likelihood of improvement below 50%, i.e. their expected ZT value is below the highest value in the dataset.  Therefore, they are biased towards materials with high model uncertainty as well.