In [1]:
from citrination_client import CitrinationClient
from os import environ
from citrination_client import PifQuery, SystemQuery, PropertyQuery, ChemicalFieldOperation, FieldOperation, Filter
client = CitrinationClient(environ['CITRINATION_API_KEY'], 'https://citrination.com')

# Machine learning on Citrination

Citrination will automagically generate machine learning models when given sufficient meta-data:
 1. A list of records (pifs)
 1. Identification of columns as inputs or outputs
 1. [Implicit] consistency of unlisted conditions

## CSV to Models

User-defined machine learning is exposed via the "csv2models" tool:
 1. Put data in rows
 1. Label columns
 1. ...
 1. Models!
 
In this tutorial, we'll generate a valid CSV from a query.  You can also use any CSV you might have by setting the column names.

## Example: density from MaterialsProject

We'll train a model from chemical formula to density using [data](https://citrination.com/search/simple?property=density&includedDatasets=150675) from the [materials project](https://materialsproject.org/).

Let's start with a simple query for the density that extracts it along with the formula

In [2]:
system_query = SystemQuery(
    chemical_formula=ChemicalFieldOperation(
        extract_as="formula"
    ),
    properties=PropertyQuery(
        name=FieldOperation(
            filter=[Filter(equal="density")]
        ),
        value=FieldOperation(
            extract_as="density"
        )
    )
)

Material's project is big, so we'll just pull out 100 records for now.  If we don't draw them randomly, they'll all be `Al` and `As` and `Cs`.

In [3]:
test_query = PifQuery(
    include_datasets=[150675],
    size=500,
    random_results=True,
    system=system_query
)

Let's see what we've got:

In [4]:
search_result = client.search(test_query)
print("We found {} records".format(search_result.total_num_hits))
print([x.extracted for x in search_result.hits[0:2]])

We found 52265 records
[{'density': '7.176336140014397', 'formula': 'Bi2Se3'}, {'density': '15.843994143005123', 'formula': 'Pu5Pt3'}]


Now we just need to format in a CSV with `INPUT:<name>` and `OUPUT:<name>` headers.  Note: don't use any spaces (sorry!).

In [5]:
def write_csv(name, rows):
    with open(name, "w") as f:
        f.write("INPUT:CHEMICAL_FORMULA,OUTPUT:Density-g/cm3\n")
        for row in rows:
            f.write("{formula:s}, {density:s}\n".format(**row))
write_csv('density.csv', [x.extracted for x in search_result.hits])

Upload that csv to the [models page](https://citrination.com/models/).

## Data science

We can do better than that!

In [6]:
stable_query = system_query = SystemQuery(
    chemical_formula=ChemicalFieldOperation(
        extract_as="formula"
    ),
    properties=[
        PropertyQuery(
            name=FieldOperation(
                filter=[Filter(equal="density")]
            ),
            value=FieldOperation(
                extract_as="density"
            ),
            logic="MUST"
        ),
        PropertyQuery(
            name=FieldOperation(
                filter=[Filter(equal="Energy above convex hull")]
            ),
            value=FieldOperation(
                filter=[Filter(max=1.0e-9)]
            ),
            logic="MUST"
        )
    ]
)

In [7]:
better_query = PifQuery(
    include_datasets=[150675],
    random_results=True,
    size=500,
    system=stable_query
)
better_result = client.search(better_query)
print("We found {} records".format(better_result.total_num_hits))
write_csv('better_density.csv', [x.extracted for x in better_result.hits])

We found 31512 records


## Applying the model

We can use the model to make predictions through the client.  The `predict` method expects the name of the model and a list of inputs, where each input is a map from property names to property values.

The result is a dictionary with a `candidates` member that is a list of maps from property names to values.  However, the values here are pairs of the form `(expected value, uncertainty)`.

In [8]:
inputs = [{"CHEMICAL_FORMULA": "AlCu"},]
resp = client.predict("betterdensitydemo", inputs)
prediction = resp['candidates'][0]['Density']
print("We predict the density of {} to be {} +/- {}".format(inputs[0]['CHEMICAL_FORMULA'], *prediction))

We predict the density of AlCu to be 5.786268548675726 +/- 1.264895144698526


### Elemental properties

The model uses average elemental properties, based on [magpie](https://bitbucket.org/wolverton/magpie), to featurize the chemical formula.  The predictions contain those and any other latent features as well: 

In [9]:
print(list(resp['candidates'][0].items())[0:5])

[('CHEMICAL_FORMULA_Number_l1', [21.0, 0.0]), ('CHEMICAL_FORMULA_MeltingT_l1', [1145.62, 0.0]), ('CHEMICAL_FORMULA_NdValence_l1', [5.0, 0.0]), ('CHEMICAL_FORMULA_BoilingT_l1', [2996.0, 0.0]), ('CHEMICAL_FORMULA_MendeleevNumber_l1', [68.5, 0.0])]


## Design

Now that we have a model, we can optimize it over the space of materials.  Creating a good sampler is generally hard, so here we'll just screen our model over the compounds in ICSD.

In [24]:
with open("./example_data/icsd.dat", "r") as f:
    compounds = [x.split()[0] for x in f.readlines()]
inputs = [{"CHEMICAL_FORMULA": x} for x in compounds[:10000]]
results = client.predict("betterdensitydemo", inputs)['candidates']
best = sorted(results, key=lambda x: -x['Density'][0])[0]
print("Highest density compound is {} with rho={} +/- {}".format(
    best['CHEMICAL_FORMULA'][0], best['Density'][0], best['Density'][1]
))

Highest density compound is HfPt3 with rho=18.575539743320043 +/- 1.986902621143938
