In [1]:
from citrination_client import CitrinationClient
from os import environ
from pypif import pif


from citrination_client import PifSystemQuery, PifSystemReturningQuery
from citrination_client import FieldQuery, ValueQuery, NameQuery
from citrination_client import PropertyQuery,DataQuery, DatasetQuery, ChemicalFieldQuery, Filter

client = CitrinationClient(environ['CITRINATION_API_KEY'], 'https://citrination.com')

# Machine learning on Citrination

Citrination will automagically generate machine learning models when given sufficient meta-data:
 1. A list of records (pifs)
 1. Identification of columns as inputs or outputs
 1. [Implicit] consistency of unlisted conditions

## CSV to Models

User-defined machine learning is exposed via "data views":
 1. Put data into a CSV
 1. Upload as a dataset
 1. Include in a view
 1. ...
 1. Models!
 
In this tutorial, we'll generate a valid CSV from a query.  You can also use any CSV you might have laying around.

## Example: density from MaterialsProject

We'll train a model from chemical formula to density using [data](https://citrination.com/search/simple?property=density&includedDatasets=150675) from the [materials project](https://materialsproject.org/).

Let's start with a simple query for the density that extracts it along with the formula

In [2]:
system_query=PifSystemQuery(
                chemical_formula=ChemicalFieldQuery(
                    extract_as="formula"
                ),
                properties=PropertyQuery(
                    name=FieldQuery(
                        filter=[Filter(equal="Density")]
                    ),
                    value=FieldQuery(
                        extract_as="density",
                        extract_all=True)
                    )
                )

Material's project is big, so we'll just pull out 500 records for now.  If we don't draw them randomly, they'll all be `Al` and `As` and `Cs`.

In [3]:
dataset_id = '150675'

test_query = PifSystemReturningQuery(
                size=500,
                random_results=True,
                query=DataQuery(
                    dataset=DatasetQuery(
                        id=[Filter(equal='150675')]
                    ),
                    system=system_query
                ))

Let's see what we've got:

In [6]:
search_result = client.search.pif_search(test_query)
print("We found {} records".format(len(search_result.hits)))
print([x.extracted for x in search_result.hits[0:2]])

We found 500 records
[{'density': ['6.741920951056403'], 'formula': 'W25O73'}, {'density': ['4.035981296421682'], 'formula': 'Li4VCr3O8'}]


Now we just need to format in a CSV. The csv needs headers that conform to our [CSV template](http://help.citrination.com/knowledgebase/articles/1188136-citrine-template-csv-csv).

In [7]:
def write_csv(name, rows):
    with open(name, "w") as f:
        f.write("FORMULA, PROPERTY: Density\n")
        for row in rows:
            f.write("{}, {}\n".format(row.get('formula'), row.get('density')))
            
write_csv('density.csv', [x.extracted for x in search_result.hits])

Upload that csv to the [new dataset page](https://citrination.com/datasets/new), making sure to use the `Citrine: Template CSV` from the dropdown menu, which will create a dataset of records with the chemical formula and density taken from the CSV.

To train the models, we create a _data view_ based on the dataset we just created.  To create a data view:
 1. Go to the [data views page](https://citrination.com/data_views) and click "Create new dataset"
 1. Search for the property name "Density" and select the dataset you created before.  Advance with the "NEXT >" button in the top right corner
 1. Select the "Chemical formula" and "Density" properties (or "Include all")
 1. The Chemical formula should be recognized as an "Inorganic Chemical Formula" and "Input"; click the right arrow to advance to the next property
 1. The density should be recognized as a "Real" and "Output".  Enter "Infinity" for the Max value
 1. Review the annotations, click "Next >", name your data view, and click "Save"

When the models are done training, you'll have access to predictions, model reports, and other analysis via the new data view.

## Data science

We can do better than that!  Many of the DFT records are unstable or meta-stable.  What we really want are densities of stable phases, so let's filter on the energy above the convex hull.

In [8]:
stable_query = PifSystemQuery(
                    chemical_formula=ChemicalFieldQuery(
                        extract_as='formula'
                    ),
                    properties=[
                        PropertyQuery(
                            name=FieldQuery(
                                filter=[Filter(equal="Density")]),
                            value=FieldQuery(
                                extract_as="density",
                                logic="MUST")
                        ),
                        PropertyQuery(
                            name=FieldQuery(
                                filter=[Filter(equal="Energy Above Convex Hull")]),
                            value=FieldQuery(
                                extract_as="EACH",
                                filter=[Filter(max='0.000000001')],
                                logic="MUST")
                        )]
)



Let's re-run with this new query, saving to `better_density.csv`.

In [10]:
dataset_id = 150675
query_size = 5000

better_query = PifSystemReturningQuery(
                size=query_size,
                random_results=True,
                query=DataQuery(
                    dataset=DatasetQuery(
                        id=[Filter(equal=str(dataset_id))]
                    ),
                    system=stable_query
                ))


better_result = client.search.pif_search(better_query)

print("We found {} records".format(len(better_result.hits)))
print([x.extracted for x in search_result.hits[0:2]])
write_csv('better_density.csv', [x.extracted for x in better_result.hits])

We found 5000 records
[{'density': ['6.741920951056403'], 'formula': 'W25O73'}, {'density': ['4.035981296421682'], 'formula': 'Li4VCr3O8'}]


## Applying the model

We can use the model to make predictions through the client.  The `predict` method expects the ID number of the data view and a list of inputs, where each input is a map from property names to property values.

The result is a dictionary with a `candidates` member that is a list of maps from property names to values.  However, the values here are pairs of the form `(expected value, uncertainty)`.

In [22]:
inputs = [{"Chemical formula": "AlCu"},]
resp = client.models.predict("27", inputs)
print(len(inputs), len(resp))
prediction = resp[0].get_value('Density')
print("We predict the density of {} to be {} +/- {}".format(inputs[0]['Chemical formula'], prediction.value, prediction.loss))

1 1
We predict the density of AlCu to be 7.464366567871429 +/- 2.178253940554062


### Elemental properties

The model uses average elemental properties, based on [magpie](https://bitbucket.org/wolverton/magpie), to featurize the chemical formula.  The predictions contain those and any other latent features as well: 

In [40]:
keys = list(resp[0].all_keys())
print([[key, resp[0].get_value(key).value] for key in keys[0:5]])

[['Min atomic radius plus max electronegativity difference for Chemical formula dopants', 0.0], ['mean of Elemental atomic volume for Chemical formula dopants', 0.0], ['mean of Non-dimensional work function for Chemical formula', 0.7004071932745081], ['mean of DFT volume ratio for Chemical formula dopants', 0.0], ['mean of Shear Modulus Melting Temp Product for Chemical formula dopants', 0.0]]


## Design

Now that we have a model, we can optimize it over the space of materials.  Creating a good sampler is generally hard, so here we'll just screen our model over the compounds in ICSD.

In [44]:
with open("./example_data/icsd.dat", "r") as f:
    compounds = [x.split()[0] for x in f.readlines()]
inputs = [{"Chemical formula": x} for x in compounds[:1000]]
resp = client.models.predict("27", inputs)
results = [{"formula": r.get_value('Chemical formula').value, \
      "value": r.get_value('Density').value, \
      "loss": r.get_value('Density').loss} for r in resp]
best = sorted(results, key=lambda x: -x['value'])[0]
print("Highest density compound is {} with rho={} +/- {}".format(
    best['formula'], best['value'], best['loss']
))

Highest density compound is Np1Al99 with rho=14.450434349941387 +/- 6.354853591092307
