In [1]:
from citrination_client import CitrinationClient
from os import environ
from citrination_client import PifQuery, SystemQuery, PropertyQuery, ChemicalFieldOperation, FieldOperation, Filter
client = CitrinationClient(environ['CITRINATION_API_KEY'], 'https://citrination.com')

# Machine learning on Citrination

Citrination will automagically generate machine learning models when given sufficient meta-data:
 1. A list of records (pifs)
 1. Identification of columns as inputs or outputs
 1. [Implicit] consistency of unlisted conditions

## CSV to Models

User-defined machine learning is exposed via the "csv2models" tool:
 1. Put data in rows
 1. Label columns
 1. ...
 1. Models!
 
In this tutorial, we'll generate a valid CSV from a query.  You can also use any CSV you might have by setting the column names.

## Example: density from MaterialsProject

We'll train a model from chemical formula to density using [data](https://citrination.com/search/simple?property=density&includedDatasets=150675) from the [materials project](https://materialsproject.org/).

Let's start with a simple query for the density that extracts it along with the formula

In [2]:
system_query = SystemQuery(
    chemical_formula=ChemicalFieldOperation(
        extract_as="formula"
    ),
    properties=PropertyQuery(
        name=FieldOperation(
            filter=[Filter(equal="density")]
        ),
        value=FieldOperation(
            extract_as="density"
        )
    )
)

Material's project is big, so we'll just pull out 100 records for now.  If we don't draw them randomly, they'll all be `Al` and `As` and `Cs`.

In [3]:
test_query = PifQuery(
    include_datasets=[150675],
    size=100,
    random_results=True,
    system=system_query
)

Let's see what we've got:

In [4]:
search_result = client.search(test_query)
print("We found {} records".format(search_result.total_num_hits))
print([x.extracted for x in search_result.hits[0:2]])

We found 52265 records
[{'formula': 'Cs', 'density': '1.8998152107659778'}, {'formula': 'Cs', 'density': '1.8998152107659778'}]


Now we just need to format in a CSV with `INPUT:<name>` and `OUPUT:<name>` headers.  Note: don't use any spaces (sorry!).

In [5]:
def write_csv(name, rows):
    with open(name, "w") as f:
        f.write("INPUT:CHEMICAL_FORMULA,OUTPUT:Density-g/cm3\n")
        for row in rows:
            f.write("{formula:s}, {density:s}\n".format(**row))
write_csv('density.csv', [x.extracted for x in search_result.hits])

Upload that csv to the [models page](https://citrination.com/models/).

In [8]:
candidates = [{"CHEMICAL_FORMULA": "Fe"},]
resp = client.predict("densitydemo", candidates)
print(resp.content)

Posting to https://citrination.com/api/csv_to_models/densitydemo/predict with data={"usePrior": true, "candidates": [{"CHEMICAL_FORMULA": "Fe"}], "predictionSource": "scalar"}
b'Prediction failed: null'


## Data science

We can do better than that!

In [7]:
stable_query = system_query = SystemQuery(
    chemical_formula=ChemicalFieldOperation(
        extract_as="formula"
    ),
    properties=[
        PropertyQuery(
            name=FieldOperation(
                filter=[Filter(equal="density")]
            ),
            value=FieldOperation(
                extract_as="density"
            ),
            logic="MUST"
        ),
        PropertyQuery(
            name=FieldOperation(
                filter=[Filter(equal="Energy above convex hull")]
            ),
            value=FieldOperation(
                filter=[Filter(max=0.001)]
            ),
            logic="MUST"
        )
    ]
)

In [8]:
better_query = PifQuery(
    include_datasets=[150675],
    random_results=True,
    system=stable_query
)
better_result = client.search(better_query)
print("We found {} records".format(better_result.total_num_hits))
write_csv('better_density.csv', [x.extracted for x in better_result.hits])

We found 52265 records
