# PyCC Data Views API Tutorial

*Authors: Enze Chen, Eric Lundberg*

In this notebook, we will cover how to *create* a data view using the [Citrination API](http://citrineinformatics.github.io/python-citrination-client/). Data views provide the configuration necessary in order to perform machine learning and identify relationships in your data. Previously, data views could only be created using the online user interface, but now they can be created using Python and API calls as well. We will demonstrate this functionality using the [Band gaps from Strehlow and Cook](https://citrination.com/datasets/1160/show_search?searchMatchOption=fuzzyMatch) dataset, where we will create a view mapping: 

$$\text{Chemical formula (inorganic) + Crystallinity (categorical)} \longrightarrow \text{Band gap (real)}$$

## Learning objectives
By the end of this tutorial, you will know how to:
* Create DataViewBuilder objects
* Create new data views from existing data using the DataViewsClient
* Perform operations on views using the DataViewsClient

## Background knowledge
In order to get the most out of this tutorial, you should already be familiar with the following:
* Create and access datasets through the API ([documentation](http://citrineinformatics.github.io/python-citrination-client/tutorial/data_examples.html) and [tutorial](1_data_client_api_tutorial.ipynb))
* What the data views [front-end UI](https://citrination.com/data_views) looks like

## Imports

In [1]:
# Standard packages
import json
import os
import time
import uuid # generating random IDs

# Third-party packages
from citrination_client import *
from citrination_client.views.data_view_builder import DataViewBuilder

## Data view builder
This class handles the configuration for data views and returns a **configuration** object that is an input for the data views client. The configuration specifies the datasets, model, and descriptors. Some of the important parameters to note are:
* **dataset_ids**: An array of strings, one for each dataset ID that should be included in the view.
* **model_type**: A string of either `linear`, which use linear regression, or `default`, which uses a random forest.
* **descriptors**: A descriptor instance, which could be `{RealDescriptor, InorganicDescriptor, OrganicDescriptor, CategoricalDescriptor, or AlloyCompositionDescriptor}`.
    * Note: Chemical formulas for the API take the key `formula`.
* **roles**: A role for each descriptor, as a string, which could be `{input, output, latentVariable, ignored}`.
* **group_by**: A Boolean for whether or not to group by a descriptor during CV (`default = False`).

In [9]:
# Create ML configuration
dv_builder = DataViewBuilder()
dv_builder.dataset_ids(['1160']) # ID number for band gaps dataset
dv_builder.model_type('default') # random forest

# Define descriptors
crystallinity = ['Single crystalline', 'Polycrystalline', 'Amorphous'] # Obtained from dataset
desc_crystal = CategoricalDescriptor(key='Property Crystallinity',
                                     categories=crystallinity)
dv_builder.add_descriptor(descriptor=desc_crystal,
                          role='input')

desc_formula = InorganicDescriptor(key='formula',
                                   threshold=1) # threshold <= 1; hidden in future releases
dv_builder.add_descriptor(descriptor=desc_formula,
                          role='input')

desc_bandgap = RealDescriptor(key='Property Band gap',
                              lower_bound=0.0,
                              upper_bound=1e9,
                              units='eV')
dv_builder.add_descriptor(descriptor=desc_bandgap,
                          role='output')

# Build the configuration once all the pieces are in place
dv_config = dv_builder.build()

In [10]:
desc_bandgap

{
  "descriptor_key": "Property Band gap",
  "category": "Real",
  "lower_bound": 0.0,
  "upper_bound": 1000000000.0,
  "units": "eV"
}

## Data view client
After obtaining your customized configuration, you have to initialize a data views client instance in order to create a data view from the configuration you built. The `create()` method returns the ID for the data view, which you will need for subsequent analysis and retraining.

In [15]:
# Instantiate the base CitrinationClient
client = CitrinationClient(os.environ['CITRINATION_API_KEY'], 'https://citrination.com')

# Instantiate the DataViewsClient
dv_client = client.data_views

# Create a data view using the above configuration and store the ID
view_name = 'PyCC View ' + str(uuid.uuid4()) # random name to avoid clashes
view_desc = 'This view was created by the PyCC API tutorial.'
dv_id = dv_client.create(configuration=dv_config,
                         name=view_name,
                         description=view_desc)
print('ID:', dv_id, 'Name:', view_name)

9217 PyCC View b7051eda-2d20-4f4e-a78c-98f9c0282487


## Data view properties and analysis
Now that the view is on your Citrination site, you can use the ID to do a variety of analyses. For example, you can obtain the metadata in JSON format for easy extraction.

In [16]:
view_metadata = dv_client.get(dv_id)
print('Name of view: {}'.format(view_metadata['name']))
print('Column names: {}'.format(view_metadata['selected_columns']))
print('Descriptor roles: {}'.format(view_metadata['configuration']['roles']))

Name of view: PyCC View b7051eda-2d20-4f4e-a78c-98f9c0282487
Column names: ['Property Crystallinity', 'formula', 'Property Band gap']
Descriptor roles: {'Property Band gap': 'output', 'formula': 'input', 'Property Crystallinity': 'input'}


### Check status of services
If there's a lot of data, training might take some time, and you might want to check when `predict` services are ready. Other possible services include `experimental_design`, `data_reports`, and `model_reports`.

In [23]:
vars(dv_client.get_data_view_service_status(dv_id)._model_reports)

{'_ready': True,
 '_context': 'success',
 '_event': <citrination_client.models.event.Event at 0x1150d5a58>,
 '_reason': 'Model reports are now available'}

In [17]:
print(dv_client.get_data_view_service_status(dv_id).predict.reason)

Predict services are ready.


### Retraining a view
Once you've updated an included dataset by uploading more data (from the literature, sequential learning, etc.), you can easily retrain your models.

In [24]:
dv_client.retrain(dv_id)

True

### Submitting a prediction request
This can be done using the function `submit_predict_request()`, which returns an ID for the prediction request.

In [25]:
candidate_formula = 'GaN'
candidates = [{'formula':candidate_formula, 'Property Crystallinity':'Single crystalline'}]
predict_id = dv_client.submit_predict_request(dv_id,
                                              candidates,
                                              prediction_source='scalar',
                                              use_prior=False)

# Use a loop to monitor status
while True:
    predict_status = dv_client.check_predict_status(dv_id, predict_id)
    print('Prediction job status for {}: {}'.format(predict_id, predict_status['status']))
    if predict_status['status'] == 'Finished':
        break
    time.sleep(5)

predict_results = predict_status['results']
predict_value = predict_results['candidates'][0]['Property Band gap']
predict_loss = predict_results['loss'][0]['Property Band gap']
print('The predicted bandgap for {} is {} +/- {}'.format(candidate_formula, predict_value, predict_loss))

Prediction job status for 86be64e6-a153-41b6-9531-f7a80954e390: Accepted
Prediction job status for 86be64e6-a153-41b6-9531-f7a80954e390: Finished
The predicted bandgap for GaN is 3.420499661246613 +/- 0.13075207058150282


In [28]:
predict_results

{'candidates': [{'mean of Miracle Ratio for formula dopants': 0.0,
   'Maximum radius ratio for formula': 1.7183098591549295,
   'mean of Elastic Poisson Ratio for formula': 0.27781211902485525,
   'mean of Elemental polarizability for formula': 4.609999999999999,
   'mean of Packing density for formula': 410.3878115564787,
   'mean of Elemental crystal structure (space group) for formula': 129.0,
   'mean of Number of unfilled f valence electrons for formula dopants': 0.0,
   'mean of DFT volume ratio for formula dopants': 0.0,
   'mean of Valence electron density for formula': 0.44186192995716805,
   'mean of Modulii sum for formula': 138.15649798051794,
   'mean of Non-dimensional liquid range for formula dopants': 0.0,
   'formula dopant stoichiometry': 0.0,
   'Maximum weight fraction for formula dopants': 0.0,
   'Maximum atomic fraction for formula dopants': 0.0,
   'mean of Elemental melting temperature for formula': 182.98000000000002,
   'mean of Miracle Ratio for formula': 0

### Deleting a view
You can delete views very easily through the API, so handle with care!

In [29]:
# dv_client.delete(dv_id)

## Conclusion
To recap, this notebook went through the steps for creating a data view using the API.
1. First, we used the DataViewBuilder object to specify the configuration.
2. Then, we trained the model, which is simple as long as the configuration is correct.
3. Lastly, we explored some of the post-processing capabilities, such as retraining and submitting predictions.

## Additional resources
It's now possible to conduct the major aspects of the Citrination workflow through the API, which should increase the speed and flexibility of informatics approaches. Some other topics that might interest you include:
* More details regarding client functions in the [code base](https://github.com/CitrineInformatics/python-citrination-client/blob/master/citrination_client/views/client.py).
* [DataClient](http://citrineinformatics.github.io/python-citrination-client/tutorial/data_examples.html) - This allows you to create datasets and upload PIF data (only) using the API.
  * There is also a corresponding [tutorial](1_data_client_api_tutorial.ipynb).
* [ModelsClient](http://citrineinformatics.github.io/python-citrination-client/tutorial/models_examples.html) - This allows you to submit predict and design runs using the API.
  * There is also a corresponding [tutorial](3_models_client_api_tutorial.ipynb).
* Other examples on [learn-citrination](https://github.com/CitrineInformatics/learn-citrination).