![Banner logo](https://raw.githubusercontent.com/CitrineInformatics/community-tools/master/templates/fig/citrine_banner_2.png)

# PyCC Data Client Tutorial

*Authors: Enze Chen, Max Gallant*

In this notebook, we will cover how to use the [Citrination API](http://citrineinformatics.github.io/python-citrination-client/) to upload and manage datasets on Citrination. Getting your data on Citrination will allow you to keep your data organized in one place and enable you to perform machine learning (ML) on the data. The application program interface (API) aims to facilitate the process for those who prefer writing Python scripts. 

## Table of contents
1. [Learning outcomes](#Learning-outcomes)
1. [Background knowledge](#Background-knowledge)
1. [Imports](#Python-package-imports)
1. [Initialization](#Initialize-the-PyCC)
1. [Data Client](#Data-client)
1. [Conclusion](#Conclusion)
1. [Additional resources](#Additional-resources)

## Learning outcomes

[Back to ToC](#Table-of-contents)

By the end of this tutorial, you will know how to:
* Initialize the Python Citrination Client (PyCC).
* Create a new dataset and upload data to that dataset using the [`DataClient`](http://citrineinformatics.github.io/python-citrination-client/tutorial/data_examples.html) sub-client.
* Retrieve data from a dataset and update dataset properties using the `DataClient`.

## Background knowledge

[Back to ToC](#Table-of-contents)

In order to get the most out of this tutorial, you should already be familiar with the following:
* The Physical Information File (PIF) schema. 
  * [Documentation](http://citrineinformatics.github.io/pif-documentation/schema_definition/index.html)
  * [Publication](https://www.cambridge.org/core/journals/mrs-bulletin/article/beyond-bulk-single-crystals-a-data-format-for-all-materials-structurepropertyprocessing-relationships/AADBAEDA62B0391D708CF02269989E8B)
  * [Example](https://github.com/CitrineInformatics/learn-citrination/blob/master/AdvancedPif.ipynb)
* What the datasets [front-end UI](https://citrination.com/datasets) looks like.

## Python package imports

[Back to ToC](#Table-of-contents)

In [None]:
# Standard packages
from os import environ  # get environment variables
from time import sleep  # wait time
from uuid import uuid4  # generating random IDs

# Third-party packages
from citrination_client import *

## Initialize the PyCC

[Back to ToC](#Table-of-contents)

Assuming that this is the very first time you're interacting with the Citrination API, we will first go over how to properly initialize the client that handles all communication. Most APIs require a key for access, and the PyCC is no exception. You can find your API key by navigating to [Citrination](https://citrination.com), clicking your username in the top-right corner, clicking "Account Settings," and then looking under your Email. Copy this key to your clipboard (`Ctrl+C`).

Since the key is linked to your specific user profile, *you should never hard-code or expose your API key in your code.* Instead, first store the API key in your [environment variables](https://medium.com/@himanshuagarwal1395/setting-up-environment-variables-in-macos-sierra-f5978369b255) like so (for Macs):
* In Terminal, type `vim ~/.bash_profile` (or use an editor of your choice).
* In that file, press `i` (edit mode) and add the line `export CITRINATION_API_KEY="paste_your_api_key"`.
* Save and exit (`Esc`, `:wq`, `Enter`).
* Open up a new Terminal and load this notebook one more time.

Instructions for setting environment variables in Windows can be found online such as [this site](https://www.computerhope.com/issues/ch000549.htm).

Now when you're coding, you can initialize the PyCC using the following syntax:

In [None]:
site = "https://citrination.com"  # site you want to access; we'll use the public site
client = CitrinationClient(api_key=environ.get('CITRINATION_API_KEY'), 
                           site=site)
client # reveal the attributes

The first argument is your API key, which you've stored in your system, and the second argument is your site URL. This example uses the public Citrination site, and different sites have different API keys, so pay attention to what you have listed in your `~/.bash_profile`. 

**Key takeaway**: Never expose your API key in your code.

## Data client

[Back to ToC](#Table-of-contents)

Once the base client is initialized, the [`DataClient`](http://citrineinformatics.github.io/python-citrination-client/tutorial/data_examples.html) can be easily accessed using the `.data` attribute.

In [None]:
data_client = client.data
data_client  # reveal the methods

### Create a dataset
Before you can upload data, you have to create an empty dataset to store the files in. The `create_dataset()` method of the `DataClient` does exactly this and returns a [`Dataset`](http://citrineinformatics.github.io/python-citrination-client/modules/data/datasets.html) object. The method has the following inputs:
* `name`: A string for the name of the dataset. It cannot be the same as that of an existing dataset that you own.
* `description`: A string for the description of the dataset.
* `public`: A Boolean indicating whether to make the dataset public (`default=False`).

In [None]:
data_name = 'PyCC Dataset ' + str(uuid4())[:6]
data_desc = 'This dataset was created by the PyCC API tutorial.'
dataset = data_client.create_dataset(name=data_name, 
                                     description=data_desc)

Once you've created the `Dataset` object, you can obtain the dataset ID from the `.id` attribute of a `Dataset`. You will need this ID for subsequent operations.

In [None]:
dataset_id = dataset.id
dataset_time = dataset.created_at
print('Dataset {} was created at {}.'.format(dataset_id, dataset_time))
print('It can be accessed at {}/datasets/{}'.format(site, dataset_id))

If you click on the above URL, it will take you to the dataset on Citrination, which at this point should be empty.

### Upload data to a dataset
The `upload()` method of the `DataClient` allows you to upload a file or a directory to a dataset. The method has the following inputs:
* `dataset_id`: The integer value of the ID of the dataset to which you will be uploading data.
* `source_path`: The path to the file or directory you want to upload.
* `dest_path`: The name of the file or directory as it should appear on Citrination (`default=None`).

The returned [`UploadResult`](http://citrineinformatics.github.io/python-citrination-client/modules/data/data_client.html#citrination_client.data.upload_result.UploadResult) object tracks the number of successful and failed uploads. You can also use the function `get_ingest_status()` to check the status of ingest.

*Note*: Any file format can be uploaded, but the current CitrinationClient (v5.0.1) only supports the [ingestion](https://help.citrination.com/knowledgebase/articles/1195249-citrination-file-ingesters) (i.e. "processing") of PIF files. 

In [None]:
# Upload a single file
upload_result = data_client.upload(dataset_id=dataset_id, 
                                   source_path='test_pif.json')
print('Successful upload? {}'.format(upload_result.successful())) # Boolean; True if none fail

# Upload a directory; each file is recursively added and has the folder name as a prefix
upload_result = data_client.upload(dataset_id=dataset_id, 
                                   source_path='test_pif_dir/')
print('Number of successful uploads: {}'.format(len(upload_result.successes))) # list of successful files

# Check ingest status with loop
while True:
    ingest_status = data_client.get_ingest_status(dataset_id=dataset_id)
    if (ingest_status == 'Finished'):
        print('Ingestion complete!')
        print('Dataset URL: {}/datasets/{}'.format(site, dataset_id))
        break
    else:
        print('Waiting for data ingest...')
        sleep(10)

**Verify**: If you go back to the dataset in the UI and refresh the page, you should find it populated with PIF records!

### Retrieving data: File download URLs
The more common way to retrieve data from datasets on Citrination is to request download URLs. The `get_dataset_files()` function can be used to get a list of [`DatasetFile`](http://citrineinformatics.github.io/python-citrination-client/modules/data/datasets.html#citrination_client.data.dataset_file.DatasetFile) objects from a dataset. The method has the following inputs:
* `dataset_id`: The integer value of the ID of the dataset that you're retrieving data from.
* `glob`: A [regex](https://ryanstutorials.net/regular-expressions-tutorial/) used to select one or more files in the dataset (`default='.'`).
* `is_dir`: A Boolean indicating whether or not the supplied pattern should be treated as a directory to search in (`default=False`).
* `version_number`: The integer value of the version number of the dataset to retrieve files from (`default=None`).

In [None]:
regex = 'pif'  # matches files with 'pif' in the name
dataset_files = data_client.get_dataset_files(dataset_id=dataset_id, 
                                              glob=regex)
print('The regex \'{}\' matched {} files in dataset {}.'.format(regex, 
                                                                len(dataset_files), 
                                                                dataset_id))

[`DatasetFile`](http://citrineinformatics.github.io/python-citrination-client/modules/data/datasets.html#citrination_client.data.dataset_file.DatasetFile) objects have `path` and `url` attributes that can then be accessed. There is also a `download_files()` method with the following parameters:
* `dataset_files`: A list of `DatasetFile` objects.
* `destination`: The path to the desired local download destination (`default='.'`).

In [None]:
print('The first file in the dataset is "{}"'.format(dataset_files[0].path))

# Download all files, preserving the same file organization
data_client.download_files(dataset_files=dataset_files, 
                           destination='./downloads/')

### Retrieving data: PIF retrieval
Another way to retrieve data is to request the contents of a single PIF record in JSON format. The `get_pif()` method takes in the following parameters and returns a [pypif](https://github.com/CitrineInformatics/pypif) [`pif`](http://citrineinformatics.github.io/pif-documentation/schema_definition/index.html) object.
* `dataset_id`: The integer value of the ID of the dataset that you're retrieving data from.
* `uid`: A string representing the uid of the PIF to retrieve.
* `dataset_version`: The integer value of the version number of the dataset to retrieve files from (`default=None`).

*Note*: Because the `uid` is only revealed through the web UI and [`SearchClient`](http://citrineinformatics.github.io/python-citrination-client/tutorial/search_examples.html), `get_pif()` is not commonly used when working solely with the `DataClient`.

In [None]:
pif_uid = 'test_uid'  # this UID was set in the PIF
my_pif = data_client.get_pif(dataset_id=dataset_id, 
                             uid=pif_uid)
print('The chemical formula of this PIF is {}.'.format(my_pif.chemical_formula))

### Modify a dataset
You can easily modify datasets on Citrination with the `update_dataset()` function. It takes as inputs:
* `dataset_id`: The integer value of the ID of the dataset that you're retrieving data from.
* `name`: A string for the new name of the dataset (`default=None`).
* `description`: A string for the new description of the dataset (`default=None`).
* `public`: A Boolean indicating whether the dataset should be public (`default=None`).

In [None]:
new_name = 'PyCC Dataset New Name ' + str(uuid4())[:6]
public_flag = False
new_dataset = data_client.update_dataset(dataset_id=dataset_id, 
                                         name=new_name, 
                                         public=public_flag)
print('Dataset {} is now named "{}."'.format(dataset_id, new_dataset.name))

If you just wanted to see all the files in a dataset that contain a particular pattern, you can use the `list_files()` method. It takes in the first three arguments of the `get_dataset_files()` method and returns a list of file paths.

In [None]:
print('Files list: {0}.'.format(data_client.list_files(dataset_id=dataset_id, 
                                                       glob='.')))

The `create_dataset_version()` method of the `DataClient` creates a new version of a data set. Note that creating a new version deletes all records from the old version, so handle with care!

In [None]:
dataset_version = data_client.create_dataset_version(dataset_id=dataset_id)
print('Dataset {} is now version {}.'.format(dataset_id, dataset_version.number))

### Delete a dataset

Finally, if you wish to delete a dataset that you own, you can use the `delete_dataset()` method of the `DataClient`. As this is a permanent deletion, please handle with care!

In [None]:
# data_client.delete_dataset(dataset_id=dataset_id)

## Conclusion

[Back to ToC](#Table-of-contents)

To recap, this notebook went through the steps for managing data on Citrination using the `DataClient`. The topics covered included:
* How to properly initialize the Python Citrination Client with your API key.
* How to create a new dataset.
* How to upload data to the dataset.
* How to retrieve data from the dataset.
* How to modify the properties of the dataset.
* How to delete a dataset.

## Additional resources

[Back to ToC](#Table-of-contents)

It's now possible to conduct the major aspects of the Citrination workflow through the API, which should increase the speed and flexibility of informatics approaches. Some other topics that might interest you include:
* [DataViewsClient](http://citrineinformatics.github.io/python-citrination-client/tutorial/view_examples.html) - This allows you to build views (i.e. train ML models) using the API.
  * There is also a corresponding [tutorial](2_data_views_client_api_tutorial.ipynb).
* [SearchClient](http://citrineinformatics.github.io/python-citrination-client/tutorial/search_examples.html) - This gives you a flexible and fast way to access PIF data on Citrination.
  * There is also a corresponding [tutorial](4_search_client_api_tutorial.ipynb).