# Publishing datasets to Foundry
3/17/21

This notebook serves as a quick walkthrough for publishing datasets via Foundry.

This code requires the newest version of Foundry (v0.0.7), if you're running locally, be sure you've updated the `foundry_ml` package


In [None]:
!pip install foundry_ml --upgrade
!pip install mdf_connect_client

In [49]:
from foundry import Foundry

f = Foundry(no_browser=True, no_local_server=True)

The following cell contains variables for all of the possible arguments you could pass to `f.publish()`, for illustrative purposes.

In [40]:
# load your metadata that describes your dataset. Could be manual dict like this, 
#   could be loaded from a file, etc.
example_iris_metadata = {
    "inputs":["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"],
    "input_descriptions":["sepal length in unit(cm)", "sepal width in unit(cm)", "petal length in unit(cm)", "petal width in unit(cm)"],
    "input_units":["cm","cm","cm","cm"],
    "outputs":["y"],
    "output_descriptions":["flower type"],
    "output_units":[],
    "output_labels":["setosa","versicolor", "virginica"],
    "short_name":"iris_example",
    "package_type":"tabular"
}

# this should be a Globus endpoint. In this example, I use a dataset on our temp
#   server. You can also make your local machine a Globus endpoint
data_source = "https://app.globus.org/file-manager?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=%2Ffoundry%2F_test_blaiszik_foundry_iris_v1.2%2F"

# full title
title = "Scourtas example iris dataset"

# authors to list 
authors = ["A Scourtas", "B Blaiszik"]

# shorthand title (optional)
short_name = "colab_example_AS_iris"

# affiliations of authors (optional)
# In the same order as authors list. If a different number of affiliations are 
#   given, all affiliations will be applied to all authors.
affiliations = ["Globus Labs, UChicago"]

# publisher of the data (optional)
# The default is MDF
publisher = "Materials Data Facility"

# publication year (optional)
# The default is the current calendar year
publication_year = 2021

We won't use all of these variables in our call to `f.publish()`, because many of the default values for the parameters (such as "MDF" for `publisher`) work well for our use case. 

However, the metadata, data source, title, and authors are all required.

Before we publish, you first need to join a Globus group! You can [join this group](https://app.globus.org/groups/cc192dca-3751-11e8-90c1-0a7c735d220a/about) to get started.

In [41]:
# publish to Foundry! returns a result object we can inspect
res = f.publish(example_iris_metadata, data_source, title, authors, short_name=short_name)

In [42]:
# check if publication request was valid
res['success']

True

In [43]:
res

{'error': None,
 'source_id': '_test_colab_example_iris_v1.1',
 'status_code': 202,
 'success': True}

We we can use the `source_id` of the `res` result to check the status of our submission. Ths `source_id` is a unique identifier based on the title and version of your dataset.

In [45]:
source_id = res['source_id']

# check_status() 
f.check_status(source_id=source_id)


Status of TEST submission _test_colab_example_iris_v1.1 (scourtas_colab_example_iris_publish)
Submitted by Aristana Scourtas at 2021-03-17T19:41:51.869694Z

Submission initialization has not started yet.
Cancellation of previous submissions has not started yet.
Connect data download has not started yet.
Data transfer to primary destination has not started yet.
Metadata extraction has not started yet.
Dataset curation has not started yet.
MDF Search ingestion has not started yet.
Data transfer to secondary destinations has not started yet.
MDF Publish publication has not started yet.
Citrine upload has not started yet.
Materials Resource Registration has not started yet.
Post-processing cleanup has not started yet.

This submission is still processing.



Success! The dataset is now in the curation phase, and needs to be approved by an admin/curator before publishing. You can check back in on the status using hte `check_status()` function at any time.

You can pass `short=True` to `check_status()` to print a short finished/processing message, which can beuseful for checking many datasets' status at once. Pass `raw=True` to get raw, full status result (default `False` is recommended).

To re-publish a dataset, pass `update=True` to `f.publish()`

In [48]:
# republishing same dataset -- note the version increments automatically
res = f.publish(example_iris_metadata, data_source, title, authors, short_name=short_name, update=True)
res

{'error': None,
 'source_id': '_test_colab_example_iris_v1.3',
 'status_code': 202,
 'success': True}

If you have any questions, please reach out to Aristana Scourtas (aristana [at] uchicago.edu), Ben Blaiszik (blaiszik [at] uchicago.edu), or KJ Schmidt (kjschmidt [at] uchicago.edu)