![foundry_logo](../../assets/foundry.png)

## Publishing HDF5 data to Foundry
6/13/2022

This notebook walks through:

1. Creating metadata
2. Publishing to foundry

This code uses the latest verion of foundry (v0.2.0), if you're running locally, be sure you've updated the `foundry_ml` package.

This notebook is set up to run as a [Google Colaboratory](https://colab.research.google.com/notebooks/intro.ipynb#scrollTo=5fCEDCU_qrC0) notebook, which allows you to run Python code in the browser, or as a [Jupyter](https://jupyter.org/) notebook, which runs locally on your machine.

For help getting started with Globus, see the [primary publishing notebook](dataset_publishing.ipynb).

### Setup

In [None]:
%pip install --upgrade --quiet foundry_ml mdf_connect_client

#### Data format and naming

Right now Foundry only accepts HDF5 files or tabular data in JSON file format ([JSON Lines](https://jsonlines.org/) or standard multiline JSON are acceptable). JSON data files are better supported at the moment.

Eventually, we will also accept tabular data in CSV format, file data (e.g. whole images, zipped archives), and other data formats.

Datasets and associated metadata should follow the example in our [gitbook](https://ai-materials-and-chemistry.gitbook.io/foundry/v/docs/concepts/foundry-datasets) and the '*metadata*' section of the publishing example below. Filenames should correspond to the 'splits' as specified in the metadata to be accepted by the system.

Your metadata can be a separate file (titled "foundry_metadata.json") or you can use a string literal (see example below under [*Arguments needed for publishing*](#arguments-needed-for-publishing)).

### Foundry instantiation

We need to pass `no_browser=True` and `no_local_server=True` to the constructor in order for the code to work in Jupyter notebooks or Google Colab.

In [None]:
from foundry import Foundry

f = Foundry(no_browser=True, no_local_server=True)

### Arguments needed for publishing

This section describes and defines the variables for all of the possible arguments you could pass to `f.publish()`, for illustrative purposes.

#### Metadata

The metadata describe the dataset inputs and outputs, along with the type of dataset (e.g. "tabular) and other relevant info. The metadata should meet the schema defined [here](https://github.com/materials-data-facility/data-schemas/blob/master/schemas/projects.json#L111) under "foundry".

In [None]:
example_metadata = {
    "short_name": "segmentation-dev",
    "data_type": "hdf5",
    "task_type": ["unsupervised", "segmentation"],
    "domain": ["materials science", "chemistry"],
    "n_items": 100,
    "splits": [{
        "type": "train",
        "path": "foundry.hdf5",
        "label": "train"
    }],
    "keys": [{
        "key": ["/data/arr1"],
        "type": "input",
        "description": "input, unlabeled images"
    }, {
        "key": ["/data/arr2"],
        "type": "input",
        "description": "This is an another array containing input data"
    }, {
        "key": ["/data/labeled"],
        "type": "target",
        "description": "target, labeled images"
    }]
}

In [None]:
f.list()

In [None]:
f.load("foundry_wei_atom_locating_benchmark_v1.1", globus=False)