Data ingestion workflow

We would like to make it easier for administrators of Marble nodes to load data onto their node. We should also define some some metadata requirements that should accompany all data on Marble nodes to make data easier to manage and search going forward.

The workflow must describe/automate:

- download the data from an external source
- parse the metadata from the data and translate it to a STAC entry (by applying the appropriate STAC extensions)
- organize the STAC entries into collections and describe the metadata for the whole collection
- move the raw data to a location that is served by the THREDDS server
- add the STAC entries and collections to the STAC catalog

Suggested steps to take for this project:

1. Research the following if you are not already familiar:

	- THREDDS data server (https://www.unidata.ucar.edu/software/tds/)
	- STAC: https://stacspec.org/en
		- Item: https://github.com/radiantearth/stac-spec/blob/master/item-spec/item-spec.md
		- Collection: https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md
		- Catalog: https://github.com/radiantearth/stac-spec/blob/master/catalog-spec/catalog-spec.md
		- API: https://github.com/radiantearth/stac-api-spec

2. Investigation the current data ingestion software options:

	- scripts used to download CMIP6 data: https://github.com/DACCS-Climate/cmip6_utils
	- scripts used to download NEX_GDDP_CMIP6 data: https://github.com/DACCS-Climate/NEX_GDDP_CMIP6
	- generic download scripts: https://github.com/Ouranosinc/miranda 

3. Write a report (for internal use only) that describes:
	
	- whether each of the workflow steps can be automated with existing software or would require new software to automate
	- what are the limits of the automation software for each step (what sort of data can it handle, what sort of metadata, how efficient is it, etc.)
	- what information should tutorials describe (especially for steps that cannot be automated) 
	- what metadata should be included for each STAC Entry and Collection to ensure that the data is easy to search for

4. Develop any additional required software and wrapper scripts to automate as much of the work as possible

	- in order to be compatible with the [birdhouse-deploy](https://github.com/bird-house/birdhouse-deploy/) software these scripts must be packaged as a docker image

5. Write tutorials describing the suggested data ingestion workflow. This should include:

	- how to run any automated sections (with a detailed description of configuration options)
	- how to perform any manual steps required by the workflow

6. Write a data-checking script that checks that the data on a given Marble node conforms to the metadata requirements outlined in step 4.

	- this script should be run offline and should provide informative messages to help a node administrator fix their data

Deliverables:

- report
- tutorials (to be hosted on https://github.com/DACCS-Climate/marble-tutorials)
- software to automate as much of the data ingestion process as possible (as a docker container)
- software to check that all metadata conforms to the expected requirements

Participants/Roles:

- Student (TBD): research and software development
- Steve Easterbrook: consult on metadata requirements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data ingestion workflow #59

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data ingestion workflow #59

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions