<img src="../_static/DREGS_logo_v2.png" width="300"/>

# Working with pipeline datasets

This tutorial focuses on how to register data into the data registry from a
complete end-to-end pipeline. A "pipeline" in this context is any collection of
datasets that are inter-dependent, i.e., the output data from one process feeds
into the next process as its starting point. For example, a pipeline could
start with some raw imagery from a telescope, this raw imagery is then reduced
and fed into a piece of software that outputs a human-friendly value added
catalog. Or, a pipeline could be from a numerical simulation, starting with the
simulation's initial conditions, which then feed into an N-body code, which
then feed into a structure finder and gets reduced to a halo catalog.

In the DESC data registry nomenclature, each stage of a pipeline is an
"**execution**", the data product(s) produced during each execution are "**datasets**",
and executions are linked to one another via "**dependencies**".

### What we cover in this tutorial

In this tutorial we will learn how to:

- Register a series of dependant datasets from a pipeline into the data registry.

### Before we begin

If you haven't done so already, check out the getting setup page from the docs if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
import dataregistry
print("Working with dataregistry version:", dataregistry.__version__)

# A pipeline example

For this example we have a pipeline comprising of three stages.

In the first
stage three datasets are produced (a directory structure and two individual files). The data output from the first stage
feeds into the second stage as input, which in turn produces its own output (in
this case a directory structure). Finally, the output data from stage two is
fed into the third stage as input and produces its own output dataset directory
structure. Thus our three stages have a simple sequential linking structure;
`Stage1 -> Stage2` and `Stage2 -> Stage3`.

Below is a graphical representation of the setup.

<img src="../_static/pipeline_example.png" width="600"/>

How then would we go about inputting the five datasets from this pipeline into the DESC data registry?

### Connect to the data registry

To begin we establish a link to the data registry using the ``DataRegistry`` class (the Getting Started tutorial goes through this step in more detail).

Note we are setting a global `owner` and `owner_type` here, which will be inherited automatically during each `register_dataset` call. 

In [None]:
from dataregistry import DataRegistry

# Establish connection to database, setting a default owner and owner_type for all registered datasets in this instance.
datareg = DataRegistry(owner="DESC CO Group", owner_type="group")

### Register the executions and datasets with DataRegistry

Now we can enter our database entries, starting with an `execution` entry to
represent the first stage of our pipeline.

In [None]:
ex1_id = datareg.Registrar.register_execution(
   "pipeline-stage-1",
   description="The first stage of my pipeline",
)

where `ex1_id` is the `DataRegistry` index for this execution, which we will reference later.

Next, we register the datasets associated with the output of `pipeline-stage-1`. Note we mark them as "dummy" datasets, this means that no data is copied (or even neeeds to exist), only a database entry is created.

In [None]:
dataset_id1 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p1/",
   "0.0.1",
   description="A directory structure output from pipeline stage 1",
   old_location="/somewhere/on/machine/my-dataset/",
   execution_id=ex1_id,
   name="Dataset 1.1",
   is_overwritable=True,
   is_dummy=True
)

dataset_id2 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p2.db",
   "0.0.1",
   description="A file output from pipeline stage 1",
   old_location="/somewhere/on/machine/other-datasets/database.db",
   execution_id=ex1_id,
   name="Dataset 1.2",
   is_overwritable=True,
   is_dummy=True
)

dataset_id3 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p3.hdf5",
   "0.0.1",
   description="Another file output from pipeline stage 1",
   old_location="/somewhere/on/machine/other-datasets/info.hdf5",
   execution_id=ex1_id,
   name="Dataset 1.3",
   is_overwritable=True,
   is_dummy=True
)

Now, the `execution` for stage two of our pipeline. Note this will
automatically generate a dependency between the two executions.

In [None]:
ex2_id = datareg.Registrar.register_execution(
   "pipeline-stage-2",
   description="The second stage of my pipeline",
   input_datasets=[dataset_id1,dataset_id2,dataset_id3],
)

and then to finish, we repeat the process for the remaining datasets and
remaining execution.

In [None]:
dataset_id4 = datareg.Registrar.register_dataset(
    "pipeline_tutorial/dataset_2p1",
    "0.0.1",
    description="A directory structure output from pipeline stage 2",
    old_location="/somewhere/on/machine/my-second-dataset/",
    execution_id=ex2_id,
    name="Dataset 2.1",
    is_overwritable=True,
    is_dummy=True
)

ex3_id = datareg.Registrar.register_execution(
    "pipeline-stage-3",
    description="The third stage of my pipeline",
    input_datasets=[dataset_id4],
)

dataset_id5 = datareg.Registrar.register_dataset(
    "pipeline_tutorial/dataset_3p1",
    "0.0.1",
    description="A directory structure output from pipeline stage 3",
    old_location="/somewhere/on/machine/my-third-dataset/",
    execution_id=ex3_id,
    name="Dataset 3.1",
    is_overwritable=True,
    is_dummy=True
)

# Querying a pipeline dataset

Coming soon