<div style="overflow: hidden;">
    <img src="images/DREGS_logo_v2.png" width="300" style="float: left; margin-right: 10px;">
</div>

# Getting started with the DESC data registry (part 2)

Here we continue our getting started tutorial, introducing "executions" and "dependencies".

### What we cover in this tutorial

In this tutorial we will learn how to:

- Create a new execution and assign datasets to it
- Connect executions through dependencies

### Before we begin

If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
import dataregistry
print("Working with dataregistry version:", dataregistry.__version__)

## Executions

The `dataregistry` nomenclature for a grouping of datasets is an "execution". For example, when multiple datasets are produced from a DESC pipeline stage, or numerical simulation, say, the parent pipeline stage or numerical simulation would be the "execution", and we associate the individual datasets as child members of that execution.

Executions have their own entries and associated metadata in the dataregistry. Those execution entries must be registered first, then during the dataset registration we can associate the datasets with their parent execution entry.

By default, if a dataset is not assigned a user-created execution during registration, a stand alone execution is generated for the dataset automatically. Therefore if your execution produces a single output, i.e., the dataset you are registering, you do not need to worry about also creating a separate execution entry for it. Executions are always necessary to link dependencies, which we will cover later in the tutorial, which is why all datasets must have one.  

To create an execution we do the following:

In [None]:
from dataregistry import DataRegistry

# Establish connection to database (using defaults)
datareg = DataRegistry()

# Register a new execution
ex1_id = datareg.Registrar.register_execution(
   "pipeline-stage-1",
   description="The first stage of my pipeline",
)

where `ex1_id` is the `DataRegistry` index for this execution, which we will need to associate the datasets with this execution.

All executions require a `name`, in our case "pipeline-stage-1". We have also provided an optional description. For a full list of metadata options for executions see the reference documentation [here](https://lsstdesc.org/dataregistry/reference_python.html).

Now when we register a new dataset relating to this execution we just need to provide the execution ID, e.g.,

In [None]:
dataset_id1 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p1/",
   "0.0.1",
   description="A directory structure output from pipeline stage 1",
   old_location="/somewhere/on/machine/my-dataset/",
   execution_id=ex1_id,
   name="Dataset 1.1",
   is_overwritable=True,
   is_dummy=True
)

This is largely the same as the previous tutorial for registering a dataset, however now we are manually specifying the parent execution (`execution_id=ex1_id`).

Note `is_dummy=True` is a flag to ignore the data at `old_location` (i.e., nothing is copied), and just create an entry in the database. This is a flag for testing purposes only.

## Dependencies

Executions represent groupings of datasets connected to a single "run" (such as the output from a single stage of a DESC pipeline). Datasets can also be linked to one another via "dependencies", to classify them as a precursor dataset within the scope of a larger pipeline. Dependencies are created between datasets through their execution, however note that not all the datasets within the precursor execution are required to be dependencies of the datasets in the following execution.

As with executions, dependencies are their own entry in the data registry, however they are generated automatically with the registration of executions (via the `input_datasets` option), so the user never needs to deal with creating dependencies directly.

Take for example this simple pipeline, with three stages. Dataset 1.1 created from the first execution is a precursor dataset to the second execution, and Dataset 2.1 is a precursor dataset to the third execution:

<div style="overflow: hidden;">
    <img src="images/pipeline_example.png" width="800" style="float: left; margin-right: 10px;">
</div>

The DESC CO Group wants to enter this into the data registry, they would do it like so:

In [None]:
from dataregistry import DataRegistry

# Establish connection to database, setting a default owner and owner_type for all registered datasets in this instance.
datareg = DataRegistry(owner="DESC CO Group", owner_type="group")

# Create execution for first pipeline stage
ex1_id = datareg.Registrar.register_execution(
   "pipeline-stage-1"
)

# Register datasets with first pipeline stage.
dataset_id1 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p1/",
   "0.0.1",
   execution_id=ex1_id,
   is_overwritable=True,
   is_dummy=True
)

dataset_id2 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p2.db",
   "0.0.1",
   execution_id=ex1_id,
   is_overwritable=True,
   is_dummy=True
)

dataset_id3 = datareg.Registrar.register_dataset(
   "pipeline_tutorial/dataset_1p3.hdf5",
   "0.0.1",
   execution_id=ex1_id,
   is_overwritable=True,
   is_dummy=True
)

# Create execution for second pipeline stage
ex2_id = datareg.Registrar.register_execution(
   "pipeline-stage-2",
   input_datasets=[dataset_id1,dataset_id2,dataset_id3]
)

# Register datasets with second pipeline stage
dataset_id4 = datareg.Registrar.register_dataset(
    "pipeline_tutorial/dataset_2p1",
    "0.0.1",
    execution_id=ex2_id,
    is_overwritable=True,
    is_dummy=True
)

# Create execution for third pipeline stage
ex3_id = datareg.Registrar.register_execution(
    "pipeline-stage-3",
    input_datasets=[dataset_id4],
)

# Register datasets with third pipeline stage
dataset_id5 = datareg.Registrar.register_dataset(
    "pipeline_tutorial/dataset_3p1",
    "0.0.1",
    execution_id=ex3_id,
    is_overwritable=True,
    is_dummy=True
)

We have skipped all the optional entries, such as dataset ``description``'s or ``execution_start``'s, for clarity, however we recommend being as thorough as possible when registering your entries into the registry.

Note we never explicitly created any dependencies, they are automatically created because of the lines `input_datasets=[dataset_id1]` and `input_datasets=[dataset_id4]`.

During pipeline queries, these dependencies will be internally used to return all the associated datasets with a given pipeline.

## Querying executions and pipelines