<img src="../_static/DREGS_logo.png" width="300"/>

# Working with pipeline datasets

This tutorial focuses on how to register data into `DREGS` from a
complete end-to-end pipeline. A "pipeline" in this context is any collection of
datasets that are inter-dependent, i.e., the output data from one process feeds
into the next process as its starting point. For example, a pipeline could
start with some raw imagery from a telescope, this raw imagery is then reduced
and fed into a piece of software that outputs a human-friendly value added
catalog. Or, a pipeline could be from a numerical simulation, starting with the
simulation's initial conditions, which then feed into an N-body code, which
then feed into a structure finder and gets reduced to a halo catalog.

In the DESC data registry nomenclature, each stage of a pipeline is an
"**execution**", the data product(s) produced during each execution are "**datasets**",
and executions are linked to one another via "**dependencies**".

### What we cover in this tutorial

In this tutorial we will learn how to:

- Register a series of dependant datasets from a pipeline into DREGS

### Before we begin

If you haven't done so already, check out the [getting setup](file:///home/mcalpine/Documents/dataregistry/dataregistry/docs/build/html/tutorial_setup.html) page from the docs if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [11]:
import dataregistry
print("Working with dataregistry version:", dataregistry.__version__)

Working with dataregistry version: 0.2.2


## A pipeline example

For this example we have a pipeline comprising of three stages.

In the first
stage three datasets are produced (a directory structure and two individual files). The data output from the first stage
feeds into the second stage as input, which in turn produces its own output (in
this case a directory structure). Finally, the output data from stage two is
fed into the third stage as input and produces its own output dataset directory
structure. Thus our three stages have a simple sequential linking structure;
`Stage1 -> Stage2` and `Stage2 -> Stage3`.

Below is a graphical representation of the setup.

<img src="../_static/pipeline_example.png" width="600"/>

How then would we go about inputting the five datasets from this pipeline into the DESC data registry?

### Connect to the database

To begin we need to get set up; importing the `DREGS` class (see the "Getting started with DREGS tutorial for a bit more detail about this stage).

In [2]:
from dataregistry import DREGS

# Establish connection to database (using defaults) 
dregs = DREGS()

### Register the executions and datasets with DREGS

Now we can enter our database entries, starting with an `execution` entry to
represent the first stage of our pipeline.

In [3]:
ex1_id = dregs.Registrar.register_execution(
   "pipeline-stage-1",
   description="The first stage of my pipeline",
)

where ``ex1_id`` is the `DREGS` index for this execution, which we will reference later.

Next, we register the datasets associated with the output of
``pipeline-stage-1``. Each dataset by default (as we have not specified
otherwise) will be entered with ``owner=$USER`` and ``owner_type=user``. Note we mark them as "dummy" datasets, this means that no data is copied, only a database entry is created.

In [4]:
dataset_id1 = dregs.Registrar.register_dataset(
   "dregs_pipeline_tutorial/dataset_1p1/",
   "0.0.1",
   description="A directory structure output from pipeline stage 1",
   old_location="/somewhere/on/machine/my-dataset/",
   execution_id=ex1_id,
   name="Dataset 1.1",
   is_overwritable=True,
   is_dummy=True
)

dataset_id2 = dregs.Registrar.register_dataset(
   "dregs_pipeline_tutorial/dataset_1p2.db",
   "0.0.1",
   description="A file output from pipeline stage 1",
   old_location="/somewhere/on/machine/other-datasets/database.db",
   execution_id=ex1_id,
   name="Dataset 1.2",
   is_overwritable=True,
   is_dummy=True
)

dataset_id3 = dregs.Registrar.register_dataset(
   "dregs_pipeline_tutorial/dataset_1p3.hdf5",
   "0.0.1",
   description="Another file output from pipeline stage 1",
   old_location="/somewhere/on/machine/other-datasets/info.hdf5",
   execution_id=ex1_id,
   name="Dataset 1.3",
   is_overwritable=True,
   is_dummy=True
)

Now, the `execution` for stage two of our pipeline. Note this will
automatically generate a dependency between the two executions.

In [5]:
ex2_id = dregs.Registrar.register_execution(
   "pipeline-stage-2",
   description="The second stage of my pipeline",
   input_datasets=[dataset_id1,dataset_id2,dataset_id3],
)

and then to finish, we repeat the process for the remaining datasets and
remaining execution.

In [6]:
dataset_id4 = dregs.Registrar.register_dataset(
    "dregs_pipeline_tutorial/dataset_2p1",
    "0.0.1",
    description="A directory structure output from pipeline stage 2",
    old_location="/somewhere/on/machine/my-second-dataset/",
    execution_id=ex2_id,
    name="Dataset 2.1",
    is_overwritable=True,
    is_dummy=True
)

ex3_id = dregs.Registrar.register_execution(
    "pipeline-stage-3",
    description="The third stage of my pipeline",
    input_datasets=[dataset_id4],
)

dataset_id5 = dregs.Registrar.register_dataset(
    "dregs_pipeline_tutorial/dataset_3p1",
    "0.0.1",
    description="A directory structure output from pipeline stage 3",
    old_location="/somewhere/on/machine/my-third-dataset/",
    execution_id=ex3_id,
    name="Dataset 3.1",
    is_overwritable=True,
    is_dummy=True
)

## Querying a pipeline dataset

In [7]:
# Create a filter that queries on the dataset name
f = dregs.Query.gen_filter('dataset.name', '==', 'my_desc_dataset')

Like with SQL, column names can either be explicit, or not, with the prefix of their table name. For example `name` rather than `dataset.name`. However this would only be valid if the column `name` was unique across all tables, which it is not. We would always recommend being explicit, and including the table name with filters.

Now we can pass this filter through to a query using the `Query` extension of the `DREGS` class, e.g.,

In [8]:
# Query the database
results = dregs.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.relative_path'], [f])

Which takes a list of column names we want to return, and a list of filter objects for the query.

A SQLAlchemy object is returned, we can look at the results like so:

In [9]:
for r in results:
    print(r)

(5, 'my_desc_dataset', 'my_desc_project/my_desc_dataset')
(6, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(7, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(8, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(9, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(18, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(10, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(12, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(14, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(16, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(17, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')
(19, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')
(11, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')
(13, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')
(15, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')


To get a list of all columns in the database, along with what table they belong to, you can use the `Query.get_all_columns()` function, i.e.,

In [10]:
print(dregs.Query.get_all_columns())

['dataset.dataset_id', 'dataset.name', 'dataset.relative_path', 'dataset.version_major', 'dataset.version_minor', 'dataset.version_patch', 'dataset.version_string', 'dataset.version_suffix', 'dataset.dataset_creation_date', 'dataset.is_archived', 'dataset.is_external_link', 'dataset.is_overwritable', 'dataset.is_overwritten', 'dataset.is_valid', 'dataset.register_date', 'dataset.creator_uid', 'dataset.access_API', 'dataset.execution_id', 'dataset.description', 'dataset.owner_type', 'dataset.owner', 'dataset.data_org', 'dataset.nfiles', 'dataset.total_disk_space', 'execution.execution_id', 'execution.description', 'execution.register_date', 'execution.execution_start', 'execution.name', 'execution.locale', 'execution.configuration', 'execution.creator_uid', 'dataset_alias.dataset_alias_id', 'dataset_alias.alias', 'dataset_alias.dataset_id', 'dataset_alias.supersede_date', 'dataset_alias.register_date', 'dataset_alias.creator_uid', 'dependency.dependency_id', 'dependency.register_date', 