<img src="../_static/DREGS_logo.png" width="300"/>

# Working with pipeline datasets

This tutorial focuses on how to register data into `DREGS` from a
complete end-to-end pipeline. A "pipeline" in this context is any collection of
datasets that are inter-dependent, i.e., the output data from one process feeds
into the next process as its starting point. For example, a pipeline could
start with some raw imagery from a telescope, this raw imagery is then reduced
and fed into a piece of software that outputs a human-friendly value added
catalog. Or, a pipeline could be from a numerical simulation, starting with the
simulation's initial conditions, which then feed into an N-body code, which
then feed into a structure finder and gets reduced to a halo catalog.

In the DESC data registry nomenclature, each stage of a pipeline is an
"execution", the data product(s) produced during each execution are "datasets",
and executions are linked to one another via "dependencies".

### What we cover in this tutorial

In this tutorial we will learn how to:

- Register a series of dependant datasets from a pipeline into DREGS

### Before we begin

If you haven't done so already, check out the [getting setup](file:///home/mcalpine/Documents/dataregistry/dataregistry/docs/build/html/tutorial_setup.html) page from the docs if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [1]:
import dataregistry
print("Working with dataregistry version:", dataregistry.__version__)

Working with dataregistry version: 0.2.2


## A pipeline example

For this example, we have a pipeline comprising of three stages. In the first
stage three datasets are produced, one dataset is a directory structure, and
the remaining two are individual files. The data output from the first stage
feeds into the second stage as input, which in turn produces its own output (in
this case a directory structure). Finally, the output data from stage two is
fed into the third stage as input and produces its own output dataset directory
structure. Thus our three stages have a simple sequential linking structure;
`Stage1 -> Stage2` and `Stage2 -> Stage3`.

Below is a graphical representation of the setup.

<img src="../_static/pipeline_example.png" width="300"/>

How then would we go about inputting this pipeline into the DESC data registry?

To begin we need to get set up; importing the `DREGS` class. As with the
standalone dataset example, we are assuming the default DREGS configuration
(see the :ref:`Installation page <installation>` for more details).

.. code-block:: python

   from dataregistry import DREGS

   # Establish connection to database (using defaults) 
   dregs = DREGS()

Now we can enter our database entries, starting with an `execution` entry to
represent the first stage of our pipeline.

.. code-block:: python

   ex1_id = dregs.Registrar.register_execution(
       "pipeline-stage-1",
       description="The first stage of my pipeline",
   )

where ``ex1_id`` is the `DREGS` index for this execution which we will reference later.

Next, we register the datasets associated with the output of
``pipeline-stage-1``. Each dataset by default (as we have not specified
otherwise) will be entered with ``owner=$USER`` and ``owner_type=user``.



In [4]:
from dataregistry import DREGS

# Establish connection to database (using defaults) 
dregs = DREGS()

However, if you are not connecting to the default DREGS database, or if your configuration file is located somewhere other than your `$HOME` directory, you must provide that information during the object creation, e.g.,

In [26]:
# When your configuration file is not in the default location
# dregs = DREGS(config_file="/path/to/config")

# If you are connecting to a database other than the DREGS default, you need to specify the schema name to connect to
# dregs = DREGS(schema_version="myschema")

# If you want to specify the root_dir that data is copied to (this should only be changed for testing purposes)
# dregs = DREGS(root_dir="/my/root/dir")

When creating a `DREGS` class instance which you intend to use for registering datasets, you may optionally set a default `owner` and/or `owner_type` that will be inherited for all datasets that are registered during that instance. See details in the section "*Registering new datasets with DREGS*" below for details about `owner` and `owner_type`.

In [2]:
# Setting a global owner and owner_type default value for all datasets that will be registered during this instance
# dregs = DREGS(owner="desc", owner_type="production")

## Registering new datasets with DREGS

Now that we have made our connection to the database we can register some datasets using the `Registrar` extension of the `DREGS` class.

In [5]:
# Create a empty text file
with open("dummy_dataset.txt", "w") as f:
    f.write("some data")

# Add new entry.
dataset_id = dregs.Registrar.register_dataset(
    "dregs_nersc_tutorial/my_desc_dataset",
    "1.0.0",
    description="An output from some DESC code",
    owner="DESC",
    owner_type="group",
    is_overwritable=True,
    old_location="dummy_dataset.txt"
)

This will register a new dataset into DREGS. A few notes:

### The relative path

Datasets are registered in the data registry under a relative path. For those interested, the eventual full path for the dataset will be `<root_dir>/<owner_type>/<owner>/<relative_path>`. The relative path is one of the two required parameters you must specify when registering a dataset (in the example here our relative path is `dregs_nersc_tutorial/my_desc_dataset`).

By default datasets are non-overwritable, therefore relative paths for a given `owner` and `owner_type` must be unique in that case (we have allowed this example dataset to be overwritable so that it does not raise an error through multiple tests).

### The version string

The second required parameter is the version string, in the semantic format, i.e., MAJOR.MINOR.PATCH. There is also the optional `version_suffix` parameter which can be used to add a suffix to the version string.

### Owner and Owner type

Datasets are registered under a given `owner`. This can be any string, however it should be informative. If no `owner` is specified, and no global `owner` was set when the `DREGS` instance was created, `$USER` is used as the default.

One further level of classification is the `owner_type`, which can be one of `user`, `group`, `project` or `production`. If `owner_type` is not specified, and no global `owner_type` was set when the `DREGS` instance was created, `user` is the default. Note that production datasets can never be overwritten (even if `is_overwritable=True`).

### Copying the data

Registering a dataset with DREGS does two things; it creates an entry in the DESC data registry database with the appropriate metadata, and it (optionally) copies the dataset contents to a central shared space at NERSC (i.e., under `<root_dir>`). 

If the data are already at the correct relative path under the NERSC shared space, then only the relative path needs to be provided, and the dataset will then be registered. However it's likely for most users the data will need to be copied from another location at NERSC to the DREGS shared space. That initial location may be specified using the `old_location` parameter. 

In our example we have created a dummy text file as our dataset and ingested it into the data registry, however this can be any file or directory (directories will be recursively copied) at NERSC.

### Extra options

All the `register_dataset` parameters we do not explicitly specify revert to their default values. For a full list of these options see the documentation [here](http://lsstdesc.org/dataregistry/reference_python.html#the-dregs-class). We would advise you to be as precise as possible when creating entries within DREGS. 

## Updating a previously registered dataset with a newer version

If a dataset previously registered with DREGS gets updated, you have three options for how to handle the new entry into DREGS:

- You can enter it as a completely new standalone dataset with no links to the previous dataset
- You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)
- You can enter it as a new version of the previous dataset (recommended)

Unless you are overwriting a previous dataset, you cannot enter a new dataset (even an updated version) using the same relative path. However, datasets can share the same `name` field, which is what we'll use to keep our updated dataset connected to our previous one.

Note that in the original dataset we did not specify `name` during registration. The default name for a dataset is the file or directory name taken from its relative path. In our example above this would be `my_desc_dataset`.

The combination of `name`, `version` and `version_suffix` for any dataset in DREGS must be unique. As we are updating a dataset with the same name, we have to make sure to update the version to a new value. One handy feature is automatic version "bumping" for datasets, i.e., rather than specifying a new version string manually, one can select "major", "minor" or "patch" for the version string to automatically bump it up. In our case, selecting "minor" will automatically generate the version "1.1.0".

In [6]:
# Add new entry for an updated dataset with an updated version.
dataset_id = dregs.Registrar.register_dataset(
    "dregs_nersc_tutorial/my_updated_desc_dataset",
    "minor", # Automatically bumps to "1.1.0"
    description="An output from some DESC code (updated)",
    is_overwritable=True,
    old_location="dummy_dataset.txt",
    name="my_desc_dataset" # Using this name links it to the previous dataset.
)

## Querying the data registry

Now that we have covered the basics of dataset registration, we can have a look at how to query entries in the DREGS database.

Queries are constructed from one or more boolean logic "filters", which translate to SQL `WHERE` clauses in the code. 

For example, to create a filter that will query for all datasets in DREGS with the name "my_desc_dataset" would be as follows:

In [7]:
# Create a filter that queries on the dataset name
f = dregs.Query.gen_filter('dataset.name', '==', 'my_desc_dataset')

Like with SQL, column names can either be explicit, or not, with the prefix of their table name. For example `name` rather than `dataset.name`. However this would only be valid if the column `name` was unique across all tables, which it is not. We would always recommend being explicit, and including the table name with filters.

Now we can pass this filter through to a query using the `Query` extension of the `DREGS` class, e.g.,

In [12]:
# Query the database
results = dregs.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.relative_path'], [f])

Which takes a list of column names we want to return, and a list of filter objects for the query.

A SQLAlchemy object is returned, we can look at the results like so:

In [13]:
for r in results:
    print(r)

(5, 'my_desc_dataset', 'my_desc_project/my_desc_dataset')
(6, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(7, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(8, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(9, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(10, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(12, 'my_desc_dataset', 'dregs_nersc_tutorial/my_desc_dataset')
(11, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')
(13, 'my_desc_dataset', 'dregs_nersc_tutorial/my_updated_desc_dataset')


To get a list of all columns in the database, along with what table they belong to, you can use the `Query.get_all_columns()` function, i.e.,

In [None]:
print(dregs.Query.get_all_columns())