<img src="../_static/DREGS_logo_v2.png" width="300"/>

# Getting started with the DESC data registry

The *DESC data registry* is a means of storing and keeping track of DESC related datasets, i,e., where they are, when they were produced and what precursor datasets they depend on.

This is a quick tutorial to get your started using the `dataregistry` at NERSC, both for entering your own datasets into the registry, and for finding out what datasets already exist through queries.

### What we cover in this tutorial

In this tutorial we will learn how to:

- Connect to the DESC data registry using the `DataRegistry` class
- Register a dataset
- Perform a simple query 

### Before we begin

If you haven't done so already, check out the "getting setup" page from the docs if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
import dataregistry
print("Working with dataregistry version:", dataregistry.__version__)

## The DataRegistry class

The top-level `DataRegistry` class provides a quick and easy-to-use interface to `dataregistry` functionality.

Upon creation, the `DataRegistry` class automatically establishes a connection to the DESC data registry database using the information in your `~/.config_reg_access` and `~/.pgpass` files.

In many cases, in particular when you only need query functionality, or when you're working on your own at NERSC, a simple call with no additional arguments is adequate, e.g.,

In [None]:
from dataregistry import DataRegistry

# Establish connection to database (using defaults)
datareg = DataRegistry()

However, if you are not connecting to the default database schema, or if your configuration file is located somewhere other than your `$HOME` directory, you must provide that information during the object creation, e.g.,

In [None]:
# When your configuration file is not in the default location
# datareg = DataRegistry(config_file="/path/to/config")

# If you are connecting to a database schema other than the default, you need to specify the schema name to connect to
# datareg = DataRegistry(schema_version="myschema")

# If you want to specify the root_dir that data is copied to (this should only be changed for testing purposes)
# datareg = DataRegistry(root_dir="/my/root/dir")

When creating a `DataRegistry` class instance which you intend to use for registering datasets, you may optionally set a default `owner` and/or `owner_type` that will be inherited for all datasets that are registered during that instance. See details in the section "*Registering new datasets with DataRegistry*" below for details about `owner` and `owner_type`.

In [None]:
# Setting a global owner and owner_type default value for all datasets that will be registered during this instance
# datareg = DataRegistry(owner="desc", owner_type="group")

## Registering new datasets with DataRegistry

Now that we have made our connection to the database we can register some datasets using the `Registrar` extension of the `DataRegistry` class.

In [None]:
# Create a empty text file as some example data
with open("dummy_dataset.txt", "w") as f:
    f.write("some data")

# Add new entry.
dataset_id = datareg.Registrar.register_dataset(
    "nersc_tutorial/my_desc_dataset",
    "1.0.0",
    description="An output from some DESC code",
    owner="DESC",
    owner_type="group",
    is_overwritable=True,
    old_location="dummy_dataset.txt"
)

This will register a new dataset. A few notes:

### The relative path

Datasets are registered at the data registry shared space under a relative path. For those interested, the eventual full path for the dataset will be `<root_dir>/<owner_type>/<owner>/<relative_path>`. The relative path is one of the two required parameters you must specify when registering a dataset (in the example here our relative path is `nersc_tutorial/my_desc_dataset`).

### The version string

The second required parameter is the version string, in the semantic format, i.e., MAJOR.MINOR.PATCH. There is also the optional `version_suffix` parameter which can be used to add a suffix to the version string.

### Owner and Owner type

Datasets are registered under a given `owner`. This can be any string, however it should be informative. If no `owner` is specified, and no global `owner` was set when the `DataRegistry` instance was created, `$USER` is used as the default.

One further level of classification is the `owner_type`, which can be one of `user`, `group`, `project` or `production`. If `owner_type` is not specified, and no global `owner_type` was set when the `DataRegistry` instance was created, `user` is the default. Note that `owner_type=production` datasets can only got into the production schema, and can never be overwritten (see next tutorial for more information on the production schema).

### Overwriting datasets

By default datasets are not overwritable. In that scenario you will need to choose a combination of `relative_path`, `owner` and `owner_type` that is not already taken in the database. For our example we have set it so that the dataset can be overwritten so that it does not raise an error through multiple tests. Note that when a dataset has `is_overwritable=true`, the data in the shared space will overwritten with each registration, but the entry in the data registry database is never lost (a new unique entry is created each time, and the 'old' entries will obtain `true` for their `is_overwritten` row).

### Copying the data

Registering a dataset does two things; it creates an entry in the DESC data registry database with the appropriate metadata, and it (optionally) copies the dataset contents to a central shared space at NERSC (i.e., under `<root_dir>`). 

If the data are already at the correct relative path under the NERSC shared space, then only the relative path needs to be provided, and the dataset will then be registered. However it's likely for most users the data will need to be copied from another location at NERSC to the data registry shared space. That initial location may be specified using the `old_location` parameter. 

In our example we have created a dummy text file as our dataset and ingested it into the data registry, however this can be any file or directory (directories will be recursively copied) at NERSC.

### Extra options

All the `register_dataset` parameters we do not explicitly specify revert to their default values. For a full list of these options see the documentation [here](http://lsstdesc.org/dataregistry/reference_python.html#the-dregs-class). We would advise you to be as precise as possible when creating entries within the data registry. 

## Updating a previously registered dataset with a newer version

If a dataset previously registered with the data registry gets updated, you have three options for how to handle the new entry:

- You can enter it as a completely new standalone dataset with no links to the previous dataset
- You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)
- You can enter it as a new version of the previous dataset (recommended)

Unless you are overwriting a previous dataset (when `is_overwritable=True`), you cannot enter a new dataset (even an updated version) using the same relative path. However, datasets can share the same `name` field, which is what we'll use to keep our updated dataset connected to our previous one.

Note that in our test dataset we did not specify `name` during registration. The default name for a dataset is the file or directory name taken from its relative path. In our example above this would be `my_desc_dataset`.

The combination of `name`, `version` and `version_suffix` for any dataset must be unique. As we are updating a dataset with the same name, we have to make sure to update the version to a new value. One handy feature is automatic version "bumping" for datasets, i.e., rather than specifying a new version string manually, one can select "major", "minor" or "patch" for the version string to automatically bump it up. In our case, selecting "minor" will automatically generate the version "1.1.0".

In [None]:
# Add new entry for an updated dataset with an updated version.
dataset_id = datareg.Registrar.register_dataset(
    "nersc_tutorial/my_updated_desc_dataset",
    "minor", # Automatically bumps to "1.1.0"
    description="An output from some DESC code (updated)",
    is_overwritable=True,
    old_location="dummy_dataset.txt",
    name="my_desc_dataset" # Using this name links it to the previous dataset.
)

## Querying the data registry

Now that we have covered the basics of dataset registration, we can have a look at how to query entries in the database.

Queries are constructed from one or more boolean logic "filters", which translate to SQL `WHERE` clauses in the code. 

For example, to create a filter that will query for all datasets in registry with the name "my_desc_dataset" would be as follows:

In [None]:
# Create a filter that queries on the dataset name
f = datareg.Query.gen_filter('dataset.name', '==', 'my_desc_dataset')

Like with SQL, column names can either be explicit, or not, with the prefix of their table name. For example `name` rather than `dataset.name`. However this would only be valid if the column `name` was unique across all tables, which it is not. We would always recommend being explicit, and including the table name with filters.

Now we can pass this filter through to a query using the `Query` extension of the `DataRegistry` class, e.g.,

In [None]:
# Query the database
results = datareg.Query.find_datasets(['dataset.dataset_id', 'dataset.name', 'dataset.relative_path'], [f])

Which takes a list of column names we want to return, and a list of filter objects for the query.

A SQLAlchemy object is returned, we can look at the results like so:

In [None]:
for r in results:
    print(r)

To get a list of all columns in the database, along with what table they belong to, you can use the `Query.get_all_columns()` function, i.e.,

In [None]:
print(datareg.Query.get_all_columns())