<div style="overflow: hidden;">
    <img src="images/DREGS_logo_v2.png" width="300" style="float: left; margin-right: 10px;">
</div>

# Getting started: Part 1 - Registering datasets

The *DESC data registry* is a means of storing and keeping track of DESC related datasets, i,e., where they are, when they were produced and what precursor datasets they depend on.

This is a quick tutorial to get your started using the `dataregistry` at NERSC, both for entering your own datasets into the registry, and for finding out what datasets already exist through queries.

### What we cover in this tutorial

In this tutorial we will learn how to:

1) Connect to the DESC data registry using the `DataRegistry` class
2) Register a dataset
3) Update a registered dataset with a new version
4) Modify a previously registered dataset with updated metadata
5) Delete a dataset
6) Special cases
    * Registering external datasets
    * Manually specifying the relative path

### Before we begin

If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
import dataregistry
print("Working with dataregistry version:", dataregistry.__version__)

## 1) The DataRegistry class

The top-level `DataRegistry` class provides a quick and easy-to-use interface to `dataregistry` functionality.

In many cases, in particular when you only need query functionality, or when you're working on your own at NERSC, a simple call with no additional arguments is adequate, e.g.,

In [None]:
from dataregistry import DataRegistry

# Establish connection to database (using defaults)
datareg = DataRegistry()

With no arguments, the `DataRegistry` class will automatically attempt to:
- establish a connection to the registry database using the information in your `~/.config_reg_access` and `~/.pgpass` files
- connect to the default database schema
- use the default NERSC "`site`" for the `root_dir`

The root directory (`root_dir`) is the base path under which all ingested data will be copied. Other than for testing, this should generally be the NERSC `site` address.

### Special cases away from the defaults

If you are not connecting to the default database schema, or if your configuration file is located somewhere other than your `$HOME` directory, you must provide that information during the object creation, e.g.,

In [None]:
# When your configuration file is not in the default location
# datareg = DataRegistry(config_file="/path/to/config")

# If you are connecting to a database schema other than the default, you need to specify the schema name to connect to
# datareg = DataRegistry(schema="myschema")

# If you want to specify the root_dir that data is copied to (this should only be changed for testing purposes)
# datareg = DataRegistry(root_dir="/my/root/dir")

### Setting universal `owner`'s and/or `owner_type`s for datasets

When creating a `DataRegistry` class instance which you intend to use for registering datasets, you may optionally set a default `owner` and/or `owner_type` that will be inherited for all datasets that are registered during that instance. See details in the section "*Registering new datasets with DataRegistry*" below for details about `owner` and `owner_type`.

In [None]:
# Setting a global owner and owner_type default value for all datasets that will be registered during this instance
# datareg = DataRegistry(owner="desc", owner_type="group")

## 2) Registering new datasets with the `DataRegistry` class

Now that we have made our connection to the database, we can register some datasets using the `Registrar` extension of the `DataRegistry` class.

In [None]:
# Create an empty text file as some example data
with open("dummy_dataset.txt", "w") as f:
    f.write("some data")

# Add new entry.
dataset_id, execution_id = datareg.Registrar.dataset.register(
    "nersc_tutorial:my_desc_dataset",
    "1.0.0",
    description="An output from some DESC code",
    owner="DESC",
    owner_type="group",
    is_overwritable=True,
    old_location="dummy_dataset.txt"
)

# This is the unique identifier assigned to the dataset from the registry
print(f"Dataset {dataset_id} created")

# This is the id of the execution the dataset belongs to (see next tutorial)
print(f"Dataset assigned to execution {execution_id}")

This will register a new dataset. A few notes:

### The dataset name (mandatory)

The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human. Note the combination of `name`, `version` and `version_suffix` must be unique in the database. The dataset `name` allows for an easy retrieval of the dataset for querying and updating.

### The version string (mandatory)

The second required parameter is the version string, in the semantic format, i.e., MAJOR.MINOR.PATCH. There exists also an optional ``version_suffix`` parameter, which may be used to further identify the dataset, e.g. with a value like "rc1" to make it clear it's only a release candidate, possibly not in its final form.


### Owner and Owner type

Datasets are registered under a given `owner`. This can be any string, however it should be informative. If no `owner` is specified, and no global `owner` was set when the `DataRegistry` instance was created, `$USER` is used as the default.

One further level of classification is the `owner_type`, which can be one of `user`, `group`, `project` or `production`. If `owner_type` is not specified, and no global `owner_type` was set when the `DataRegistry` instance was created, `user` is the default.

Note that `owner_type="production"` datasets can only go into the production schema, and can never be overwritten (see the "[Production Schema](https://github.com/LSSTDESC/dataregistry/tree/main/docs/source/tutorial_notebooks/production_scheme.ipynb)" tutorial for more information).

### Overwriting datasets

By default, datasets in the data registry, once registered, are not overwritable. You can change this behavior by setting `is_overwritable=True` when registering your datasets. If `is_overwritable=True` on one of your previous datasets, you can register a new dataset with the same combination of `relative_path`, `owner` and `owner_type` as before (be warned that any previous data stored under this path will be deleted first). 

Note that whilst the data in the shared space will be overwritten with each registration when `is_overwritable=True`, the original entries in the data registry database are never lost (a new unique entry is created each time, and the 'old' entries will obtain `True` for their `is_overwritten` column).

### Copying the data

Registering a dataset does two things; it creates an entry in the DESC data registry database with the appropriate metadata, and it (optionally) copies the dataset contents to the `root_dir`. 

If the data are already at the correct relative path within the `root_dir`, then only the relative path needs to be provided, and the dataset will then be registered. However it's likely for most users the data will need to be copied from another location to the `root_dir`. That initial location may be specified using the `old_location` parameter. 

In our example we have created a dummy text file as our dataset and ingested it into the data registry, however this can be any file or directory (directories will be recursively copied).

### Extra options

All the `Registrar.dataset.register()` parameters we do not explicitly specify revert to their default values. For a full list of these options see the documentation [here](http://lsstdesc.org/dataregistry/reference_python.html#the-dregs-class). We would advise you to be as precise as possible when creating entries within the data registry. 

## 3) Updating a previously registered dataset with a newer version

If you have a dataset that has been previously registered within the data registry, and that dataset has updates, you have three options for how to handle the new entry:

1. You can enter it as a completely new standalone dataset with no links to the previous dataset
2. You can enter it as a new version of the previous dataset (recommended)
3. You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)

For 1. simply follow the procedure above for registering a new dataset.

For 2. we register a new dataset as before, making sure to keep the same dataset `name`, but updating the dataset `version`. One can update the `version` in two ways: manually entering a new version string, or having the `dataregistry` automatically "bump" the dataset version by selecing either "major", "minor" or "patch" for the version string. For example, lets register an updated version of our dataset, bumping the minor tag (i.e., bumping `1.0.0` -> `1.1.0`).

In [None]:
# Create an empty text file as some example data
with open("updated_dummy_dataset.txt", "w") as f:
    f.write("some updated data")

# Add new entry for an updated dataset with an updated version.
updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(
    "nersc_tutorial:my_desc_dataset",
    "minor", # Automatically bumps to "1.1.0"
    description="An output from some DESC code (updated)",
    is_overwritable=True,
    old_location="updated_dummy_dataset.txt",
)

# This is the unique identifier assigned to the updated dataset from the registry
print(f"Dataset {updated_dataset_id} created")

# This is the id of the execution the updated dataset belongs to (see next tutorial)
print(f"Dataset assigned to execution {updated_execution_id}")

Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.

For 3., to update a previous dataset and overwrite the existing data, we have the pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example

In [None]:
# Create an empty text file as some example data
with open("updated_dummy_dataset_again.txt", "w") as f:
    f.write("some further updated data")

# Add new entry for an updated dataset with an updated version.
updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(
    "nersc_tutorial:my_desc_dataset",
    "patch", # Automatically bumps to "1.1.1"
    description="An output from some DESC code (further updated)",
    is_overwritable=True,
    old_location="updated_dummy_dataset_again.txt",
    relative_path="nersc_tutorial:my_desc_dataset_1.1.0",
)

# This is the unique identifier assigned to the updated dataset from the registry
print(f"Dataset {updated_dataset_id} created")

# This is the id of the execution the updated dataset belongs to (see next tutorial)
print(f"Dataset assigned to execution {updated_execution_id}")

will create a new dataset, version `1.1.1`, but the new data has overwritten the data for version `1.1.0`.

## 4) Modifying the metadata of a previously registered dataset

Once a dataset has been registered with the `dataregistry` it can still be modified, however only for certain metadata columns.

We can check and see what metadata columns we are allowed to modify by running

In [None]:
# What columns in the dataset table are modifiable?
print(datareg.Registrar.dataset.get_modifiable_columns())

To modify the metadata of a dataset entry we call the `.modify()` function for the appropriate schema table, which accepts the unique entry ID of the dataset we wish to modify and a key-value pair dictionary with the modifications we wish to make.

For example, if we want to update the `description` column of our first example entry above we would do

In [None]:
# A key-value dict of the columns we want to update, with their new values
update_dict = {"description": "My new updated description"}

datareg.Registrar.dataset.modify(dataset_id, update_dict)

## 5) Deleting a dataset in the dataregistry

To delete a dataset entry from the dataregistry we call the .delete() function which accepts one argument, the dataset_id of the entry you wish to delete, e.g.,

In [None]:
# Delete dataset with entry ID == dataset_id
datareg.Registrar.dataset.delete(dataset_id)

Note that this will remove the dataset data stored under the root_dir, however the entry within the registry database will remain (with an updated status indicated the dataset was deleted).

## 6) Special cases

### Registering external datasets

Typically when we register datasets we are asking the `dataregistry` to collate provenance data for the dataset and to physically manage the data (i.e., to copy the data to the central `root_dir`).

However the `dataregistry` can also accept entries for "external" datasets, e.g., those located off site and not controlled by DESC. These will only be entries in the database, for querying purposes, and no physical data (managed by the `dataregistry`) will be associated with those entries.

When registering an external dataset into the `dataregistry` you must provide either a `contact_email` or `url` during registration, and set `location_type="external"`. The remaining provenance information is the same as before.

For example

In [None]:
# Add new external dataset entry.
datareg.Registrar.dataset.register(
    "nersc_tutorial/external_dataset",
    "0.0.1",
    description="Images from some external observatory",
    is_overwritable=True,
    location_type="external",
    url="www.data.com",
)

### Specifying the relative path

Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. 

By default, the `relative_path` is constructed from the `name`, `version` and `version_suffix` (if there is one), in the format `relative_path=<name>_<version>_<version_suffix>`. However, one can also manually select the `relative_path` during registration, for example

In [None]:
# Add new entry with a manual relative path.
datareg.Registrar.dataset.register(
    "nersc_tutorial:my_desc_dataset_with_relative_path",
    "1.0.0",
    is_dummy=True, # For testing purposes, means we need no actual data to work (only a database entry is created)
    relative_path="nersc_tutorial/my_desc_dataset",
)

will register a dataset under the `relative_path` of `nersc_tutorial/my_desc_dataset`.

For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable). 

You can leave `name` as `None` when registering using a manual `relative_path`, which will construct the `name` automatically from the `relative_path`- However we always recommend being explicit and choosing a `name` also.