<div style="overflow: hidden;">
    <img src="images/DREGS_logo_v2.png" width="300" style="float: left; margin-right: 10px;">
</div>

# Getting started: Part 2 - A closer look at datasets

The *DESC data registry* is a means of storing and keeping track of DESC related datasets, i,e., where they are, when they were produced and what precursor datasets they depend on.

This is a quick tutorial to get your started using the `dataregistry` at NERSC, both for entering your own datasets into the registry, and for finding out what datasets already exist through queries.

### What we cover in this tutorial

In this tutorial we will learn about:

1) Tagging datasets with `keywords`
2) The `relative_path` and where datasets are stored
3) The dataset `status` and history

### Before we begin

If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
# Come up with a random owner name to avoid clashes
from random import randint
import os
OWNER = "tutorial_" + os.environ.get('USER') + '_' + str(randint(0,int(1e6)))

import dataregistry
print(f"Working with dataregistry version: {dataregistry.__version__} as random owner {OWNER}")

**Note** that running some of the cells below may fail, especially if run multiple times. This will likely be from clashes with the unique constraints within the database (hopefully the error output is informative). In these events either; (1) run the cell above to establish a new database connection with a new random user, or (2) manually change the conflicting database column(s) that are clashing during registration.

## 1) Tagging datasets with keywords

In [None]:
from dataregistry import DataRegistry

# Establish connection to the tutorial schema
datareg = DataRegistry(schema="tutorial_working", owner=OWNER)

To make datasets broadly easier for people to find, they can be tagged with one or more keywords.

Keywords are restricted to those within a predefined list, to see the list run

In [None]:
print(datareg.Registrar.dataset.get_keywords())

Or run `dregs show keywords` from the command line.

Keywords are passed as a list of strings during dataset registration. For example


In [None]:
# Add new dataset entry with keywords.
dataset_id, execution_id = datareg.Registrar.dataset.register(
    "nersc_tutorial:keywords_dataset",
    "0.0.1",
    description="A dataset with some keywords tagged",
    is_overwritable=True,
    keywords=["simulation"],
    location_type="dummy", # for testing, means we need no data
)

# This is the unique identifier assigned to the updated dataset from the registry
print(f"Dataset {dataset_id} created")

# This is the id of the execution the updated dataset belongs to (see next tutorial)
print(f"Dataset assigned to execution {execution_id}")

will create a dataset tagged with the keyword "simulation".

We can also append keywords to a dataset after it has been registered using the `add_keywords()` function, providing a list of the new keywords we want to add, e.g.,

In [None]:
# List of keywords to add to dataset
updated_keywords = ["observation"]

datareg.Registrar.dataset.add_keywords(dataset_id, updated_keywords)

Note keywords will never be duplicated, i.e., if you `add_keywords()` with a list containing a keyword that is already tagged for that dataset, no new duplicate keyword entry will be created.

We can return all datasets tagged with certain keywords with a simple query, which we cover in the next tutorial.

## 2) The relative path and where datasets are stored

The files and directories of registered datasets are stored under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC.

By default, when not manually specified, the `relative_path` is constructed from the `name` and `version`, in the format `relative_path=.gen_paths/<name>_<version>/`. 

One can manually select the `relative_path` during registration if they explicitly care about where the data is located relative to the `root_dir`, for example

In [None]:
# Add new entry with a manual relative path.
datareg.Registrar.dataset.register(
    "nersc_tutorial:my_desc_dataset_with_relative_path",
    "1.0.0",
    relative_path=f"NERSC_tutorial/{OWNER}/my_desc_dataset",
    location_type="dummy", # for testing, means we need no actual data to exist
)

will register a dataset under the `relative_path` of `nersc_tutorial/my_desc_dataset`.

If the registered dataset was a single file, the `relative_path` will be the explicit (relative) pathname to that file, e.g., `.gen_paths/mydataset_1.0.0/myfile.txt` or `my/manual/path/myfile.txt`. If the registered dataset was a directory, the `relative_path` is the pathname to the directory containing the dataset contents.

For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. Naturally, the `relative_path` you select cannot already be taken by another dataset (an error will be raised in this case), and any manually specified `relative_path` cannot start with `.gen_paths` as this directory is reserved for autogenerated `relative_path`s.

When you overwrite a previous dataset entry using the `replace()` function, the original `relative_path` at registration (automatically generated or manual) will be used.

One can construct the full absolute path to a dataregistry file using the `get_dataset_absolute_path()` helper function, e.g.,

In [None]:
# Find the full absolute path to a dataset using the dataset id
absolute_path = datareg.Query.get_dataset_absolute_path(dataset_id)

print(f"The absolute path for {dataset_id} is {absolute_path}"

## 3) The dataset `status` and history

Datasets can go through multiple changes from the point they are registered. A partial history of the dataset and its current status is embedded within the metadata.

### The dataset status

The `status` row for a dataset is an integer bitmask value that reports the current state of a dataset. The four bits are, in order:

- bit 0: `valid`: This should generally always be true. If the valid bit is false, it means that there was an issue during registration, most likely a failure during data copying.
- bit 1: `deleted`: If this bit is true, the files belonging to the dataset have been deleted from the `root_dir`, the entry in the database persists.
- bit 2: `archived`: If this bit is true, the dataset has been archived from the `root_dir` to `archive_path`, the entry in the database persists.
- bit 3: `replaced`: When a dataset is overwritten using the `replace()` function, a new entry is formed in the database each time, with the old entry pointing to the new one. Replaced entried with have true for their `replaced` bit.

There are utillities in the `dataregistry` package to help decipher a datasets status. For example

In [None]:
from dataregistry.registrar.dataset_util import get_dataset_status

# The `get_dataset_status` function takes in a dataset `status` and a bit index, and returns if that bit is True or False
dataset_status = 1

# Is dataset valid?
print(f"Dataset is valid: {get_dataset_status(dataset_status, 'valid')}")

# Is dataset replaced?
print(f"Dataset is replaced: {get_dataset_status(dataset_status, 'replaced')}")

### The dataset history

There are some columns for datasets that track a rudamentary history:

- `register_date`: Stores when the dataset was registered
- `creation_date`: Is when the dataset itself was created, read from the file metadata automataically (note this can be manually overwritten during dataset registration)
- `archive_date`: If the dataset has been archived out of the registry `root_dir`, this is when it happened
- `delete_date`: If the dataset has been deleted, this is when it happened. There is also `delete_uid`, which is the user ID of the person that deleted the data.