<div style="overflow: hidden;">
    <img src="images/DREGS_logo_v2.png" width="300" style="float: left; margin-right: 10px;">
</div>

# Getting started: Part 1 - Registering datasets

The *DESC data registry* is a means of storing and keeping track of DESC related datasets, i,e., where they are, when they were produced and what precursor datasets they depend on.

This is a quick tutorial to get your started using the `dataregistry` at NERSC, both for entering your own datasets into the registry, and for finding out what datasets already exist through queries.

### What we cover in this tutorial

In this tutorial we will learn how to:

1) Connect to the DESC data registry using the `DataRegistry` class
2) Register a dataset
3) Update a registered dataset with a new version
4) Replace a dataset
5) Modify a previously registered dataset with updated metadata
6) Delete a dataset

### Before we begin

If you haven't done so already, check out the [getting setup](https://lsstdesc.org/dataregistry/tutorial_setup.html) page from the documentation if you want to run this tutorial interactively.

A quick way to check everything is set up correctly is to run the first cell below, which should load the `dataregistry` package, and print the package version.

In [None]:
# Come up with a random owner name to avoid clashes
from random import randint
import os
OWNER = "tutorial_" + os.environ.get('USER') + '_' + str(randint(0,int(1e6)))

import dataregistry
print(f"Working with dataregistry version: {dataregistry.__version__} as random owner {OWNER}")

> **Note** that running some of the cells below may fail, especially if run multiple times. This will likely be from clashes with the unique constraints within the database (hopefully the error output is informative). In these events either; (1) run the cell above to establish a new database connection with a new random user, or (2) manually change the conflicting database column(s) that are clashing during registration.

## 1) The DataRegistry class

The top-level `DataRegistry` class provides a quick and easy-to-use interface to `dataregistry` functionality.

In many cases, in particular when you only need query functionality, or when you're working on your own at NERSC, a simple call with no additional arguments is adequate, e.g.,

```python
from dataregistry import DataRegistry

# Create connection object
datareg = DataRegistry()
```

By default, initializing the `DataRegistry` class without arguments will:  

- Attempt to establish a connection to the registry database using credentials from your `~/.config_reg_access` and `~/.pgpass` files.  
- Connect to the default LSST DESC namespace.  
- Set the default `root_dir` to the NERSC `site`.  

The root directory (`root_dir`) serves as the base path where all ingested data will be stored. Outside of testing, this should typically point to the NERSC `site` location.  

> **Note:** The code snippet above is in a markdown cell and is not meant to be executed. Instead, we will run the modified code below to connect to the tutorial namespace.

For more details on alternative ways to connect to the registry using the `DataRegistry` object, refer to the **Advanced Tutorial** section.

### Connecting to the Tutorial Namespace  

In these tutorials, we will connect to the tutorial namespace to avoid modifying the default LSST DESC namespace with test entries. If you're practicing with `dataregistry` outside of these notebooks, you are welcome to use the tutorial schemas for your own entries.  

Within these notebooks, the `reg_reader` account has write access to the tutorial schema for testing purposes. However, only the `reg_writer` account has permission to write to the LSST DESC namespace.  

For more details on permissions and dataset management, refer to the **Advanced Tutorial on Datasets**.

Lets start by connecting to the tutorial namespace:

In [None]:
from dataregistry import DataRegistry

# Connect to the tutorial namespace and set the default owner to `OWNER`
datareg = DataRegistry(namespace="tutorial", owner=OWNER)

## 2) Registering new datasets with the `DataRegistry` class

First a bit of setup, creating some temporary files to serve as dummy data for this tutorial.

In [None]:
import tempfile

# Create three temporary text files with example data
temp_files = [tempfile.NamedTemporaryFile(delete=False, mode='wb') for i in range(3)]

for i, temp_file in enumerate(temp_files):
    temp_file.write(f"This is some temporary data, number {i}".encode())
    temp_file.close()

In [None]:
# Add new entry.
dataset_id, execution_id = datareg.registrar.dataset.register(
    "nersc_tutorial:my_first_desc_dataset",           # `name`
    "1.0.0",                                          # `version`
    description="An output from some DESC code",
    owner_type="group",
    is_overwritable=True,
    old_location=temp_files[0].name
)

# This is the unique identifier assigned to the dataset from the registry
print(f"Dataset {dataset_id} created")

# This is the id of the execution the dataset belongs to (see advanced registering tutorial)
print(f"Dataset assigned to execution {execution_id}")

The code above registers a new dataset. Here are some key details:  

### Dataset Name (Mandatory)  

The first required argument for the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_first_desc_dataset`.  
- This name is primarily for human readability and can be any valid string.  
- However, special characters (`&*/\?$`) and spaces **are not allowed**.  
- The combination of `name`, `version`, `owner`, and `owner_type` must be **unique** in the database.  

A well-chosen `name` makes it easy to retrieve the dataset for queries and updates.  

### Version String (Mandatory)  

The second required parameter is the version string, which follows semantic versioning:  
- Format: `MAJOR.MINOR.PATCH` (e.g., `1.0.0`).  

### Owner and Owner Type  

Each dataset is registered under an owner, which should be an informative string.  
- If no `owner` is specified, and no global `owner` was set when the `DataRegistry` instance was created, the default is `$USER`.  

Datasets are further classified by `owner_type`, which can be: `user` (default), `group`, `project` or `production`.

### Overwriting Datasets  

By default, datasets **cannot** be overwritten once registered.  
- To allow overwriting, set `is_overwritable=True` when registering the dataset.  
- If a dataset was previously registered with `is_overwritable=True`, it can be replaced (see *Replacing a Dataset* below).  

### Copying the Data  

Registering a dataset performs two actions:  
1. Creates an entry in the DESC data registry database with metadata.  
2. (Optionally) Copies the dataset contents from `old_location` to the `root_dir`. This is optional because the data may already be on location (see advanced registering tutorial for these cases).

In our example, we registered a dummy text file, but any file or directory can be ingested (directories are copied recursively).  

> **Note:** The data registry does not support symbolic links (symlinks) for dataset registration.  

### Extra Options  

Any parameters not explicitly specified in `registrar.dataset.register()` revert to their default values. For a full list of available options, refer to the documentation: [Data Registry API Reference](http://lsstdesc.org/dataregistry/reference_python.html#the-dregs-class).  

> **Tip:** Be as precise as possible when creating entries in the data registry to ensure clarity and consistency.  


## 3) Updating a Previously Registered Dataset with a Newer Version  

If you have a dataset already registered in the data registry, but the dataset has now been updated, registering the new version is straightforward.  

To update a dataset, register it again using the same dataset `name` but with a new `version`. There are two ways to update the version:  
1. **Manually** enter a new version string.  
2. **Automatically increment** the version using the data registry’s built-in versioning system. You can choose to bump one of the following:  
   - `"major"` (e.g., `1.0.0 → 2.0.0`)  
   - `"minor"` (e.g., `1.0.0 → 1.1.0`)  
   - `"patch"` (e.g., `1.0.0 → 1.0.1`)  

For example, let's register an updated version of our dataset, increasing the minor version (`1.0.0 → 1.1.0`).  

> **Note:** Both versions (`1.0.0` and `1.1.0`) will remain stored within `root_dir`, sharing the same dataset name but existing as separate database entries.  


In [None]:
# Add new entry for an updated dataset with an updated version.
updated_dataset_id, updated_execution_id = datareg.registrar.dataset.register(
    "nersc_tutorial:my_first_desc_dataset",
    "minor", # Automatically bumps to "1.1.0"
    description="An output from some DESC code (updated)",
    owner_type="group",
    is_overwritable=True,
    old_location=temp_files[1].name,
)

# This is the unique identifier assigned to the updated dataset from the registry
print(f"Dataset {updated_dataset_id} created")

# This is the id of the execution the updated dataset belongs to (see next tutorial)
print(f"Dataset assigned to execution {updated_execution_id}")

## 4) Replacing a Dataset  

Instead of updating a dataset to a new version (where each version is retained in the registry), you can replace an existing dataset using the `replace()` method of the `Registrar` class.  

Replacing a dataset is similar to registering a new entry, with one key difference:  `replace()` first deletes any previous data files associated with the dataset. It then registers a new entry, keeping the same: `name`, `version`, `owner`, `owner_type` and `relative_path` (dataset location in `root_dir`, see the advanced registering tutorial for details).  

> **Important:** The dataset being replaced must have been registered with `is_overwritable=True`.  

While the actual data files are overwritten, previous database entries are never lost:  
- A new unique entry is created each time a dataset is replaced.  
- The original entry is marked with a `replaced` status.  
- The `replace_id` column links the old dataset to the new one that replaced it.  

To replace the first entry we made, we register new data under the same `name`, `version`, `owner`, and `owner_type`. This ensures the correct dataset is located and replaced.  

In [None]:
# Add new entry, overwriting the data in the `root_dir`.
updated_dataset_id, updated_execution_id = datareg.registrar.dataset.replace(
    "nersc_tutorial:my_first_desc_dataset",
    "1.0.0",                                          
    description="An output from some DESC code (further updated)",
    owner_type="group",
    is_overwritable=True,
    old_location=temp_files[2].name,
)

# This is the unique identifier assigned to the replaced dataset from the registry
print(f"Dataset {updated_dataset_id} created")

# This is the id of the execution the replaced dataset belongs to (see next tutorial)
print(f"Dataset assigned to execution {updated_execution_id}")

### Restrictions on Dataset Replacement  

Only the following datasets can be replaced:  
- Valid datasets registered with `is_overwritable=True`.  
- Invalid datasets (datasets that failed to register due to copying errors or interruptions).  

> **Note:** Deleted or archived datasets **cannot** be replaced. See the advanced registering tutorial for more details on dataset `status`.  

## 5) Modifying the metadata of a previously registered dataset

After a dataset has been registered in the `dataregistry`, certain metadata columns can still be modified.  

We can check and see what metadata columns we are allowed to modify by running

In [None]:
# What columns in the dataset table are modifiable?
print(datareg.registrar.dataset.get_modifiable_columns())

To modify the metadata of a dataset entry, use the `.modify()` function for the appropriate table (in or example the `dataset` table). 

This function requires:

1) The unique entry ID of the dataset you wish to modify.
2) A key-value pair dictionary containing the metadata changes.

For instance, if we want to update the description column of our first dataset entry, we would use the following code:

In [None]:
# A key-value dict of the columns we want to update, with their new values
update_dict = {"description": "My new updated description"}

datareg.registrar.dataset.modify(dataset_id, update_dict)

## 6) Deleting a dataset in the dataregistry

To delete a dataset entry from the data registry, use the `.delete()` function, which requires four arguments: `name`, `version_string`, `owner` and `owner_type`.

For example

In [None]:
# Delete a dataset
datareg.registrar.dataset.delete("nersc_tutorial:my_first_desc_dataset", "1.0.0", OWNER, "group")

Important Notes:
- The files and directories associated with the dataset in the `root_dir` will be removed.
- However, the registry database entry will remain, with its `status` updated to indicate that the dataset has been deleted.