Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap name and relative_path as the mandatory entry #123

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 113 additions & 17 deletions docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@
"4) Modify a previously registered dataset with updated metadata\n",
"5) Delete a dataset\n",
"6) Special cases\n",
" * Registering external datasets \n",
" * Registering external datasets\n",
" * Manually specifying the relative path\n",
"\n",
"### Before we begin\n",
"\n",
Expand Down Expand Up @@ -166,7 +167,7 @@
"\n",
"# Add new entry.\n",
"dataset_id, execution_id = datareg.Registrar.dataset.register(\n",
" \"nersc_tutorial/my_desc_dataset\",\n",
" \"nersc_tutorial:my_desc_dataset\",\n",
" \"1.0.0\",\n",
" description=\"An output from some DESC code\",\n",
" owner=\"DESC\",\n",
Expand All @@ -191,13 +192,13 @@
"source": [
"This will register a new dataset. A few notes:\n",
"\n",
"### The relative path\n",
"### The dataset name (mandatory)\n",
"\n",
"Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable, see below). \n",
"The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human, however note that the special characters `&*/\\?$` are not allowed to be part of the `name` string. The combination of `name`, `version` and `version_suffix` must be unique in the database.\n",
"\n",
"The relative path is one of the two required parameters you must specify when registering a dataset (in the example here our relative path is `nersc_tutorial/my_desc_dataset`).\n",
"The dataset `name` allows for an easy retrieval of the dataset for querying and updating.\n",
"\n",
"### The version string\n",
"### The version string (mandatory)\n",
"\n",
"The second required parameter is the version string, in the semantic format, i.e., MAJOR.MINOR.PATCH. There exists also an optional ``version_suffix`` parameter, which may be used to further identify the dataset, e.g. with a value like \"rc1\" to make it clear it's only a release candidate, possibly not in its final form.\n",
"\n",
Expand Down Expand Up @@ -238,15 +239,11 @@
"\n",
"If you have a dataset that has been previously registered within the data registry, and that dataset has updates, you have three options for how to handle the new entry:\n",
"\n",
"- You can enter it as a completely new standalone dataset with no links to the previous dataset\n",
"- You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)\n",
"- You can enter it as a new version of the previous dataset (recommended)\n",
"1. You can enter it as a new version of the previous dataset (recommended in most situations)\n",
"2. You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)\n",
"3. You can enter it as a completely new standalone dataset with no links to the previous dataset\n",
"\n",
"Unless you are overwriting a previous dataset (when `is_overwritable=True`), you cannot enter a new dataset (even an updated version) using the same relative path. However, datasets can share the same `name` field, which acts as a reference for the dataset, and which is what we'll use to keep our updated dataset connected to our previous one.\n",
"\n",
"Note that for our test dataset entry above we did not specify a `name` during registration. The default `name` for a dataset is the file or directory name, extracted from the relative path. In our example above this was `my_desc_dataset`. If you are unsure of what name you specified for your previous dataset that you want to overwrite, you will need to query the database first to find out (see next tutorial for queries).\n",
"\n",
"The combination of `name`, `version` and `version_suffix` for any dataset must be unique. As we are updating a dataset with the same name, we have to make sure to update the version to a new value. One handy feature is automatic version \"bumping\" for datasets, i.e., rather than specifying a new version string manually, one can select \"major\", \"minor\" or \"patch\" for the version string to automatically bump that property of the version string up. In our case, selecting \"minor\" will automatically generate the version \"1.1.0\"."
"For 1. we register a new dataset as before, making sure to keep the same dataset `name`, but updating the dataset `version`. One can update the `version` in two ways: manually entering a new version string, or having the `dataregistry` automatically \"bump\" the dataset version by selecing either \"major\", \"minor\" or \"patch\" for the version string. For example, lets register an updated version of our dataset, bumping the minor tag (i.e., bumping `1.0.0` -> `1.1.0`)."
]
},
{
Expand All @@ -258,14 +255,55 @@
},
"outputs": [],
"source": [
"# Create an empty text file as some example data\n",
"with open(\"updated_dummy_dataset.txt\", \"w\") as f:\n",
" f.write(\"some updated data\")\n",
"\n",
"# Add new entry for an updated dataset with an updated version.\n",
"updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(\n",
" \"nersc_tutorial/my_updated_desc_dataset\",\n",
" \"nersc_tutorial:my_desc_dataset\",\n",
" \"minor\", # Automatically bumps to \"1.1.0\"\n",
" description=\"An output from some DESC code (updated)\",\n",
" is_overwritable=True,\n",
" old_location=\"dummy_dataset.txt\",\n",
" name=\"my_desc_dataset\" # Using this name links it to the previous dataset.\n",
" old_location=\"updated_dummy_dataset.txt\",\n",
")\n",
"\n",
"# This is the unique identifier assigned to the updated dataset from the registry\n",
"print(f\"Dataset {updated_dataset_id} created\")\n",
"\n",
"# This is the id of the execution the updated dataset belongs to (see next tutorial)\n",
"print(f\"Dataset assigned to execution {updated_execution_id}\")"
]
},
{
"cell_type": "markdown",
"id": "d60f1bbb-adcb-4847-9e65-d8d14d928041",
"metadata": {},
"source": [
"Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.\n",
"\n",
"For 2., to update a previous dataset and overwrite the existing data, we have to pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2494eeea-b53f-472b-aafd-88216cb36494",
"metadata": {},
"outputs": [],
"source": [
"# Create an empty text file as some example data\n",
"with open(\"updated_dummy_dataset_again.txt\", \"w\") as f:\n",
" f.write(\"some further updated data\")\n",
"\n",
"# Add new entry for an updated dataset with an updated version.\n",
"updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(\n",
" \"nersc_tutorial:my_desc_dataset\",\n",
" \"patch\", # Automatically bumps to \"1.1.1\"\n",
" description=\"An output from some DESC code (further updated)\",\n",
" is_overwritable=True,\n",
" old_location=\"updated_dummy_dataset_again.txt\",\n",
" relative_path=\"nersc_tutorial:my_desc_dataset_1.1.0\",\n",
")\n",
"\n",
"# This is the unique identifier assigned to the updated dataset from the registry\n",
Expand All @@ -275,6 +313,16 @@
"print(f\"Dataset assigned to execution {updated_execution_id}\")"
]
},
{
"cell_type": "markdown",
"id": "dbdff8de-973e-4aa6-90ba-42f448d87222",
"metadata": {},
"source": [
"will create a new dataset, version `1.1.1`, but the new data has overwritten the data for version `1.1.0`.\n",
"\n",
"For 3. simply follow the procedure above for registering a new dataset."
]
},
{
"cell_type": "markdown",
"id": "1cc01c4e-8404-4fb4-b082-8969546f3ffc",
Expand Down Expand Up @@ -385,6 +433,54 @@
" url=\"www.data.com\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e95c0aa9-3b05-4b80-a91c-e6f643365bcb",
"metadata": {},
"source": [
"### Specifying the relative path\n",
"\n",
"Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. \n",
"\n",
"By default, the `relative_path` is constructed from the `name`, `version` and `version_suffix` (if there is one), in the format `relative_path=<name>_<version>_<version_suffix>`. However, one can also manually select the `relative_path` during registration, for example"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2eda857c-e9b5-40ab-999f-65b858831a08",
"metadata": {},
"outputs": [],
"source": [
"# Add new entry with a manual relative path.\n",
"datareg.Registrar.dataset.register(\n",
" \"nersc_tutorial:my_desc_dataset_with_relative_path\",\n",
" \"1.0.0\",\n",
" is_dummy=True, # For testing purposes, means we need no actual data to work (only a database entry is created)\n",
" relative_path=\"nersc_tutorial/my_desc_dataset\",\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8dbb1c0a-70b8-452b-8b13-a98901359afa",
"metadata": {},
"source": [
"will register a dataset under the `relative_path` of `nersc_tutorial/my_desc_dataset`.\n",
"\n",
"For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable). \n",
"\n",
"You can leave `name` as `None` when registering using a manual `relative_path`, which will construct the `name` automatically from the `relative_path`- However we always recommend being explicit and choosing a `name` also."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d2222b9e-c986-4595-9404-2fd56eb40c5f",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
Expand Down
56 changes: 39 additions & 17 deletions src/dataregistry/registrar/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,15 @@
_bump_version,
_copy_data,
_form_dataset_path,
_name_from_relpath,
_relpath_from_name,
_parse_version_string,
_read_configuration_file,
get_directory_info,
_name_from_relpath
)
from .dataset_util import set_dataset_status, get_dataset_status

_ILLEGAL_NAME_CHAR = ["$","*","&","/","?","\\"]

class DatasetTable(BaseTable):
def __init__(self, db_connection, root_dir, owner, owner_type, execution_table):
Expand All @@ -29,10 +31,9 @@ def __init__(self, db_connection, root_dir, owner, owner_type, execution_table):

def register(
self,
relative_path,
name,
version,
version_suffix=None,
name=None,
creation_date=None,
description=None,
execution_id=None,
Expand All @@ -55,6 +56,7 @@ def register(
location_type="dataregistry",
url=None,
contact_email=None,
relative_path=None,
):
"""
Create a new dataset entry in the DESC data registry.
Expand All @@ -70,10 +72,9 @@ def register(

Parameters
----------
relative_path** : str
name** : str
version** : str
version_suffix** : str, optional
name** : str, optional
creation_date** : datetime, optional
description** : str, optional
execution_id** : int, optional
Expand Down Expand Up @@ -109,6 +110,7 @@ def register(
url**: str, optional
For `location_type="external"` only
contact_email**: str, optional
relative_path** : str, optional

Returns
-------
Expand All @@ -118,6 +120,16 @@ def register(
The execution ID associated with the dataset
"""

# If `old_location` is None, we need a relative path
if old_location is None and location_type == "dataregistry":
if relative_path is None:
raise ValueError("`relative_path` is required when `old_location` is None")

# If no name, we need a `relative_path`
if name is None:
if relative_path is None:
raise ValueError("`relative_path` is required when `name` is None")

# If external dataset, check for either a `url` or `contact_email`
if location_type == "external":
if url is None and contact_email is None:
Expand Down Expand Up @@ -161,19 +173,8 @@ def register(
"Only owner_type='production' can go in the production schema"
)

# If `name` not passed, automatically generate a name from the relative path
if name is None:
name = _name_from_relpath(relative_path)

# Look for previous entries. Fail if not overwritable
dataset_table = self._get_table_metadata("dataset")
previous = self._find_previous(relative_path, owner, owner_type)

if previous is None:
print(f"Dataset {relative_path} exists, and is not overwritable")
return None, None

# Deal with version string (non-special case)
dataset_table = self._get_table_metadata("dataset")
if version not in ["major", "minor", "patch"]:
v_fields = _parse_version_string(version)
version_string = version
Expand All @@ -185,6 +186,27 @@ def register(
f"{v_fields['major']}.{v_fields['minor']}.{v_fields['patch']}"
)

# If `relative_path` not passed, automatically generate one from the
# name, version and version_suffix
if relative_path is None:
JoanneBogart marked this conversation as resolved.
Show resolved Hide resolved
relative_path = _relpath_from_name(name, version_string, version_suffix)

# If `name` is None, generate it from the `relative_path`
if name is None:
name = _name_from_relpath(relative_path)

# Make sure `name` is legal (i.e., no illegal characters)
for i_char in _ILLEGAL_NAME_CHAR:
if i_char in name:
raise ValueError(f"Cannot have character {i_char} in name string")

# Look for previous entries. Fail if not overwritable
previous = self._find_previous(relative_path, owner, owner_type)

if previous is None:
print(f"Dataset {relative_path} exists, and is not overwritable")
return None, None

# If no execution_id is supplied, create a minimal entry
if execution_id is None:
if execution_name is None:
Expand Down
29 changes: 29 additions & 0 deletions src/dataregistry/registrar/registrar_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
"_form_dataset_path",
"get_directory_info",
"_name_from_relpath",
"_relpath_from_name",
"_copy_data",
]
VERSION_SEPARATOR = "."
Expand Down Expand Up @@ -340,3 +341,31 @@ def _compute_checksum(file_path):
)

raise Exception(e)

def _relpath_from_name(name, version, version_suffix):
"""
Construct a relative path from the name and version of a dataset.

We use this when the `relative_path` is not explicitly defined.

Parameters
----------
name : str
Dataset name
version : str
Dataset version
version_suffix : str
Dataset version suffix

Returns
-------
relative_path : str
Automatically generated `relative_path`
"""

if version_suffix is not None:
relative_path = f"{name}_{version}_{version_suffix}"
else:
relative_path = f"{name}_{version}"

return relative_path
6 changes: 3 additions & 3 deletions src/dataregistry/schema/schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -177,13 +177,13 @@ dataset:
description: "Unique identifier for this dataset"
name:
type: "String"
description: "Any convenient, evocative name for the human. Note the combination of name, version and version_suffix must be unique. If None name is generated from the relative path."
description: "Any convenient, evocative name for the human. Note the combination of name, version and version_suffix must be unique."
nullable: False
cli_optional: True
relative_path:
type: "String"
description: "Relative path storing the data, relative to `<root_dir>`"
description: "Relative path storing the data, relative to `<root_dir>`. If None name is generated automatically from the provided name and version."
nullable: False
cli_optional: True
version_major:
type: "Integer"
description: "Major version in semantic string (i.e., X.x.x)"
Expand Down
6 changes: 3 additions & 3 deletions src/dataregistry_cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,10 @@ def get_parser():

# Entries unique to registering the dataset when using the CLI
arg_register_dataset.add_argument(
"relative_path",
"name",
help=(
"Destination for the dataset within the data registry. Path is"
"relative to <registry root>/<owner_type>/<owner>."
"Any convenient, evocative name for the human."
"Note the combination of name, version and version_suffix must be unique."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the question you raised about uniqueness: I don't think that name, version and version_suffix need to be unique across the entire database. It's enough if (name, version, version_suffix, owner, owner_type) is unique

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this in another PR

),
type=str,
)
Expand Down
Loading
Loading