LSSTDESC · stuartmcalpine · May 15, 2024 · May 15, 2024 · May 15, 2024 · May 15, 2024
diff --git a/docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb b/docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb
@@ -25,7 +25,8 @@
     "4) Modify a previously registered dataset with updated metadata\n",
     "5) Delete a dataset\n",
     "6) Special cases\n",
-    "    * Registering external datasets \n",
+    "    * Registering external datasets\n",
+    "    * Manually specifying the relative path\n",
     "\n",
     "### Before we begin\n",
     "\n",
@@ -166,7 +167,7 @@
     "\n",
     "# Add new entry.\n",
     "dataset_id, execution_id = datareg.Registrar.dataset.register(\n",
-    "    \"nersc_tutorial/my_desc_dataset\",\n",
+    "    \"nersc_tutorial:my_desc_dataset\",\n",
     "    \"1.0.0\",\n",
     "    description=\"An output from some DESC code\",\n",
     "    owner=\"DESC\",\n",
@@ -191,13 +192,13 @@
    "source": [
     "This will register a new dataset. A few notes:\n",
     "\n",
-    "### The relative path\n",
+    "### The dataset name (mandatory)\n",
     "\n",
-    "Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable, see below). \n",
+    "The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human, however note that the special characters `&*/\\?$` are not allowed to be part of the `name` string. The combination of `name`, `version` and `version_suffix` must be unique in the database.\n",
     "\n",
-    "The relative path is one of the two required parameters you must specify when registering a dataset (in the example here our relative path is `nersc_tutorial/my_desc_dataset`).\n",
+    "The dataset `name` allows for an easy retrieval of the dataset for querying and updating.\n",
     "\n",
-    "### The version string\n",
+    "### The version string (mandatory)\n",
     "\n",
     "The second required parameter is the version string, in the semantic format, i.e., MAJOR.MINOR.PATCH. There exists also an optional ``version_suffix`` parameter, which may be used to further identify the dataset, e.g. with a value like \"rc1\" to make it clear it's only a release candidate, possibly not in its final form.\n",
     "\n",
@@ -238,15 +239,11 @@
     "\n",
     "If you have a dataset that has been previously registered within the data registry, and that dataset has updates, you have three options for how to handle the new entry:\n",
     "\n",
-    "- You can enter it as a completely new standalone dataset with no links to the previous dataset\n",
-    "- You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)\n",
-    "- You can enter it as a new version of the previous dataset (recommended)\n",
+    "1. You can enter it as a new version of the previous dataset (recommended in most situations)\n",
+    "2. You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)\n",
+    "3. You can enter it as a completely new standalone dataset with no links to the previous dataset\n",
     "\n",
-    "Unless you are overwriting a previous dataset (when `is_overwritable=True`), you cannot enter a new dataset (even an updated version) using the same relative path. However, datasets can share the same `name` field, which acts as a reference for the dataset, and which is what we'll use to keep our updated dataset connected to our previous one.\n",
-    "\n",
-    "Note that for our test dataset entry above we did not specify a `name` during registration. The default `name` for a dataset is the file or directory name, extracted from the relative path. In our example above this was `my_desc_dataset`. If you are unsure of what name you specified for your previous dataset that you want to overwrite, you will need to query the database first to find out (see next tutorial for queries).\n",
-    "\n",
-    "The combination of `name`, `version` and `version_suffix` for any dataset must be unique. As we are updating a dataset with the same name, we have to make sure to update the version to a new value. One handy feature is automatic version \"bumping\" for datasets, i.e., rather than specifying a new version string manually, one can select \"major\", \"minor\" or \"patch\" for the version string to automatically bump that property of the version string up. In our case, selecting \"minor\" will automatically generate the version \"1.1.0\"."
+    "For 1. we register a new dataset as before, making sure to keep the same dataset `name`, but updating the dataset `version`. One can update the `version` in two ways: manually entering a new version string, or having the `dataregistry` automatically \"bump\" the dataset version by selecing either \"major\", \"minor\" or \"patch\" for the version string. For example, lets register an updated version of our dataset, bumping the minor tag (i.e., bumping `1.0.0` -> `1.1.0`)."
    ]
   },
   {
@@ -258,14 +255,55 @@
    },
    "outputs": [],
    "source": [
+    "# Create an empty text file as some example data\n",
+    "with open(\"updated_dummy_dataset.txt\", \"w\") as f:\n",
+    "    f.write(\"some updated data\")\n",
+    "\n",
     "# Add new entry for an updated dataset with an updated version.\n",
     "updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(\n",
-    "    \"nersc_tutorial/my_updated_desc_dataset\",\n",
+    "    \"nersc_tutorial:my_desc_dataset\",\n",
     "    \"minor\", # Automatically bumps to \"1.1.0\"\n",
     "    description=\"An output from some DESC code (updated)\",\n",
     "    is_overwritable=True,\n",
-    "    old_location=\"dummy_dataset.txt\",\n",
-    "    name=\"my_desc_dataset\" # Using this name links it to the previous dataset.\n",
+    "    old_location=\"updated_dummy_dataset.txt\",\n",
+    ")\n",
+    "\n",
+    "# This is the unique identifier assigned to the updated dataset from the registry\n",
+    "print(f\"Dataset {updated_dataset_id} created\")\n",
+    "\n",
+    "# This is the id of the execution the updated dataset belongs to (see next tutorial)\n",
+    "print(f\"Dataset assigned to execution {updated_execution_id}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d60f1bbb-adcb-4847-9e65-d8d14d928041",
+   "metadata": {},
+   "source": [
+    "Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.\n",
+    "\n",
+    "For 2., to update a previous dataset and overwrite the existing data, we have to pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2494eeea-b53f-472b-aafd-88216cb36494",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create an empty text file as some example data\n",
+    "with open(\"updated_dummy_dataset_again.txt\", \"w\") as f:\n",
+    "    f.write(\"some further updated data\")\n",
+    "\n",
+    "# Add new entry for an updated dataset with an updated version.\n",
+    "updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(\n",
+    "    \"nersc_tutorial:my_desc_dataset\",\n",
+    "    \"patch\", # Automatically bumps to \"1.1.1\"\n",
+    "    description=\"An output from some DESC code (further updated)\",\n",
+    "    is_overwritable=True,\n",
+    "    old_location=\"updated_dummy_dataset_again.txt\",\n",
+    "    relative_path=\"nersc_tutorial:my_desc_dataset_1.1.0\",\n",
     ")\n",
     "\n",
     "# This is the unique identifier assigned to the updated dataset from the registry\n",
@@ -275,6 +313,16 @@
     "print(f\"Dataset assigned to execution {updated_execution_id}\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "dbdff8de-973e-4aa6-90ba-42f448d87222",
+   "metadata": {},
+   "source": [
+    "will create a new dataset, version `1.1.1`, but the new data has overwritten the data for version `1.1.0`.\n",
+    "\n",
+    "For 3. simply follow the procedure above for registering a new dataset."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "1cc01c4e-8404-4fb4-b082-8969546f3ffc",
@@ -385,6 +433,54 @@
     "    url=\"www.data.com\",\n",
     ")"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e95c0aa9-3b05-4b80-a91c-e6f643365bcb",
+   "metadata": {},
+   "source": [
+    "### Specifying the relative path\n",
+    "\n",
+    "Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. \n",
+    "\n",
+    "By default, the `relative_path` is constructed from the `name`, `version` and `version_suffix` (if there is one), in the format `relative_path=<name>_<version>_<version_suffix>`. However, one can also manually select the `relative_path` during registration, for example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2eda857c-e9b5-40ab-999f-65b858831a08",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Add new entry with a manual relative path.\n",
+    "datareg.Registrar.dataset.register(\n",
+    "    \"nersc_tutorial:my_desc_dataset_with_relative_path\",\n",
+    "    \"1.0.0\",\n",
+    "    is_dummy=True, # For testing purposes, means we need no actual data to work (only a database entry is created)\n",
+    "    relative_path=\"nersc_tutorial/my_desc_dataset\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8dbb1c0a-70b8-452b-8b13-a98901359afa",
+   "metadata": {},
+   "source": [
+    "will register a dataset under the `relative_path` of `nersc_tutorial/my_desc_dataset`.\n",
+    "\n",
+    "For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable). \n",
+    "\n",
+    "You can leave `name` as `None` when registering using a manual `relative_path`, which will construct the `name` automatically from the `relative_path`- However we always recommend being explicit and choosing a `name` also."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d2222b9e-c986-4595-9404-2fd56eb40c5f",
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/src/dataregistry/registrar/dataset.py b/src/dataregistry/registrar/dataset.py
@@ -11,13 +11,15 @@
     _bump_version,
     _copy_data,
     _form_dataset_path,
-    _name_from_relpath,
+    _relpath_from_name,
     _parse_version_string,
     _read_configuration_file,
     get_directory_info,
+    _name_from_relpath
 )
 from .dataset_util import set_dataset_status, get_dataset_status
 
+_ILLEGAL_NAME_CHAR = ["$","*","&","/","?","\\"]
 
 class DatasetTable(BaseTable):
     def __init__(self, db_connection, root_dir, owner, owner_type, execution_table):
@@ -29,10 +31,9 @@ def __init__(self, db_connection, root_dir, owner, owner_type, execution_table):
 
     def register(
         self,
-        relative_path,
+        name,
         version,
         version_suffix=None,
-        name=None,
         creation_date=None,
         description=None,
         execution_id=None,
@@ -55,6 +56,7 @@ def register(
         location_type="dataregistry",
         url=None,
         contact_email=None,
+        relative_path=None,
     ):
         """
         Create a new dataset entry in the DESC data registry.
@@ -70,10 +72,9 @@ def register(
 
         Parameters
         ----------
-        relative_path** : str
+        name** : str
         version** : str
         version_suffix** : str, optional
-        name** : str, optional
         creation_date** : datetime, optional
         description** : str, optional
         execution_id** : int, optional
@@ -109,6 +110,7 @@ def register(
         url**: str, optional
             For `location_type="external"` only
         contact_email**: str, optional
+        relative_path** : str, optional
 
         Returns
         -------
@@ -118,6 +120,16 @@ def register(
             The execution ID associated with the dataset
         """
 
+        # If `old_location` is None, we need a relative path
+        if old_location is None and location_type == "dataregistry":
+            if relative_path is None:
+                raise ValueError("`relative_path` is required when `old_location` is None")
+
+        # If no name, we need a `relative_path`
+        if name is None:
+            if relative_path is None:
+                raise ValueError("`relative_path` is required when `name` is None")
+
         # If external dataset, check for either a `url` or `contact_email`
         if location_type == "external":
             if url is None and contact_email is None:
@@ -161,19 +173,8 @@ def register(
                     "Only owner_type='production' can go in the production schema"
                 )
 
-        # If `name` not passed, automatically generate a name from the relative path
-        if name is None:
-            name = _name_from_relpath(relative_path)
-
-        # Look for previous entries. Fail if not overwritable
-        dataset_table = self._get_table_metadata("dataset")
-        previous = self._find_previous(relative_path, owner, owner_type)
-
-        if previous is None:
-            print(f"Dataset {relative_path} exists, and is not overwritable")
-            return None, None
-
         # Deal with version string (non-special case)
+        dataset_table = self._get_table_metadata("dataset")
         if version not in ["major", "minor", "patch"]:
             v_fields = _parse_version_string(version)
             version_string = version
@@ -185,6 +186,27 @@ def register(
                 f"{v_fields['major']}.{v_fields['minor']}.{v_fields['patch']}"
             )
 
+        # If `relative_path` not passed, automatically generate one from the
+        # name, version and version_suffix
+        if relative_path is None:
+            relative_path = _relpath_from_name(name, version_string, version_suffix)
+
+        # If `name` is None, generate it from the `relative_path`
+        if name is None:
+            name = _name_from_relpath(relative_path)
+
+        # Make sure `name` is legal (i.e., no illegal characters)
+        for i_char in _ILLEGAL_NAME_CHAR:
+            if i_char in name:
+                raise ValueError(f"Cannot have character {i_char} in name string")
+
+        # Look for previous entries. Fail if not overwritable
+        previous = self._find_previous(relative_path, owner, owner_type)
+
+        if previous is None:
+            print(f"Dataset {relative_path} exists, and is not overwritable")
+            return None, None
+
         # If no execution_id is supplied, create a minimal entry
         if execution_id is None:
             if execution_name is None:

diff --git a/src/dataregistry/registrar/registrar_util.py b/src/dataregistry/registrar/registrar_util.py
@@ -12,6 +12,7 @@
     "_form_dataset_path",
     "get_directory_info",
     "_name_from_relpath",
+    "_relpath_from_name",
     "_copy_data",
 ]
 VERSION_SEPARATOR = "."
@@ -340,3 +341,31 @@ def _compute_checksum(file_path):
         )
 
         raise Exception(e)
+
+def _relpath_from_name(name, version, version_suffix):
+    """
+    Construct a relative path from the name and version of a dataset.
+
+    We use this when the `relative_path` is not explicitly defined.
+
+    Parameters
+    ----------
+    name : str
+        Dataset name
+    version : str
+        Dataset version
+    version_suffix : str
+        Dataset version suffix
+
+    Returns
+    -------
+    relative_path : str
+        Automatically generated `relative_path`
+    """
+
+    if version_suffix is not None:
+        relative_path = f"{name}_{version}_{version_suffix}"
+    else:
+        relative_path = f"{name}_{version}"
+
+    return relative_path
diff --git a/src/dataregistry/schema/schema.yaml b/src/dataregistry/schema/schema.yaml
@@ -177,13 +177,13 @@ dataset:
     description: "Unique identifier for this dataset"
   name:
     type: "String"
-    description: "Any convenient, evocative name for the human. Note the combination of name, version and version_suffix must be unique. If None name is generated from the relative path."
+    description: "Any convenient, evocative name for the human. Note the combination of name, version and version_suffix must be unique."
     nullable: False
-    cli_optional: True
   relative_path:
     type: "String"
-    description: "Relative path storing the data, relative to `<root_dir>`"
+    description: "Relative path storing the data, relative to `<root_dir>`.  If None name is generated automatically from the provided name and version."
     nullable: False
+    cli_optional: True
   version_major:
     type: "Integer"
     description: "Major version in semantic string (i.e., X.x.x)"

diff --git a/src/dataregistry_cli/cli.py b/src/dataregistry_cli/cli.py
@@ -123,10 +123,10 @@ def get_parser():
 
     # Entries unique to registering the dataset when using the CLI
     arg_register_dataset.add_argument(
-        "relative_path",
+        "name",
         help=(
-            "Destination for the dataset within the data registry. Path is"
-            "relative to <registry root>/<owner_type>/<owner>."
+            "Any convenient, evocative name for the human."
+            "Note the combination of name, version and version_suffix must be unique."
         ),
         type=str,
     )