Swap `name` and `relative_path` as the mandatory entry #123

stuartmcalpine · 2024-05-15T13:39:23Z

When registering a dataset, now name is required and relative_path is generated from the name, as opposed to the other way round.

Changelog

name and version are now the two mandatory columns when registering a dataset, as opposed to name and relative_path
If relative_path is not passed (as it is now optional), it is generated automatically in the format relative_path=<name>_<version_string>_<version_suffix>
You still have the option of choosing the relative_path= location when registering, it's now just optional. If you have a relative_path chosen, you can leave the name field as None, and the name will be generated from the relative path as before.
Updated the documentation to reflect the changes.
Updated CI tests to reflect the changes, and added some for new cases.

Thoughts

Do we want the name/version/version_suffix to be unique for a user, it may be frustrating to think of a increasingly complex name each time if the unique constraint is over the whole database. This is still true if the name was generated from the relative path from before, though less likely.
People could technically chose the same name (by mistake), and happen to chose a different version. Which will cause confusion down the line, which again maybe points to having names only unique to users.

JoanneBogart

This looks generally ok, pending discussion of repercussions, if any, of only requiring (name, version, version_suffix, owner, owner_type) to be unique. If this seems safe it should be documented somewhere, probably in DatasetTable.register. And the uniqueness constraint in load_schema.py will have to be modified. (It would be better still if we had a way to describe uniqueness constraints in schema.yaml rather than having just that piece of the database description in code.) See also my comment about _relpath_from_name. Probably just the docstring needs to be changed.

src/dataregistry/registrar/registrar_util.py

JoanneBogart · 2024-05-16T23:44:35Z

src/dataregistry_cli/cli.py

-            "Destination for the dataset within the data registry. Path is"
-            "relative to <registry root>/<owner_type>/<owner>."
+            "Any convenient, evocative name for the human."
+            "Note the combination of name, version and version_suffix must be unique."


Concerning the question you raised about uniqueness: I don't think that name, version and version_suffix need to be unique across the entire database. It's enough if (name, version, version_suffix, owner, owner_type) is unique

Will do this in another PR

JoanneBogart

There are a couple simple things to change. There is also the issue of how to handle overwritable. See comment added to issue #109

JoanneBogart · 2024-05-17T19:13:43Z

docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb

    "\n",
-    "Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable, see below). \n",
+    "The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human. Note the combination of `name`, `version` and `version_suffix` must be unique in the database. The dataset `name` allows for an easy retrieval of the dataset for querying and updating.\n",


I think we should at least recommend that people stay away from characters in name which could cause problems as part of a file path or URL since we may be generating a path including that string. That would include at least white space characters, &, $, ?, / and \. In fact it probably is best to check for them and refuse to register if name contains any of them.

They are now illegal

I don't see any exclusion of white space characters. Can you add them, or at least blank, to the list? It would be good to also exclude tab, carriage return and linefeed if possible.

docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb

JoanneBogart · 2024-05-17T19:21:06Z

docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb

+   "source": [
+    "Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.\n",
+    "\n",
+    "For 3., to update a previous dataset and overwrite the existing data, we have the pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example"


This would change to 2. if we reorder as I suggested.
And I guess you meant "we have to pass the relative_path", not "we have to the pass...".
But I don't think this is necessary (if we put some constraints on overwriting) or even desirable: it could cause problems when relative_path was not generated by the user for previous versions; they won't know what it is without looking it up first.
I'll write more about this in a separate comment in #109

Reordered comment.

Yes I agree knowing the relative_path is a bit annoying, but that was also true when the name was autogenerated and you needed to look it up first.

Will continue discussion with issue

src/dataregistry/registrar/dataset.py

stuartmcalpine added 4 commits May 15, 2024 15:38

Swap name and relative_path to be the mandatory entry

ee6d795

Fix to new format

ee2defd

Fix CLI to new fomat

d5ff7ae

Allow for auto name when relative_path is passed

a0fc9bb

JoanneBogart reviewed May 17, 2024

View reviewed changes

stuartmcalpine added 3 commits May 17, 2024 16:22

Update some test names

8133ebd

Update docs

4399617

Address reviewer comments

ffd8c9e

JoanneBogart requested changes May 17, 2024

View reviewed changes

Address reviewer comments

a96d5f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap `name` and `relative_path` as the mandatory entry #123

Swap `name` and `relative_path` as the mandatory entry #123

stuartmcalpine commented May 15, 2024 •

edited

Loading

JoanneBogart left a comment

JoanneBogart May 16, 2024

stuartmcalpine May 28, 2024

JoanneBogart left a comment

JoanneBogart May 17, 2024 •

edited

Loading

stuartmcalpine May 23, 2024

JoanneBogart May 23, 2024

JoanneBogart May 17, 2024

stuartmcalpine May 23, 2024

Swap name and relative_path as the mandatory entry #123

Are you sure you want to change the base?

Swap name and relative_path as the mandatory entry #123

Conversation

stuartmcalpine commented May 15, 2024 • edited Loading

Changelog

Thoughts

JoanneBogart left a comment

Choose a reason for hiding this comment

JoanneBogart May 16, 2024

Choose a reason for hiding this comment

stuartmcalpine May 28, 2024

Choose a reason for hiding this comment

JoanneBogart left a comment

Choose a reason for hiding this comment

JoanneBogart May 17, 2024 • edited Loading

Choose a reason for hiding this comment

stuartmcalpine May 23, 2024

Choose a reason for hiding this comment

JoanneBogart May 23, 2024

Choose a reason for hiding this comment

JoanneBogart May 17, 2024

Choose a reason for hiding this comment

stuartmcalpine May 23, 2024

Choose a reason for hiding this comment

Swap `name` and `relative_path` as the mandatory entry #123

Swap `name` and `relative_path` as the mandatory entry #123

stuartmcalpine commented May 15, 2024 •

edited

Loading

JoanneBogart May 17, 2024 •

edited

Loading