Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap name and relative_path as the mandatory entry #123

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

stuartmcalpine
Copy link
Collaborator

@stuartmcalpine stuartmcalpine commented May 15, 2024

When registering a dataset, now name is required and relative_path is generated from the name, as opposed to the other way round.

Changelog

  • name and version are now the two mandatory columns when registering a dataset, as opposed to name and relative_path
  • If relative_path is not passed (as it is now optional), it is generated automatically in the format relative_path=<name>_<version_string>_<version_suffix>
  • You still have the option of choosing the relative_path= location when registering, it's now just optional. If you have a relative_path chosen, you can leave the name field as None, and the name will be generated from the relative path as before.
  • Updated the documentation to reflect the changes.
  • Updated CI tests to reflect the changes, and added some for new cases.

Thoughts

  • Do we want the name/version/version_suffix to be unique for a user, it may be frustrating to think of a increasingly complex name each time if the unique constraint is over the whole database. This is still true if the name was generated from the relative path from before, though less likely.
  • People could technically chose the same name (by mistake), and happen to chose a different version. Which will cause confusion down the line, which again maybe points to having names only unique to users.

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks generally ok, pending discussion of repercussions, if any, of only requiring (name, version, version_suffix, owner, owner_type) to be unique. If this seems safe it should be documented somewhere, probably in DatasetTable.register. And the uniqueness constraint in load_schema.py will have to be modified. (It would be better still if we had a way to describe uniqueness constraints in schema.yaml rather than having just that piece of the database description in code.) See also my comment about _relpath_from_name. Probably just the docstring needs to be changed.

src/dataregistry/registrar/registrar_util.py Outdated Show resolved Hide resolved
"Destination for the dataset within the data registry. Path is"
"relative to <registry root>/<owner_type>/<owner>."
"Any convenient, evocative name for the human."
"Note the combination of name, version and version_suffix must be unique."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerning the question you raised about uniqueness: I don't think that name, version and version_suffix need to be unique across the entire database. It's enough if (name, version, version_suffix, owner, owner_type) is unique

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this in another PR

Copy link
Collaborator

@JoanneBogart JoanneBogart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple simple things to change. There is also the issue of how to handle overwritable. See comment added to issue #109

"\n",
"Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable, see below). \n",
"The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human. Note the combination of `name`, `version` and `version_suffix` must be unique in the database. The dataset `name` allows for an easy retrieval of the dataset for querying and updating.\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least recommend that people stay away from characters in name which could cause problems as part of a file path or URL since we may be generating a path including that string. That would include at least white space characters, &, $, ?, / and \. In fact it probably is best to check for them and refuse to register if name contains any of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are now illegal

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any exclusion of white space characters. Can you add them, or at least blank, to the list? It would be good to also exclude tab, carriage return and linefeed if possible.

"source": [
"Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.\n",
"\n",
"For 3., to update a previous dataset and overwrite the existing data, we have the pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would change to 2. if we reorder as I suggested.
And I guess you meant "we have to pass the relative_path", not "we have to the pass...".
But I don't think this is necessary (if we put some constraints on overwriting) or even desirable: it could cause problems when relative_path was not generated by the user for previous versions; they won't know what it is without looking it up first.
I'll write more about this in a separate comment in #109

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered comment.

Yes I agree knowing the relative_path is a bit annoying, but that was also true when the name was autogenerated and you needed to look it up first.

Will continue discussion with issue

src/dataregistry/registrar/dataset.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants