-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap name
and relative_path
as the mandatory entry
#123
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks generally ok, pending discussion of repercussions, if any, of only requiring (name, version, version_suffix, owner, owner_type) to be unique. If this seems safe it should be documented somewhere, probably in DatasetTable.register. And the uniqueness constraint in load_schema.py will have to be modified. (It would be better still if we had a way to describe uniqueness constraints in schema.yaml rather than having just that piece of the database description in code.) See also my comment about _relpath_from_name. Probably just the docstring needs to be changed.
"Destination for the dataset within the data registry. Path is" | ||
"relative to <registry root>/<owner_type>/<owner>." | ||
"Any convenient, evocative name for the human." | ||
"Note the combination of name, version and version_suffix must be unique." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concerning the question you raised about uniqueness: I don't think that name, version and version_suffix need to be unique across the entire database. It's enough if (name, version, version_suffix, owner, owner_type) is unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple simple things to change. There is also the issue of how to handle overwritable
. See comment added to issue #109
"\n", | ||
"Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable, see below). \n", | ||
"The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human. Note the combination of `name`, `version` and `version_suffix` must be unique in the database. The dataset `name` allows for an easy retrieval of the dataset for querying and updating.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least recommend that people stay away from characters in name
which could cause problems as part of a file path or URL since we may be generating a path including that string. That would include at least white space characters, &
, $
, ?
, /
and \
. In fact it probably is best to check for them and refuse to register if name
contains any of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are now illegal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any exclusion of white space characters. Can you add them, or at least blank, to the list? It would be good to also exclude tab, carriage return and linefeed if possible.
docs/source/tutorial_notebooks/getting_started_1_register_datasets.ipynb
Outdated
Show resolved
Hide resolved
"source": [ | ||
"Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.\n", | ||
"\n", | ||
"For 3., to update a previous dataset and overwrite the existing data, we have the pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would change to 2. if we reorder as I suggested.
And I guess you meant "we have to pass the relative_path
", not "we have to the pass...".
But I don't think this is necessary (if we put some constraints on overwriting) or even desirable: it could cause problems when relative_path
was not generated by the user for previous versions; they won't know what it is without looking it up first.
I'll write more about this in a separate comment in #109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordered comment.
Yes I agree knowing the relative_path
is a bit annoying, but that was also true when the name
was autogenerated and you needed to look it up first.
Will continue discussion with issue
When registering a dataset, now
name
is required andrelative_path
is generated from thename
, as opposed to the other way round.Changelog
name
andversion
are now the two mandatory columns when registering a dataset, as opposed toname
andrelative_path
relative_path
is not passed (as it is now optional), it is generated automatically in the formatrelative_path=<name>_<version_string>_<version_suffix>
relative_path=
location when registering, it's now just optional. If you have arelative_path
chosen, you can leave thename
field asNone
, and thename
will be generated from the relative path as before.Thoughts