-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swap name
and relative_path
as the mandatory entry
#123
base: main
Are you sure you want to change the base?
Changes from 7 commits
ee6d795
ee2defd
d5ff7ae
a0fc9bb
8133ebd
4399617
ffd8c9e
a96d5f4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,7 +25,8 @@ | |
"4) Modify a previously registered dataset with updated metadata\n", | ||
"5) Delete a dataset\n", | ||
"6) Special cases\n", | ||
" * Registering external datasets \n", | ||
" * Registering external datasets\n", | ||
" * Manually specifying the relative path\n", | ||
"\n", | ||
"### Before we begin\n", | ||
"\n", | ||
|
@@ -166,7 +167,7 @@ | |
"\n", | ||
"# Add new entry.\n", | ||
"dataset_id, execution_id = datareg.Registrar.dataset.register(\n", | ||
" \"nersc_tutorial/my_desc_dataset\",\n", | ||
" \"nersc_tutorial:my_desc_dataset\",\n", | ||
" \"1.0.0\",\n", | ||
" description=\"An output from some DESC code\",\n", | ||
" owner=\"DESC\",\n", | ||
|
@@ -191,13 +192,11 @@ | |
"source": [ | ||
"This will register a new dataset. A few notes:\n", | ||
"\n", | ||
"### The relative path\n", | ||
"### The dataset name (mandatory)\n", | ||
"\n", | ||
"Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable, see below). \n", | ||
"The first of two mandatory arguments to the `register()` function is the dataset `name`, which in our example is `nersc_tutorial:my_desc_dataset` (note there is nothing special about the `:` here, the `name` can be any legal string). This should be any convenient, evocative name for the human. Note the combination of `name`, `version` and `version_suffix` must be unique in the database. The dataset `name` allows for an easy retrieval of the dataset for querying and updating.\n", | ||
"\n", | ||
"The relative path is one of the two required parameters you must specify when registering a dataset (in the example here our relative path is `nersc_tutorial/my_desc_dataset`).\n", | ||
"\n", | ||
"### The version string\n", | ||
"### The version string (mandatory)\n", | ||
"\n", | ||
"The second required parameter is the version string, in the semantic format, i.e., MAJOR.MINOR.PATCH. There exists also an optional ``version_suffix`` parameter, which may be used to further identify the dataset, e.g. with a value like \"rc1\" to make it clear it's only a release candidate, possibly not in its final form.\n", | ||
"\n", | ||
|
@@ -238,15 +237,13 @@ | |
"\n", | ||
"If you have a dataset that has been previously registered within the data registry, and that dataset has updates, you have three options for how to handle the new entry:\n", | ||
"\n", | ||
"- You can enter it as a completely new standalone dataset with no links to the previous dataset\n", | ||
"- You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)\n", | ||
"- You can enter it as a new version of the previous dataset (recommended)\n", | ||
"\n", | ||
"Unless you are overwriting a previous dataset (when `is_overwritable=True`), you cannot enter a new dataset (even an updated version) using the same relative path. However, datasets can share the same `name` field, which acts as a reference for the dataset, and which is what we'll use to keep our updated dataset connected to our previous one.\n", | ||
"1. You can enter it as a completely new standalone dataset with no links to the previous dataset\n", | ||
"2. You can enter it as a new version of the previous dataset (recommended)\n", | ||
JoanneBogart marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"3. You can overwrite the existing dataset with the new data (only if the previous dataset was entered with `is_overwritable=True`)\n", | ||
"\n", | ||
"Note that for our test dataset entry above we did not specify a `name` during registration. The default `name` for a dataset is the file or directory name, extracted from the relative path. In our example above this was `my_desc_dataset`. If you are unsure of what name you specified for your previous dataset that you want to overwrite, you will need to query the database first to find out (see next tutorial for queries).\n", | ||
"For 1. simply follow the procedure above for registering a new dataset.\n", | ||
"\n", | ||
"The combination of `name`, `version` and `version_suffix` for any dataset must be unique. As we are updating a dataset with the same name, we have to make sure to update the version to a new value. One handy feature is automatic version \"bumping\" for datasets, i.e., rather than specifying a new version string manually, one can select \"major\", \"minor\" or \"patch\" for the version string to automatically bump that property of the version string up. In our case, selecting \"minor\" will automatically generate the version \"1.1.0\"." | ||
"For 2. we register a new dataset as before, making sure to keep the same dataset `name`, but updating the dataset `version`. One can update the `version` in two ways: manually entering a new version string, or having the `dataregistry` automatically \"bump\" the dataset version by selecing either \"major\", \"minor\" or \"patch\" for the version string. For example, lets register an updated version of our dataset, bumping the minor tag (i.e., bumping `1.0.0` -> `1.1.0`)." | ||
] | ||
}, | ||
{ | ||
|
@@ -258,14 +255,17 @@ | |
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Create an empty text file as some example data\n", | ||
"with open(\"updated_dummy_dataset.txt\", \"w\") as f:\n", | ||
" f.write(\"some updated data\")\n", | ||
"\n", | ||
"# Add new entry for an updated dataset with an updated version.\n", | ||
"updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(\n", | ||
" \"nersc_tutorial/my_updated_desc_dataset\",\n", | ||
" \"nersc_tutorial:my_desc_dataset\",\n", | ||
" \"minor\", # Automatically bumps to \"1.1.0\"\n", | ||
" description=\"An output from some DESC code (updated)\",\n", | ||
" is_overwritable=True,\n", | ||
" old_location=\"dummy_dataset.txt\",\n", | ||
" name=\"my_desc_dataset\" # Using this name links it to the previous dataset.\n", | ||
" old_location=\"updated_dummy_dataset.txt\",\n", | ||
")\n", | ||
"\n", | ||
"# This is the unique identifier assigned to the updated dataset from the registry\n", | ||
|
@@ -275,6 +275,52 @@ | |
"print(f\"Dataset assigned to execution {updated_execution_id}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "d60f1bbb-adcb-4847-9e65-d8d14d928041", | ||
"metadata": {}, | ||
"source": [ | ||
"Note that both sets of data, from version `1.0.0` and `1.1.0` still exist, and they are linked through the dataset `name`.\n", | ||
"\n", | ||
"For 3., to update a previous dataset and overwrite the existing data, we have the pass the `relative_path` of the existing dataset (see Section 6 for more details on the `relative_path`). For example" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would change to 2. if we reorder as I suggested. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reordered comment. Yes I agree knowing the Will continue discussion with issue |
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2494eeea-b53f-472b-aafd-88216cb36494", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Create an empty text file as some example data\n", | ||
"with open(\"updated_dummy_dataset_again.txt\", \"w\") as f:\n", | ||
" f.write(\"some further updated data\")\n", | ||
"\n", | ||
"# Add new entry for an updated dataset with an updated version.\n", | ||
"updated_dataset_id, updated_execution_id = datareg.Registrar.dataset.register(\n", | ||
" \"nersc_tutorial:my_desc_dataset\",\n", | ||
" \"patch\", # Automatically bumps to \"1.1.1\"\n", | ||
" description=\"An output from some DESC code (further updated)\",\n", | ||
" is_overwritable=True,\n", | ||
" old_location=\"updated_dummy_dataset_again.txt\",\n", | ||
" relative_path=\"nersc_tutorial:my_desc_dataset_1.1.0\",\n", | ||
")\n", | ||
"\n", | ||
"# This is the unique identifier assigned to the updated dataset from the registry\n", | ||
"print(f\"Dataset {updated_dataset_id} created\")\n", | ||
"\n", | ||
"# This is the id of the execution the updated dataset belongs to (see next tutorial)\n", | ||
"print(f\"Dataset assigned to execution {updated_execution_id}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "dbdff8de-973e-4aa6-90ba-42f448d87222", | ||
"metadata": {}, | ||
"source": [ | ||
"will create a new dataset, version `1.1.1`, but the new data has overwritten the data for version `1.1.0`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "1cc01c4e-8404-4fb4-b082-8969546f3ffc", | ||
|
@@ -385,6 +431,54 @@ | |
" url=\"www.data.com\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e95c0aa9-3b05-4b80-a91c-e6f643365bcb", | ||
"metadata": {}, | ||
"source": [ | ||
"### Specifying the relative path\n", | ||
"\n", | ||
"Datasets are registered within the registry under a path relative to the root directory (`root_dir`), which, by default, is a shared space at NERSC. \n", | ||
"\n", | ||
"By default, the `relative_path` is constructed from the `name`, `version` and `version_suffix` (if there is one), in the format `relative_path=<name>_<version>_<version_suffix>`. However, one can also manually select the `relative_path` during registration, for example" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2eda857c-e9b5-40ab-999f-65b858831a08", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Add new entry with a manual relative path.\n", | ||
"datareg.Registrar.dataset.register(\n", | ||
" \"nersc_tutorial:my_desc_dataset_with_relative_path\",\n", | ||
" \"1.0.0\",\n", | ||
" is_dummy=True, # For testing purposes, means we need no actual data to work (only a database entry is created)\n", | ||
" relative_path=\"nersc_tutorial/my_desc_dataset\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8dbb1c0a-70b8-452b-8b13-a98901359afa", | ||
"metadata": {}, | ||
"source": [ | ||
"will register a dataset under the `relative_path` of `nersc_tutorial/my_desc_dataset`.\n", | ||
"\n", | ||
"For those interested, the eventual full path for the dataset will be `<root_dir>/<schema>/<owner_type>/<owner>/<relative_path>`. This means that the combination of `relative_path`, `owner` and `owner_type` must be unique within the registry, and therefore cannot already be taken when you register a new dataset (an exception to this is if you allow your datasets to be overwritable). \n", | ||
"\n", | ||
"You can leave `name` as `None` when registering using a manual `relative_path`, which will construct the `name` automatically from the `relative_path`- However we always recommend being explicit and choosing a `name` also." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d2222b9e-c986-4595-9404-2fd56eb40c5f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -123,10 +123,10 @@ def get_parser(): | |
|
||
# Entries unique to registering the dataset when using the CLI | ||
arg_register_dataset.add_argument( | ||
"relative_path", | ||
"name", | ||
help=( | ||
"Destination for the dataset within the data registry. Path is" | ||
"relative to <registry root>/<owner_type>/<owner>." | ||
"Any convenient, evocative name for the human." | ||
"Note the combination of name, version and version_suffix must be unique." | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Concerning the question you raised about uniqueness: I don't think that name, version and version_suffix need to be unique across the entire database. It's enough if (name, version, version_suffix, owner, owner_type) is unique There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will do this in another PR |
||
), | ||
type=str, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should at least recommend that people stay away from characters in
name
which could cause problems as part of a file path or URL since we may be generating a path including that string. That would include at least white space characters,&
,$
,?
,/
and\
. In fact it probably is best to check for them and refuse to register ifname
contains any of them.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are now illegal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any exclusion of white space characters. Can you add them, or at least blank, to the list? It would be good to also exclude tab, carriage return and linefeed if possible.