Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Dataset identifier #53

Closed
odscrachel opened this issue May 16, 2023 · 9 comments · Fixed by #119
Closed

[Proposal] Dataset identifier #53

odscrachel opened this issue May 16, 2023 · 9 comments · Fixed by #119
Labels
metadata Issues related to common, core metadata proposal New feature or request

Comments

@odscrachel
Copy link
Contributor

What is the context or reason for the change?

There is a need to have a unique id per dataset and per resource. This will act as a parent id per dataset.

What is your proposed change?

Create a dataset identifier identifier with description ‘An id for this dataset. This identifier must be unique within the data catalogue it is stored in, and it is recommended that an identifier is chosen with a high likelihood of being globally unique.’

Why is this not covered by the existing model?

The current model contains the fields dataset title and description

@odscrachel odscrachel added the proposal New feature or request label May 16, 2023
@matamadio matamadio changed the title Dataset identifier [Proposal] Dataset identifier May 16, 2023
@duncandewhurst
Copy link
Contributor

Looks good and aligns with DCAT's identifier property.

@stufraser1
Copy link
Member

stufraser1 commented May 23, 2023

I support the use of a dataset identifier, but the question is whether this should be a numeric UID or a URL to the dataset.

WB data catalogue creates a UID when we upload to that catalogue. The original schema used a URL to account for the possibility this schema would be used in multiple catalogues, therefore the dataset could be found across the Web. Drawback is of course URLs can break or change.

I think we should support URL as a dataset identifier, and where data is uploaded to a catalogue which uses its own UID system, that would also be appended.

Either way these probably need to be added retrospectively once a dataset is uploaded, unless the catalogue creates it on upload (as is the case for WB data catalog)

@duncandewhurst
Copy link
Contributor

Looking at other standards:

  • DCAT recommends the use of HTTP URIs, but permits any literal value
  • Project Open Data recommends the use of HTTP URIs, but permits any string value

I suggest that we follow a similar approach:

Field Title Description Type
identifier Identifier A unique identifier for the dataset. Use of an HTTP URI is recommended. String

That way:

  • if a publisher cannot provide a HTTP URI for some reason, they can still publish their metadata using RDLS.
  • permitting a string rather than any literal value prevents difficulties that come when trying to compare the same identifier formatted as a string in one place and as a number in another.

I figure that a catalog system's own UID be part of that catalog's metadata, to which the RDLS metadata will be added/integrated, so I don't think we need to support multiple identifiers in RDLS.

Sound good?

@odscrachel odscrachel added the metadata Issues related to common, core metadata label May 30, 2023
@stufraser1
Copy link
Member

clarify how this works with model.id - e.g. #85

@duncandewhurst
Copy link
Contributor

If I understood correctly, this issue is about an identifier for the dataset that the RDLS metadata describes, whilst #85 is about referencing the dataset's source model, which wouldn't be catalogued using RDLS. Did I miss something?

@pslh
Copy link

pslh commented Jun 7, 2023

I note that some (but obviously not all) datasets have a DOI for this purpose.
Regarding use of catalogue UIDs I also wonder what happens if datasets are cohosted in different repositories/databases for example where a national dataset is published and maintained by a national authority but also shared with international entities including GFDRR.

@duncandewhurst
Copy link
Contributor

@stufraser1 @matamadio can you advise on whether it would be useful to link to listings of the same dataset in other catalogs from within the RDLS metadata?


Logging some initial research:

In data.gov, there is an Identifier field in the metadata that contains a URL to a JSON representation of the dataset in the catalog it was originally published in (example)

There are two IANA Link Relations that might be relevant:

  • canonical: Designates the preferred version of a resource (the IRI and its contents).
  • duplicate: Refers to a resource whose available representations are byte-for-byte identical with the corresponding representations of the context IRI.

@stufraser1
Copy link
Member

I note that some (but obviously not all) datasets have a DOI for this purpose. Regarding use of catalogue UIDs I also wonder what happens if datasets are cohosted in different repositories/databases for example where a national dataset is published and maintained by a national authority but also shared with international entities including GFDRR.

For co-hosting datasets we should be able to use the URI.
I think this is addressed by the proposal at #53 (comment)
to use HTTP URI.
Data uploaded to WB Data Catalog will receive a UID for that catalog, so would have an internal UID generated automatically. This may not be the case for other systems where RDLS may be used in future, but the string Identifier field could be used for that purpose.

@duncandewhurst
Copy link
Contributor

In the scenario where a dataset is first listed in a national authority's data catalog and then added to the World Bank's data catalog, which HTTP URI do we want to see in identifier, the URI of the dataset in the national authority's data catalog, or its URI in the World Bank catalog?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata Issues related to common, core metadata proposal New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants