Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identifiers #223

Open
bertvannuffelen opened this issue Mar 25, 2022 · 4 comments
Open

identifiers #223

bertvannuffelen opened this issue Mar 25, 2022 · 4 comments
Labels
release:3.0.0 https://semiceu.github.io/DCAT-AP/releases/3.0.0 status:fixed This issue has been fixed in a draft.

Comments

@bertvannuffelen
Copy link
Contributor

This is an broad issue to capture questions and opinions on identifiers. During the webinar of 10 march 2022 the WG discussed on the role of dct:identifier and adms:identifier in identifying datasets throughout harvesting of catalogues.

To streamline the discussion, the WG agreed with the view that dct:identifier is the identifier assigned by the "owner/first publisher" of the dataset. This removes an ambiguity in the definition of dct:identifier which could be also interpreted as the identifier assigned by the catalogue it is currently part of.

This issue is to collect the community feedback on this topic. We will also provide a coherent proposal based on the WG discussion that has taken place.

@bertvannuffelen
Copy link
Contributor Author

bertvannuffelen commented Apr 25, 2022

Dear community,

a proposal for the guidelines to comment on can be found at:
the https://github.com/SEMICeu/DCAT-AP/blob/2.x.y-draft/releases/2.x.y/usageguide-identifiers.md

As during the last webinar no agreement was on the status of this proposal it is shifted to a future release.
Also this is a new invite to provide comments to the proposal.

@jakubklimek
Copy link
Contributor

The Czech data catalog implements what is to be avoided by the guidelines - it mints an IRI for a harvested dataset regardless of its original IRI. If there was an original IRI, it is preserved in dct:identifier.

This is not to argue that the approach is correct, but I would like to take this opportunity to mention arguments that led us to this implementation that I did not find mentioned in the guidelines.

  1. Guaranteed dereferencablity of the IRIs. The source catalog assigns IRIs to datasets, but does not implement their dereferencablility, or the dereferencability of other IRIs - distributions, data services, etc. The national catalog does that, but that only works with IRIs in its domain.
  2. Security (Trustworthiness of the registered catalogs) - By assigning new (publisher-scoped) IRIs and processing the metadata instead of taking it unaltered when harvesting the datasets, we can avoid one publisher stating (intentionally, or by mistake) something about a dataset of another publisher without their knowledge, which could affect query results on the single National Open Data Catalog SPARQL endpoint. Admittedly, this goes against the open-world assumption, but in the context of a public administration system, this is something we want to avoid rather than encourage.

@bertvannuffelen
Copy link
Contributor Author

@jakubklimek, I understand the arguments.

And exactly because of these experiences, the guidelines propose that harvesters and portal owners should ensure that all identifiers are included in adms:identifier.
If every portal would do that, dynamically a list of equivalent identifiers is being created.
And this offers then the potential to implement deduplication algorithms, trusted cross-reference throughout the network of harvesting, ....

It does not impact any portal user experience nor publisher (only technical support to the harvesting community), but the potential is high.

@bertvannuffelen bertvannuffelen added release:3.0.0 https://semiceu.github.io/DCAT-AP/releases/3.0.0 status:fixed This issue has been fixed in a draft. labels Feb 1, 2024
@bertvannuffelen
Copy link
Contributor Author

This issue will be closed as an reference to the assessment/proposal is in the specification. The assessment/proposal has not been included in full but in this way readers of the specification can better find it and take the considerations into account in their implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:3.0.0 https://semiceu.github.io/DCAT-AP/releases/3.0.0 status:fixed This issue has been fixed in a draft.
Projects
None yet
Development

No branches or pull requests

2 participants