Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add areaID to Distribution Entity #40

Closed
bart-v opened this issue Oct 29, 2020 · 8 comments
Closed

Add areaID to Distribution Entity #40

bart-v opened this issue Oct 29, 2020 · 8 comments

Comments

@bart-v
Copy link

bart-v commented Oct 29, 2020

https://github.com/CatalogueOfLife/coldp#distribution does not contain a proper way to add an area (GUID) identifier

@mdoering
Copy link
Member

It was anticipated to have area holding the ID in case of all gazetteers other than TEXT.
Adding areaID might be less ambiguous, but means area and areaID would be mutually exclusive.

@bart-v
Copy link
Author

bart-v commented Oct 29, 2020

I don't see why they would necessarily be mutually exclusive.
area could be a human readable version of the areaID

@mdoering
Copy link
Member

That's true. But the ID would dictate what the area really is, not the human "label". The label would not be relevant for sharing and ignored in favour of the key and its normative title, potentially even even translated into different languages. We don't share labels for ranks, statuses or other vocabularies as part of an archive.

@bart-v
Copy link
Author

bart-v commented Oct 29, 2020

OK, so where it it explained that "area" can contain an identifier?

@mdoering
Copy link
Member

Nowhere it seems ;) I will update the readme which is all we have at this stage.

@dhobern
Copy link

dhobern commented Oct 29, 2020

I've been making use of Distribution records in my datasets and agree that there is room for improvement, Right now, the closest thing that we have to a unique identifier for the area is the combination of the gazetteer and the area. Different areas might have the same "area" value in different gazeteers. My use case is to write a command line tool for editing the contents of COLDP files directly and there are only two or three areas where I'm hitting current problems. This is one of them.

I think we should distinguish four strings and discuss which of these should be embedded in the Distribution record - if we are not careful, we will also need a Region or Area record to make sure that we have all we need. The four strings are:

  1. Human readable name for region - "Sabah".
  2. Code for region in gazetter - "MY-12" for Sabah in ISO.
  3. Genuinely unique ID for region within the dataset - even if the dataset uses multiple gazetteers
  4. URI or other GUID for region - ideally linking to much more information.

My personal preference would be to engineer this whole space rather better and for TDWG to host explicit and consistent Gazetteers that include all this information and probably also shape file data for the TDWG geographic region list, for the ISO list, etc. Then users could reference these from inside our YAML metadata or else supply their own equivalent Gazetteer definitions inside the COLDP package. That's a little vague but I could explain it further.

Then the patterns for supplying distribution information inside COLDP could be as follows.

1 - Default minimum - text-only distribution information provided denormalised for each record

  • Nothing reguired in YAML
  • Everything is supplied as free text in Distribution.area

2 - Default recommended - Explicit URI-based pointers to a gazetteer

  • Nothing required in YAML
  • YAML may identify Gazetteer(s) used (for robustness)
  • GUID for a known region is provided in each Distribution.areaID
  • Free text or additional information may be provided in Distribution.area

3 - Alternative - user supplies custom gazetteer

  • YAML indicates file in COLDP file that serves as Gazetteer for dataset
  • Local Gazetteer includes as a minimum an ID and Name for each area referenced by Distribution.areaID in one or more records but recommended to include other information comparable with the standard (TDWG) gazetteers
  • Distribution.areaID points to IDs of local gazetteer records

@mdoering
Copy link
Member

I don't think GUIDs or URLs are always the way to go.

I much more like the idea of reusing existing standards and combine a local id (area) provided by the standard with the namespace (gazetteer) that these values are unique in. It is also much more in harmony with the rest of the standard that nowhere mandates GUIDs or URLs as identifiers.

I like the idea of describing the gazetteer in the YAML file. But probably only for custom additions to the standard ones ColDP lists already. Better than YAML is to have a selected number of supported gazetteers that allows us to know exactly what we are dealing with and use supporting shape files, hierarchies, translated human label or whatever we want & get hold of.

Is it useful to share Germany as the area when you have the iso country code DE already for sharing? You introduce options for inconsistency. I found the previous COL distribution model suffering mostly from its inconsistent use.

Said that I am open for having both area and areaID as the later is easier to understand. Shall we?

@mdoering mdoering reopened this Oct 29, 2020
@dhobern
Copy link

dhobern commented Oct 29, 2020

It has certainly seemed surprising to me that in just this one place, we drop the use of ID and simply have "area". As I noted, there is an issue with the possible non-uniqueness of area ids across multiple gazetteers so I would like to have a mechanism to guarantee/assert that there is a defined link from a Distribution record to an area.

I started saying something in my previous note to this issue but then got sidetracked. We need to be able to include the relevant external ids like DE for Germany in the ISO gazetteer. We also need to have an ID for the area that will be unique in the context of the COLDP package. If only one gazetteer is used, this may happen naturally, but even so, the implication is that users need to check both the area(ID) and the gazetteer to be sure that the area is the one intended. In my case, I've started including an additional region.csv file in the package (could easily be distribution.csv) to capture the information that I feel I must include to make my COLDP package meaningful (and also so that it is possible to generate human-readable versions simply from joining data within the package).

Going with area (human text name) and areaID (gazetteer specific ID) seems a good step forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants