Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specific recommendations for metadata distribution info #4

Closed
mbjones opened this issue Dec 1, 2018 · 36 comments
Closed

specific recommendations for metadata distribution info #4

mbjones opened this issue Dec 1, 2018 · 36 comments
Labels
Extension Leading Practice Recommended practices for repository implementation proposed decision
Milestone

Comments

@mbjones
Copy link
Collaborator

mbjones commented Dec 1, 2018

Some schema.org providers use DataDownload to provide an explicit distribution link to an associated metadata file that might have more detailed metadata in a common XML format like ISO-19115. Is there a clear convention on how one can indicate that a particular distribution represents the metadata for the package? In DataONE's ORE-based packaging format, we use the cito:documents property and its inverse to indicate that a specific metadata document provides documentation for a particular set of data files. Would that be reasonable here, or is there a schema.org property that I missed?

@mbjones mbjones added the enhancement New feature or request label Dec 1, 2018
@ashepherd
Copy link
Member

Good topic! It doesn't seem like schema.org has a defined way of describing related resources. It looks like this is being discussed by the W3C Data Exchange Working Group w3c/dxwg#482 where ORE got mentioned w3c/dxwg#482 (comment)

@dr-shorthair

@sjskhalsa
Copy link

dxwg#482 is about distributions in which the files in a package are distinct but all are necessary. A pointer to richer metadata is different. Perhaps https://www.w3.org/TR/vocab-dcat-2/#Property:resource_relation with subrelation dct:ReferencedBy?

@mbjones
Copy link
Collaborator Author

mbjones commented Dec 2, 2018

Note that @smrgeoinfo incorporated such a link in IEDA JSON-LD as an entry in the distribution section for one of their landing pages, like:

{
    "@type": "DataDownload",
    "additionalType": "http://www.w3.org/ns/dcat#DataCatalog",
    "encodingFormat": "text/xml",
    "name": "ISO Metadata Document",
    "url": "http://get.iedadata.org/metadata/iso/usap/609070iso.xml"
}

That provides the link, but there is no way to semantically distinguish it from any of the other data files in the Dataset. The name is just a string value, so we'd like to be able to know via a specific @type property, which could be an additionalType if we want to keep DataDownload as the main type. As an aside, I'm not sure what the additionalType here set to DataCatalog is meant to convey, but may be a separate discussion.

@ashepherd
Copy link
Member

ashepherd commented Dec 2, 2018

One approach might be to use cito:documents on the DataDownload (being a package) to point link to the related resources. This seems like a good place for an extension term where we can link cito:documents to schema.org/hasPart in something like this? (examples below aren't fleshed out)

geosci:documents a owl:ObjectProperty ;
  skos:exactMatch cito:documents ;
  skos:broadMatch schemaorg:hasPart ;
  rdfs:range schemaorg:MediaObject ;
  rdfs:domain schemaorg:MediaObject .
  
{
    "@type": "DataDownload",
    "encodingFormat": "text/xml",
    "name": "ISO Metadata Document",
    "url": "http://get.iedadata.org/metadata/iso/usap/609070iso.xml",
    "geosci:documents":[
      { "@type": "MediaObject", "name": "data file", "encodingFormat": "text/tsv", "url": ... },
      { "@type": "MediaObject", "name": "metadata file", "encodingFormat": "text/xml", "url": ... },
      { "@type": "MediaObject", "name": "supplemental document", "encodingFormat": "application/pdf", "url": ... },
      ...
    ]
}

[edited]
after thinking about this, I wondered if we could 'pun' cito:documents instead of minting a new property geosci:documents, but after looking at cito:documents, it doesn't define a domain and range, so I think we decide this approach above is worth exploring, it seems like a new property might be in order so we don't create semantic entanglements for the Cito Ontology with a punning.

@mbjones
Copy link
Collaborator Author

mbjones commented Dec 3, 2018

@ashepherd That's interesting. Wouldn't the list of MediaObjects in the geosci:documents also be listed in the JSON-LD as more DataDownload objects to which we could attach cito:isDocumentedBy pointers back to the metadata object? Or are you thinking this new list of MediaObject would replace the DataDownload list? The IEDA landing page I linked earlier shows such a list.

@smrgeoinfo
Copy link
Contributor

The DCAT revision group seems to be of a mind that distributions are different representations of the described resource content (some level of 'information equivalence', see w3c/dxwg#531). The metadata is a description of the resource, not a representation of its actual information content. I think what is really needed is a related resource pattern that provides a qualified association, something like xlink:href, xlink:role or the qualifiedAssociations in PROV. Schema.org Role might do the trick. If one took this pattern to heart, it could be used for distributions as well, where distribution is just one of the roles for a related resource.

p.s. I agree dcat:DataCatalog is an odd property to put on a link to a metadata record describing the resource, but I'm pretty sure I got the recommendation to use that from DataOne (DV?).

@ashepherd
Copy link
Member

@smrgeoinfo do you get the sense the data packaging is out of scope of what DCAT group is thinking about?

@mbjones I was thinking those MediaObject could be in the same JSON-LD document like IEDA has above (but I guess they could live elsewhere too). Maybe my approach isn't the best if data packaging isn't aligned with the meaning of distributions like @smrgeoinfo mentions.

@smrgeoinfo
Copy link
Contributor

Its hard to say, but it seems that they're taking a pretty narrow interpretation of distribution.

In the long run, I think a broader concept of 'related resource links' that include properties specifying what the links are about would be a better long term solution, so it could be used for landing pages, ftp directories, data packages, services for visualization and subsetting, applicable specifications. In the ISO 19115 world, the distributionInfo does get used to link to a broad array of resources related to using the metadata subject resource. The dct:conformsTo property could be used to identify a specification that describes the link function/role, and DCAT has this property on datasets and distributions. The boundary between distributions and related resources is pretty fuzzy-- the DCAT profiles ontology group has stepped into that one big time with their ResourceDescriptor class-- basically a set of related resources about a profile (but not distributions!) https://github.com/w3c/dxwg/issues/573

@ashepherd
Copy link
Member

will review in EarthCube P419. @fils mentioned the Digital Object model
coming out of the RDA Data Foundation & Terminology Group, and we were going to inspect how it related to updates to DCAT if its useful.

i'm curious if anyone else has heard of this and if there are any big diffs between it and OAI-ORE.

@mbjones
@smrgeoinfo

@smrgeoinfo
Copy link
Contributor

I'm not really up to speed on recent Digital Object architecture, but from a quick look an Larry L's paper and C2CAMP, it looks like the basic concepts are overlapping-- package data with metadata. What I don't see is any specification of precisely what the metadata for both file and data typing would look like, and the exercise is academic until that exists.

@datadavev
Copy link
Collaborator

At least for metadata describing the same dataset as schema:Dataset (e.g. an ISO 19139 describes the same Dataset as described in the schema:Dataset instance, which was perhaps even derived from the ISO document), the schema:encoding property ("A media object that encodes this CreativeWork") is appropriate. This is an instance of MediaObject which should include a contentUrl and an encodingFormat. e.g.:

{
    "@type":"Dataset",
    ...
    "encoding":{
        "@type": "MediaObject",
        "contentUrl":"https://example.org/link/to/iso.xml",
        "encodingFormat":"http://www.isotc211.org/2005/gmd",
        "description":"ISO TC211 XML rendering of metadata.",
        "dateModified":"2019-06-12T14:44:15Z"
    },
    ...
}

The schema:distribution property of the Dataset, an instance of schema:DataDownload ("A dataset in downloadable form") aligns well with the requirement to both identify how a data component of a Dataset can be retrieved and the format that it will be expressed in. e.g. a CSV component oof a Dataset could be specified as:

{
    "@type":"Dataset",
    ...
    "distribution": [
        {
            "@type":"DataDownload",
            "contentUrl": "https://example.org/link/to/data",
            "encodingFormat":"data format",
            "identifier": {
                "@type": ["PropertyValue", "datacite:ResourceIdentifier"],
                "datacite:usesIdentifierScheme": { 
                    "@id": "datacite:doi" 
                },
                "propertyId":"DOI",
                "url": "https://doi.org/10.1234/blh",
                "value": "10.1234/blah"
            },
            "encoding": {
                "@type": "MediaObject",
                "contentUrl":"https://example.org/link/to/data.csv",
                "encodingFormat":"text/csv",
                "description":"Comma separated data",
                "dateModified":"2019-06-12T14:44:15Z"            
            }
        }
    ],
    ...
}

Note that the encoding property of the DataDownload instance is used to describe the format of the data component. Multiple components of the Dataset can be specified with an array of DataDownload instances in the distribution property.

The premise of these suggestions is that encoding identifies alternate encodings for the schema.org Dataset metadata instance, and distribution identifies how to retrieve components of the Dataset.

@ashepherd
Copy link
Member

I like Dave's proposal for the use of encoding on Dataset for referencing other related/associated MediaObjects. I do wonder about the use of encoding for DataDownload since itself is a MediaObject and has all the encoding properties directly on itself to describe encoding. and we can have multiple distributions` (as an array).

{
    "@type":"Dataset",
    ...
    "distribution": [
        {
            "@type":"DataDownload",
            "contentUrl": "https://example.org/link/to/data.csv",
            "encodingFormat":"text/csv",
            "description":"Data as Comma separated data",
            "dateModified":"2019-06-12T14:44:15Z"    
            "identifier": {
                "@type": ["PropertyValue", "datacite:ResourceIdentifier"],
                "datacite:usesIdentifierScheme": { 
                    "@id": "datacite:doi" 
                },
                "propertyId":"DOI",
                "url": "https://doi.org/10.1234/blh",
                "value": "10.1234/blah"
            }
        },
        {
            "@type":"DataDownload",
            "contentUrl": "https://example.org/link/to/data.nc",
            "encodingFormat":"application/x-netcdf4",
            "description":"Data as NetCDF4",
            "dateModified":"2019-06-12T14:44:15Z"    
            "identifier": {
                "@type": ["PropertyValue", "datacite:ResourceIdentifier"],
                "datacite:usesIdentifierScheme": { 
                    "@id": "datacite:doi" 
                },
                "propertyId":"DOI",
                "url": "https://doi.org/10.1234/blh-nc4",
                "value": "10.1234/blah-nc4"
            }
        }
    ],
    ...
}

@datadavev, have you seen an example where the specific distribution has another MediaObject? JI just want to make sure we are clear on the distinction.

@datadavev
Copy link
Collaborator

@ashepherd, I totally agree, my mistake there. Adding a separate encoding property is superfluous for a DataDownload instance.

There are a couple of situations where additional encodings may perhaps be appropriate - 1) the content is available in different formats from different contentUrls and all have the same identifier; or 2) as for 1 but with content negotiation so same contentUrl, but there's a desire to provide separate descriptions for the different formats.

@ashepherd
Copy link
Member

@datadavev, those are good examples! If you feel up to adding another pull request by adding that as an example just above here: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#accessing-data-through-a-service-endpoint

and below the basic example of a DataDownload.

@smrgeoinfo, does this help address your thoughts about MIME Type (application/xml) and format (ISO 19115-2)?

Should we consider making encodingFormat an array?

{
    "@type":"Dataset",
    ...
    "encoding":{
        "@type": "MediaObject",
        "contentUrl":"https://example.org/link/to/iso.xml",
        "encodingFormat": ["application/xml", "http://www.isotc211.org/2005/gmd"],
        "description":"ISO TC211 XML rendering of metadata.",
        "dateModified":"2019-06-12T14:44:15Z"
    },
    ...
}

@smrgeoinfo
Copy link
Contributor

smrgeoinfo commented Aug 24, 2019

I really don't think using sdo:encoding to link to another metadata record about the same resource is consistent with the intention of sdo:encoding. The closest match I can see is using sdo:subjectOf (scope note "A CreativeWork or Event about this Thing"). The ISO metadata record is a separate CreativeWork.Dataset that is about the same thing that the sdo:Dataset record describes. Encoding might look like this:

{
    "@type":"Dataset",
    ...
    "subjectOf":{
        "@type": "Dataset",
        "url":"https://example.org/link/to/iso.xml",
        "encodingFormat": ["application/xml", "http://www.isotc211.org/2005/gmd"],
        "description":"ISO TC211 XML rendering of metadata.",
        "dateModified":"2019-06-12T14:44:15Z"
    },
    ...
}

of course this would be even better if there were a subtype of 'Dataset' for data that is about other datasets, i.e. metadata; lacking that we have to rely on the encoding format including a URI for the metadata scheme as in the example.

@ashepherd
Copy link
Member

@smrgeoinfo can you tell us how you interpret the sdo:encoding definition? That might help us understand where you disagree.

@smrgeoinfo
Copy link
Contributor

A media object that encodes this CreativeWork. This property is a synonym for associatedMedia.
It seems clear to me that this property is intended to specify how a representation of a resource is encoded (syntax of serialization). The essential question is: "is the metadata record about a resource a representation of the resource, or a distinct resource whose topic is some other resource".

My thinking is that the metadata record is a separate resource, thus NOT an encoding of the resource that a sdo:Dataset object is about. Simple test is how you would use them-- can you do any scientific analysis with a metadata record about a dataset, or do you need an actual representation of the data?

@datadavev
Copy link
Collaborator

Consider though that an so:Dataset instance and an ISO XML metadata record are both used to describe the same dataset and the so:Dataset instance can be derived from the metadata record. Hence the metadata record is an alternate encoding of the so:Dataset instance and vice-versa.

@smrgeoinfo
Copy link
Contributor

Key question-- Is the subject of sdo:encoding SELF (i.e. the sdo:DataSet instance), or the dataset that the sdo:Dataset instance describes. As usual the documentation for the element is unclear -- what does 'this CreativeWork' refer to? SELF, or a dataset that SELF is about? sdo:MediaObject is defined as "A media object, such as an image, video, or audio object embedded in a web page or a downloadable dataset", is that what the sdo:encoding definition is referring to? Googling 'media object' is interesting.

I think the more useful and consistent interpretation is that the subject of the elements in the sdo:DataSet instance is the dataset, not the sdo:DataSet instance. ISO 19115 makes this distinction clear with properties named gmd:metadata... (not always consistently...), and DCAT distinguishes dcat:CatalogRecord and dcat:Resource. Schema.org has some proposed properties on sdo:Dataset with sd... prefixes (sdDatePublished, sdLicense, sdPublisher) that appear to be about the metadata record ('structured data') as opposed to the dataset that is the subject of the record.

@datadavev
Copy link
Collaborator

datadavev commented Dec 2, 2019

I wrote up three approaches to providing external links to metadata associated with an SO:Dataset:

  1. Using subjectOf to indicate the SO:Dataset is the subject of an SO:CreativeWork of derivatives.
  2. Using the inverse of 1, about
  3. Using encoding to indicate the referenced SO:MediaObject is an alternative encoding of the SO:Dataset document.

See: https://so-tools.readthedocs.io/en/latest/external_metadata.html

The write up focussed on functionality being implemented for harvesting in DataONE, so emphasis is more on functionality that can be leveraged by that infrastructure, though is intended to be generally applicable.

In summary each approach will work, with about being perhaps the easiest for more complex datasets (though really no different to subjectOf).

It would not be difficult to recast that document to align with the Guideline document, which I'm happy to do if there's broader agreement.

@smrgeoinfo
Copy link
Contributor

+1 for the subjectOf approach.

In the 'about' example, a client parsing the so:DataSet/hasPart links would face a problem determining which part is actually the data (as opposed to metadata describing the data). One could probably infer the correct answer, but it requires client developers to write more code.

In the encoding approach, I think that since the information content of the data and of the metadata describing the data are different, they are not encodings of the same resource. The data is about something in the world, the metadata is about that data.

@datadavev
Copy link
Collaborator

hasPart is just an aggregation mechanism, asserting that those pieces are part of the bundle. It makes no assertion about the type of relationship. The about relation asserts that one of those pieces is about something (e.g. the entire dataset or specific components). Determining which item is the metadata about the SO:Dataset should be straight forward, for example:

PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX SO:   <https://schema.org/>

SELECT ?metadata ?about
WHERE {
    ?about rdf:type SO:Dataset .
    ?metadata SO:about ?about .
}

would return the id of the metadata and the SO:Dataset respectively.

So I think the subjectOf and about approaches are pretty much equivalent, just looking at things in a different direction.

@ashepherd ashepherd added Leading Practice Recommended practices for repository implementation Extension labels Dec 2, 2019
@ashepherd
Copy link
Member

Implemented schema:subjectOf at BCO-DMO in this way. curious about comments:

just showing url, name, distribution, and subjectOf properties for brevity.

I typed the metadata as DataDownload (which I'm not sold on), but it was the only type that let you specify both the contentUrl and contentSize, encodingFormat. I also investigated using DigitalDocument or MediaObject, but decided that DataDownload was better fit, but doesn't seem right.

NOTE: I also used the additionalType property to specify the flavor of XML I was using for our ISO 19115-2 metadata record. @smrgeoinfo, curious your thoughts on that.

{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@type": "Dataset",
  "@id": "https://www.bco-dmo.org/dataset/3300",
  "url": "https://www.bco-dmo.org/dataset/3300",
  "name": "Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project)",
  "distribution": [
    {
      "@id": "http://lod.bco-dmo.org/id/dataset-file/751117",
      "@type": "DataDownload",
      "name": "Tab Separated Values (.tsv)",
      "contentUrl": "https://darchive.mblwhoilibrary.org/bitstream/1912/10794/1/dataset-3300_larval-krill-pigments__v1.tsv",
      "contentSize": "15541",
      "encodingFormat": "text/tab-separated-values",
      "creativeWorkStatus": "Published"
    },
    {
      "@id": "http://lod.bco-dmo.org/id/dataset-file/751118",
      "@type": "DataDownload",
      "name": "Portable Document Format (.pdf)",
      "contentUrl": "https://darchive.mblwhoilibrary.org/bitstream/1912/10794/2/Dataset_description.pdf",
      "contentSize": "55872",
      "encodingFormat": "application/pdf",
      "creativeWorkStatus": "Published"
    }
  ],
  "subjectOf": [
    {
      "@id": "http://lod.bco-dmo.org/id/dataset-file/751119",
      "@type": "DataDownload",
      "name": "ISO 19115-2 (NOAA Profile)",
      "contentUrl": "https://darchive.mblwhoilibrary.org/bitstream/1912/10794/3/NOAA_ISO19115-2.xml",
      "contentSize": "72285",
      "encodingFormat": "application/xml",
      "encodesCreativeWork": "https://www.bco-dmo.org/dataset/3300",
      "creativeWorkStatus": "Published",
      "additionalType": "http://www.isotc211.org/2005/gmd-noaa"
    }
  ]
...
}

@ashepherd ashepherd added this to the ESIP Winter Meeting milestone Dec 19, 2019
@smrgeoinfo
Copy link
Contributor

@ashepherd I think your 'sdo:subjectOf' example looks good. My only suggestion is to add "http://www.isotc211.org/2005/gmd" as an encoding format, since I assume the noaa profile is consistent with base gmd. I'm guessing a client might likely look for a particular xml metadata format by searching for the URI for the xml schema to which it conforms; it could be in sdo:encodingFormat or sdo:additionalType. They might be fine with any flavor of ISO19139 xml... Perhaps we can recommend conventions for recording encoding scheme hierarchies like application/xml --> http://www.isotc211.org/2005/gmd -->
http://www.isotc211.org/2005/gmd-noaa.

xml world:
mimeType --> schema --> profile
rdf world
mimeType --> rdfVocabulary --> profile
text data
mimeType --> convention

@smrgeoinfo
Copy link
Contributor

note to above -- my assumption was that gmd-noaa was schema valid against gmd (Type 1 profile), but its actually not ISO schema valid, so the http://www.isotc211.org/2005/gmd uri shouldn't be included.

@jyucsiro
Copy link

jyucsiro commented Jan 7, 2020

for profiles, see this W3C spec https://www.w3.org/TR/dx-prof/
See example 2 for a relevant example: https://www.w3.org/TR/dx-prof/#eg-conformance-to-profile

with this

{
    "@type":"Dataset",
    ...
    "subjectOf":{
        "@type": "Dataset",
        "url":"https://example.org/link-to-resource",
        "encodingFormat": ["some mime-type", "some format"],
        "conformsTo": "http://example.org/profile/x" 
    },
    ...
}

where http://example.org/profile/x identifies the relevant profile. possibly use an associated SHACL to validate...

@charlesvardeman
Copy link
Collaborator

Bioschemas.org draft profile for DataRecord is probably relevant.
https://bioschemas.org/specifications/drafts/DataRecord/
Use of additional property to link to gene ontology. Mailing list discussion thread.
https://lists.w3.org/Archives/Public/public-bioschemas/2018Nov/0008.html

@charlesvardeman
Copy link
Collaborator

Schema.org WebAPI issue also has a relevant discussion with respect to using mime type for encoding formats (as well as additional properties).
schemaorg/schemaorg#1423 (comment)

@ashepherd
Copy link
Member

@ashepherd
Copy link
Member

@mbjones mbjones modified the milestones: ESIP Winter Meeting, v1.1 Jan 9, 2020
@mbjones mbjones removed the enhancement New feature or request label Jan 9, 2020
@mbjones
Copy link
Collaborator Author

mbjones commented Jan 22, 2020

@ashepherd I reviewed the ADR for this, and it generally looks good. I think we need to still:

  • Update metadata section and SHACL
    It looks like we haven't updated the Metadata section of the guide to reflect this new proposed ADR, so that would need to be done before we close this issue.

I have created a feature branch feature_4_dataset_metadata_distributions to start incorporating these changes. I added a diagram, revised the text to be consistent with the ADR, and removed the SHACL block, which seems like a more advanced topic than would be needed in the guide itself. @datadavev it would be great if you could edit this branch with any changes you see are needed, as I modified a bunch of your text.

I created a PR #81 from this branch for review and commenting, but please feel free to edit the branch directly.

Other issues

We also discussed controlled vocabularies for the encodingFormat, and it seems there are multiple options, and if we want to make a recommendation for that, we should do so in a separate issue.

We also agreed that following the development of dct:conformsTo as it evolves would be good, but for now it was not ready to recommend as it is just a W3C Note.

@mbjones
Copy link
Collaborator Author

mbjones commented Jan 22, 2020

@ashepherd The diagram I made for the metadata section is here: https://www.lucidchart.com/invitations/accept/b1db3455-e7a1-486e-9f54-2e3bc692450c

I couldn't seem to gain access to your folder with my free Lucid account. Probably missing something simple. Can you move it over?

@smrgeoinfo
Copy link
Contributor

the schema.org "Values expected to be one of these types" for subjectOf are CreativeWork or Event, but the structured data testing tool doesn't throw an error for @type download, so its not breaking.

ADR looks good to me

@rduerr
Copy link
Collaborator

rduerr commented Feb 3, 2020

This ADR is probably fine as is - but the bigger issue of typing multiple kinds of relationships hasn't been resolved. Is that a new issue?

@mbjones
Copy link
Collaborator Author

mbjones commented Feb 27, 2020

Decision on 02-27 call; merge to develop.

@mbjones
Copy link
Collaborator Author

mbjones commented Feb 28, 2020

Done, merged to develop. Closing issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Extension Leading Practice Recommended practices for repository implementation proposed decision
Projects
None yet
Development

No branches or pull requests

8 participants