Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intention of adding provenance of DOIs with isBasedOn? #72

Closed
alko-k opened this issue Jan 8, 2020 · 27 comments · Fixed by #134
Closed

Intention of adding provenance of DOIs with isBasedOn? #72

alko-k opened this issue Jan 8, 2020 · 27 comments · Fixed by #134
Assignees
Labels
accepted decision Issues on which a decision was accepted for release. enhancement New feature or request Update Documentation updates to the guidance docs
Milestone

Comments

@alko-k
Copy link

alko-k commented Jan 8, 2020

Hi again @ashepherd ,
is there an intention of adding the isBasedOn schema.org property to refer to older DOIs on the full dataset json?

Thanks
Alexandra

@ashepherd
Copy link
Member

great idea, @alko-k! Could you write up a proposal here with an example we could use in the documentation?

@alko-k
Copy link
Author

alko-k commented Jan 14, 2020

Thanks Adam,
I will write up an example. For now this is what google suggests:
-Use the isBasedOn property in cases where the republished dataset (including its metadata) has been changed significantly.
-When a dataset derives from or aggregates several originals, use the isBasedOn property.

I also add the schema git issue for isBasedOn: schemaorg/schemaorg#1993

@mbjones
Copy link
Collaborator

mbjones commented Feb 28, 2020

schema:isBasedOn is a reasonable although lightweight provenance statement. In our other work, we use PROV-O predicates like prov:wasDerivedFrom, prov:used, and prov:generatedBy to express a more nuanced set of relationships among source and derived objects and the processes that were run. It seems like schema:isBasedOn is equivalent to prov:wasDerivedFrom, but lacks the ability to link to the processes that generated the derived data from the source data.

Here's an example data package in which we've embedded the PROV-O properties in our ORE manifest for the data package. You can look at the RDF triples we're using with a tool like rapper:

$ rapper -o turtle https://cn.dataone.org/cn/v2/object/resource_map_urn:uuid:c2e7831c-3e38-4ac1-a0b5-dff3a00ad9f1

So, I'd like to see our guidance recommend PROV-O vocabularies for provenance, with a recommendation that schema:isBasedOn could also be used and should be considered equivalent to prov:wasDerivedFrom.

@mbjones mbjones modified the milestones: v1.1, v1.2 Feb 28, 2020
@alko-k
Copy link
Author

alko-k commented Mar 4, 2020

Thanks @mbjones for all your insight and examples. There is a small issue though that the 'structured-data/testing-tool' google provides, will not pass the test with the additional prov properties...

@mbjones
Copy link
Collaborator

mbjones commented Mar 4, 2020

Yeah, we have encountered that issue of the Google SDTT throwing an error when it encounters types outside of schema.org. It is annoying for sure. We have discussed that with Google, and they indicate that the Google tools ignore those type errors and that they still import documents with other types, but they ignore the other types. We've asked them to change them to Warnings, but they have indicated that the SDTT is focused on Google's import, and so they want to keep those as errors. For our recommendations, we've decided to 1) mostly recommend schema.org types, but 2) to go ahead and recommend other types when needed if there isn't something suitable in schema.org. Our recommendations on external vocabularies in the @type field are being discussed in issue #74 and our proposed language is about to be merged in PR #95, and our take is written up in the decision record on schema.org/additionalType.

@ThomasThelen
Copy link

ThomasThelen commented Apr 6, 2020

Other projects have had similar goals of using schema.org to describe science artifacts and they all seem to trickle external vocabularies in as they need to describe specifics. One example of a specification that mixes schema.org and W3C Prov is RO Crate. You can see they they used mostly schema in this example but also brought in prov (and used it side by side with schema).

Normal w3c prov

Alt text

As JSON-LD,

{
   "@context":[
      {
         "prov":"http://www.w3.org/ns/prov#"
      }
   ],
   "@graph":[
      {
         "@id":"plot.py_execution",
         "@type":"prov:Activity",
         "prov:used":"daily-total-female-births.csv"
      },
      {
         "@id":"daily-total-female-births.csv",
         "@type":"prov:Entity"
      },
      {
         "@id":"female-daily-births.png",
         "@type": "prov: Entity",
         "prov:wasGeneratedBy":"plot.py_execution"
      }
   ]
}

Minimal Extension to ProvONE

Note the provone namespace
Alt text

{
   "@context":[
      {
         "prov":"http://www.w3.org/ns/prov#"
      },
      {
         "provone":"http://purl.dataone.org/provone/2015/01/15/ontology#"
      }
   ],
   "@graph":[
      {
         "@id":"plot.py_execution",
         "@type":"provone:Execution",
         "prov:used":"daily-total-female-births.csv"
      },
      {
         "@id":"daily-total-female-births.csv",
         "@type":"provone:Data"
      },
      {
         "@id":"female-daily-births.png",
         "@type": "prov: Entity",
         "prov:wasGeneratedBy":"plot.py_execution"
      }
   ]
}

@mbjones
Copy link
Collaborator

mbjones commented Jul 25, 2020

Began a new branch https://github.com/ESIPFed/science-on-schema.org/tree/feature_72_provenance for editing the Guide and a new proposed provenance ADR for how we recommend handling provenance information. Editing is not complete, still working on:

  • Explanatory text
  • Figures showing the examples
  • Examples for each of the types of provenance relationships

@mbjones
Copy link
Collaborator

mbjones commented Jul 28, 2020

@ashepherd @datadavev @fils @alko-k @smrgeoinfo Completed first draft of the provenance proposal. Please review the:

@amoeba @csjx @gothub @mpsaloha Given your familiarity with our use of PROV and ProvONE in DataONE, I would appreciate if you could give this a look over as well. You'll note that I omitted the use of prov:qualifiedAssociation, and so I'd like to discuss the reasoning implications of omitting this. I'm not happy with it, but at the same time doing it correctly complicated the graph sufficiently to question whether it is sensible in the schema.org context. We may want to use another predicate than prov:hadPlan that we can equate via punning to the more complicated PROV model with Association.

@datadavev
Copy link
Collaborator

See also: schemaorg/schemaorg#1905

@mbjones
Copy link
Collaborator

mbjones commented Jul 28, 2020

@davev thanks for the pointer on schema:Action , I was unaware of that. I think it could be successfully used in place of provone:Execution, albeit with less semantic specificity. Maybe schema:CreateAction would be better. We could I suppose propose to create a schema:ExecuteAction term in schema.org as a subclass of schema:Action, which would be an equivalent property to provone:Execution. There doesn't seem to be an equivalent property to prov:hadPlan, but we might be able to make schema:instrument work to indicate the role of software if you don't worry over semantics too much. Here's my original proposed structure:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "schema:isBasedOn": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": 
      {
        "@id": "https://example.org/executions/execution-42",
        "@type": "provone:Execution",
        "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
        "prov:used": "https://doi.org/10.xxxx/Dataset-1"
      }
}

And here's the same structure rewritten with only schema.org using CreateAction in place of Execution:

{
  "@context": {
    "@vocab": "https://schema.org/"
  },
  "@graph": [
    {
      "@id": "https://doi.org/10.xxxx/Dataset-2",
      "@type": "https://schema.org/Dataset",
      "https://schema.org/name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
      "schema:isBasedOn": "https://doi.org/10.xxxx/Dataset-1"
    },
    {
      "@id": "https://example.org/executions/execution-42",
      "@type": "schema:CreateAction",
      "schema:instrument": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
      "schema:object": "https://doi.org/10.xxxx/Dataset-1",
      "schema:result": "https://doi.org/10.xxxx/Dataset-2"
    }
  ]
}

I think schema:result is the inverse of prov:wasGeneratedBy, so that works. It's a little more convoluted to have to create the implicit graph because there is no equivalent property to prov:wasGeneratedBy, or at least schema:result is the inverse, making the attachment to schema:Dataset different. However, I'm really unclear if this is the proper use of schema:object, where I used it in place of prov:used. The definition of schema:object is:

The object upon which the action is carried out, whose state is kept intact or changed. Also known as the semantic roles patient, affected or undergoer (which change their state) or theme (which doesn't). e.g. John read a book.

Which is confusingly similar to schema:result to me. They use the book as an example for both result and object, so I'm not sure really what they represent. Thoughts?

If we did all of this with schema.org, I'd want to be explicit in the guide as to the intended mapping to PROV so that equivalence could be had with people using the more precise PROV and ProvONE vocabularies. I think by comparing them to more explicit vocabularies we can make our intended interpretation clear. Feedback appreciated.

@datadavev
Copy link
Collaborator

Your example seems ok to me. I read schema:object as a target (input) of some Action, and schema:result as an outcome (output) of an Action.

That said, what is the practical benefit of using only schema.org terms when there is an established practice using the prov and provone semantics? Would it be better (less ambiguous) to promote the use of prov and provone and provide a mapping of terms from those to schema for consumers not familiar with prov?

@datadavev
Copy link
Collaborator

btw, this is another way of writing your second example above to be slightly more Dataset centric:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "resultOf": {
      "@reverse": "result"
    }
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "isBasedOn": "https://doi.org/10.xxxx/Dataset-1",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "resultOf": {
    "@id": "https://example.org/executions/execution-42",
    "@type": "CreateAction",
    "instrument": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
    "object": "https://doi.org/10.xxxx/Dataset-1"
  }
}

@amoeba
Copy link
Contributor

amoeba commented Jul 28, 2020

This looks really good and the edits to the Dataset guide look and read great.

Something that stands out to me is the shape of the prov:wasGeneratedBy example, specifically the "@id": "https://example.org/executions/execution-42", triple. If I ran into that in a guide I wouldn't know what to do because of the made-up URI. I don't know if non-resolving URIs really fit into the Schema.org pattern or SOSO for that matter.

I might flatten it, like:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R"
}

(The prov:wasGeneratedBy triple now isn't really valid as the object isn't really a prov:Activity.)

I can see you're trying to find a way to capture an execution explicitly and my example makes the execution implicit and vague. Another property, like foo:wasDerivedBy might make this a little more clear that prov:wasDerivedFrom and prov:wasGeneratedBy are connected but ultimately my example is less rich.

While Schema.org tends to be pretty flat, SOSO doesn't really shy away from it, so an alternative to my super flat example might look like:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": {
    "@id": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
    "@type": "Foo",
    "foo:used": "https://doi.org/10.xxxx/Dataset-1"
  }
}

But I don't think we have the terms to do this right now.

@mbjones
Copy link
Collaborator

mbjones commented Jul 28, 2020

@amoeba Thanks, Bryce. I agree about the @id for the execution instance. I seriously considered making it a blank node by omitting the @id because people often don't track executions. They do, however, track execution times and other properties, so it would be nice to have something to hang those properties on, and to differentiate multiple executions of the same script (especially for model runs, etc). But there's been a lot discussion in this group about avoiding blank nodes, so I thought it prudent to put in some stand-in for the execution identifier. I would prefer to leave it out though.

@mbjones
Copy link
Collaborator

mbjones commented Jul 29, 2020

@PaoloMissier @ludaesch do you have any thoughts on this issue discussing provenance representation in schema.org and PROV/ProvONE? See in particular: #72 (comment) and the comments that follow.

@mpsaloha
Copy link
Contributor

mpsaloha commented Jul 30, 2020 via email

@mbjones
Copy link
Collaborator

mbjones commented Jul 30, 2020

Thanks for the comments @mpsaloha. I was equating schema:isBasedOn with prov:wasDerivedFrom, whereas we have interpreted the subproperty prov:wasRevisionOf to specialize the property to the narrower case where the new entity is both derived from the original AND it represents a new version of the same entity. So, all revisions are derivations, but not all derivations are revisions. In DataONE, we interpret prov:wasRevisionOf to mean that the new object is meant to explicitly replace the original version, and is wholly substitutable for the orginal. We use that to hide older versions of Datasets in search results. And there are definitely broader uses of prov:wasDerivedFrom, such as when two data sources are combined into an integrated whole, but the new Dataset is not meant to replace the original per se. So I'd like us to be able to express both the derived from and replaces semantics from PROV.

I looked for a subproperty in SO that was equivalent to prov:wasRevisionOf, and didn't find a match. There could be one though. The closest thing I could find is that there is schema:UpdateAction which is meant to explicitly be an action in which the schema:result replaces the schema:object, but because these same properties are used in all schema:Action classes, such as schema:CreateAction, the interpretation of the schema:result as a "replacement" only applies within the context of schema:UpdateAction. So, I couldn't find a dedicated subproperty indicating replacement semantics in SO, and I left the prov:wasRevisionOf for the time being. Maybe there's another approach.

@mpsaloha
Copy link
Contributor

mpsaloha commented Jul 30, 2020 via email

@datadavev
Copy link
Collaborator

Perhaps SO:ReplaceAction (The act of editing a recipient by replacing an old object with a new object) with its replacee and replacer corresponds with prov:wasRevisionOf? Though the description doesn't necessarily mean the replacer is a revision, could be just a new instance.

@mpsaloha
Copy link
Contributor

mpsaloha commented Jul 30, 2020 via email

@datadavev
Copy link
Collaborator

Yep, agreed. The act of substitution is clear, but the notion that the replacement is a revision of the replacee is not.

Dave-- I think we'd then lose the notion of the replacer being a revision_of rather than simply substitute_for the replacee. The example they give of changing movies is very different from the notion that the derived entity contains significant components of the original entity. Mark

@rduerr
Copy link
Collaborator

rduerr commented Jul 31, 2020

Of all of the options above the one that is most human understandable is:

{
  "@context": {
    "@vocab": "https://schema.org/",
    "prov": "http://www.w3.org/ns/prov#",
    "provone": "http://purl.dataone.org/provone/2015/01/15/ontology#"
  },
  "@id": "https://doi.org/10.xxxx/Dataset-2",
  "@type": "Dataset",
  "name": "Removal of organic carbon by natural bacterioplankton communities as a function of pCO2 from laboratory experiments between 2012 and 2016",
  "prov:wasDerivedFrom": "https://doi.org/10.xxxx/Dataset-1",
  "schema:isBasedOn": "https://doi.org/10.xxxx/Dataset-1",
  "prov:wasGeneratedBy": 
      {
        "@id": "https://example.org/executions/execution-42",
        "@type": "provone:Execution",
        "prov:hadPlan": "https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R",
        "prov:used": "https://doi.org/10.xxxx/Dataset-1"
      }
}

And I think an @id is necessary because I can see people needing to query for every dataset/product that used a particular commonly used script as part of a processing chain, when that script is found to have a bug in it that requires reprocessing everything that used it.

@smrgeoinfo
Copy link
Contributor

smrgeoinfo commented Jul 31, 2020

I edited @rduerr 's example to preserve the formatting in the JSON, no content change.

+1 for that encoding approach

@mbjones
Copy link
Collaborator

mbjones commented Aug 4, 2020

Discussed the proposal and ADR during the SOSO call on Aug 3. General consensus that the use of PROV-O and ProvONE predicates was preferred because of their increased semantic precision. We agreed to move towards approving the ADR, but will give people another week or so to comment. @mbjones will prepare a PR with minor revisions shortly thereafter.

@rduerr
Copy link
Collaborator

rduerr commented Aug 5, 2020

I looked at the ADR and updated text - looks good to me.

@mbjones
Copy link
Collaborator

mbjones commented Aug 6, 2020

Uploaded the current ProvONE OWL file to COR for better community visibility and navigation. See:
http://cor.esipfed.org/ont?iri=http://purl.dataone.org/provone/2015/01/15/ontology%23

@mbjones mbjones added the enhancement New feature or request label Sep 25, 2020
@mbjones mbjones self-assigned this Sep 25, 2020
@ashepherd ashepherd added this to In progress in science-on-schema.org Dec 11, 2020
@mbjones mbjones added accepted decision Issues on which a decision was accepted for release. Update Documentation updates to the guidance docs labels Jan 22, 2021
@mbjones mbjones linked a pull request Jan 22, 2021 that will close this issue
@mbjones
Copy link
Collaborator

mbjones commented Jan 27, 2021

PR #134 merged in the accepted provenance features into develop and will now be included in the release, so closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted decision Issues on which a decision was accepted for release. enhancement New feature or request Update Documentation updates to the guidance docs
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

9 participants