Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment with Bioschemas profile Workflow 0.5 #81

Merged
merged 18 commits into from Jul 30, 2020
Merged

Conversation

stain
Copy link
Contributor

@stain stain commented Jun 8, 2020

Updating our Workflow section to map closer to BioSchemas profile for Workflows.

It also removes wf4ever references for Workflow and Script.

Reflecting the desire to avoid @type: [arrays] (which issue, @ptsefton ?) - this also gets rid of the previous triple-typing of @type: [File, SoftwareSourceCode, Script] to just @type: SoftwareSourceCode (other changes needed elsewhere for that change)

Work in progress

  • Update URIs for released Workflow 0.5 profile
  • Check if profile recommends @type: Workflow or @type: SoftwareSourceCode
  • Check with BioSchemas community what temporary namespace to use
  • array-less @type overall in RO-Crate (not blocker)

Author vs creator

https://bioschemas.org/profiles/Workflow/0.4-DRAFT-2020_05_11/ specifies creator rather than author - however we have made https://researchobject.github.io/ro-crate/1.0/ consistent to use author for other types including the Dataset itself and also https://bioschemas.org/profiles/ScholarlyArticle/0.1-DRAFT-2019_03_15/ - however their https://bioschemas.org/profiles/Dataset/0.3-RELEASE-2019_06_14/ recommends creator rather than author.

Occasionally the author of a workflow may be different from the creator, e.g. Alice writes the workflow in Galaxy, then Bob rewrites it in Snakemake, which is quite a different workflow language. However the conceptual workflow could remain the same. As a workflow is typed as http://schema.org/SoftwareSourceCode then perhaps it makes most sense to note who made the code lines as the creator - so I left our workflow examples to also use that.

See section Authoring in our PAV paper for discussion of author vs creator vs curator vs contributor.

Multiple type array

Removing the multiple @type: [arrays] meant I also got rid of WorkflowSketch so the diagrams are now in a sense untyped except for their about relation to a workflow, which again is now just a SoftwareSourceCode.

When we change this generally in RO-Crate we have to also soften the requirement that data entities from hasPart has to have type File - as workflows are generally saved in files. (Same applies to ImageObject and ScholarlyArticle if embedded).

Note that the FormalParameter proposal uses additionalType and format with links to EDAM ontology. In the example I used them as full URIs , but not sure if we need to recommend their @type: Thing contextual entities as in my example, as their URIs generally give a readable description (e.g. http://edamontology.org/format_1929 )

Script vs Workflow

Removing the wf4ever terms and the multiple types make it harder to distinguish workflows from scripts. Perhaps that was always tricky, e.g. https://snakemake.readthedocs.io/ workflows look a lot like a script anyway.

It is unclear from https://bioschemas.org/profiles/Workflow/0.4-DRAFT-2020_05_11/ if they are proposing a new type Workflow (to become http://schema.org/Workflow) or specifying how https://schema.org/SoftwareSourceCode should be used under this profile.

This text suggest the second:

This Profile fits into the schema.org hierarchy as follows:
Thing > CreativeWork > SoftwareSourceCode

compared to https://bioschemas.org/profiles/ChemicalSubstance/0.4-RELEASE/

This Profile fits into the schema.org hierarchy as follows:
Thing > BioChemEntity > ChemicalSubstance

However clarity needs to be sought from BioSchemas as this is inconsistent across their site - perhaps @alaninmcr @AlasdairGray can chip in.

Assuming this, this pull request uses just SoftwareSourceCode and removes the previous distinction between Script and Workflow from wf4ever.

BioSchemas as an optional profile in RO-Crate

https://bioschemas.org/profiles/Workflow/0.4-DRAFT-2020_05_11/ specifies these mandatory properties:

  • creator
  • dateCreated
  • input
  • license
  • name
  • output
  • programmingLanguage
  • sdPublisher
  • url
  • version

I think having all of this information is a bit excessive for any RO-Crate that happens to have a workflow, as our other types are not as restrictive. Therefore I added the BioSchemas compliance as a new, in a way optional section.

However I did use the word SHOULD in this wording, so it may need to be softened to make it clear they don't have to follow this section?

To comply with the BioSchemas Workflow profile, where possible, data entities representing workflows SHOULD describe these properties and their related contextual entities (..)

Namespaces

This pull request is work in progress because it is reflecting changes for planned release Workflows DRAFT 0.5 which adds the FormalParameter type for inputs and outputs. Once that is released on bioschemas.org we can insert the date in this @context mapping:

          "input": "https://bioschemas.org/profiles/Workflow/0.5-DRAFT-2020_xx_xx/#input",
          "output": "https://bioschemas.org/profiles/Workflow/0.5-DRAFT-2020_xx_xx/#output",
          "format": "https://bioschemas.org/profiles/Workflow/0.5-DRAFT-2020_xx_xx/#format",
          "FormalParameter": "https://bioschemas.org/profiles/Workflow/0.5-DRAFT-2020_xx_xx/#FormalParameter",

It is not pretty, but as these terms will be proposed to schema.org, we don't know if they will change in the process (e.g. format might be dropped for http://schema.org/encodingFormat and input might become http://schema.org/inputParameter rather than intended http://schema.org/input.

These URIs are all 404 now, so the idea was to map to https://bioschemas.org/profiles/Workflow/0.4-DRAFT-2020_05_11/#input etc - even if strangely there is no id="input" HTML ancor on that page (there probably should! Views, @AlasdairGray ?).

If we release RO-Crate 1.1 we have to be stable in what we map to, just like https://w3id.org/ro/crate/1.0/context has a fixed mapping to https://schema.org/version/5.0/ terms - these crates might end up on tape drives etc. and should be able to have a long life.

I've updated the context for schema.org release 8.0 which adds some extra terms (see diff). I also added a new isBasedOn property to reflect basing our context on schema.org, pcdm and the BioSchemas Workflow profile (again once the 0.5 URI is known).

stain and others added 9 commits June 8, 2020 14:13
Remove Workflow and Script
As discussed in meeting 2020-05-28 we should reduce/avoid need for @type array

aligning with BioSchemas Workflow profile means we don't need wf4ever terms for Workflow or Script or Sketch, however at the cost of loosing some precision
This must be augmented for pending changes in 0.5 for `FormalParameter`
@stain stain added the enhancement New feature or request label Jun 8, 2020
@stain stain marked this pull request as draft June 8, 2020 17:32
@stain stain added this to the RO-Crate 1.1 milestone Jun 10, 2020
@stain stain added this to In progress in RO-Crate specifications Jun 10, 2020
@stain
Copy link
Contributor Author

stain commented Jun 24, 2020

It was decided in BioSchemas Workflow working group that Workflow will be a new type. This makes it more important to know which URI to refer to it with.

Other proposed types from BioSchemas, e.g. https://bioschemas.org/types/Taxon/0.3-RELEASE-2019_11_18/ have a separate namespace

Canonical URL
https://bioschemas.org/Taxon

however no equivalent seems to exist for its new properties, e.g. https://bioschemas.org/childTaxon does not exist however https://bioschemas.org/Taxon#childTaxon do refer to the right HTML row:

                    <tr id="childTaxon">
                       <th style="color: #0B794B;">childTaxon</th>
                       <td>
                         <a style="color: #0B794B;" href="/types/drafts/Taxon">Taxon</a> or<br/>
                         <a href="http://schema.org/Text">Text</a> or<br/>
                         <a href="http://schema.org/URL">URL</a>
                       </td>
                       <td>
                         Closest child taxa of the taxon in question. <br/>
                         Inverse property: <span style="color: #0B794B;">parentTaxon</span>
                       </td>
                     </tr>

So assuming it will appear on the bioschemas.org website, perhaps we should use https://bioschemas.org/Workflow as type and the new properties in the style of https://bioschemas.org/Workflow#input ? We still refer to which version of the profile at the isBasedOn link.

..although it is consistent with the http://schema.org/Grant example 3,
see schemaorg/schemaorg#383

Using https:/bioschemas.org/Workflow#input etc
Added a softer distinction between workflow and script
@ptsefton
Copy link
Contributor

I thought that we resolved to keep @type arrays - so that it was easy to identify Data Entities, that was my recollection of the conclusion in the last call. If you remove "File" then it makes it harder for developers to identify things that might need be fetched from their @id URIs and to build interfaces like the one in Describo.

@stain
Copy link
Contributor Author

stain commented Jul 2, 2020

@ptsefton Reflecting #83 I changed it to use the @type arrays again in fe5ec8f including File in all examples. Now just waiting for @alaninmcr to push Workflow and FormalParameter onto https://bioschemas.org/types/

@alaninmcr
Copy link
Contributor

@alaninmcr has sent the Workflow and FormalParameter types to bioschemas. I am waiting for their response.

@stain
Copy link
Contributor Author

stain commented Jul 20, 2020

Latest discussion in BioSchemas/bioschemas.github.io#304 and BioSchemas/schemaorg#7 (incl. @ljgarcia @alaninmcr @AlasdairGray) concludes to rename Workflow to ComputationalWorkflow to leave Workflow free for "doing stuff in the lab" kind of workflows - which may or may not later become a superclass of ComputationalWorkflow.

Those pull requests for updating bioschemas.org are blocked mainly by that rename.

@stain
Copy link
Contributor Author

stain commented Jul 22, 2020

@stain stain marked this pull request as ready for review July 22, 2020 15:59
stain added a commit to ResearchObject/ro-crate-py that referenced this pull request Jul 22, 2020
@stain stain changed the title WIP: Alignment with Bioschemas profile Workflow 0.5 Alignment with Bioschemas profile Workflow 0.5 Jul 30, 2020
@stain stain merged commit 2d5293a into master Jul 30, 2020
RO-Crate specifications automation moved this from In progress to Done Jul 30, 2020
@stain stain deleted the bioschemas-workflow-0.5 branch July 30, 2020 15:09
@stain stain mentioned this pull request Sep 15, 2020
@stain stain mentioned this pull request Sep 24, 2020
10 tasks
@stain
Copy link
Contributor Author

stain commented Sep 30, 2020

See also review edits in #100

lrodrin pushed a commit to inab/ro-crate-py that referenced this pull request Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

None yet

3 participants