Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T2: Data #2

Open
RenskeW opened this issue Dec 3, 2022 · 8 comments
Open

T2: Data #2

RenskeW opened this issue Dec 3, 2022 · 8 comments

Comments

@RenskeW
Copy link
Owner

RenskeW commented Dec 3, 2022

Input and (intermediate) output data.

  • D1 Identification: PID, version, name, and description of the dataset. Preferred citation of the data. When the data is not FAIR: URL and download data as an alternative for PID and version. When the dataset is a subset of a larger collection (e.g. a database): PID of database, database version and download date, and the query or filtering strategy which produced the dataset.
  • D2 File characteristics: Filename, format, creation and last modification timestamps, size, and checksum.
  • D3 Access: URL to a downloadable form of the data. License.
  • D4 Mapping: The workflow and step parameters for which the data is an input or output.
@RenskeW
Copy link
Owner Author

RenskeW commented Dec 17, 2022

What is represented in CWLProv RO Bundle in RDF:

D2 File characteristics:

  • creation date of (intermediate) output files
  • filename
  • checksum (id of entity in RDF graph)

D4 Mapping:

  • Data linked to workflow parameter
  • Data linked to step parameter

In addition, the following can be found in primary-job.json:

  • D1 Identification: Structured annotations about the input data. There are no clear guidelines in the latest CWL standards (v1.2) for these annotations, but if workflow authors add them to their data they will be contained in the input parameter file (primary-job.json).
  • D2 File characteristics:
    • Format of inputs of type File, via the format metadata field.
    • Size is contained in both primary-job.json (input files) and primary-output.json (output files).
    • Checksum: For input files in primary-job.json, and output files in primary-output.json
  • D3 Access: Similar to D1, this information can be added with structured annotations, but the CWL Standards do not specify how, e.g. which ontology and terms should be used.

@RenskeW
Copy link
Owner Author

RenskeW commented Jan 3, 2023

Results of analysis of RO-Crates converted by runcrate from CWLProv RO Bundles:

  • D2 - File characteristics (see Scenario 1):
    • checksum of files is included as the ID of the entity, but it is not explicitly specified that the ID corresponds to the checksum (nor is this the case in CWLProv).
    • filename is not included in ro-crate-metadata.json, and since primary-job.json is not included in the RO-Crate, this information can no longer be retrieved.
    • creation date of (intermediate) output files is not represented in ro-crate-metadata.json.
    • format of input and output files is not included in ro-crate-metadata.json (only their expected formats are described for the parameters for which they are values, if explicitly specified in the workflow description).
  • D4 - Mapping: The parameter values are linked to the CommandLineTool and Workflow parameters they correspond to (see Scenario 4).

Given that primary-job.json and primary-output.json are not included in the RO-Crates generated by runcrate, all information they contain that is not carried over to ro-crate-metadata.json is lost.

@RenskeW
Copy link
Owner Author

RenskeW commented Jan 3, 2023

Suggested enhancement 1:

Add the creation datetime for (at least) the (intermediate) output files to ro-crate-metadata.json.

@RenskeW
Copy link
Owner Author

RenskeW commented Jan 3, 2023

Suggested enhancement 2:

Add the filename of input and (intermediate) output files and directories to ro-crate-metadata.json.

@RenskeW
Copy link
Owner Author

RenskeW commented Jan 3, 2023

Suggested enhancement 3:

Add the size of input and output files, contained in primary-job.json/primary-output.json, to ro-crate-metadata.json.

@RenskeW
Copy link
Owner Author

RenskeW commented Jan 3, 2023

Suggested enhancement 4:

Add the format of input and output files, contained in primary-job.json/primary-output.json, to ro-crate-metadata.json.

@RenskeW
Copy link
Owner Author

RenskeW commented Mar 18, 2023

UPDATE: Checksum and basename now included in ro-crate-metadata.json.

@simleo
Copy link

simleo commented Nov 9, 2023

ResearchObject/runcrate#69 and ResearchObject/runcrate#70 also added creation time, size and format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants