Adding RO-Crate for the Autosubmit workflow that uses mHM and their test data domains #61

kinow · 2023-06-07T10:01:42Z

This adds the first RO-Crate produced with Autosubmit using mHM, a mesoscale hydrological model: https://mhm.pages.ufz.de/mhm/stable/

They provide test data to run the model, which is downloaded and used in the workflow. The PR with RO-Crate included in the model workflow is this one: kinow/auto-mhm-test-domains#12

And this is the MR that adds RO-Crate to Autosubmit (under review): https://earth.bsc.es/gitlab/es/autosubmit/-/merge_requests/317

There was about 27MB of log files in the tmp directory, which were truncated to save space in this repository (find . -type f -exec truncate -c -s 0 {} \;).

Thanks!
-Bruno

simleo · 2023-06-07T10:24:07Z

There are several problems with the crate:

ro-crate-metadata.json only has to conformsTo https://w3id.org/ro/crate/1.1 and https://w3id.org/workflowhub/workflow-ro-crate/1.0
ro-crate-metadata.json should not have license, mainEntity and mentions; mentions goes in the root data entity and needs to be referenced by @id
Person's affiliation should just be https://ror.org/05sd8tv96, no mailto
"conf/minimal.yml" has a null encodingFormat; null should never appear in the RO-Crate JSON-LD
actionStatus should be CompletedActionStatus (assuming there were no errors), or it can be omitted
The action's instrument must be the workflow (workflow.yml), not https://mhm.pages.ufz.de/mhm/stable/. The latter could be listed in the workflow's softwareRequirements, but adding the SoftwareApplication type to the workflow.
the input and output properties belong in the workflow, not in the action. The action should have object and result, pointing to the actual input and output data entities or PropertyValue instances
The property to link a parameter value to the corresponding formal parameter is exampleOfWork, not exampleWork
The scripts referenced by workflow.yml should be in the crate: e.g. templates/sim.sh, templates/graph.sh, etc. In particular there should be something that calls mhm, I guess.
- I will update the Autosubmit code to create entities pointing to the scripts on Git/Subversion/Local file system
- Will also explain it in a) Autosubmit docs, b) RO-Crate paper, c) Zenodo
- Use https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html for git-based URIs, e.g. https://n2t.net/swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml – use at least commit and path (the link would resolve as soon as the git repo is archived)
- .. or https://w3id.org/cwl/view permalinks for CWL workflows.
- https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html has URIs for git commits and files within. A bit many options - would recommend at least commit and path -- you can get that independent of where the git repo was pushed or not. Similar to https://w3id.org/cwl/view permalinks which you could also use for CWL files

I'm attaching a "corrected" version of the metadata file for comparison. I could not add the File and PropertyValue instances to the action's input and output because I don't know which is which.

ro-crate-metadata.json.zip

kinow · 2023-06-10T23:26:06Z

Some progress today. Found a possible bug in Autosubmit which may require a hotfix or a workaround to retrieve the workflow status programmatically. Will complete the other pending items tomorrow or next week.

actionStatus should be CompletedActionStatus (assuming there were no errors), or it can be omitted

https://earth.bsc.es/gitlab/es/autosubmit/-/issues/1058

The latter could be listed in the workflow's softwareRequirements, but adding the SoftwareApplication type to the workflow.

Updated the instrument to point to workflow.yml, but deleted the software requirement to simplify the RO-Crate.

kinow · 2023-06-13T20:18:01Z

Hi @simleo ,

About this last item

The scripts referenced by workflow.yml should be in the crate: e.g. templates/sim.sh, templates/graph.sh, etc. In particular there should be something that calls mhm, I guess.

I'm thinking how to handle this one.

I understand that it would make sense to include the template/sim.sh script, for example. However, these scripts are hosted outside of the workflow, and imported automatically by Autosubmit (it's defined in Autosubmit's configuration).

Here's an example of what a template script like that could look like:

#!/bin/bash

set -e -o pipefail

source ../hpc-platforms/mn4/load_modules_1.0.sh

run_model.sh %SOME_CONFIGURATION_PARAM%

The tricky part is that source call. It's quite common for template scripts to include other shell scripts, or even depend on other Python or R scripts in the project directory (which can be quite huge, with lots of submodules checked out).

IOW, it would be very hard to include every file used to run the script, even after pre-processing the file (when Autosubmit replaces variables like %SOME_CONFIGURATION_PARAM).

The link that we have between the workflow configuration and the workflow template scripts is the project type and the commit of the project used (which I will have to confirm in which log file that's logged).

I could add these files to the archive, but I feel like they would be misleading to users as they wouldn't, in fact, be really helpful to know what was used to run the task commands/scripts.

I hope that makes sense? Any thoughts about this last pending item?

Thanks!
Bruno

simleo · 2023-06-15T13:38:41Z

If it's not a matter of linking to single isolated scripts, things get more complicated. In the prediction pipeline crate, the CWL workflow file points to Docker images with the required environment. I don't know if Autosubmit has the equivalent of CWL's dockerPull, but then such an image would need to exist in the first place. The backtrackbb example crate adopts a different approach, basically including required scripts, libraries and other requirements directly in the crate. I don't know if these were looked for automatically or added manually, perhaps @rsirvent can provide some advice here.

kinow · 2023-06-15T14:12:39Z

I don't know if Autosubmit has the equivalent of CWL's dockerPull

Unfortunately it doesn't have an equivalent. It only executes a Shell script - and the Shell script can use variables from the workflow configuration like %MYVAR% - in a platform via SSH/Slurm/etc. That's the gist of how it works.

The backtrackbb example crate adopts a different approach, basically including required scripts, libraries and other requirements directly in the crate. I don't know if these were looked for automatically or added manually,

I can add automatically the template scripts you mentioned above, but the main issue is that I am not able to automatically detect every script used by users (and I suspect most users are not aware either, since large workflows normally have multiple groups working, importing submodules from other projects).

These templates are located in the proj/ folder by default. This folder contains proj/**/*.sh that are shell scripts, but also can contain several other files (some large binaries like wrf_model.exe) that are compiled for the workflow and probably should not be included in the RO-Crate.

perhaps @rsirvent can provide some advice here.

That'd be great! 🙏 🙇

Thanks @simleo !

simleo · 2023-06-19T12:19:28Z

I can add automatically the template scripts you mentioned above, but the main issue is that I am not able to automatically detect every script used by users (and I suspect most users are not aware either, since large workflows normally have multiple groups working, importing submodules from other projects).

Adding what can be added is good. If not every script down the chain is included in the crate, however, there should be some kind of references to the submodules that have to be imported, so that at least someone with access to the system can reproduce the computation. I mean, if the RO-Crate is not portable to a different system, there has to be at least enough information for it to be meaningful in the same system where it was created.

kinow · 2023-06-19T12:43:54Z

I mean, if the RO-Crate is not portable to a different system, there has to be at least enough information for it to be meaningful in the same system where it was created.

Yes, that's what I was suggesting by adding the Git repository information. The tasks/jobs in Autosubmit need a Shell script to be executed. But to specify this shell script, users must add a project. Autosubmit project is configured like this:

PROJECT:
  TYPE: git # or local, or subversion...
GIT:
  PROJECT_ORIGIN: "https://github.com/kinow/auto-mhm-test-domains.git"
  PROJECT_BRANCH: "rocrate"
  PROJECT_COMMIT: ""
  PROJECT_SUBMODULES: ""
  FETCH_SINGLE_BRANCH: True

And the logs must contain the git commit hash. So given an Autosubmit RO-Crate file, users must be able to read the configuration, and find the reference to where to locate the shell scripts (git, local folder, or subversion).

And to re-run the workflow, users would have to unzip the folder, have a compatible version of Autosubmit, run autosubmit create $expid, followed by autosubmit run $expid.

The first command name, autosubmit create, is a bit misleading, as it does not create the workflow. It creates the project, i.e. it will clone the project (or copy a local folder), and preprocess the shell scripts, preparing everything for autosubmit run to be able to run the workflow.

rsirvent · 2023-06-20T09:26:06Z

Folks, sorry I completely missed this thread. Next time maybe you can ping me through Slack.

What I do in COMPSs is let the user decide what to include in the crate, mainly through 2 terms in the ro-crate-info.yaml that we require: the "files" and the "sources_dir". Essentially, for each directory specified in "sources_dir", everything on it is added recursively. The "files" term is more about adding one by one any files wanted.

You can find more detailed information at: https://compss-doc.readthedocs.io/en/stable/Sections/05_Tools/04_Workflow_Provenance.html#previous-needed-information

kinow · 2023-06-23T16:23:57Z

Hi @rsirvent

Folks, sorry I completely missed this thread. Next time maybe you can ping me through Slack.

Not a problem, will ping you over there next time 🙂

What I do in COMPSs is let the user decide what to include in the crate, mainly through 2 terms in the ro-crate-info.yaml that we require: the "files" and the "sources_dir". Essentially, for each directory specified in "sources_dir", everything on it is added recursively. The "files" term is more about adding one by one any files wanted.

You can find more detailed information at: https://compss-doc.readthedocs.io/en/stable/Sections/05_Tools/04_Workflow_Provenance.html#previous-needed-information

That's similar to what we implemented in Autosubmit. Asking users to provide globs, and then we locate files and create the entities into the JSON-LD metadata using ro-crate-py.

Thanks @rsirvent !

kinow · 2023-07-02T11:57:10Z

Just finished updating the Autosubmit branch in GitLab to add the Autosubmit Project as a SoftwareSourceCode.

I first tried to use SoftwareHeritage persistent ID's, but the main issue I had is that for most of the workflows I have the repositories are private and won't probably ever be indexed by SoftwareHeritage.

I looked at the other suggestions but couldn't find something simple that would work with GitLab/GitHub/local git repo/local directory/subversion. Then reading the SoftwareSourceCode, I realized it matched exactly what I was trying to model for Autosubmit.

The project is now added as SoftwareSourceCode, where the git commit is added as version, the URL as codeRepository, and I've also added product/runtime (Autosubmit and its version) and a description as abstract (explaining what was the Project in Autosubmit.)

I found a bug in one of the dependencies of Autosubmit that's preventing me from generating the final RO-Crate (ro-crate-py 0.8.0 worked grand BTW), so from Monday I will send a patch for this dependency (it's internal to the BSC), and then start writing the unit tests.

I believe by the end of this week (next week tops?) I should have the tests, the docs, and the code updated addressing the last point here. Then everything will be ready for review 👍

e.g.

{
    "@id": "#file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/",
    "@type": "SoftwareSourceCode",
    "abstract": "The Autosubmit project. It contains the templates used\n                    by Autosubmit for the scripts used in the workflow, as well as any other\n                    source code used by the scripts (i.e. any files sourced, or other source\n                    code compiled or executed in the workflow).",
    "codeRepository": "file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/",
    "codeSampleType": "template",
    "name": "#file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/",
    "programmingLanguage": "Any",
    "runtimePlatform": "Autosubmit f4.0.0b",
    "sdDatePublished": "2023-07-02T11:08:17+00:00",
    "targetProduct": "Autosubmit",
    "version": ""
},

Following the codeRepository link (and the version if the project type is GIT, and not LOCAL) gives the user the exact files used for running the tasks (the /templates/some-shell-file.sh archives.)

-Bruno

simleo · 2023-07-03T09:04:37Z

What will this SoftwareSourceCode entity be linked to, and how? How do consumers know that given a codeRepository of file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/ they have to look for something like /templates/some-shell-file.sh (you mean under auto-mhm-test-domains?) By the way, if this is a data entity, the @id does not have to start with a #.

kinow · 2023-07-03T09:51:22Z

What will this SoftwareSourceCode entity be linked to, and how?

Hmmm, good question. I don't believe there's a similar use case from another workflow manager that I could copy. Perhaps under hasPart of the MainEntity, and also as object of the CreateAction?

How do consumers know that given a codeRepository of file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/

Consumers of Autosubmit configuration will have to understand the basic of how Autosubmit works to use RO-Crates, I am afraid. Assuming a user knows how to read and run Autosubmit workflows, upon getting an RO-Crate I assume they would look at the "Project" and its "Project Type", and then look for the "Git" or "Local" configuration.

they have to look for something like /templates/some-shell-file.sh (you mean under auto-mhm-test-domains?

Exactly. Autosubmit users wouldn't expect that the templates are part of the workflow. They are part of the Project, and the Project is copied by Autosubmit to a working directory.

The workflow configuration says what is the template used in each task of the workflow. Autosubmit users know that the template is hosted elsewhere (Git normally, but also local paths in an HPC).

I thought about it for a while, and that's the best way I could find to model what we have in Autosubmit in Schema.org/RO-Crate profiles. And this works well for us because this way we are able to distribute RO-Crate for workflows where the configuration is fine to be shared with others. But the templates (and possibly confidential information about the applications being executed, like weather/climate models) remains private, unless the user has access to the "Project" link being shared.

By the way, if this is a data entity, the @id does not have to start with a #.

Ah, never noticed the distinction. So # are for entities linked within the context of the graph, and non-# for data?

Thanks @simleo !

kinow · 2023-07-03T09:55:51Z

Also, these "Projects" can contain large files, Singularity containers, etc., used as part of the workflow -- invoked somewhere in the main shell script, or in one of the many source'd files, or invoked by another fortran/c code that calls that… that's why we consider that part almost like a separate application, that runs for the task of the workflow.

Autosubmit does not keep track of the data used by workflow tasks (like snakemake or CWL, for instance), and it also doesn't have the granularity to point what tools are being used in a task. In our configuration model, we only know that a Shell script (or R or Python) is copied from a Project and is executed for a task in the workflow, and what are the dependencies for this task.

kinow · 2023-07-09T15:23:40Z

HI @simleo ,

I finished

updating the AS paper section
updating the mHM branch for the AS test workflow at add ro-crate configuration kinow/auto-mhm-test-domains#12
updating the Autosubmit code to include the Project information (search for codeRepository, then you can copy the ID and see that it is a new entity, and it's added as part of the main entity) https://earth.bsc.es/gitlab/es/autosubmit/-/merge_requests/317

Over the next week I will try to start writing the unit tests. I've attached the RO-Crate file so you can have a look (if you have time).

Thanks! 👋

ro-crate-metadata.json.zip

simleo · 2023-07-10T10:35:35Z

Rather than posting an ro-crate metadata file you should push a new commit containing all changes when you have a new version of the RO-Crate. Anyway, I downloaded the metadata file and several of the problems I previously reported are still present:

The ro-crate-metadata.json entity has the same conformsTo as the root data entity. Instead, it should have to conformsTo "https://w3id.org/ro/crate/1.1" and "https://w3id.org/workflowhub/workflow-ro-crate/1.0".
The Person's affiliation has to be {"@id": "https://ror.org/05sd8tv96"}, with no mailto
The action's instrument must be the workflow, i.e. "workflow.yml"

There's also a problem with the workflow's outputs. The result entries in the action are of type File, but they also have a value, which is used in PropertyValue instead. Their IDs are like #/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-pv, but they are not contextual entities, so they should not start with a #. So, instead of:

{
    "@id": "#/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-pv",
    "@type": "File",
    "exampleOfWork": {
        "@id": "#/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-param"
    },
    "name": "/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif",
    "value": 2
}

They should look like:

{
    "@id": "/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif",
    "@type": "File",
    "exampleOfWork": {
        "@id": "#/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-param"
    },
    "name": "/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif"
}

The @id of the corresponding formal parameter (linked to via exampleOfWork) seems also weird. Such IDs are more usually tied to the parameter's role, e.g. #plot_1, #plot_2. Also, why these full paths? Can't these files be included in the RO-Crate? Perhaps in the plot subdirectory, where now there is a a00d_20230606_1538.pdf that is not referenced anywhere in the RO-Crate metadata?

kinow · 2023-07-10T10:43:19Z

Oh d'oh! Forgot this was a pull request and not am issue. Sorry @simleo!

…est data domains

kinow · 2023-07-10T15:38:33Z

docs/examples/autosubmit/auto-mhm-test-domains/ro-crate-metadata.json

+            "sdDatePublished": "2023-07-10T15:33:06+00:00",
+            "targetProduct": "Autosubmit",
+            "version": "9ab36021b2f342e6a5d3a9fd74e5f8b9e25cc51a"
+        },


This ☝️ is the Autosubmit project, linking the workflow configuration with the project configuration (that contains the template scripts used to execute the workflow, for example).

It is a part of "@id": "workflow.yml" (under hasPart)

kinow · 2023-07-10T15:44:03Z

Anyway, I downloaded the metadata file and several of the problems I previously reported are still present:

The ro-crate-metadata.json entity has the same conformsTo as the root data entity. Instead, it should have to conformsTo "https://w3id.org/ro/crate/1.1" and "https://w3id.org/workflowhub/workflow-ro-crate/1.0".
The Person's affiliation has to be {"@id": "https://ror.org/05sd8tv96"}, with no mailto
The action's instrument must be the workflow, i.e. "workflow.yml"

Apologies @simleo. I had fixed it in the Project used in Autosubmit, but forgot to sync my workflow with the project (we need to run a command in Autosubmit when we want that).

I executed everything using the latest code from Git, and at least the author part was correct. I had not correctly understood the conformsTo, but now that should be fixed too, as well as the action's instrument.

There's also a problem with the workflow's outputs.

For this one I think I will need a bit more of time to read it with calm and ☕ and understand how this should be fixed. Will ping you once I have pushed a fix for this.

Thank you, and apologies for sending the wrong version and for attaching the file instead of pushing to this branch 😬 🙇

-Bruno

simleo · 2023-07-11T13:14:03Z

Thanks for the update, Bruno. A couple of things I've noticed in this version:

The startTime and endTime for the action are null (they were fine in the previous version)
In some entities you're using additionalType instead of encodingFormat, e.g.:

{
    "@id": "plot/a00d_20230606_1538.pdf",
    "@type": "File",
    "additionalType": "application/pdf",
    ...
}

Should be:

{
    "@id": "plot/a00d_20230606_1538.pdf",
    "@type": "File",
    "encodingFormat": "application/pdf",
    ...
}

In general, when you're adding MIME types and similar, the property you want is encodingFormat.

kinow · 2023-07-11T17:32:32Z

@simleo

The startTime and endTime for the action are null (they were fine in the previous version)

Ah! Good catch. I hadn't executed the workflow again, but I had cleaned the database so there were no start/end dates.

Does the specification define what implementations must do in this case? If someone tries to create an archive for a workflow that has not started, or is running, then should startTime be omitted if not started and endTime omitted if started but not ended?

In some entities you're using additionalType instead of encodingFormat, e.g.:
(...)
In general, when you're adding MIME types and similar, the property you want is encodingFormat.

Oh, #TIL. Easy to fix, let me try doing that now and pushing a change.

Thank you!!!

…eters, handle missing start/end times, and the additional/encoding of files

kinow · 2023-07-11T21:34:21Z

HI @simleo ,

I created an isolated environment to re-generate the RO-Crate archive, so now there should be less issues with my updates here (hopefully).

I have also tried to address all your feedback. I hope I got the FormalParameters/ParameterValues, and input/result correct this time.

Cheers,
Bruno

simleo · 2023-07-12T08:12:53Z

Does the specification define what implementations must do in this case? If someone tries to create an archive for a workflow that has not started, or is running, then should startTime be omitted if not started and endTime omitted if started but not ended?

The specification (note that Process Run Crate is inherited by the other two profiles) says endTime SHOULD be present and startTime MAY be present; it does not say anything about when you should record such information, but you certainly cannot record it before it's available. Why would you try to create a Workflow Run RO-Crate for a workflow that hasn't started yet?

kinow · 2023-07-12T08:24:44Z

Thanks for the answer @simleo !

Why would you try to create a Workflow Run RO-Crate for a workflow that hasn't started yet?

I understand ideally you shouldn't do that. But in Autosubmit (and in other wfms) you have something like "runs" of a workflow, and you can "clean" or "restart" between these runs.

Normally the work flow with AS is:

Create workflow (autosubmit expid ...)
Sync the Project code from Git (autosubmit create and/or autosubmit refresh)
Run (autosubmit run)
Fix issues in configuration
Start over (autosubmit create)

It may happen that users are in the state above where they have prepared a new run, but not executed it (i.e. a step 6. would probably be autosubmit run again, or some command to inspect or validate the config...)

The archival functionality does not read past runs, only the current prepared run. And in this case there is no startTime nor endTime. However, there are traces for previous runs.

We allow users to archive experiments like this, since they are useful to visualize previous runs. Once such experiment is unarchived, users would look at the current state and continue from where they stopped or, more likely, force a Git sync to start afresh and prepare a new run.

I will add a check in the function that archives RO-Crate to make sure it only runs when there is a startTime then, and alert users they must have at least initialized their workflow runs in order to archive as RO-Crate (they will be able to archive it anyway without the RO-Crate, using the traditional ZIP archives that we had).

I can add a check (as I said above) for that, but since the spec says SHOULD and MAY, I think it's fine if I just don't fill it as it's done currently?

Cheers

-Bruno

p.s. this is exactly what happened to me, I had the traces, but no current execution information as I had restarted the experiment run

simleo · 2023-07-12T09:22:33Z

I can add a check (as I said above) for that, but since the spec says SHOULD and MAY, I think it's fine if I just don't fill it as it's done currently?

The important thing is not to add properties with a null value. If the info is not available, then don't add the property at all.

kinow · 2023-07-12T09:31:04Z

I can add a check (as I said above) for that, but since the spec says SHOULD and MAY, I think it's fine if I just don't fill it as it's done currently?

The important thing is not to add properties with a null value. If the info is not available, then don't add the property at all.

Brilliant. Then I don't have to change anything now as I did it last night 👍 https://earth.bsc.es/gitlab/es/autosubmit/-/merge_requests/317/diffs?commit_id=560051e7fa68fe57750551e182cbbffda09d8479

simleo · 2023-07-12T10:47:25Z

There are now problems with FormalParameter instances:

Their @id does not start with #. It has to start with # since these are "internal" contextual entities. The current crate breaks runcrate report because ro-crate-py automatically adds a leading # when the entities are added, but then the exampleOfWork link does not work anymore. This problem was not present in the previous version.
encodingFormat is being used where additionalType should be used instead. Again, this problem was not present in the previous version.

For instance, this:

{
    "@id": "DEFAULT.EXPID-param",
    "@type": "FormalParameter",
    "encodingFormat": "String",
    "name": "DEFAULT.EXPID",
    "valueRequired": "True"
}

Should be:

{
    "@id": "#DEFAULT.EXPID-param",
    "@type": "FormalParameter",
    "additionalType": "String",
    "name": "DEFAULT.EXPID",
    "valueRequired": "True"
}

simleo · 2023-07-12T12:35:59Z

Uh, by the way. String is not an RO-Crate type. You should use Text instead. So the above entity becomes:

{
    "@id": "#DEFAULT.EXPID-param",
    "@type": "FormalParameter",
    "additionalType": "Text",
    "name": "DEFAULT.EXPID",
    "valueRequired": "True"
}

kinow · 2023-07-12T13:48:09Z

Uh, by the way. String is not an RO-Crate type. You should use Text instead. So the above entity becomes:
{
    "@id": "#DEFAULT.EXPID-param",
    "@type": "FormalParameter",
    "additionalType": "Text",
    "name": "DEFAULT.EXPID",
    "valueRequired": "True"
}

Ah, thanks @simleo !

I had left a note for later in this part of the Autosubmit rocrate code (fixed now after your feedback):

I thought it would be a good feature in ro-crate-py to provide a function/map of types between Python and what's expected in RO-Crate. WDYT?

kinow · 2023-07-12T13:49:31Z

Will try fixing the additionalType & contentType and re-generate everything from scratch to push a new commit.

I will try runcrate report too before pinging you again. I think I tried that at some point but haven't tried lately (maybe that'd be helpful to @jmfernandez as he asked about something to validate RO-Crate's).

Thanks @simleo !

simleo · 2023-07-12T14:25:26Z

I thought it would be a good feature in ro-crate-py to provide a function/map of types between Python and what's expected in RO-Crate. WDYT?

There's no such thing as "what's expected in RO-Crate" I'm afraid. I'm referring to these recommendations for mapping CWL types in the WRROC web site (sorry for not mentioning that before). In principle you could use a different convention for Autosubmit, but I think Text is the appropriate Schema.org match for strings.

Will try fixing the additionalType & contentType

contentType?

kinow · 2023-07-12T14:35:54Z

Will try fixing the additionalType & contentType

contentType?

Sorry, encodingFormat.

kinow · 2023-07-12T17:56:14Z

Their @id does not start with #. It has to start with # since these are "internal" contextual entities. The current crate breaks runcrate report because ro-crate-py automatically adds a leading # when the entities are added, but then the exampleOfWork link does not work anymore. This problem was not present in the previous version.

pip install runcrate and then tested pointing it to my working copy of this repo & branch:

$ runcrate report ~/Development/python/workspace/workflow-run-crate/docs/examples/autosubmit/auto-mhm-test-domains/
action: #create-action
  instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
  started: 2023-07-11T23:15:52
  ended: 2023-07-11T23:15:30
  inputs:
    a000Traceback (most recent call last):
  File "/home/kinow/mambaforge/envs/autosubmit4/bin/runcrate", line 8, in <module>
    sys.exit(cli())
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/runcrate/cli.py", line 87, in report
    dump_crate_actions(crate)
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/runcrate/report.py", line 79, in dump_crate_actions
    dump_action(a, control_action=ca, f=f)
  File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/runcrate/report.py", line 48, in dump_action
    f.write(f" <- {p.id}")
AttributeError: 'str' object has no attribute 'id'

Good thing you were able to tell it was because of the missing #, @simleo , because I would have no idea from looking just at the log output 😅

I fixed the ids, reran everything, and here's the output of the runcrate report:

$ runcrate report ~/autosubmit/a000/
action: #create-action
  instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
  started: 2023-07-12T19:51:16
  ended: 2023-07-12T19:51:04
  inputs:
    a000 <- #DEFAULT.EXPID-param
    local <- #DEFAULT.HPCARCH-param
    /home/kinow/autosubmit/a000/proj/git_project/conf/bootstrap <- #DEFAULT.CUSTOM_CONFIG-param
    19910101 19930101 <- #EXPERIMENT.DATELIST-param
    standard <- #EXPERIMENT.CALENDAR-param
    0 <- #EXPERIMENT.CHUNKSIZE-param
    0 <- #EXPERIMENT.NUMCHUNKS-param
    year <- #EXPERIMENT.CHUNKSIZEUNIT-param
    fc0 <- #EXPERIMENT.MEMBERS-param
    4.0.0b <- #CONFIG.AUTOSUBMIT_VERSION-param
    20 <- #CONFIG.TOTALJOBS-param
    20 <- #CONFIG.MAXWAITINGJOBS-param
    git <- #PROJECT.PROJECT_TYPE-param
    git_project <- #PROJECT.PROJECT_DESTINATION-param
    https://github.com/kinow/auto-mhm-test-domains.git <- #GIT.PROJECT_ORIGIN-param
    rocrate <- #GIT.PROJECT_BRANCH-param
     <- #GIT.PROJECT_COMMIT-param
     <- #GIT.PROJECT_SUBMODULES-param
    True <- #GIT.FETCH_SINGLE_BRANCH-param
    develop <- #MHM.BRANCH_NAME-param
    1 <- #MHM.DOMAIN-param
    2 <- #MHM.EVAL_PERIOD_DURATION_YEARS-param
  outputs:
    proj/git_project/docs/plot_1993_1995.gif <- #proj/git_project/docs/plot_1993_1995.gif-param
    proj/git_project/docs/plot_1991_1993.gif <- #proj/git_project/docs/plot_1991_1993.gif-param

https://earth.bsc.es/gitlab/es/autosubmit/-/commit/20213eda3a2060dc3df4d4325e8ea9005cdb67b9

use Text instead of String).

kinow

After pushing the latest commits, here's the result of runcrate report:

$ runcrate report ~/Development/python/workspace/workflow-run-crate/docs/examples/autosubmit/auto-mhm-test-domains/
action: #create-action
  instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
  started: 2023-07-12T20:08:46
  ended: 2023-07-12T20:08:34
  inputs:
    a000 <- #DEFAULT.EXPID-param
    local <- #DEFAULT.HPCARCH-param
    /home/kinow/autosubmit/a000/proj/git_project/conf/bootstrap <- #DEFAULT.CUSTOM_CONFIG-param
    19910101 19930101 <- #EXPERIMENT.DATELIST-param
    standard <- #EXPERIMENT.CALENDAR-param
    0 <- #EXPERIMENT.CHUNKSIZE-param
    0 <- #EXPERIMENT.NUMCHUNKS-param
    year <- #EXPERIMENT.CHUNKSIZEUNIT-param
    fc0 <- #EXPERIMENT.MEMBERS-param
    4.0.0b <- #CONFIG.AUTOSUBMIT_VERSION-param
    20 <- #CONFIG.TOTALJOBS-param
    20 <- #CONFIG.MAXWAITINGJOBS-param
    git <- #PROJECT.PROJECT_TYPE-param
    git_project <- #PROJECT.PROJECT_DESTINATION-param
    https://github.com/kinow/auto-mhm-test-domains.git <- #GIT.PROJECT_ORIGIN-param
    rocrate <- #GIT.PROJECT_BRANCH-param
     <- #GIT.PROJECT_COMMIT-param
     <- #GIT.PROJECT_SUBMODULES-param
    True <- #GIT.FETCH_SINGLE_BRANCH-param
    develop <- #MHM.BRANCH_NAME-param
    1 <- #MHM.DOMAIN-param
    2 <- #MHM.EVAL_PERIOD_DURATION_YEARS-param
  outputs:
    proj/git_project/docs/plot_1993_1995.gif <- #proj/git_project/docs/plot_1993_1995.gif-param
    proj/git_project/docs/plot_1991_1993.gif <- #proj/git_project/docs/plot_1991_1993.gif-param

kinow · 2023-07-12T18:13:57Z

docs/examples/autosubmit/auto-mhm-test-domains/ro-crate-metadata.json

+            "additionalType": "Text",
+            "name": "DEFAULT.EXPID",
+            "valueRequired": "True"
+        },


ID's of FormalParameter's fixed, @simleo. ☝️

Also the additionalType: "Text", instead of encodingFormat: "String".

simleo · 2023-07-13T12:40:06Z

OK, the metadata looks conformant now.

kinow · 2023-07-13T12:52:05Z

Thank you @simleo !!!

kinow requested a review from simleo June 7, 2023 10:01

kinow mentioned this pull request Jun 27, 2023

add ro-crate configuration kinow/auto-mhm-test-domains#12

Merged

Adding RO-Crate for the Autosubmit workflow that uses mHM and their t…

fd9b130

…est data domains

kinow force-pushed the add-autosubmit-rocrate branch from e9de112 to fd9b130 Compare July 10, 2023 13:58

Re-generate RO-Crate after fixing conformsTo, and author/publisher.

1c2812e

kinow force-pushed the add-autosubmit-rocrate branch from 1c8c56b to 1c2812e Compare July 10, 2023 15:34

kinow commented Jul 10, 2023

View reviewed changes

Re-run everything from scratch, try to fix the input and output param…

78a2c83

…eters, handle missing start/end times, and the additional/encoding of files

kinow force-pushed the add-autosubmit-rocrate branch from b35221e to 78a2c83 Compare July 11, 2023 21:26

Correct FormalParameters (ID and encodingFormat/additionalType),

3076526

use Text instead of String).

kinow commented Jul 12, 2023

View reviewed changes

simleo merged commit 2d4a5f5 into ResearchObject:main Jul 13, 2023

kinow deleted the add-autosubmit-rocrate branch July 13, 2023 17:08

Adding RO-Crate for the Autosubmit workflow that uses mHM and their test data domains #61

Adding RO-Crate for the Autosubmit workflow that uses mHM and their test data domains #61

Conversation

kinow commented Jun 7, 2023

simleo commented Jun 7, 2023 • edited by kinow

kinow commented Jun 10, 2023 • edited

kinow commented Jun 13, 2023

simleo commented Jun 15, 2023

kinow commented Jun 15, 2023

simleo commented Jun 19, 2023

kinow commented Jun 19, 2023

rsirvent commented Jun 20, 2023

kinow commented Jun 23, 2023

kinow commented Jul 2, 2023 • edited

simleo commented Jul 3, 2023 • edited

kinow commented Jul 3, 2023

kinow commented Jul 3, 2023

kinow commented Jul 9, 2023 • edited

simleo commented Jul 10, 2023

kinow commented Jul 10, 2023

kinow Jul 10, 2023

Choose a reason for hiding this comment

kinow Jul 10, 2023

Choose a reason for hiding this comment

kinow commented Jul 10, 2023

simleo commented Jul 11, 2023

kinow commented Jul 11, 2023

kinow commented Jul 11, 2023

simleo commented Jul 12, 2023

kinow commented Jul 12, 2023 • edited

simleo commented Jul 12, 2023

kinow commented Jul 12, 2023

simleo commented Jul 12, 2023

simleo commented Jul 12, 2023

kinow commented Jul 12, 2023

kinow commented Jul 12, 2023

simleo commented Jul 12, 2023

kinow commented Jul 12, 2023

kinow commented Jul 12, 2023 • edited

kinow left a comment

Choose a reason for hiding this comment

kinow Jul 12, 2023

Choose a reason for hiding this comment

kinow Jul 12, 2023

Choose a reason for hiding this comment

simleo commented Jul 13, 2023

kinow commented Jul 13, 2023

simleo commented Jun 7, 2023 •

edited by kinow

kinow commented Jun 10, 2023 •

edited

kinow commented Jul 2, 2023 •

edited

simleo commented Jul 3, 2023 •

edited

kinow commented Jul 9, 2023 •

edited

kinow commented Jul 12, 2023 •

edited

kinow commented Jul 12, 2023 •

edited