New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding RO-Crate for the Autosubmit workflow that uses mHM and their test data domains #61
Conversation
There are several problems with the crate:
I'm attaching a "corrected" version of the metadata file for comparison. I could not add the |
Some progress today. Found a possible bug in Autosubmit which may require a hotfix or a workaround to retrieve the workflow status programmatically. Will complete the other pending items tomorrow or next week.
https://earth.bsc.es/gitlab/es/autosubmit/-/issues/1058
Updated the instrument to point to |
Hi @simleo , About this last item
I'm thinking how to handle this one. I understand that it would make sense to include the Here's an example of what a template script like that could look like: #!/bin/bash
set -e -o pipefail
source ../hpc-platforms/mn4/load_modules_1.0.sh
run_model.sh %SOME_CONFIGURATION_PARAM% The tricky part is that IOW, it would be very hard to include every file used to run the script, even after pre-processing the file (when Autosubmit replaces variables like The link that we have between the workflow configuration and the workflow template scripts is the project type and the commit of the project used (which I will have to confirm in which log file that's logged). I could add these files to the archive, but I feel like they would be misleading to users as they wouldn't, in fact, be really helpful to know what was used to run the task commands/scripts. I hope that makes sense? Any thoughts about this last pending item? Thanks! |
If it's not a matter of linking to single isolated scripts, things get more complicated. In the prediction pipeline crate, the CWL workflow file points to Docker images with the required environment. I don't know if Autosubmit has the equivalent of CWL's |
Unfortunately it doesn't have an equivalent. It only executes a Shell script - and the Shell script can use variables from the workflow configuration like
I can add automatically the template scripts you mentioned above, but the main issue is that I am not able to automatically detect every script used by users (and I suspect most users are not aware either, since large workflows normally have multiple groups working, importing submodules from other projects). These templates are located in the
That'd be great! 🙏 🙇 Thanks @simleo ! |
Adding what can be added is good. If not every script down the chain is included in the crate, however, there should be some kind of references to the submodules that have to be imported, so that at least someone with access to the system can reproduce the computation. I mean, if the RO-Crate is not portable to a different system, there has to be at least enough information for it to be meaningful in the same system where it was created. |
Yes, that's what I was suggesting by adding the Git repository information. The tasks/jobs in Autosubmit need a Shell script to be executed. But to specify this shell script, users must add a project. Autosubmit project is configured like this: PROJECT:
TYPE: git # or local, or subversion...
GIT:
PROJECT_ORIGIN: "https://github.com/kinow/auto-mhm-test-domains.git"
PROJECT_BRANCH: "rocrate"
PROJECT_COMMIT: ""
PROJECT_SUBMODULES: ""
FETCH_SINGLE_BRANCH: True And the logs must contain the git commit hash. So given an Autosubmit RO-Crate file, users must be able to read the configuration, and find the reference to where to locate the shell scripts (git, local folder, or subversion). And to re-run the workflow, users would have to unzip the folder, have a compatible version of Autosubmit, run The first command name, |
Folks, sorry I completely missed this thread. Next time maybe you can ping me through Slack. What I do in COMPSs is let the user decide what to include in the crate, mainly through 2 terms in the ro-crate-info.yaml that we require: the "files" and the "sources_dir". Essentially, for each directory specified in "sources_dir", everything on it is added recursively. The "files" term is more about adding one by one any files wanted. You can find more detailed information at: https://compss-doc.readthedocs.io/en/stable/Sections/05_Tools/04_Workflow_Provenance.html#previous-needed-information |
Hi @rsirvent
Not a problem, will ping you over there next time 🙂
That's similar to what we implemented in Autosubmit. Asking users to provide globs, and then we locate files and create the entities into the JSON-LD metadata using ro-crate-py. Thanks @rsirvent ! |
Just finished updating the Autosubmit branch in GitLab to add the Autosubmit Project as a I first tried to use SoftwareHeritage persistent ID's, but the main issue I had is that for most of the workflows I have the repositories are private and won't probably ever be indexed by SoftwareHeritage. I looked at the other suggestions but couldn't find something simple that would work with GitLab/GitHub/local git repo/local directory/subversion. Then reading the The project is now added as I found a bug in one of the dependencies of Autosubmit that's preventing me from generating the final RO-Crate (ro-crate-py 0.8.0 worked grand BTW), so from Monday I will send a patch for this dependency (it's internal to the BSC), and then start writing the unit tests. I believe by the end of this week (next week tops?) I should have the tests, the docs, and the code updated addressing the last point here. Then everything will be ready for review 👍 e.g. {
"@id": "#file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/",
"@type": "SoftwareSourceCode",
"abstract": "The Autosubmit project. It contains the templates used\n by Autosubmit for the scripts used in the workflow, as well as any other\n source code used by the scripts (i.e. any files sourced, or other source\n code compiled or executed in the workflow).",
"codeRepository": "file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/",
"codeSampleType": "template",
"name": "#file:///home/kinow/Development/python/workspace/auto-mhm-test-domains/",
"programmingLanguage": "Any",
"runtimePlatform": "Autosubmit f4.0.0b",
"sdDatePublished": "2023-07-02T11:08:17+00:00",
"targetProduct": "Autosubmit",
"version": ""
}, Following the -Bruno |
What will this |
Hmmm, good question. I don't believe there's a similar use case from another workflow manager that I could copy. Perhaps under
Consumers of Autosubmit configuration will have to understand the basic of how Autosubmit works to use RO-Crates, I am afraid. Assuming a user knows how to read and run Autosubmit workflows, upon getting an RO-Crate I assume they would look at the "Project" and its "Project Type", and then look for the "Git" or "Local" configuration.
Exactly. Autosubmit users wouldn't expect that the templates are part of the workflow. They are part of the Project, and the Project is copied by Autosubmit to a working directory. The workflow configuration says what is the template used in each task of the workflow. Autosubmit users know that the template is hosted elsewhere (Git normally, but also local paths in an HPC). I thought about it for a while, and that's the best way I could find to model what we have in Autosubmit in Schema.org/RO-Crate profiles. And this works well for us because this way we are able to distribute RO-Crate for workflows where the configuration is fine to be shared with others. But the templates (and possibly confidential information about the applications being executed, like weather/climate models) remains private, unless the user has access to the "Project" link being shared.
Ah, never noticed the distinction. So Thanks @simleo ! |
Also, these "Projects" can contain large files, Singularity containers, etc., used as part of the workflow -- invoked somewhere in the main shell script, or in one of the many Autosubmit does not keep track of the data used by workflow tasks (like snakemake or CWL, for instance), and it also doesn't have the granularity to point what tools are being used in a task. In our configuration model, we only know that a Shell script (or R or Python) is copied from a Project and is executed for a task in the workflow, and what are the dependencies for this task. |
HI @simleo , I finished
Over the next week I will try to start writing the unit tests. I've attached the RO-Crate file so you can have a look (if you have time). Thanks! 👋 |
Rather than posting an ro-crate metadata file you should push a new commit containing all changes when you have a new version of the RO-Crate. Anyway, I downloaded the metadata file and several of the problems I previously reported are still present:
There's also a problem with the workflow's outputs. The {
"@id": "#/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-pv",
"@type": "File",
"exampleOfWork": {
"@id": "#/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-param"
},
"name": "/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif",
"value": 2
} They should look like: {
"@id": "/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif",
"@type": "File",
"exampleOfWork": {
"@id": "#/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif-param"
},
"name": "/home/kinow/autosubmit/a00a/proj/git_project/docs/plot_1993_1995.gif"
} The |
Oh d'oh! Forgot this was a pull request and not am issue. Sorry @simleo! |
e9de112
to
fd9b130
Compare
1c8c56b
to
1c2812e
Compare
"sdDatePublished": "2023-07-10T15:33:06+00:00", | ||
"targetProduct": "Autosubmit", | ||
"version": "9ab36021b2f342e6a5d3a9fd74e5f8b9e25cc51a" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ☝️ is the Autosubmit project, linking the workflow configuration with the project configuration (that contains the template scripts used to execute the workflow, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a part of "@id": "workflow.yml"
(under hasPart
)
Apologies @simleo. I had fixed it in the Project used in Autosubmit, but forgot to sync my workflow with the project (we need to run a command in Autosubmit when we want that). I executed everything using the latest code from Git, and at least the author part was correct. I had not correctly understood the
For this one I think I will need a bit more of time to read it with calm and ☕ and understand how this should be fixed. Will ping you once I have pushed a fix for this. Thank you, and apologies for sending the wrong version and for attaching the file instead of pushing to this branch 😬 🙇 -Bruno |
Thanks for the update, Bruno. A couple of things I've noticed in this version:
{
"@id": "plot/a00d_20230606_1538.pdf",
"@type": "File",
"additionalType": "application/pdf",
...
} Should be: {
"@id": "plot/a00d_20230606_1538.pdf",
"@type": "File",
"encodingFormat": "application/pdf",
...
} In general, when you're adding MIME types and similar, the property you want is |
Ah! Good catch. I hadn't executed the workflow again, but I had cleaned the database so there were no start/end dates. Does the specification define what implementations must do in this case? If someone tries to create an archive for a workflow that has not started, or is running, then should
Oh, #TIL. Easy to fix, let me try doing that now and pushing a change. Thank you!!! |
…eters, handle missing start/end times, and the additional/encoding of files
b35221e
to
78a2c83
Compare
HI @simleo , I created an isolated environment to re-generate the RO-Crate archive, so now there should be less issues with my updates here (hopefully). I have also tried to address all your feedback. I hope I got the FormalParameters/ParameterValues, and input/result correct this time. Cheers, |
The specification (note that Process Run Crate is inherited by the other two profiles) says |
Thanks for the answer @simleo !
I understand ideally you shouldn't do that. But in Autosubmit (and in other wfms) you have something like "runs" of a workflow, and you can "clean" or "restart" between these runs. Normally the work flow with AS is:
It may happen that users are in the state above where they have prepared a new run, but not executed it (i.e. a step 6. would probably be The archival functionality does not read past runs, only the current prepared run. And in this case there is no We allow users to archive experiments like this, since they are useful to visualize previous runs. Once such experiment is unarchived, users would look at the current state and continue from where they stopped or, more likely, force a Git sync to start afresh and prepare a new run.
I can add a check (as I said above) for that, but since the spec says SHOULD and MAY, I think it's fine if I just don't fill it as it's done currently? Cheers -Bruno p.s. this is exactly what happened to me, I had the traces, but no current execution information as I had restarted the experiment run |
The important thing is not to add properties with a |
Brilliant. Then I don't have to change anything now as I did it last night 👍 https://earth.bsc.es/gitlab/es/autosubmit/-/merge_requests/317/diffs?commit_id=560051e7fa68fe57750551e182cbbffda09d8479 |
There are now problems with
For instance, this: {
"@id": "DEFAULT.EXPID-param",
"@type": "FormalParameter",
"encodingFormat": "String",
"name": "DEFAULT.EXPID",
"valueRequired": "True"
} Should be: {
"@id": "#DEFAULT.EXPID-param",
"@type": "FormalParameter",
"additionalType": "String",
"name": "DEFAULT.EXPID",
"valueRequired": "True"
} |
Uh, by the way. {
"@id": "#DEFAULT.EXPID-param",
"@type": "FormalParameter",
"additionalType": "Text",
"name": "DEFAULT.EXPID",
"valueRequired": "True"
} |
Ah, thanks @simleo ! I had left a note for later in this part of the Autosubmit rocrate code (fixed now after your feedback): I thought it would be a good feature in ro-crate-py to provide a function/map of types between Python and what's expected in RO-Crate. WDYT? |
Will try fixing the additionalType & contentType and re-generate everything from scratch to push a new commit. I will try Thanks @simleo ! |
There's no such thing as "what's expected in RO-Crate" I'm afraid. I'm referring to these recommendations for mapping CWL types in the WRROC web site (sorry for not mentioning that before). In principle you could use a different convention for Autosubmit, but I think
contentType? |
Sorry, encodingFormat. |
$ runcrate report ~/Development/python/workspace/workflow-run-crate/docs/examples/autosubmit/auto-mhm-test-domains/
action: #create-action
instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
started: 2023-07-11T23:15:52
ended: 2023-07-11T23:15:30
inputs:
a000Traceback (most recent call last):
File "/home/kinow/mambaforge/envs/autosubmit4/bin/runcrate", line 8, in <module>
sys.exit(cli())
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/runcrate/cli.py", line 87, in report
dump_crate_actions(crate)
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/runcrate/report.py", line 79, in dump_crate_actions
dump_action(a, control_action=ca, f=f)
File "/home/kinow/mambaforge/envs/autosubmit4/lib/python3.9/site-packages/runcrate/report.py", line 48, in dump_action
f.write(f" <- {p.id}")
AttributeError: 'str' object has no attribute 'id' Good thing you were able to tell it was because of the missing I fixed the ids, reran everything, and here's the output of the runcrate report: $ runcrate report ~/autosubmit/a000/
action: #create-action
instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
started: 2023-07-12T19:51:16
ended: 2023-07-12T19:51:04
inputs:
a000 <- #DEFAULT.EXPID-param
local <- #DEFAULT.HPCARCH-param
/home/kinow/autosubmit/a000/proj/git_project/conf/bootstrap <- #DEFAULT.CUSTOM_CONFIG-param
19910101 19930101 <- #EXPERIMENT.DATELIST-param
standard <- #EXPERIMENT.CALENDAR-param
0 <- #EXPERIMENT.CHUNKSIZE-param
0 <- #EXPERIMENT.NUMCHUNKS-param
year <- #EXPERIMENT.CHUNKSIZEUNIT-param
fc0 <- #EXPERIMENT.MEMBERS-param
4.0.0b <- #CONFIG.AUTOSUBMIT_VERSION-param
20 <- #CONFIG.TOTALJOBS-param
20 <- #CONFIG.MAXWAITINGJOBS-param
git <- #PROJECT.PROJECT_TYPE-param
git_project <- #PROJECT.PROJECT_DESTINATION-param
https://github.com/kinow/auto-mhm-test-domains.git <- #GIT.PROJECT_ORIGIN-param
rocrate <- #GIT.PROJECT_BRANCH-param
<- #GIT.PROJECT_COMMIT-param
<- #GIT.PROJECT_SUBMODULES-param
True <- #GIT.FETCH_SINGLE_BRANCH-param
develop <- #MHM.BRANCH_NAME-param
1 <- #MHM.DOMAIN-param
2 <- #MHM.EVAL_PERIOD_DURATION_YEARS-param
outputs:
proj/git_project/docs/plot_1993_1995.gif <- #proj/git_project/docs/plot_1993_1995.gif-param
proj/git_project/docs/plot_1991_1993.gif <- #proj/git_project/docs/plot_1991_1993.gif-param https://earth.bsc.es/gitlab/es/autosubmit/-/commit/20213eda3a2060dc3df4d4325e8ea9005cdb67b9 |
use Text instead of String).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After pushing the latest commits, here's the result of runcrate report
:
$ runcrate report ~/Development/python/workspace/workflow-run-crate/docs/examples/autosubmit/auto-mhm-test-domains/
action: #create-action
instrument: workflow.yml (['File', 'SoftwareSourceCode', 'ComputationalWorkflow'])
started: 2023-07-12T20:08:46
ended: 2023-07-12T20:08:34
inputs:
a000 <- #DEFAULT.EXPID-param
local <- #DEFAULT.HPCARCH-param
/home/kinow/autosubmit/a000/proj/git_project/conf/bootstrap <- #DEFAULT.CUSTOM_CONFIG-param
19910101 19930101 <- #EXPERIMENT.DATELIST-param
standard <- #EXPERIMENT.CALENDAR-param
0 <- #EXPERIMENT.CHUNKSIZE-param
0 <- #EXPERIMENT.NUMCHUNKS-param
year <- #EXPERIMENT.CHUNKSIZEUNIT-param
fc0 <- #EXPERIMENT.MEMBERS-param
4.0.0b <- #CONFIG.AUTOSUBMIT_VERSION-param
20 <- #CONFIG.TOTALJOBS-param
20 <- #CONFIG.MAXWAITINGJOBS-param
git <- #PROJECT.PROJECT_TYPE-param
git_project <- #PROJECT.PROJECT_DESTINATION-param
https://github.com/kinow/auto-mhm-test-domains.git <- #GIT.PROJECT_ORIGIN-param
rocrate <- #GIT.PROJECT_BRANCH-param
<- #GIT.PROJECT_COMMIT-param
<- #GIT.PROJECT_SUBMODULES-param
True <- #GIT.FETCH_SINGLE_BRANCH-param
develop <- #MHM.BRANCH_NAME-param
1 <- #MHM.DOMAIN-param
2 <- #MHM.EVAL_PERIOD_DURATION_YEARS-param
outputs:
proj/git_project/docs/plot_1993_1995.gif <- #proj/git_project/docs/plot_1993_1995.gif-param
proj/git_project/docs/plot_1991_1993.gif <- #proj/git_project/docs/plot_1991_1993.gif-param
"additionalType": "Text", | ||
"name": "DEFAULT.EXPID", | ||
"valueRequired": "True" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ID's of FormalParameter
's fixed, @simleo. ☝️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the additionalType: "Text"
, instead of encodingFormat: "String"
.
OK, the metadata looks conformant now. |
Thank you @simleo !!! |
This adds the first RO-Crate produced with Autosubmit using mHM, a mesoscale hydrological model: https://mhm.pages.ufz.de/mhm/stable/
They provide test data to run the model, which is downloaded and used in the workflow. The PR with RO-Crate included in the model workflow is this one: kinow/auto-mhm-test-domains#12
And this is the MR that adds RO-Crate to Autosubmit (under review): https://earth.bsc.es/gitlab/es/autosubmit/-/merge_requests/317
There was about 27MB of log files in the
tmp
directory, which were truncated to save space in this repository (find . -type f -exec truncate -c -s 0 {} \;
).Thanks!
-Bruno