New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recipe testing and comparison for release 2.7.0 #2881
Comments
@sloosvel I am in dire pain after realizing blithering DKRZ's SLURM emails me for every recipe 😵💫 |
@sloosvel what's these jobs up to?
|
You can comment that if it's not useful to you, to me it was!
I think there is a limit in number of jobs an account can run simultaneously in levante. They will be pending until other jobs finish I guess |
On Levante, a user can't have more than 20 Slurm jobs running at a time. As soon as a job is finished, the next one should start |
Cheers! More emails then 🤦♂️ 🤣 |
OK guys - first (and only) Recipe running session 2022-10-26 13:13:41.568698Succesfully run recipes122 out of 127 final
Recipes that failed with DiagnosticError0 out of 127 (1 fixed, not PR-ed yet)
Recipes that failed of Missing Data2 out of 127 final
Recipes that failed of other reasons3 out of 127 final
Obsolete/resolved issues comment: The Julia ones are totally my bad - forgot to install Julia after installing esmvaltool, the autoassess ones are either of the old bug that @alistairsellar is fixing now, or they need aux data that is only on JASMIN, the ones of Missing Data are bothering me badly - since I have turned on auto downloads but they are still missing data, what do you guys recommend doing about those? @sloosvel @remi-kazeroni @bouweandela ? I will post detailed postmortems for the ones that have failed for odd reasons below 👍 |
Postmortem of failed recipes OTHER THAN Missing DataRecipes that failed with DiagnosticError0 out of 127 (1 fixed, not yet PR-ed)
Recipes that failed of other reasons or are still running1 out of 127
|
Hi @valeriupredoi, great job with the testing! I forgot to mention but we have a central pool of downloaded data on Levante at |
For this one, I would recommend using:
|
Indeed, cheers @remi-kazeroni - smpi is a memory gobbler - I restarted it on SLURM and promptly got kicked out coz mem limit (this time around I think all data has been downloaded, hence it went to intensive processing). I'll resubmit with mem reqs. What do you recommend about those that really-really are missing data? |
even with 512G still fails out of MEM 😮 |
oh crap, forgot to change the partition 😶🌫️ |
You can try with 1024G then! But that's the highest available |
totally user-side - forgot to change the partition to |
I never managed to run the smpi recipes, @remi-kazeroni did it for me in the last release. Maybe the batch script settings for this recipe can be changed in #2883 |
with correct SLURM settings as recommended by @remi-kazeroni (:beer:) those smpi monsters are happily plodding along now - yes, we should change the settings for sure. @sloosvel how did you fix the runs for those recipes that really-really dont have data, like I found in #2881 (comment) |
I don't have a definitive answer for the really-really missing data cases. As said in this comment, you could try to rerun the recipes adding these paths to you config file. But that data pool is 2 releases old. One could argue that we should delete it and re-download everything as Taking a closer look at some of these (currently) 13 cases:
|
I think for recipe_climate_change_hotspot.ym, I ended up running it on jasmin |
Hi @remi-kazeroni @sloosvel awesome, thanks a lot! Here's the thing(s):
I'll have a closer look at the meeh and schnlund ones, and will ping @schlunma asap |
Yes, the version of recipe_flato13ipcc.yml currently in #2156 is running. The cost is to remove/comment out data sets, which do not work on Levante (and to fix a wrong time period for one model). There was already some discussion on how to deal with such cases, and if I remember right @axel-lauer , who is maintainer of the original recipe_flato13ipcc.yml did not agree on removing data sets? It should also be noted, that the option --skip_nonexistent does not work for all diagnostics in recipe_flato13ipcc.yml, because in several data sets from e.g. two different experiments are needed and it does not work, if only one is there. Therefore I was going to ask, which version of recipe_flato13ipcc.yml should be in the end in #2156 in this issue. (Unfortunately I'm also not completely ready with some issues in recipe_flato13ipcc_figures_938_941.yml I hope to finish them soon). |
V, can adapt the permission to |
@schlunma Manu, they are here |
@katjaweigel many thanks for your clarification! I will consider this recipe at-risk for now, and will not faff about it until you guys fix it - not the first and not the last time we include not really fully working recipes in a release 😁 |
|
Hi @valeriupredoi please let me know if want to schedule a call, I have to say that I am quite confused by all your issues. I did not ran into any of that. |
Hi @sloosvel - many thanks, am back on track now, no need for a call just yet, maybe if you could keep an eye on this issue if I ask for some help, that'd be awesome 🍺 👍 |
OK comparison tool is now plodding along nicely - I have also added the instructions in the issue description - we can use that description to hatch us a nice doc entry - the next RM should not go through the Gates of VM Purgatory like I did yesterday 👍 |
Comparison resultsRun command and output stored
Per recipe resultLegend:
122 out of 127 final
ResultWe need to look at plots for 34 recipes; we're good to go for 85 recipes; 3 have no reference in 2.6.0 |
@schlunma Why are the |
Oh wait, is that because you're trying to combine data downloaded with the ESGF DRS with the DKRZ DRS and we don't support per path DRS settings? ESMValGroup/ESMValCore#129 |
Exactly. We found that using the full paths with |
Actually, the cleanest way to make it work is just to set |
We tried that too, but if I remember correctly there was a problem with this. Could be that it was an issue because at the beginning downloading was very slow (so there was no chance that slow recipes would run), which should be fixed now. I will give it another try sometime in the future. It would be really great if you could do something about ESMValGroup/ESMValCore#129 🚀 |
alright folks, as we have seen in #2881 (comment) we need to look at some recipes that have not had the same plots as in v2.6.0, these are 34 party poopers:
To quickly identify differing plots please have a look at this log https://esmvaltool.dkrz.de/shared/esmvaltool/compare270output_trimmed.txt We can have a look at them in the run list for v2.7.0 https://esmvaltool.dkrz.de/shared/esmvaltool/v2.7.0/debug.html vs the v2.6.0 one https://esmvaltool.dkrz.de/shared/esmvaltool/v2.6.0/debug.html - I will start having me a look but by all means, @ESMValGroup/esmvaltool-developmentteam I could really use a hand here, especially since you (as recipe maintainer/developer) you know these things well, they're all beetles and bugs on coloured paper to me 😁 |
Is there a log file where we can see the differences? My recipe has a lot of plots and if just one of them differs as the list says it'd be easier to just look at that one =D |
logfile coming right away - I'll post it in the comment above 🍺 |
before I post the log (currently curating it) to not lose you, Tina, here's the only bitty plot that differs for your recipe:
|
You can pass that recipe, that's just a different sorting for the diff ensemble members in the histogram and looks diff cause they're not labeled. Cheers for the special extract for me ;) |
@bettina-gier - legend, many thanks! 🍺 |
@valeriupredoi do you mind if I run recipe_climate_change_hotspot in jasmin, so that I can at least upload the results for this version? |
@sloosvel go for it! Cheers 🍺 - but am planning on releasing tonight, it looks promising. If you run it then upload results to the v2.7.0 that'd be awesome, and the release is not affecting that 👍 It'd be great if you was around to approve the last PR thereby changing the version number, in an hour or so, no probs if you not will ask @bouweandela |
@valeriupredoi for
|
@TomasTorsvik brilliant, many thanks for looking! And a positive difference too, thanks to your PR 🍺
Would you be OK to open an issue about this, please? And tag @ledm so we can fix that in 2.8. Many thanks! 🍺 |
@valeriupredoi the same applies for |
Fantastic, cheers @TomasTorsvik 🍺 |
@valeriupredoi I'm not sure, but it seems the difference in |
Eagle-eyed, man 🦅 - coastline contours are more pronounced in 2.7 - cartopy change most probably, but they look the same to me! |
OK this concludes the release testing marathon! Good news is there are not many bad apples among the recipes, bad news is there are a couple - see #2881 (comment) - we found a couple MAGICs project R recipes that look dubious and opened at least one issue about #2890 - but since these recipes are unmaintained, developers who wrote them have in the meantime left the institutes they're listed under etc I am not going to hold the release for some Da Vinci Code-style tracking down; we need to think what we do with such recipes. Oh and the ocean recipes by @ledm need some TLC but he's told me this for a while now, we should get together one time and fix them, no major bugs, but old crap that needs updating. I declare this Tool ready for release! Many thanks to all who helped during this testing process @sloosvel @remi-kazeroni @schlunma @bettina-gier @TomasTorsvik and @bouweandela of course 😁 🍻 |
it's out and about! 🍺 https://pypi.org/project/ESMValTool/2.7.0/ |
@bettina-gier Would it be possible to sort the ensemble members in a way that is stable between runs? With the upcoming more regular recipe testing that @ehogan et al are working on, as described in #2723, issues like this will keep popping up. |
Sister and logical evolution of #2852 - I am commencing testing and comparison of recipes and recipes results in order to release 2.7.0 at the end of this week (hopefully). System parameters below, work done on DKRZ/Levante: submit files in
/home/b/b382109/submit
, output in/scratch/b/b382109/esmvaltool_output
System and settings
conda
/mamba
Git branch and state
Date: 25 October 2022 14:22 BST
Environment
On Levante:
Environment file
ToolEnv270Test.yml
Extraneous file movements
I moved the autoassess-specific files to
/home/b/b382109/autoassess_files
- run was succesful for AA recipes then 👍Ad-hoc hacks (code changes)
/home/b/b382109/ESMValTool/esmvaltool/diag_scripts/land_carbon_cycle/diag_global_turnover.py
l.278 change.outline_patch
with.spines["geo"]
as suggested by @zklaus in Recipe recipe_carvalhais14nat.yml fails at plotting in diagnostic #2886 (comment) (cheers, dude!) - this will have to be PR-edMods to config user file
Added DKRZ downloaded data pool as:
as @schlunma and @remi-kazeroni have suggested 🍺
Recipe runs
Recipe runs results (as of final on 27 October 2022) are listed in #2881 (comment) (with very many thanks to @remi-kazeroni for running the impossible to run ones!) and are as follows:
(*)
means not counting/counting the one that had a DiagnosticError but was fixed but not PR-edRunning the comparison
Login and access to the DKRZ esmvaltool VM
Results from recipe runs are stored on the VM; login with:
Get and install miniconda on VM
E.g.
scp Miniconda3-py39_4.12.0-Linux-x86_64.sh b382109@esmvaltool.dkrz.de:~
from a file already on Levante.Setting up the input files
If you wrote recipe runs output to Levante
/scratch
partition be aware thatthe data will be removed after two weeks, so you will have to move the output data
to the
/work
partition, via e.g. anohup
job:/work
is visible by the VM so you can run the compare tool straight on the VM.NOTE do not store final release results on the VM including
/preproc/
dirs, the totalsize for all the recipes output, including
/preproc/
dirs is in the 4.5TB ballpark,much too high for the VM storage capacity
Running compare tool at VM
tool270Compare
release270stable
pip install imagehash
Input/output/run
/work/bd0854/b382109/v270
(containspreproc/
dirs too, 122 recipes)/mnt/esmvaltool_disk2/shared/esmvaltool/v2.6.0rc4
(does not containpreproc/
dirs)nohup python ESMValTool/esmvaltool/utils/testing/regression/compare.py /mnt/esmvaltool_disk2/shared/esmvaltool/v2.6.0rc4 /work/bd0854/b382109/v270 > compare270output.txt
Sanity check, as outputted by
compare.py
First pass result
Running the
compare.py
results in a few recipes not-OK (NOK) wrt plots differing from previous release v2.6.0, summary in #2881 (comment)Detailed plots inspection
Plots that differ for the 34 recipes that have them different is happening in #2881 (comment)
The text was updated successfully, but these errors were encountered: