Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test ERR_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio failing #2003

Closed
jedwards4b opened this issue May 11, 2023 · 7 comments
Closed
Labels
type: bug something is working incorrectly

Comments

@jedwards4b
Copy link
Contributor

Brief summary of bug

In the rpointer.lnd file is ./init_generated_files/finidat_interp_dest.nc
however this file does not exist for run2.

General bug information

CTSM version you are using: ctsm5.1.dev122

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: ERR test

Details of bug

[Fill in details here.]

Important details of your setup / configuration so we can reproduce the bug

[Specify anything relevant: the compset, resolution, machine, compiler, any xml or namelist changes, etc. You don't have to repeat anything that you have already noted above.]

Important output or errors that show the problem

[Fill this in with anything relevant that you haven't already noted; if there is nothing to add, delete this section.]

@jedwards4b jedwards4b added the type: bug something is working incorrectly label May 11, 2023
@ekluzek
Copy link
Contributor

ekluzek commented May 11, 2023

@jedwards4b it looks to me like this is maybe a new test that's maybe just being added? And that's why we haven't seen this before? I didn't see this test on baselines on cheyenne for cesm2_3 tags. If it was working in the past I wanted to compare to a previous working version. But, if there isn't such a version, that approach isn't going to help me. So I wanted to understand the history of this test...

@jedwards4b
Copy link
Contributor Author

It's in the cmeps test list, here is a baseline:
/glade/p/cesmdata/cseg/cmeps_baselines/cesm2_3_beta08_cmeps0.13.47/ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio/

@billsacks
Copy link
Member

If the last time this passed was cesm2_3_beta08 or thereabouts, then this used ctsm5.1.dev082, which was before the introduction of init_generated_files (which @mvertens and I implemented in ctsm5.1.dev106). That said, I just poked around a bit in /glade/scratch/jedwards/ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u and the associated archive directory (I'm not sure if that's the right one to be looking at?) and I'm confused about what's happening here: I see that rpointer.lnd in the archive directory (/glade/scratch/jedwards/archive/ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u/rest/0001-01-04-00000) indeed points to ./init_generated_files/finidat_interp_dest.nc. My recollection is that the rpointer file will point to that after init_interp is run in model initialization, but then should be updated to point to the actual restart file by the end of the run – and that's what I see in /glade/scratch/jedwards/ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u/run/rpointer.lnd, which points to ./ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u.clm2.r.0001-01-04-00000.nc. So the question to me is how the older rpointer.lnd got copied to the archive directory instead of the updated one.

@jedwards4b I see some more recent ERR tests in your scratch directory that look like they passed. Can you confirm that this is still an issue or if it might have been a fluke or something you fixed somehow?

@jedwards4b
Copy link
Contributor Author

It's interesting that ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_124427_2661ny passed - I made a change in the restart_tests.py in cime
which I decided before the run ended wouldn't work - I didn't expect that run to pass. I wonder if there is a
race condition between copying the rpointer file to the archive directory and updating it.

@billsacks
Copy link
Member

It looks like you may have run ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u multiple times, and I wonder if something got messed up from doing that?? I see two sets of logs in the archive/logs directory from that case, and the time stamps on the restart files in /glade/scratch/jedwards/archive/ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u/rest/0001-01-04-00000 have a confusing pattern, where most of them are from 11:27 or 11:28, but the rpointer.lnd is from 11:40, and the restart files in /glade/scratch/jedwards/ERR_Vnuopc_Ld5.f09_t061.B1850MOM.cheyenne_intel.allactive-defaultio.20230511_103340_ih649u/run are from a few minutes after that. So it feels like something wrong happened on the archiving side, but I can't tell what.

@jedwards4b
Copy link
Contributor Author

I tried again from scratch and this time it passed.

@billsacks
Copy link
Member

Great, thanks a lot @jedwards4b !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

3 participants