Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS scripting needs to clean up bkg files from previous segment before GEOSgcm.x runs #188

Open
bena-nasa opened this issue Jun 13, 2022 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@bena-nasa
Copy link
Collaborator

bena-nasa commented Jun 13, 2022

@rtodling
I was attempting to run essentially the develop branch of the GEOSadas for reasons that are not pertinent to this issue and run the basic C48f test case which must have not been tested in this branch.
I hit an issue that you will need to resolve that will require proper removal of the bkg files needed by mkiau.x after it runs and BEFORE GEOSgcm.x runs. Here are the details:

I when I ran that test case I was seeing errors when the GEOSgcm.x ran from netcdf at 3z saying it was trying to inquire for the varid given a name from an ncid for a file it was in the process of writing. The error code was that the variable did not exist in the file. I put a print in to determine what the variable was and as you can see it was evap:

 AGCM Date: 2019/01/17  Time: 02:52:30  Throughput(days/day)[Avg Tot Run]:    303.1    338.0    338.8  TimeRemaining(Est) 000:01:12   26.8% :  14.9% Mem Comm:Used

 Writing:    549 Slices to File:  C48f_ben.inst3_3d_asm_Np.20190117_0300z.nc4
 Writing:   1083 Slices to File:  C48f_ben.inst3_3d_asm_Nv.20190117_0300z.nc4
 Writing:    378 Slices to File:  C48f_ben.tavg3_3d_cld_Cp.20190117_0130z.nc4
 Writing:    365 Slices to File:  C48f_ben.tavg3_3d_mst_Ne.20190117_0130z.nc4
 Writing:    729 Slices to File:  C48f_ben.bkg.eta.20190117_0300z.nc4
 Writing:     52 Slices to File:  C48f_ben.bkg.sfc.20190117_0300z.nc4
 Writing:    290 Slices to File:  C48f_ben.cbkg.eta.20190117_0300z.nc4
 Writing:    587 Slices to File:  C48f_ben.vtx.mix.20190117_03z.nc4
 Writing:    729 Slices to File:  C48f_ben.asm.eta.20190117_0300z.nc4
 bmaa failed write variable EVAP
pe=00085 FAIL at line=00030    NetCDF4_put_var.H                        <status=-49>
pe=00085 FAIL at line=00842    ServerThread.F90                         <status=-49>
pe=00085 FAIL at line=00138    BaseServer.F90                           <status=-49>

This was weird and the only plausible way it could be not finding the variable in the file is if the file already existed so I put more prints in and said, if it tries to open an already existing file that contains the experiment id stop. I saw this:

 AGCM Date: 2019/01/17  Time: 02:52:30  Throughput(days/day)[Avg Tot Run]:    311.5    353.6    354.4  TimeRemaining(Est) 000:01:10   31.8% :  28.1% Mem Comm:Used

 Writing:    549 Slices to File:  C48f_ben.inst3_3d_asm_Np.20190117_0300z.nc4
 Writing:   1083 Slices to File:  C48f_ben.inst3_3d_asm_Nv.20190117_0300z.nc4
 Writing:    378 Slices to File:  C48f_ben.tavg3_3d_cld_Cp.20190117_0130z.nc4
 Writing:    365 Slices to File:  C48f_ben.tavg3_3d_mst_Ne.20190117_0130z.nc4
 Writing:    729 Slices to File:  C48f_ben.bkg.eta.20190117_0300z.nc4
 Writing:     52 Slices to File:  C48f_ben.bkg.sfc.20190117_0300z.nc4
 Writing:    290 Slices to File:  C48f_ben.cbkg.eta.20190117_0300z.nc4
 Writing:    587 Slices to File:  C48f_ben.vtx.mix.20190117_03z.nc4
 Writing:    729 Slices to File:  C48f_ben.asm.eta.20190117_0300z.nc4
pe=00049 FAIL at line=00265    NetCDF4_FileFormatter.F90                <file exists: C48f_ben.bkg.eta.20190117_0300z.nc4>
pe=00006 FAIL at line=00265    NetCDF4_FileFormatter.F90                <file exists: C48f_ben.cbkg.eta.20190117_0300z.nc4>
pe=00095 FAIL at line=00265    NetCDF4_FileFormatter.F90                <file exists: C48f_ben.bkg.sfc.20190117_0300z.nc4>

I thought, that was odd; why does the file exist? I started re-ran the experiment and stopped it as soon as the GSI started. When I did an

ls C48f_ben*.nc4

in the fvwork I saw this:

(noback/fvwork.48856) > ls C48f_ben.*.nc4
C48f_ben.bkg.eta.20190116_2100z.nc4  C48f_ben.bkg.sfc.20190116_2100z.nc4  C48f_ben.cbkg.eta.20190116_2100z.nc4
C48f_ben.bkg.eta.20190117_0000z.nc4  C48f_ben.bkg.sfc.20190117_0000z.nc4  C48f_ben.cbkg.eta.20190117_0000z.nc4
C48f_ben.bkg.eta.20190117_0300z.nc4  C48f_ben.bkg.sfc.20190117_0300z.nc4  C48f_ben.cbkg.eta.20190117_0300z.nc4

so those files were already there at the time the experiment was created. I realized they must be the background from the previous segment needed for mikau.x. If you look at the History.rc.tmpl you get with the develop branch of the GEOSadas, you will see that the bkg.sfc collection has an "EVAP" variable and that collection does not start writing until 3z to produce the backgrounds for the next segment, which is when the GEOSgcm.x was crashing. BUT the bkg.eta files that get copied in to produce the increments for the current segment when the experiment is created don't have EVAP.

So what is going on is that at 3z, History tries to write the bkg.eta file but it already exists and if the file already exists the server just opens it and tries to write to it so of course the varid inquiry for EVAP fails!

This is really a problem with the DAS scripting
The DAS scripting should be removing the old background files before the GEOSgcm.x runs, a file should not be there that History will try to write; the fact that this worked before means you were just lucky and apparently were not changing the contents of the bkg files.

HistoryGridComp should check when it decides to write a file, if it already exists and error out as it just could lead to a problem at different point in the code when the error is less clear. I will make that change in our development branch so that the existence of the file is caught when History decides it is time to write to a new file and report the file already exists, rather than during the actual writing process when the error is more confusing.

@rtodling
Copy link
Collaborator

rtodling commented Nov 7, 2022

What I don't understand is: why would overwriting a file depend on what the existing file has. Unless, NC4 overwriting is a very different beast than binary overwriting. Binary overwriting of a sequential file could not care less what was in the original file. But perhaps NC4 opens the content list ... and then, sure, if something got added or removed it would be a problem.

In any case, I will work the removal of the files from the cycle before the model starts.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants