Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: e3sm_to_cmip exception running bundles on Unified 1.9.2 #543

Closed
forsyth2 opened this issue Jan 30, 2024 · 14 comments
Closed

[Bug]: e3sm_to_cmip exception running bundles on Unified 1.9.2 #543

forsyth2 opened this issue Jan 30, 2024 · 14 comments
Labels
semver: bug Bug fix (will increment patch version)

Comments

@forsyth2
Copy link
Collaborator

forsyth2 commented Jan 30, 2024

What happened?

I was running the "c. test final Unified" steps of https://e3sm-project.github.io/zppy/_build/html/main/dev_guide/release_testing.html, for Unified 1.9.2 (that is, testing what was actually released).

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_complete_run_output/test_unified_1.9.2/v2.LR.historical_0201/post/scripts/
$ grep -v "OK" *status
# Nothing shows up. Good, complete_run ran successfully.
$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
bundle1.status:ERROR
ts_land_monthly_1850-1851-0002.status:ERROR (5)
$ grep -n ts_land_monthly_1850-1851-0002 bundle1.o463341 
1327:=== ts_land_monthly_1850-1851-0002.bash ===
1377:2024-01-30 03:14:58,326 [INFO]: __main__.py(__init__:147) >>     * output_path='/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002'
1378:2024-01-30 03:14:58,326 [INFO]: __main__.py(__init__:147) >>     * output_path='/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002'
1379:2024-01-30 03:14:58,326_326:INFO:__init__:    * output_path='/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002'
1445:mv: cannot stat '/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/lnd/180x360_aave/cmip_ts/monthly/tmp_ts_land_monthly_1850-1851-0002/CMIP6/CMIP/*/*/*/*/*/*/*/*/*.nc': No such file or directory

I see the following in the output file:

2024-01-30 03:15:06,173_173:INFO:cmorize:lai: creating CMOR variable with CMOR axis objects.
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/site-packages/e3sm_to_cmip/__\
main__.py", line 912, in _run_parallel
    out = res.result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py",\
 line 458, in result
    return self.__get_result()
  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py",\
 line 403, in __get_result
    raise self._exception

However, this appears to happen elsewhere without causing complete failures:

$ grep -n "concurrent/futures/_base.py" bundle1.o463341 
1021:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
1023:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
1430:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 458, in result
1432:  File "/lcrc/soft/climate/e3sm-unified/base/envs/e3sm_unified_1.9.2_chrysalis/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result

What machine were you running on?

Chrysalis

Environment

E3SM Unified 1.9.2 (zppy v2.3.0)

What command did you run?

zppy -c tests/integration/generated/test_bundles_chrysalis.cfg

Copy your cfg file

[default]
case = v2.LR.historical_0201
constraint = ""
dry_run = "False"
environment_commands = ""
input = "/lcrc/group/e3sm/ac.forsyth2/E3SMv2/v2.LR.historical_0201"
input_subdir = archive/atm/hist
mapping_file = "map_ne30pg2_to_cmip6_180x360_aave.20200201.nc"
# To run this test, edit `output` and `www` in this file, along with `actual_images_dir` in test_bundles.py
output = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201"
partition = "compute"
qos = "regular"
walltime = "07:00:00"
www = "/lcrc/group/e3sm/public_html/diagnostic_output/ac.forsyth2/zppy_test_bundles_www/test_unified_1.9.2"

[bundle]

  [[ bundle2 ]]
  nodes = 2
  walltime = "00:59:00"

[climo]
active = True
bundle = "bundle1"
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave ]]
  frequency = "monthly"

  [[ atm_monthly_diurnal_8xdaily_180x360_aave ]]
  frequency = "diurnal_8xdaily"
  input_files = "eam.h4"
  input_subdir = "archive/atm/hist"
  vars = "PRECT"

[ts]
active = True
bundle = "bundle1"
years = "1850:1854:2",

  [[ atm_monthly_180x360_aave ]]
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  ts_fmt = "cmip"

  [[ atm_daily_180x360_aave ]]
  frequency = "daily"
  input_files = "eam.h1"
  input_subdir = "archive/atm/hist"
  vars = "PRECT"

  [[ atm_monthly_glb ]]
  bundle = "bundle2" # Override bundle1
  frequency = "monthly"
  input_files = "eam.h0"
  input_subdir = "archive/atm/hist"
  mapping_file = "glb"
  years = "1850:1860:5",

  [[ land_monthly ]]
  extra_vars = "landfrac"
  frequency = "monthly"
  input_files = "elm.h0"
  input_subdir = "archive/lnd/hist"
  vars = "FSH,LAISHA,LAISUN,RH2M"
  ts_fmt = "cmip"

  [[ rof_monthly ]]
  bundle = "bundle3" # Override bundle1, let bundle1 finish first because "e3sm_diags: atm_monthly_180x360_aave_mvm" requires "ts: atm_monthly_180x360_aave"
  extra_vars = 'areatotal2'
  frequency = "monthly"
  input_files = "mosart.h0"
  input_subdir = "archive/rof/hist"
  mapping_file = ""
  vars = "RIVER_DISCHARGE_OVER_LAND_LIQ"

[tc_analysis]
active = True
bundle = "bundle3" # Let bundle1 finish first because "e3sm_diags: atm_monthly_180x360_aave_mvm" requires "ts: atm_monthly_180x360_aave"
scratch = "/lcrc/globalscratch/ac.forsyth2/"
years = "1850:1852:2",

[e3sm_diags]
active = True
grid = '180x360_aave'
ref_final_yr = 2014
ref_start_yr = 1985
sets = "lat_lon","zonal_mean_xy","zonal_mean_2d","polar","cosp_histogram","meridional_mean_2d","enso_diags","qbo","diurnal_cycle","annual_cycle_zonal_mean","streamflow", "zonal_mean_2d_stratosphere", "tc_analysis",
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1854:2", "1850:1854:4",

  [[ atm_monthly_180x360_aave ]]
  bundle = "bundle1"
  climo_diurnal_frequency = "diurnal_8xdaily"
  climo_diurnal_subsection = "atm_monthly_diurnal_8xdaily_180x360_aave"
  sets = "polar","enso_diags","diurnal_cycle",

  [[ atm_monthly_180x360_aave_mvm ]]
  # Test model-vs-model using the same files as the reference
  bundle = "bundle3"
  climo_subsection = "atm_monthly_180x360_aave"
  diff_title = "Difference"
  ref_final_yr = 1851
  ref_name = "v2.LR.historical_0201"
  ref_start_yr = 1850
  ref_years = "1850-1851",
  reference_data_path = "/lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/v2.LR.historical_0201/post/atm/180x360_aave/clim"
  run_type = "model_vs_model"
  sets = "polar","enso_diags","streamflow","tc_analysis",
  short_ref_name = "v2.LR.historical_0201"
  swap_test_ref = False
  tag = "model_vs_model"
  ts_num_years_ref = 2
  ts_subsection = "atm_monthly_180x360_aave"

[mpas_analysis]
active = False

[global_time_series]
active = True
atmosphere_only = True
bundle = "bundle2"
experiment_name = "v2.LR.historical_0201"
figstr = "v2_historical_0201"
ts_num_years = 5
walltime = "00:30:00" # bundle2 should take walltime from "ts: atm_monthly_glb", i.e., "02:00:00"
years = "1850-1860",

[ilamb]
active = True
# No bundle, let bundle1 finish first because "ilamb" requires "ts: atm_monthly_180x360_aave"
grid = '180x360_aave'
short_name = 'v2.LR.historical_0201'
ts_num_years = 2
years = "1850:1852:2",

What jobs are failing?

$ cd /lcrc/group/e3sm/ac.forsyth2/zppy_test_bundles_output/test_unified_1.9.2/v2.LR.historical_0201/post/scripts
$ grep -v "OK" *status
bundle1.status:ERROR
ts_land_monthly_1850-1851-0002.status:ERROR (5)

What stack trace are you encountering?

No response

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Jan 30, 2024
@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 2, 2024

Strangely, this error does in fact occur on Chrysalis using Unified 1.9.2rc3. I know I tested that successfully though. Therefore, something has changed to affect past versions.

I will try on Perlmutter too.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 2, 2024

@chengzhuzhang Interestingly, I don't actually see this error on Perlmutter. Since Perlmutter is the primary machine people use bundles on, I suppose we can mark this lower priority.

@xylar
Copy link
Contributor

xylar commented Feb 5, 2024

@forsyth2 and @chengzhuzhang, I've been keeping an eye on this. Is this something you expect to have diagnosed and fixed soon? @wlin7 found another bug in MPAS-Analysis, MPAS-Dev/MPAS-Analysis#981, that will require another bug-fix release of E3SM-Unified. I could include a fix for this if need be.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Feb 5, 2024

@xylar I think we decided it's lower priority. I'm not quite sure what would cause this issue.

@xylar
Copy link
Contributor

xylar commented Feb 5, 2024

Okay, just wanted to check.

@chengzhuzhang
Copy link
Collaborator

@forsyth2 I think there are two issues (#544 and #546 )we should perhaps consider to figure out, if they are user errors or need a fix in zppy.

@forsyth2
Copy link
Collaborator Author

that will require another bug-fix release of E3SM-Unified

@xylar Do you have a timeline/expected deadline for this?

For reference, our prioritized list for zppy:

@xylar
Copy link
Contributor

xylar commented Feb 13, 2024

I have already tested 1.9.3rc1. It fixed the MPAS-Analysis issue it was meant to fix.

I would be willing to wait until early next week and then make a second and hopefully final rc but I don't want a process that snowballs and takes 2 months like 1.9.2 did (which was partly because of the holidays).

@forsyth2
Copy link
Collaborator Author

@chengzhuzhang How urgent are we deeming the above issues? I think it's unlikely they could all be fixed by next week.

@chengzhuzhang
Copy link
Collaborator

chengzhuzhang commented Feb 13, 2024

@forsyth2 #424 (I need to run the test suite and make fixes)
and #548 are ready to review. Please help review and integrate.

Have you had a chance to look at the other two (#544 and #546)? If not I will try to look into both and see if quick fixes are possible.

And I don't think #543 is a priority.

@chengzhuzhang
Copy link
Collaborator

@xylar thanks for the heads-up. e3sm_diags will have a new release as well. I will work with @tomvothecoder to have the release candidate ready by this week.

@forsyth2
Copy link
Collaborator Author

#424 (I need to run the test suite and make fixes)
and #548 are ready to review. Please help review and integrate.

I will test/code-review those tomorrow morning, I'm out-of-office this afternoon.

Have you had a chance to look at the other two (#544 and #546)? If not I will try to look into both and see if quick fixes are possible.

Not yet. I will try to take a look at those tomorrow too.

And I don't think #543 is a priority.

Sounds good.

@xylar
Copy link
Contributor

xylar commented Feb 14, 2024

Okay, I'll expect zppy and e3sm_diags RCs by sometime next week and I can make an E3SM-Unified rc2 after that.

@forsyth2
Copy link
Collaborator Author

In testing #424, I got the bundles test passing. I think it was a combination of two issues 1) cannot stat error happens on two land variables, so I removed those, 2) it looks like at some point I accidentally updated the expected bundles files to be a non-merged PR's output, so I updated the expected files.

In any case, since bundles is passing, I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

No branches or pull requests

3 participants