Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDAT Migration: Refactor annual_cycle_zonal_mean set #798

Merged

Conversation

chengzhuzhang
Copy link
Contributor

Description

Refactor annual_cycle_zonal_mean with xarray/xcdat
Driver is pretty short and has unique _create_annual_cycle function

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have added tests that prove my fix is effective or that my feature works
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@chengzhuzhang chengzhuzhang force-pushed the refactor/669-annual_cycle_zonal_mean branch from 7bc2657 to ebe73f1 Compare March 22, 2024 19:57
@chengzhuzhang
Copy link
Contributor Author

Basic driver and plotting scripts are working. Through only with MULTIPROCESSING=FALSE, if switching it on, I hit errors as follows. It appears from ds = xc.open_mfdataset(**args) which is newly added to ready multi-months data to concatenate into annual cycle time series.

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
2024-03-22 15:14:20,314 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
Traceback (most recent call last):
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 68, in run_diag
    ds_test = test_ds.get_climo_dataset(var_key, "ANNUALCYCLE")
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 365, in get_climo_dataset
    ds = self._get_climo_dataset(season)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 393, in _get_climo_dataset
    ds = self._open_annual_cycle_climo_dataset(filepath)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 425, in _open_annual_cycle_climo_dataset
    ds = xc.open_mfdataset(**args)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xcdat/dataset.py", line 277, in open_mfdataset
    ds = xr.open_mfdataset(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/backends/api.py", line 1053, in open_mfdataset
    combined = combine_by_coords(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 958, in combine_by_coords
    concatenated_grouped_by_data_vars = tuple(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 959, in <genexpr>
    _combine_single_variable_hypercube(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 630, in _combine_single_variable_hypercube
    concatenated = _combine_nd(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 252, in concat
    return _dataset_concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 526, in _dataset_concat
    merged_vars, merged_indexes = merge_collected(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 290, in merge_collected
    merged_vars[name] = unique_variable(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 137, in unique_variable
    out = out.compute()
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 547, in compute
    return new.load(**kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 520, in load
    loaded_data, *_ = chunkmanager.compute(self._data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
    return compute(*data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 956, in new_fork
    _on_forked_process(setup_tracing=apply_arg_patch and not is_subprocess_fork)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 232, in _on_forked_process
    pydevd.settrace_forked(setup_tracing=setup_tracing)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3134, in settrace_forked
    settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2821, in settrace
    _locked_settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2902, in _locked_settrace
    py_db.connect(host, port)  # Note: connect can raise error.
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 1421, in connect
    s = start_client(host, port)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
80.67s - Could not connect to 127.0.0.1: 49425

@chengzhuzhang chengzhuzhang added the cdat-migration-fy24 CDAT Migration FY24 Task label Mar 22, 2024
@chengzhuzhang
Copy link
Contributor Author

Current results with one variable: https://portal.nersc.gov/cfs/e3sm/cdat-migration-fy24/669-annual_cycle_zonal_mean/viewer/

Other TODO items:

  • refine axis config for plot
  • fix viewer
  • Verify all variable runs

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Apr 11, 2024

Basic driver and plotting scripts are working. Through only with MULTIPROCESSING=FALSE, if switching it on, I hit errors as follows. It appears from ds = xc.open_mfdataset(**args) which is newly added to ready multi-months data to concatenate into annual cycle time series.

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
2024-03-22 15:14:20,314 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
Traceback (most recent call last):
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 68, in run_diag
    ds_test = test_ds.get_climo_dataset(var_key, "ANNUALCYCLE")
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 365, in get_climo_dataset
    ds = self._get_climo_dataset(season)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 393, in _get_climo_dataset
    ds = self._open_annual_cycle_climo_dataset(filepath)
  File "/global/homes/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/dataset_xr.py", line 425, in _open_annual_cycle_climo_dataset
    ds = xc.open_mfdataset(**args)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xcdat/dataset.py", line 277, in open_mfdataset
    ds = xr.open_mfdataset(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/backends/api.py", line 1053, in open_mfdataset
    combined = combine_by_coords(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 958, in combine_by_coords
    concatenated_grouped_by_data_vars = tuple(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 959, in <genexpr>
    _combine_single_variable_hypercube(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 630, in _combine_single_variable_hypercube
    concatenated = _combine_nd(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 232, in _combine_nd
    combined_ids = _combine_all_along_first_dim(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 267, in _combine_all_along_first_dim
    new_combined_ids[new_id] = _combine_1d(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/combine.py", line 290, in _combine_1d
    combined = concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 252, in concat
    return _dataset_concat(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/concat.py", line 526, in _dataset_concat
    merged_vars, merged_indexes = merge_collected(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 290, in merge_collected
    merged_vars[name] = unique_variable(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/merge.py", line 137, in unique_variable
    out = out.compute()
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 547, in compute
    return new.load(**kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/variable.py", line 520, in load
    loaded_data, *_ = chunkmanager.compute(self._data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/daskmanager.py", line 70, in compute
    return compute(*data, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/context.py", line 281, in _Popen
    return Popen(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 956, in new_fork
    _on_forked_process(setup_tracing=apply_arg_patch and not is_subprocess_fork)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 232, in _on_forked_process
    pydevd.settrace_forked(setup_tracing=setup_tracing)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 3134, in settrace_forked
    settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2821, in settrace
    _locked_settrace(
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 2902, in _locked_settrace
    py_db.connect(host, port)  # Note: connect can raise error.
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/pydevd.py", line 1421, in connect
    s = start_client(host, port)
  File "/global/u2/c/chengzhu/.vscode-server/extensions/ms-python.debugpy-2024.2.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_comm.py", line 493, in start_client
    s.connect((host, port))
TimeoutError: timed out
80.67s - Could not connect to 127.0.0.1: 49425

This issue seems to be related to these:

  1. Performance
  2. Conflicts with multiprocessing scheduler using context of fork when calling to_netcdf()

I'm currently debugging and will push fixes.

@tomvothecoder tomvothecoder force-pushed the refactor/669-annual_cycle_zonal_mean branch from ebe73f1 to 629b8e3 Compare April 11, 2024 20:20
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit fixes the multiprocessing=True TimeoutError issue in this comment.

RE: #798 (comment)

Current results with one variable: portal.nersc.gov/cfs/e3sm/cdat-migration-fy24/669-annual_cycle_zonal_mean/viewer

Other TODO items:

* refine axis config for plot

* fix viewer

* Verify all variable runs

I think the only remaining items are the last two bullets.

Comment on lines 447 to 507
# NOTE: This GitHub issue explains why the "coords" and "compat" args
# are defined as they are below: https://github.com/xCDAT/xcdat/issues/641
args = {
"paths": filepath,
"decode_times": False,
"add_bounds": ["X", "Y"],
"coords": "minimal",
"compat": "override",
"chunks": "auto",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notable change. I am going to remove "chunks": "auto" because we end up loading the dataset into memory. This means downstream computational operations are all serial within the single process.

Comment on lines 478 to 527
# NOTE: There seems to be an issue with `open_mfdataset()` and
# using the multiprocessing scheduler defined in e3sm_diags,
# resulting in timeouts and resource locking.
# To avoid this, we load the multi-file dataset into memory before
# performing downstream operations.
# Related GH issue: https://github.com/pydata/xarray/issues/3781
ds.load(scheduler="sync")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notable change.

Comment on lines 123 to 130
# --------------------------------------------------------------------------
plt.xticks(time, X_TICKS)
lat_formatter = LatitudeFormatter() # type: ignore
ax.yaxis.set_major_formatter(lat_formatter)
ax.tick_params(labelsize=8.0, direction="out", width=1)
ax.xaxis.set_ticks_position("bottom")
ax.yaxis.set_ticks_position("left")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this block of code from the old plotter because it was missing here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it fixes the "refine axis config for plot" todo item in this comment.

_save_data_metrics_and_plots(
parameter,
plot_func,
var_key,
test_zonal_mean.to_dataset(),
ref_zonal_mean.to_dataset(),
diff,
metrics_dict={},
metrics_dict=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics_dict can be set to None after removing the metrics_dict arg from the plot function.

@@ -36,7 +34,6 @@ def plot(
da_test: xr.DataArray,
da_ref: xr.DataArray,
da_diff: xr.DataArray,
metrics_dict: MetricsDict,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused metrics_dict arg.

Comment on lines 416 to 427
def _open_annual_cycle_climo_dataset(self, filepath: str) -> xr.Dataset:
"""Open 12 monthly mean climatology dataset.

Parameters
----------
filepath : str
The path to the climatology datasets.
"""
args = {"paths": filepath, "decode_times": False, "add_bounds": ["X", "Y"]}
ds = xc.open_mfdataset(**args)
return ds

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also replaced _open_annual_cycle_climo_dataset() with an updated version of _open_climo_dataset() that supports multi-file datasets.

@tomvothecoder
Copy link
Collaborator

@chengzhuzhang you can pick this set back up. I did not make any progress since our last meeting on 4/15/24 (notes). Specifically, there is still a problem related to:

multiprocessing = True threw timeout error, fixed by loading multi-file dataset into memory (conflicts with Dask multiprocessing scheduler)

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jul 10, 2024

  1. viewer is fixed in 322
  2. I can confirm with mutiprocessing on, it still ran into error:
2024-07-10 11:27:10,547 [ERROR]: run.py(run_diags:91) >> Error traceback:
Traceback (most recent call last):
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/run.py", line 89, in run_diags
    params_results = main(params)
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/e3sm_diags_driver.py", line 371, in main
    parameters_results = _run_with_dask(parameters)
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/e3sm_diags_driver.py", line 316, in _run_with_dask
    results = bag.map(CoreParameter._run_diag).compute(num_workers=num_workers)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 342, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/dask/base.py", line 628, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
  1. A full run with all variables running in series also stopped midway
  2. Errors also occur with 3 variables which are data specific:
2024-07-10 12:29:55,272 [INFO]: annual_cycle_zonal_mean_driver.py(run_diag:56) >> Variable: SCO
2024-07-10 12:30:46,299 [INFO]: annual_cycle_zonal_mean_driver.py(_run_diags_annual_cycle:124) >> Selected region: global
2024-07-10 12:30:50,654 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
TypeError: float() argument must be a string or a real number, not 'tuple'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 76, in run_diag
    _run_diags_annual_cycle(
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 142, in _run_diags_annual_cycle
    test_zonal_mean = test_zonal_mean.sel(lat=(-60, 60))
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1617, in sel
    ds = self._to_temp_dataset().sel(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataset.py", line 3074, in sel
    query_results = map_index_queries(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexing.py", line 193, in map_index_queries
    results.append(index.sel(labels, **options))
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 748, in sel
    label_array = normalize_label(label, dtype=self.coord_dtype)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 545, in normalize_label
    value = np.asarray(value, dtype=dtype)
ValueError: setting an array element with a sequence.
2024-07-10 12:30:50,730 [INFO]: annual_cycle_zonal_mean_driver.py(run_diag:56) >> Variable: TCO
2024-07-10 12:31:24,528 [INFO]: annual_cycle_zonal_mean_driver.py(_run_diags_annual_cycle:124) >> Selected region: global
2024-07-10 12:31:26,916 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
TypeError: float() argument must be a string or a real number, not 'tuple'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 76, in run_diag
    _run_diags_annual_cycle(
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 142, in _run_diags_annual_cycle
    test_zonal_mean = test_zonal_mean.sel(lat=(-60, 60))
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataarray.py", line 1617, in sel
    ds = self._to_temp_dataset().sel(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataset.py", line 3074, in sel
    query_results = map_index_queries(
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexing.py", line 193, in map_index_queries
    results.append(index.sel(labels, **options))
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 748, in sel
    label_array = normalize_label(label, dtype=self.coord_dtype)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/indexes.py", line 545, in normalize_label
    value = np.asarray(value, dtype=dtype)
ValueError: setting an array element with a sequence.
2024-07-10 12:31:26,916 [INFO]: annual_cycle_zonal_mean_driver.py(run_diag:56) >> Variable: SST
2024-07-10 12:32:03,765 [INFO]: annual_cycle_zonal_mean_driver.py(_run_diags_annual_cycle:124) >> Selected region: global
2024-07-10 12:32:06,626 [INFO]: io.py(_write_to_netcdf:134) >> 'SST' test variable output saved in: /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global_test.nc
2024-07-10 12:32:06,778 [INFO]: io.py(_write_to_netcdf:134) >> 'SST' ref variable output saved in: /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global_ref.nc
2024-07-10 12:32:06,783 [INFO]: io.py(_write_to_netcdf:134) >> 'SST' diff variable output saved in: /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global_diff.nc
2024-07-10 12:32:06,783 [INFO]: io.py(_save_data_metrics_and_plots:66) >> Metrics saved in /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean/annual_cycle_zonal_mean/SST_CL_HadISST/HadISST_CL-SST-ANNUALCYCLE-global.json
2024-07-10 12:32:07,551 [ERROR]: core_parameter.py(_run_diag:341) >> Error in e3sm_diags.driver.annual_cycle_zonal_mean_driver
Traceback (most recent call last):
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/parameter/core_parameter.py", line 338, in _run_diag
    single_result = module.run_diag(self)
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 76, in run_diag
    _run_diags_annual_cycle(
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/annual_cycle_zonal_mean_driver.py", line 167, in _run_diags_annual_cycle
    _save_data_metrics_and_plots(
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/driver/utils/io.py", line 81, in _save_data_metrics_and_plots
    plot_func(*args)
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/plot/annual_cycle_zonal_mean_plot.py", line 67, in plot
    _add_colormap(
  File "/global/u2/c/chengzhu/e3sm_diags/e3sm_diags/plot/annual_cycle_zonal_mean_plot.py", line 112, in _add_colormap
    var = var.transpose("lat", "time")
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/dataarray.py", line 3022, in transpose
    dims = tuple(utils.infix_dims(dims, self.dims, missing_dims))
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/utils.py", line 814, in infix_dims
    existing_dims = drop_missing_dims(dims_supplied, dims_all, missing_dims)
  File "/global/cfs/cdirs/e3sm/zhang40/conda_envs/e3sm_diags_dev_654_zonal_mean_xy/lib/python3.10/site-packages/xarray/core/utils.py", line 906, in drop_missing_dims
    raise ValueError(
ValueError: Dimensions {'lat'} do not exist. Expected one or more of ('time', 'latitude')

@chengzhuzhang
Copy link
Contributor Author

When set multi-processing=True, error concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending. remains. Without loading dataset i.e. ds.load(scheduler="sync"), it has a TimeoutError.

@@ -109,7 +109,8 @@ def _add_colormap(
# Add the contour plot
# --------------------------------------------------------------------------
ax = fig.add_axes(DEFAULT_PANEL_CFG[subplot_num], projection=None)
var = var.transpose("lat", "time")
# var = var.transpose("lat", "time")
var = var.transpose(var.dims[1], var.dims[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One SST data set has "latitude" instead of "lat" as dimension name. The code change voided using the dimension name explicitly.

@chengzhuzhang
Copy link
Contributor Author

More update: The TimeOut error came from driver/utils/regrid.py

   ds_a_regrid = ds_a_new.regridder.horizontal(
        var_key, output_grid, tool=tool, method=method
    )

@@ -413,6 +413,14 @@ def _get_climo_dataset(self, season: str) -> xr.Dataset:
# ds = ds[[self.var, 'lat_bnds', 'lon_bnds']]
ds = ds[[self.var] + keep_bnds]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line only keeps variable after derivation and bounds related data variables. It helps to remove excessive data for alleviating memory usage. This change in dataset_xr.py may affect other sets.

# To avoid this, we load the multi-file dataset into memory before
# performing downstream operations.
# Related GH issue: https://github.com/pydata/xarray/issues/3781
ds.load(scheduler="sync")
Copy link
Contributor Author

@chengzhuzhang chengzhuzhang Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to resolve conflicts between multiprocessing and dask. Though we need it to be here to only load variables needed, otherwise, out of memory still occur as @tomvothecoder noted.

@chengzhuzhang
Copy link
Contributor Author

  • f0a80c9 and 15811b8 together address the the error:
    multiprocessing = True threw timeout error or concurrent.futures.process.BrokenProcessPool. It is fixed by first reduce dataset size and then load multi-file dataset into memory (the later is because Python's multiprociessing conflicts with Dask multiprocessing scheduler)

  • Regression testing caught an error in main which is addressed by fix save_ncfiles for annual_cycle_zonal_mean #822

  • The regression results mostly matched except for the AODVIS variable, which the development branch use test data as both test and ref plots.

@tomvothecoder tomvothecoder force-pushed the refactor/669-annual_cycle_zonal_mean branch from f1dc8eb to 2cf65c4 Compare July 15, 2024 20:41
@tomvothecoder tomvothecoder changed the base branch from cdat-migration-fy24 to main July 15, 2024 20:42
@tomvothecoder tomvothecoder changed the base branch from main to cdat-migration-fy24 July 15, 2024 20:42
@tomvothecoder tomvothecoder force-pushed the refactor/669-annual_cycle_zonal_mean branch 2 times, most recently from 2c188cd to befef87 Compare July 15, 2024 20:53
…igned time coords

- Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers
@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jul 18, 2024

@tomvothecoder Thank you for troubleshooting. I was testing both datasets for the same variable, but the example 2 dataset should be retired (I replaced this dataset in lat_lon, but missed this instance in annual_cycle_zonal_mean). As you pointed out that the created datasets is correct, the first time step gives March mean. We could have a fix to align time (that should fix the plot which have x axis/ticks start from January). Since this dataset is retired, I think we should just focus on example 1 for now. (I should remember to update the main branch with new data in .cfg)

I just pushed a fix to issue 1 in this commit: 159cdf5 (#798).

It involves setting decode_times=True to properly concatenate time coordinates. I found that no downstream operations are affected with this change except the annual_cycle_zonal_mean plotter which uses the time coordinates for plotting. I had to update the plotter to extract the months to use as X axis values.

Also, I updated the comment above describing how CDAT replaces time coordinates with month integers in _create_annual_cycle() as a workaround to this issue.

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder When testing with decode_times = False, I found that for example one, the decoded_time is just not right. For instance, for the January mean climatology file, the time was decoded as time (time) object 2000-07-02 00:30:00. I also found that time variant units is standard for ncclimo generated climatology files for model and obs data. Not sure why MERRA2_Aerosols stands out..

@tomvothecoder
Copy link
Collaborator

@tomvothecoder When testing with decode_times = False, I found that for example one, the decoded_time is just not right. For instance, for the January mean climatology file, the time was decoded as time (time) object 2000-07-02 00:30:00. I also found that time variant units is standard for ncclimo generated climatology files for model and obs data. Not sure why MERRA2_Aerosols stands out..

Did you mean decode_times=True? If so, I will take a closer look.

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jul 22, 2024

Did you mean decode_times=True? If so, I will take a closer look.

Yes!
I think the climatology data can't be decoded correctly by cftime somehow.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jul 22, 2024

@tomvothecoder When testing with decode_times = False, I found that for example one, the decoded_time is just not right. For instance, for the January mean climatology file, the time was decoded as time (time) object 2000-07-02 00:30:00. I also found that time variant units is standard for ncclimo generated climatology files for model and obs data. Not sure why MERRA2_Aerosols stands out..

I verified that cftime is decoding the time coordinates correctly. The issue is that the raw time coordinates are not correct relative to the "units" attribute (10782720, 'minutes since 1980-01-01 00:30:00'). The time axis is also missing the "calendar" attribute, with "standard" being subbed in as the default.

I don't think this was caught in the CDAT codebase because the _create_annual_cycle() function avoids this issue by opening each dataset individually, replacing the time coordinate with the month integer, then concatenating the datasets into a single dataset along the time axis.

Although I'm not a fan of a custom I/O function to handle data quality issues, we have to implement a function similar to _create_annual_cycle() as a workaround for this specific case.

cftime decoding -- cftime.DatetimeGregorian(2000, 7, 2, 0, 30, 0, 0, has_year_zero=False)

from glob import glob

import cftime
import xcdat as xc

args = {
    "add_bounds": ["X", "Y"],
    "coords": "minimal",
    "compat": "override",
    "chunks": "auto",
}

filepath = "/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_[0-1][0-9]_*climo.nc"
paths = sorted(glob(filepath))

# filepath 1: '/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/MERRA2_Aerosols/MERRA2_Aerosols_01_198001_202101_climo.nc'
ds_raw_time = xc.open_mfdataset(paths[0], **args, decode_times=False)

# 10782720
time_int = ds_raw_time.time.values.item()
# 'minutes since 1980-01-01 00:30:00'
units = ds_raw_time.time.units
# None, so "standard"
calendar = ds_raw_time.time.attrs.get("calendar", "standard")

# cftime.DatetimeGregorian(2000, 7, 2, 0, 30, 0, 0, has_year_zero=False)
cftime.num2date(time_int, units, calendar=calendar)

datetime.datetime decoding -- datetime.datetime(2000, 7, 2, 0, 30)

import datetime

first_step = datetime.datetime(1980, 1, 1, hour=0, minute=30)
time_delta = datetime.timedelta(minutes=10782720)

# datetime.datetime(2000, 7, 2, 0, 30)
print(first_step + time_delta)

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jul 22, 2024

Although I'm not a fan of a custom I/O function to handle data quality issues, we have to implement a function similar to _create_annual_cycle() as a workaround for this specific case.

Actually, the easier thing to do is to ignore the decoded time values since they aren't used and to assume the order is 1-12 (Jan to Dec) like what the CDAT code does. The main caveat is that the time coordinates must be in ascending order, which they are when opening the datasets in Xarray/xCDAT with decode_times=True.

The only change needed is to update time_months in the plotter to range(1, 13).

# Make sure the months are in order to cover cases where the climatology
# spans more than 1 year, resulting in months being out of order.
# e.g., [3, 4, 5,...1, 2] -> [1,2,3, 4, 5,...]
time_months = sorted([t.dt.month for t in time])
var = var.squeeze()

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder I was searching some code example that reads data using open_mfdatasets with specifying order of files: https://stackoverflow.com/questions/75241585/using-xarrays-open-mfdataset-to-open-a-series-of-nc-files

ds = xarray.open_mfdataset(
    [f'{i}.nc' for i in range(10)],
    concat_dim=[
        pd.Index(np.arange(10), name="new_dim"),
    ],
    combine="nested",

)

Though I think your solution actually works okay, given that decode_times=True actually had time coordinate in ascend order (even though the decoded month value doesn't match with the actually climatology month. Update time_months in the plotter to range(1, 13), again put back the correct month index. I will do another regression test to confirm.

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder I'm retesting this set will all variables, and realize that the memory issue came back. Then I tested again with the commit which resolved the memory issue (15811b8). No errors. Some changes between (f2c3568) and 15811b8 brought back the issue. I doubted the decode_times is the cause though.

@tomvothecoder
Copy link
Collaborator

@tomvothecoder I'm retesting this set will all variables, and realize that the memory issue came back. Then I tested again with the commit which resolved the memory issue (15811b8). No errors. Some changes between (f2c3568) and 15811b8 brought back the issue. I doubted the decode_times is the cause though.

Besides the recent plotter update, decode_times=True is the only other change from commit 159cdf5 (#798). Maybe decoding times is introducing an overhead, although it should be lazy in xCDAT. Also if climatology files are being used, the number of time coordinates to decode should be minimal. More debugging needed here.

@chengzhuzhang
Copy link
Contributor Author

To change back decode_times = False did not help. And sadly, some git history was emptied with a few force-pushs. I tried to revert to recent commits, the concurrent.futures.process.BrokenProcessPool: always occur. I kind of running out of debugging method.

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jul 25, 2024

Not sure the best solution to continue troubleshooting, after ruling out the args change for open_mfdataset. But what I did is to swap the dataset_xr.py from commit 15811b8 into latest code. (I do need to edit slightly to make the code work, i.e. change CLIMO_FREQ to Climo_Freq). No memory issue. At least it narrows down the problematic file, and I suspect some changes made in other PRs being merged introduced memory problem. I'm stepping through the differs to see what might be the cause.

The file diff for dataset_xr.py is here https://www.diffchecker.com/mTw8AWif/

- Due to incorrectly updating `keep_bnds` logic
- Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar`
@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jul 25, 2024

I was actually in the middle of debugging here with my comment. I resolved the multiprocessing issue, it was my fault :(

Issues I resolved in f9a9ea7 (#798)

  1. Slow .load() performance and sometimes multiprocessing issue (concurrent.futures.process.BrokenProcessPool)

    • Root cause: My mistake here and sorry for removing git history with rebasing. I accidentally committed incorrect logic for keep_bnds = [var for var in all_vars if "bnd" or "bounds" in var] which kept all variables in the dataset before .load().
    • Solution: Update keep_bnds = [var for var in all_vars if "bnd" in var or "bounds" in var]
  2. With decode_times=True, I get ValueError: 'months since' units only allowed for '360_day' calendar for the TCO and SCO reference variables when writing out to netCDF

    • Root cause: The source dataset ('/global/cfs/cdirs/e3sm/diagnostics/observations/Atm/climatology/OMI-MLS/OMI-MLS_01_200501_201701_climo.nc') has the units 'months since 2005-01-01 00:00:00' and is missing the "calendar" attribute ("standard" is used as a default). Once again, the CDAT code does not run into this issue because it replaces time coordinates with month integers.
    • Solution: Added _encode_time_coords() to driver to encode time coordinates to month integers

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jul 25, 2024

I re-ran the regression test notebook with the latest commit. I am still getting the following diffs:

AODVIS

Comparing:
/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-ANNUALCYCLE-global_ref.nc 
 /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-Annual-Cycle_test.nc
AODVIS
var_key AODVIS

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 1808 / 2160 (83.7%)
Max absolute difference: 0.12250582
Max relative difference: 91.14554689
 x: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...
 y: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...

ALBEDO -- It just looks like np.inf is being used in xCDAT while np.nan is used with CDAT. I recall this happening in other regression tests. Replacing np.inf with np.nan resolves this issue and vice versa.

Comparing:
/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/CERES-EBAF-TOA-v4.1/ceres_ebaf_toa_v4.1-ALBEDO-ANNUALCYCLE-global_ref.nc 
 /global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/CERES-EBAF-TOA-v4.1/ceres_ebaf_toa_v4.1-ALBEDO-Annual-Cycle_test.nc
ALBEDO
var_key ALBEDO

Not equal to tolerance rtol=1e-05, atol=0

x and y nan location mismatch:
 x: array([[0.69877 , 0.695266, 0.68627 , ...,      inf,      inf,      inf],
       [0.712032, 0.706896, 0.69354 , ...,      inf,      inf,      inf],
       [0.765447, 0.743142, 0.738787, ..., 0.752918, 0.751204, 0.833122],...
 y: array([[0.69877 , 0.695266, 0.68627 , ...,      nan,      nan,      nan],
       [0.712033, 0.706896, 0.69354 , ...,      nan,      nan,      nan],
       [0.765447, 0.743142, 0.738787, ..., 0.752918, 0.751204, 0.833123],...

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jul 25, 2024

@tomvothecoder this is big relief! I skimed through the file several times and noticed the changed line keep_bnds = [var for var in all_vars if "bnd" or "bound" in var], but was not careful enough to catch the problem! No worries about the variable AODVIS. I will update the .cfg file to replace this obs source with two new data source.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jul 25, 2024

I added a debug script for AODVIS that compares the max, min, sum, and mean. All of the values look close.

I think the max relative diff is large because the values are close to 0.

import numpy as np
import xcdat as xc

dev_path = "/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-ANNUALCYCLE-global_ref.nc"
main_path = "/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-Annual-Cycle_test.nc"


var_a = xc.open_dataset(dev_path)["AODVIS"]
var_b = xc.open_dataset(main_path)["AODVIS"]

"""
Floating point comparison

AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1808 / 2160 (83.7%)
Max absolute difference: 0.12250582
Max relative difference: 91.14554689
 x: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...
 y: array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],...
"""
np.testing.assert_allclose(var_a, var_b)

# Get the max of all values
# -------------------------
# 0.28664299845695496
print(var_a.max().item())
# 0.2866430557436412
print(var_b.max().item())

# Get the min of all values
# -------------------------
# 0.0
print(var_a.min().item())
# 0.0
print(var_b.min().item())

# Get the sum of all values
# -------------------------
# 224.2569122314453
print(var_a.sum().item())
# 224.25691348856003
print(var_b.sum().item())

# Get the mean of all values
# -------------------------
# 0.10382264107465744
print(var_a.mean().item())
# 0.1038226451335926
print(var_b.mean().item())


# %%
# Get the max absolute diff
# -------------------------
# 0.12250582128763199
print((var_a - var_b).max().item())

@chengzhuzhang
Copy link
Contributor Author

chengzhuzhang commented Jul 25, 2024

I think the max relative diff is large because the values are close to 0.

yeah, the values and metrics look all very close. Based on the plots i saw earlier, months were off.
Anyway based on the comments from #624 I retired AODVIS from MACv1 in lat-lon, but missed the annual_cycle_zonal_mean set. I made the update 7ba0900.

@chengzhuzhang
Copy link
Contributor Author

@tomvothecoder I think we can merge after CI/CD test is completed!

@chengzhuzhang chengzhuzhang merged commit 8a8dafa into cdat-migration-fy24 Jul 25, 2024
2 of 4 checks passed
@chengzhuzhang chengzhuzhang deleted the refactor/669-annual_cycle_zonal_mean branch July 25, 2024 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cdat-migration-fy24 CDAT Migration FY24 Task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants