In [1]:
import xarray as xr
import xcdat as xc
import numpy as np
import xskillscore as xs 

Description: in metrics table from e3sm_diags v3.0.0, it is shown a small difference for regridded mean compare to e3sm_diags v2. 


V3:
Variables	Unit	          Test_mean	Ref._mean	Mean_Bias	Test_STD	Ref._STD	RMSE	Correlation

SST global HadISST_CL	degC	20.256	18.777	1.48	8.178	9.464	1.055	0.992

SST global HadISST_PI	degC	20.256	19.058	1.199	8.178	8.853	1.233	0.991

SST global HadISST_PD	degC	20.256	18.885	1.372	8.178	9.47	1.082	0.992

V2:

SST global HadISST_CL	degC	20.256	18.698	1.559	8.178	9.536	1.054	0.992

SST global HadISST_PI	degC	20.256	18.978	1.279	8.178	8.933	1.232	0.991

SST global HadISST_PD	degC	20.256	18.807	1.45	8.178	9.543	1.082	0.992



Summary: the small difference came from regridding routine change, both uses bilinear, but for the new code base, it needs explicitly add a mask to the dataset to pass into ESMF regridder. Otherwise, there will be more data treated as missing, a.k.a missing data bleeding into regridded data. 

Solutions:
1. in xcdat regridder, add `mask` before passing data into xesmf 
2. in e3sm_diags add `mask` before calling xcdat
3. to use conservative_norm method for SST, though this requires the HadISST data drop the lat bounds which is in descending (already fixed in lcrc inputdata server), another issue xcdat team is addressing.

Data for testing available from :https://web.lcrc.anl.gov/public/e3sm/zhang40/cdat-migration-fy24/test_data/

In [2]:
f_a = '/Users/zhang40/Downloads/HadISST_CL-SST-ANN-global_test.nc'
f_b = '/Users/zhang40/Downloads/HadISST_CL-SST-ANN-global_ref.nc'

In [3]:
sst_a = xr.open_dataset(f_a)
sst_b = xr.open_dataset(f_b)
var = 'SST'

In [4]:
sst_a = sst_a.bounds.add_missing_bounds()
sst_b = sst_b.bounds.add_missing_bounds()

weights = sst_a.spatial.get_weights(["X", "Y"], data_var=var)

output_grid = sst_a.regridder.grid
# Regriding without mask
sst_b_regrid_bilinear = sst_b.regridder.horizontal(
            var, output_grid, tool='xesmf', method='bilinear'
        )

sst_b_regrid_conservative_normed = sst_b.regridder.horizontal(
            var, output_grid, tool='xesmf', method='conservative_normed'
        )
result_xr1 = xs.rmse(sst_a[var], sst_b_regrid_bilinear[var], dim=["lat", "lon"], weights=weights, skipna=True)
result_xr2 = xs.rmse(sst_a[var], sst_b_regrid_conservative_normed[var], dim=["lat", "lon"], weights=weights, skipna=True)


print('When no mask is explicitly added:')
print('weighted mean, bilinear:', sst_b_regrid_bilinear[var].weighted(weights).mean().values, result_xr1.values)
print('weighted mean, conserve:', sst_b_regrid_conservative_normed[var].weighted(weights).mean().values, result_xr2.values)

  common_dims = tuple(pd.unique([d for v in vars for d in v.dims]))


When no mask is explicitly added:
weighted mean, bilinear: 18.77674568342201 1.4763235405423747
weighted mean, conserve: 18.646808919906057 1.4764820110242953


In [5]:
# Add a mask variable to the dataset to regrid with a mask. This helps
# prevent missing values (`np.nan`) from bleeding into the
# regridding.
# https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html#Regridding-with-a-mask
# sst_b["mask"] = xr.where(~np.isnan(sst_b[var]), 1, 0)
# Below creates a True/False boolean mask, which may be faster and use less memory.
sst_b["mask"] = ~np.isnan(sst_b[var])
sst_b_regrid_bilinear = sst_b.regridder.horizontal(
            var, output_grid, tool='xesmf', method='bilinear'
        )

sst_b_regrid_conservative_normed = sst_b.regridder.horizontal(
            var, output_grid, tool='xesmf', method='conservative_normed'
        )
result_xr1 = xs.rmse(sst_a[var], sst_b_regrid_bilinear[var], dim=["lat", "lon"], weights=weights, skipna=True)
result_xr2 = xs.rmse(sst_a[var], sst_b_regrid_conservative_normed[var], dim=["lat", "lon"], weights=weights, skipna=True)

print('With mask explicitly added:')
print('weighted mean and rmse, bilinear:', sst_b_regrid_bilinear[var].weighted(weights).mean().values, result_xr1.values)
print('weighted mean and rmse, conserve:', sst_b_regrid_conservative_normed[var].weighted(weights).mean().values, result_xr2.values)



With mask explicitly added:
weighted mean and rmse, bilinear: 18.673915615671618 1.4764820110242953
weighted mean and rmse, conserve: 18.646808919906057 1.4764820110242953
