Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where are all the ocean variables? #34

Open
rabernat opened this issue Feb 3, 2020 · 71 comments
Open

Where are all the ocean variables? #34

rabernat opened this issue Feb 3, 2020 · 71 comments
Assignees

Comments

@rabernat
Copy link

rabernat commented Feb 3, 2020

I started to look at the LENS AWS data. I discovered there is very little available

import intake_esm
import intake
url = 'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(url)
col.search(component='ocn').df
	component	frequency	experiment	variable	path
0	ocn	monthly	20C	SALT	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-SAL...
1	ocn	monthly	20C	SSH	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-SSH...
2	ocn	monthly	20C	SST	s3://ncar-cesm-lens/ocn/monthly/cesmLE-20C-SST...
3	ocn	monthly	CTRL	SALT	s3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-SA...
4	ocn	monthly	CTRL	SSH	s3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-SS...
5	ocn	monthly	CTRL	SST	s3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-SS...
6	ocn	monthly	RCP85	SALT	s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-S...
7	ocn	monthly	RCP85	SSH	s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-S...
8	ocn	monthly	RCP85	SST	s3://ncar-cesm-lens/ocn/monthly/cesmLE-RCP85-S...

There are only 3 variables: SALT (3D), SSH (2D), and SST (2D).

At minimum, I would also like to have THETA (3D), UVEL (3D), VVEL (3D), and WVEL (3D), and all the surface fluxes of heat and freshwater. Beyond that, it would be ideal to also have the necessary variables to reconstruct the tracer and momentum budgets.

Are there plans to add more data?

@jeffdlb
Copy link
Contributor

jeffdlb commented Feb 3, 2020

Ryan-

You are correct there is not much ocean data there. The plan was to add additional data on request, so thanks for your request. We do have room in our AWS allocation to add more, and I would be glad for us to do so. I will discuss this with some team members at our meeting on Thursday. Besides the THETA, UVEL, VVEL, WVEL fields you mentioned, can you indicate which specific other variables you desire? Gary Strand's list of variable names is at http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html

@rabernat
Copy link
Author

rabernat commented Feb 3, 2020

Thanks @jeffdlb! I will review the list of fields and get back to you. Our general interest is ocean heat and salt budgets.

@jeffdlb
Copy link
Contributor

jeffdlb commented Feb 3, 2020

PS. We currently only have monthly ocean data, whereas the other realms also have some daily or 6-hour data. Is monthly sufficient for your use case?

@rabernat
Copy link
Author

rabernat commented Feb 3, 2020

For coarse-resolution (non eddy-resolving) models, the oceans tend not to have too much sub-monthly variability. If we did need daily data, it would just be surface fluxes. Monthly should be fine for the other stuff.

@rabernat
Copy link
Author

Ok, here is my best guess at identifying the variables we would need for the heat and salt budgets. Would be good for someone with more POP experience (e.g. @matt-long) to verify.

THETA
UVEL
UVEL2
VVEL
VVEL2
WVEL
FW
HDIFB_SALT
HDIFB_TEMP
HDIFE_SALT
HDIFE_TEMP
HDIFN_SALT
HDIFN_TEMP
HMXL
HOR_DIFF
KAPPA_ISOP
KAPPA_THIC
KPP_SRC_SALT
KPP_SRC_TEMP
RESID_S
RESID_T
QFLUX
SHF
SHF_QSW
SFWF
SFWF_WRST
SSH
TAUX
TAUX2
TAUY
TAUY2
VNT_ISOP
VNT_SUBM
UES
UET
VNS
VNT
WTS
WTT

@jeffdlb
Copy link
Contributor

jeffdlb commented Feb 12, 2020 via email

@matt-long
Copy link
Collaborator

TEMP is the POP variable name for potential temperature (not THETA).

Many of the data not available on glade are available on HPSS
[HSI]: /CCSM/csm/CESM-CAM5-BGC-LE/ocn/proc/tseries/monthly

And (possibly) on NCAR Campaign storage here
/glade/campaign/cesm/collections/cesmLE
(accessible on Casper)

Note that to close a tracer budget you need
Advection
UE{S,T}
VN{S,T}
WT{S,T}

Lateral diffusion (GM, submeso)
HDIF{E,N,B}_{S,T}

Vertical mixing
KPP_SRC_{SALT,TEMP}
DIA_IMPVF_{SALT,TEMP} # implicit diabatic vertical mixing, I think you missed this one

Surface fluxes # some choices
SHF_QSW # I don't think we save fully 3D QSW, unfortunately
QSW_HTP # top layer
QSW_HBL # boundary layer
SHF
QFLUX
SFWF

Inventory
TEMP
SALT
SSH # free-surface deviations impact the tracer mass in top layer, where dz = dz + SSH

@bonnland
Copy link
Collaborator

According to Gary Strand, all of the LENS data is available on Glade here:

/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE

When I look at what is available for the ocean, I find this:

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls annual/
DIA_IMPVF_DIC/		HDIFB_DOC/	    HDIFE_O2/		J_Fe/		      KPP_SRC_Fe/      VN_DIC/		WT_DOC/
DIA_IMPVF_DIC_ALT_CO2/	HDIFB_Fe/	    HDIFN_DIC/		J_NH4/		      KPP_SRC_O2/      VN_DIC_ALT_CO2/	WT_Fe/
DIA_IMPVF_DOC/		HDIFB_O2/	    HDIFN_DIC_ALT_CO2/	J_NO3/		      UE_DIC/	       VN_DOC/		WT_O2/
DIA_IMPVF_Fe/		HDIFE_DIC/	    HDIFN_DOC/		J_PO4/		      UE_DIC_ALT_CO2/  VN_Fe/
DIA_IMPVF_O2/		HDIFE_DIC_ALT_CO2/  HDIFN_Fe/		J_SiO3/		      UE_DOC/	       VN_O2/
HDIFB_DIC/		HDIFE_DOC/	    HDIFN_O2/		KPP_SRC_DIC/	      UE_Fe/	       WT_DIC/
HDIFB_DIC_ALT_CO2/	HDIFE_Fe/	    J_ALK/		KPP_SRC_DIC_ALT_CO2/  UE_O2/	       WT_DIC_ALT_CO2/

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls monthly
ALK/		   DIC_ALT_CO2/       get	      MELTH_F/	       photoC_diat/  SENH_F/	 TAUY/			      VISOP/
ATM_CO2/	   DOC/		      HBLT/	      MOC/	       photoC_diaz/  SFWF/	 TAUY2/			      VNS/
BSF/		   DpCO2/	      HMXL/	      N_HEAT/	       photoC_sp/    SHF/	 TBLT/			      VVEL/
CFC11/		   DpCO2_ALT_CO2/     IAGE/	      N_SALT/	       PREC_F/	     SHF_QSW/	 TEMP/			      WISOP/
CFC_ATM_PRESS/	   ECOSYS_ATM_PRESS/  IFRAC/	      O2/	       QFLUX/	     SNOW_F/	 tend_zint_100m_ALK/	      WTS/
CFC_IFRAC/	   ECOSYS_IFRAC/      INT_DEPTH/      O2_CONSUMPTION/  QSW_HBL/      spCaCO3/	 tend_zint_100m_DIC/	      WVEL/
CFC_XKW/	   ECOSYS_XKW/	      IOFF_F/	      O2_PRODUCTION/   QSW_HTP/      spChl/	 tend_zint_100m_DIC_ALT_CO2/  XBLT/
CO2STAR/	   EVAP_F/	      Jint_100m_ALK/  O2SAT/	       RHO/	     SSH/	 tend_zint_100m_DOC/	      XMXL/
DCO2STAR/	   FG_ALT_CO2/	      Jint_100m_DIC/  O2_ZMIN/	       RHO_VINT/     SSH2/	 tend_zint_100m_O2/	      zsatarag/
DCO2STAR_ALT_CO2/  FG_CO2/	      Jint_100m_DOC/  O2_ZMIN_DEPTH/   ROFF_F/	     SST/	 TLT/			      zsatcalc/
DIA_DEPTH/	   FvICE_ALK/	      Jint_100m_O2/   pCO2SURF/        SALT/	     STF_CFC11/  TMXL/
diatChl/	   FvICE_DIC/	      LWDN_F/	      PD/	       SALT_F/	     STF_O2/	 UES/
diazChl/	   FvPER_ALK/	      LWUP_F/	      PH/	       SCHMIDT_CO2/  TAUX/	 UISOP/
DIC/		   FvPER_DIC/	      MELT_F/	      PH_ALT_CO2/      SCHMIDT_O2/   TAUX2/	 UVEL/

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ ls daily
CaCO3_form_zint/  diazC_zint_100m/  ECOSYS_XKW_2/  nday1/	      spCaCO3_zint_100m/  SST/	     TAUY_2/	zooC_zint_100m/
diatChl_SURF/	  DpCO2_2/	    FG_CO2_2/	   photoC_diat_zint/  spChl_SURF/	  SST2/      WVEL_50m/
diatC_zint_100m/  ecosys/	    HBLT_2/	   photoC_diaz_zint/  spC_zint_100m/	  STF_O2_2/  XBLT_2/
diazChl_SURF/	  ECOSYS_IFRAC_2/   HMXL_2/	   photoC_sp_zint/    SSH_2/		  TAUX_2/    XMXL_2/

casper26:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE/ocn/proc/tseries$ 

If someone can help decipher these variables and determine if they are worth publishing, I would be happy to work on getting them onto AWS.

@bonnland
Copy link
Collaborator

@jeffdlb @rabernat On second glance, it appears that my directory listing above shows that some variables are missing. I'm waiting to hear from Gary to get some clarification.

@rabernat
Copy link
Author

@bonnland -- were you the one who produced the original S3 LENS datasets? If so, it would be nice to build on that effort. My impression from @jeffdlb is that they have a pipeline set up, they just need to find the data! Maybe you were part of that...sorry for my ignorance.

As for the missing variables, I guess I would just request that you take the intersection between my requested list and what is actually available. I think that the list of monthly and daily variables you showed above is a great start. I would use nearly all of it.

@bonnland
Copy link
Collaborator

@rabernat I just got word from Gary; I was originally in a slightly different folder. I have the correct folder path now, and all 273 monthly ocean variables appear to be present.

I was part of the original data publishing, so I know parts of the workflow. The most time consuming part is creating the CSV file describing an intake-esm catalog, which I did not originally take part in. The catalog is used to load data into xarray and then write out to Zarr. We have the file paths now; I just need to research how to construct the remaining fields for the CSV file.

@bonnland
Copy link
Collaborator

bonnland commented Mar 5, 2020

@rabernat I've loaded some variables, and the datasets are big. A single variable will take over 2TB. Here are some stats for five of the variables:

Screen Shot 2020-03-05 at 3 28 47 PM

Note that these sizes are uncompressed sizes, and they will be smaller on disk.

Is there a priority ordering that makes sense if we can initially publish just a subset? Anderson believes that if the available space on AWS has not changed, we have around 30 TB available.

@jeffdlb Do you know more exactly how much space is left on S3, and when we might get more?

@jeffdlb
Copy link
Contributor

jeffdlb commented Mar 6, 2020 via email

@andersy005
Copy link
Contributor

Is 20.95TB the total for all the new ocean variables, or only for a subset?
If subset, can you estimate the total (uncompressed) for all the new vars?

The 20.95 TB is for only 5 variables.

Of the 39 variables listed in #34 (comment), we found 38 variables. A back of the envelope calculation shows that their total uncompressed size would be ~ 170TB.

@rabernat
Copy link
Author

rabernat commented Mar 6, 2020

Do we have any idea what typical zarr + zlib compression rates are for these datasets? I would not be surprised to see a factor for 2 or more.

@jeffdlb
Copy link
Contributor

jeffdlb commented Mar 6, 2020

@rabernat The one data point I have so far is for atm/monthly/cesmLE-RCP85-TREFHT.zarr:
5.5 GB storage
10.1 GB uncompressed

@jeffdlb
Copy link
Contributor

jeffdlb commented Mar 6, 2020

I will ask Joe & Ana whether we can have up to ~150 TB more. If not, we may need to prioritize.

@rabernat Do you know of any other expected users of these ocean variables? We might need to have some good justification for this >2x allocation increase.

@rabernat
Copy link
Author

rabernat commented Mar 6, 2020

TEMP, UVEL, VVEL, WVEL, SHF, and SFWF would be the bare minimum I think.

Will try to get a sense of other potential users.

@dbonan
Copy link

dbonan commented Mar 6, 2020

I've been using some of the ocean output from the CESM-LE. I've mainly been looking at overturning, heat transport, and surface forcing (i.e., MOC, SHF, UVEL, VVEL, TEMP, SST, SALT). I know there would be a lot of interest in biogeochemical variables, too. I agree it would be nice to have this on AWS data storage!

@cspencerjones
Copy link
Contributor

cspencerjones commented Mar 6, 2020

I would definitely use it if available! SHF, SFWF, UVEL, VVEL, VNS, VNT, TEMP and SALT at least would be helpful. But also TAUX, TAUY, UES, UET and PD would be good too.

@jbusecke
Copy link

jbusecke commented Mar 6, 2020

I would definitely be keen to look at some biogeochemical variables, like DIC, DOC and O2. The full O2 budget would be dope but I presume that is a lot of data (not exactly sure which terms are needed but it seems they are usually the ones with ‘_O2’ appended (e.g. VN_O2, UE_O2 etc). Thanks for pinging me.

@bonnland
Copy link
Collaborator

bonnland commented Mar 6, 2020

@rabernat I'm in the process of creating the Zarr files for TEMP, UVEL, VVEL, WVEL, SHF, and SFWF, just as an initial test. I've discovered in the process, that the coordinate dimension describing vertical levels has different names depending on the variable. For example:

 UVEL                  (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 12, 30, 384, 320), meta=np.ndarray>

 WVEL                  (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 12, 60, 384, 320), meta=np.ndarray>

The chunk size for UVEL is 30 because we were originally thinking of splitting the 60 vertical levels into two chunks. We could do the same for WVEL; we just need to be careful about using the different coordinate dimension names when we specify chunk sizes.

Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct? Also, is there perhaps a better chunking strategy than what we are considering here?

@mnlevy1981
Copy link

@bonnland the different vertical coordinates signify different locations in the level: z_t is the center, while z_w_top is the top of the level and z_w_bot is the bottom of the level. Most variables will be at cell centers, i.e. z_t, though some of those are only saved in the top 150m (z_t_150m). Note that this last dimension is only 15 levels, rather than the 60 levels comprising the other dimensions.

That's a longwinded way of saying

Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct?

Keep them distinct, please

@rabernat
Copy link
Author

rabernat commented Mar 7, 2020

Keep them distinct, please

👍. When producing analysis-ready data, we should always think very carefully before changing any of the metadata.

z_t is the center, while z_w_top is the top of the level and z_w_bot is the bottom of the level.

Going a bit off topic, but I find POP to be pretty inconsistent about its dimension naming conventions. In the vertical, it uses different dimension names for the different grid positions. But in the horizontal, it is perfectly happy to use nlon and nlat as the dimensions for all the variables, regardless of whether they are at cell center, corner, face, etc. @mnlevy1981, do you have any insight into this choice?

@mnlevy1981
Copy link

mnlevy1981 commented Mar 8, 2020

Going a bit off topic, but I find POP to be pretty inconsistent about its dimension naming conventions. In the vertical, it uses different dimension names for the different grid positions. But in the horizontal, it is perfectly happy to use nlon and nlat as the dimensions for all the variables, regardless of whether they are at cell center, corner, face, etc. @mnlevy1981, do you have any insight into this choice?

I don't know for sure, but I suspect that POP (or an ancestor of POP) originally had z_w = z_t + 1 and then someone realized that all the variables output on the interface could be sorted into either the "0 at the surface" bucket or the "0 at the ocean floor" bucket so there was a chance to save some memory in output files by splitting the z_w coordinate into z_w_top and z_w_bot (and at that point, z_t and z_w were already ingrained in the code so it wasn't worth condensing to nz). Meanwhile, the two horizontal coordinate spaces (TLAT, TLONG and ULAT, ULONG)* always had the same dimensions because of the periodic nature of the horizontal grid. That's pure speculation, though.

* Going even further off topic, the inconsistency that trips me up is trying to remember when I need the "g" in "lon"... going off memory, I'm 80% sure it's nlon but TLONG and ULONG. I see the last two get shortened to tlon and ulon in random scripts often enough that I need to stop and think about it.

@bonnland
Copy link
Collaborator

@rabernat I'm finishing up code for processing and publishing the ocean variables. I'd like to see what difference zlib compression makes. Are there any special parameters needed, or just use all defaults for the compression? Do you have an example of specifying this compression choice?

@andersy005
Copy link
Contributor

We may want to stick with the default compressor because it appears to be providing a pretty good compression ratio:

In [1]: import zarr

In [2]: zstore = "/glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr"

In [3]: ds = zarr.open_consolidated(zstore)

In [5]: ds["SFWF"].info
Out[5]:
Name               : /SFWF
Type               : zarr.core.Array
Data type          : float32
Shape              : (40, 1872, 384, 320)
Chunk shape        : (40, 12, 384, 320)
Order              : C
Read-only          : False
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.ConsolidatedMetadataStore
Chunk store type   : zarr.storage.DirectoryStore
No. bytes          : 36805017600 (34.3G)
Chunks initialized : 156/156
In [7]: !du -h {zstore}
2.5K    /glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr/time_bound
....
13G /glade/scratch/bonnland/lens-aws/ocn/monthly/cesmLE-20C-SFWF.zarr

@bonnland bonnland self-assigned this Mar 16, 2020
@bonnland
Copy link
Collaborator

@rabernat I should be transferring the following to AWS sometime today and tomorrow: TEMP, UVEL, VVEL, WVEL, VNS, VNT, SHF, SFWF. All will cover the CTRL, RCP85, and 20C experiments. @andersy005 should be updating the AWS intake catalog when the transfer is complete.

@bonnland
Copy link
Collaborator

Actually, it looks like we inadvertently wrote out the Zarr files with incorrect metadata. It is going to take a few more days to re-write and then transfer to AWS.

@rabernat
Copy link
Author

I don't think I can make that decision for you. Simply stating that, due to the coronvirus pandemic and associated impacts on my time (enormous new child care responsibilities, remote teaching, etc.), I personally won't be able to do much on this until May (post spring semester).

@cspencerjones
Copy link
Contributor

Thanks very much for doing this! I will try to make a start with what's there sometime next week. If it is easy to upload TAUX, TAUY, those would also be helpful to have (though I can start without them if you'd prefer to wait until I've tried it).

@bonnland
Copy link
Collaborator

@cspencerjones Thanks for offering to check things. It takes a good chunk of CPU hours to produce these files, so I'd feel better knowing there isn't some glitch in what we have so far that makes these data difficult to use.

I will create and upload TAUX and TAUY, hopefully by Tuesday, and I'll respond here when they are ready. It would be great to see if you can use them successfully before creating more Zarr files.

@rabernat
Copy link
Author

I did find some time to simply open up some data. Overall it looks good! Thanks for making this happen. Based on this quick look, I do have some feedback.

Let's consider, for example, WVEL, the vertical velocity:

import s3fs
import xarray as xr

fs = s3fs.S3FileSystem(anon=True)
s3_path = 's3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-WVEL.zarr'
ds = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
ds

Which gives the following long output:

<xarray.Dataset>
Dimensions:               (d2: 2, lat_aux_grid: 395, member_id: 1, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 21612, transport_comp: 5, transport_reg: 2, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
  * lat_aux_grid          (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0
  * member_id             (member_id) int64 1
  * moc_z                 (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06
  * time                  (time) object 0400-02-01 00:00:00 ... 2201-01-01 00:00:00
  * z_t                   (z_t) float32 500.0 1500.0 ... 512502.8 537500.0
  * z_t_150m              (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0
  * z_w                   (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * z_w_bot               (z_w_bot) float32 1000.0 2000.0 ... 549999.06
  * z_w_top               (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94
Dimensions without coordinates: d2, moc_comp, nlat, nlon, transport_comp, transport_reg
Data variables:
    ANGLE                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ANGLET                (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DXT                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DXU                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DYT                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    DYU                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    HT                    (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    HTE                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    HTN                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    HU                    (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    HUS                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    HUW                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    KMT                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    KMU                   (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    REGION_MASK           (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    T0_Kelvin             float64 ...
    TAREA                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    TLAT                  (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    TLONG                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    UAREA                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ULAT                  (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    ULONG                 (nlat, nlon) float64 dask.array<chunksize=(384, 320), meta=np.ndarray>
    WVEL                  (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 480, 1, 384, 320), meta=np.ndarray>
    cp_air                float64 ...
    cp_sw                 float64 ...
    days_in_norm_year     timedelta64[ns] ...
    dz                    (z_t) float32 dask.array<chunksize=(1,), meta=np.ndarray>
    dzw                   (z_w) float32 dask.array<chunksize=(60,), meta=np.ndarray>
    fwflux_factor         float64 ...
    grav                  float64 ...
    heat_to_PW            float64 ...
    hflux_factor          float64 ...
    latent_heat_fusion    float64 ...
    latent_heat_vapor     float64 ...
    mass_to_Sv            float64 ...
    moc_components        (moc_comp) |S256 dask.array<chunksize=(3,), meta=np.ndarray>
    momentum_factor       float64 ...
    nsurface_t            float64 ...
    nsurface_u            float64 ...
    ocn_ref_salinity      float64 ...
    omega                 float64 ...
    ppt_to_salt           float64 ...
    radius                float64 ...
    rho_air               float64 ...
    rho_fw                float64 ...
    rho_sw                float64 ...
    salinity_factor       float64 ...
    salt_to_Svppt         float64 ...
    salt_to_mmday         float64 ...
    salt_to_ppt           float64 ...
    sea_ice_salinity      float64 ...
    sflux_factor          float64 ...
    sound                 float64 ...
    stefan_boltzmann      float64 ...
    time_bound            (time, d2) object dask.array<chunksize=(10806, 2), meta=np.ndarray>
    transport_components  (transport_comp) |S256 dask.array<chunksize=(5,), meta=np.ndarray>
    transport_regions     (transport_reg) |S256 dask.array<chunksize=(2,), meta=np.ndarray>
    vonkar                float64 ...
Attributes:
    Conventions:               CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc...
    NCO:                       4.3.4
    calendar:                  All years have exactly  365 days.
    cell_methods:              cell_methods = time: mean ==> the variable val...
    contents:                  Diagnostic and Prognostic Variables
    history:                   Thu Oct 10 08:38:35 2013: /glade/apps/opt/nco/...
    intake_esm_varname:        WVEL
    nco_openmp_thread_number:  1
    revision:                  $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy...
    source:                    CCSM POP2, the CCSM Ocean Component
    tavg_sum:                  2678400.0
    tavg_sum_qflux:            2678400.0
    title:                     b.e11.B1850C5CN.f09_g16.005

Based on this, I have two suggestions.

  1. All variables but WVEL should be coordinates, not data variables. This is easily accomplished with the following code:
    coord_vars = [vname for vname in ds.data_vars if 'time' not in ds[vname].dims]
    ds_fixed = ds.set_coords(coord_vars)
    It should be possible to fix this issue just by rewriting the zarr metadata, rather than re-outputting the whole dataset.
  2. The chunk choice on WVEL (and presumably other 3D variables) is, in my view, less than ideal:
    WVEL (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 480, 1, 384, 320), meta=np.ndarray>
    
    First, the chunks are on the large side (235.93 MB). Second, each vertical level is in a separate chunk, while 20 years of time are stored contiguously. If I want to get a complete 3D field for a single timestep, I therefore have to download over 14 GB of data. I recognize that the choice of chunks is subjective and depends on the use case. However, based on my experience working with ocean model output, I think the most common use case is to access all vertical levels in a single contiguous chunk. (This corresponds to how netCDF files are commonly output and is what people are used to.) I would recommend instead using chunks ds.WVEL.chunk({'time': 6, 'z_w_top': -1, 'nlon': -1, 'nlat': -1}), which would produce ~175 MB chunks.

I hope this feedback is useful.

@bonnland
Copy link
Collaborator

bonnland commented Apr 12, 2020

@rabernat That is helpful feedback, and worth talking about IMHO, thank you. I will move forward with the chunking you suggest if I don't hear any objections in the next day or so.

This issue of which variables should be coordinates has come up before in discussions with @andersy005 . Across variables, and possibly across ensemble members, in the original NetCDF files, these extra variables differ (examples: ULAT, ULONG, TLAT, TLONG are missing in some cases). The differences can apparently prevent concatenation into Xarray objects from working properly. I'm not as clear as Anderson on the potential problems. At any rate, It's good that we can address the metadata later if needed. It means I can move forward with creating these variables now.

@rabernat
Copy link
Author

The differences can apparently prevent concatenation into Xarray objects from working properly. I'm not as clear as Anderson on the potential problems

I can see how this could cause problems. However, I personally prefer to have all that stuff as coordinates. It's easy enough to just .reset_coords(drop=True) before any merge / alignment operations.

An even better option is to just drop all of the non-dimension coordinates before writing the zarr data, and then saving them to a standalone grid dataset, which can be brought in as needed for geometric calculations. That's what we did, for example, with the MITgcm LLC4320 dataset. You're currently wasting a non-negligible amount of space by storing all of these duplicate TAREA etc. variables in each of the ocean datasets.

@bonnland
Copy link
Collaborator

OK, that sounds like good advice. I'm assuming that removal of these variables is also something that can be done retroactively. Be sure to let me know if this not the case, or I will go ahead with the same procedure we've been using for now (since we have to go back anyway to fix the metadata for our other Zarr stores).

@rabernat
Copy link
Author

I'm assuming that removal of these variables is also something that can be done retroactively.

Should be as simple as deleting the directories for those variables and re-consolidating metadata.

@andersy005
Copy link
Contributor

andersy005 commented Apr 12, 2020

You're currently wasting a non-negligible amount of space by storing all of these duplicate TAREA etc. variables in each of the ocean datasets.

It turns out that these variables consume ~20 MB per zarr store.

An even better option is to just drop all of the non-dimension coordinates before writing the zarr data, and then saving them to a standalone grid dataset, which can be brought in as needed for geometric calculations.

👍. Would this grid dataset include static variables only? It appears that the LLC4320_grid includes time as well. By static variables I am referring to scalars and time-independent variables:

>>> print(grid_vars)
['hflux_factor', 'nsurface_u', 'DXU', 'latent_heat_vapor', 'salt_to_Svppt', 'DYT', 'TLONG', 'DYU', 'HTE', 'rho_air', 'HU', 'ULONG', 'DXT', 'rho_sw', 'HUS', 'HUW', 'moc_components', 'TAREA', 'ULAT', 'REGION_MASK', 'grav', 'transport_regions', 'KMU', 'sound', 'omega', 'ANGLET', 'HT', 'UAREA', 'heat_to_PW', 'days_in_norm_year', 'salt_to_ppt', 'dzw', 'sea_ice_salinity', 'cp_air', 'salt_to_mmday', 'dz', 'fwflux_factor', 'TLAT', 'HTN', 'mass_to_Sv', 'radius', 'latent_heat_fusion', 'T0_Kelvin', 'salinity_factor', 'sflux_factor', 'transport_components', 'KMT', 'rho_fw', 'cp_sw', 'ocn_ref_salinity', 'vonkar', 'nsurface_t', 'ANGLE', 'stefan_boltzmann', 'ppt_to_salt', 'momentum_factor']

Removing these grid variables produces a clean xarray dataset:

<xarray.Dataset>
Dimensions:       (d2: 2, lat_aux_grid: 395, member_id: 40, moc_z: 61, nlat: 384, nlon: 320, time: 1872, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
  * z_t           (z_t) float32 500.0 1500.0 2500.0 ... 512502.8 537500.0
  * z_t_150m      (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
  * moc_z         (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
  * z_w_top       (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * z_w_bot       (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
  * lat_aux_grid  (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
  * z_w           (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * time          (time) object 1850-02-01 00:00:00 ... 2006-01-01 00:00:00
  * member_id     (member_id) int64 1 2 3 4 5 6 7 ... 34 35 101 102 103 104 105
Dimensions without coordinates: d2, nlat, nlon
Data variables:
    time_bound    (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
    VVEL          (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>
Attributes:
    nsteps_total:              750
    nco_openmp_thread_number:  1
    cell_methods:              cell_methods = time: mean ==> the variable val...
    tavg_sum:                  2592000.0
    tavg_sum_qflux:            2592000.0
    source:                    CCSM POP2, the CCSM Ocean Component
    contents:                  Diagnostic and Prognostic Variables

@jeffdlb
Copy link
Contributor

jeffdlb commented Apr 12, 2020

@rabernat wrote:

The chunk choice on WVEL (and presumably other 3D variables) is, in my view, less than ideal...
First, the chunks are on the large side (235.93 MB). Second, each vertical level is in a separate chunk, while 20 years of time are stored contiguously.

FYI, for the 3D atmospheric data (at least monthly Q) there each chunk contains all ensemble members, 12 months of data, and 2 levels:

<xarray.DataArray 'Q' (member_id: 40, time: 1032, lev: 30, lat: 192, lon: 288)>
dask.array<zarr, shape=(40, 1032, 30, 192, 288), dtype=float32, chunksize=(40, 12, 2, 192, 288), chunktype=numpy.ndarray>

If we were to put all 30 levels in one chunk then we'd need to divide something else by a factor of ~15. Perhaps the x-y dimension should be 4x4 chunks instead of global?

I know Anderson was striving for 100MB chunks but haven't checked the size of these. The ocean data have, I think, 60 levels instead of 30, so the problem is even worse.

Also, @jhamman stated at the start of this project that it is possible to re-chunk under the hood if we don't like the arrangement, but I'm curious about how you do that in practice given the immutability of objects in an object store.

@cspencerjones
Copy link
Contributor

I also just opened the data and had a look. I agree with Ryan that rechunking so that each chunk contains all vertical levels would be very helpful: oceanographers like to plot sections! I don't object to chunking more in time in order to achieve this. I also think that it's sensible to continue chunking by memberID, because I will want to write and test my code for one member and then operate on all the members only once or twice. I'll probably hold off doing anything more until this is a bit more sorted out. Thanks to everyone for putting in this effort!

@andersy005
Copy link
Contributor

As an update I have re-chunked the data accordingly for all ocean variables:

<xarray.Dataset>
Dimensions:       (d2: 2, lat_aux_grid: 395, member_id: 40, moc_z: 61, nlat: 384, nlon: 320, time: 1872, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
  * z_t           (z_t) float32 500.0 1500.0 2500.0 ... 512502.8 537500.0
  * z_t_150m      (z_t_150m) float32 500.0 1500.0 2500.0 ... 13500.0 14500.0
  * moc_z         (moc_z) float32 0.0 1000.0 2000.0 ... 525000.94 549999.06
  * z_w_top       (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * z_w_bot       (z_w_bot) float32 1000.0 2000.0 3000.0 ... 525000.94 549999.06
  * lat_aux_grid  (lat_aux_grid) float32 -79.48815 -78.952896 ... 89.47441 90.0
  * z_w           (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * time          (time) object 1850-02-01 00:00:00 ... 2006-01-01 00:00:00
  * member_id     (member_id) int64 1 2 3 4 5 6 7 ... 34 35 101 102 103 104 105
Dimensions without coordinates: d2, nlat, nlon
Data variables:
    time_bound    (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
    VVEL          (member_id, time, z_t, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>

As you can see, I removed the grid variables. I could use some feedback on my comment above in #34 (comment) regarding what needs to go into a standalone grid dataset. The re-chunked data are residing on GLADE for now, and am ready to transfer them to S3 once the grid dataset has been sorted out.

@jeffdlb
Copy link
Contributor

jeffdlb commented Apr 13, 2020

The re-chunked data are residing on GLADE for now, and am ready to transfer them to S3 once the grid dataset has been sorted out.

Does @jhamman have a strategy for re-chunking in place directly on AWS S3? I suspect this would require reading data from the old objects, creating the new objects in a separate bucket as scratch space, deleting the old objects, copying the new objects to the main bucket, deleting the new objects from the scratch bucket. I can create a scratch bucket under our AWS account if desired.

@rabernat
Copy link
Author

This is a minor nit, but I personally perfer time_bound to also be in coords, not data_vars. Then you will just have one data variable per dataset, which has a nice, clean feel.

@rabernat
Copy link
Author

Also, there appear to be quite a few coordinates that are not used by the data variables. These could probably be removed as well.

@bonnland
Copy link
Collaborator

bonnland commented Apr 13, 2020

I have created the Zarr files for TAUX and TAUY, but I chose to place all members in a single chunk because the chunks are so much smaller (these are 2D variables, so each chunk would be 1/60 the size of a 3D variable chunk).

But because I didn't perform the same metadata operations as @andersy005, and because they are fast to recreate, I will let Anderson make these also.

@andersy005
Copy link
Contributor

As an update, I updated the chunking scheme for all existing ocean variables on AWS-S3, removed the grid variables from the zarr stores, and created a standalone grid zarr store:

In [2]: import intake
   ...: url = 'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json'
   ...: col = intake.open_esm_datastore(url)
   ...: subset = col.search(component='ocn')

In [3]: subset.unique(columns=['variable', 'experiment', 'frequency'])
Out[3]:
{'variable': {'count': 11,
  'values': ['SALT',
   'SFWF',
   'SHF',
   'SSH',
   'SST',
   'TEMP',
   'UVEL',
   'VNS',
   'VNT',
   'VVEL',
   'WVEL']},
 'experiment': {'count': 3, 'values': ['20C', 'CTRL', 'RCP85']},
 'frequency': {'count': 1, 'values': ['monthly']}}
In [1]: import s3fs
   ...: import xarray as xr
   ...:
   ...: fs = s3fs.S3FileSystem(anon=True)
   ...: s3_path = 's3://ncar-cesm-lens/ocn/monthly/cesmLE-CTRL-WVEL.zarr'
   ...: ds = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
   ...: ds
Out[1]:
<xarray.Dataset>
Dimensions:     (d2: 2, member_id: 1, nlat: 384, nlon: 320, time: 21612, z_w_top: 60)
Coordinates:
  * member_id   (member_id) int64 1
  * time        (time) object 0400-02-01 00:00:00 ... 2201-01-01 00:00:00
    time_bound  (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
  * z_w_top     (z_w_top) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
Dimensions without coordinates: d2, nlat, nlon
Data variables:
    WVEL        (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>
Attributes:
    Conventions:               CF-1.0; http://www.cgd.ucar.edu/cms/eaton/netc...
    NCO:                       4.3.4
    calendar:                  All years have exactly  365 days.
    cell_methods:              cell_methods = time: mean ==> the variable val...
    contents:                  Diagnostic and Prognostic Variables
    nco_openmp_thread_number:  1
    revision:                  $Id: tavg.F90 41939 2012-11-14 16:37:23Z mlevy...
    source:                    CCSM POP2, the CCSM Ocean Component
    tavg_sum:                  2678400.0
    tavg_sum_qflux:            2678400.0
    title:                     b.e11.B1850C5CN.f09_g16.005

In [2]: s3_path = 's3://ncar-cesm-lens/ocn/grid.zarr'

In [3]: grid = xr.open_zarr(fs.get_mapper(s3_path), consolidated=True)
In [6]: xr.merge([ds, grid])
Out[6]:
<xarray.Dataset>
Dimensions:               (d2: 2, lat_aux_grid: 395, member_id: 1, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 21612, transport_comp: 5, transport_reg: 2, z_t: 1, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
  * member_id             (member_id) int64 1
  * time                  (time) object 0400-02-01 00:00:00 ... 2201-01-01 00:00:00
    time_bound            (time, d2) object dask.array<chunksize=(6, 2), meta=np.ndarray>
  * z_w_top               (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94
    ANGLE                 (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    ANGLET                (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    DXT                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    DXU                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    DYT                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    DYU                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    HT                    (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    HTE                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    HTN                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    HU                    (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    HUS                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    HUW                   (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    KMT                   (nlat, nlon) float64 dask.array<chunksize=(192, 320), meta=np.ndarray>
    KMU                   (nlat, nlon) float64 dask.array<chunksize=(192, 320), meta=np.ndarray>
    REGION_MASK           (nlat, nlon) float64 dask.array<chunksize=(192, 320), meta=np.ndarray>
    T0_Kelvin             float64 ...
    TAREA                 (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    TLAT                  (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    TLONG                 (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    UAREA                 (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    ULAT                  (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    ULONG                 (nlat, nlon) float64 dask.array<chunksize=(192, 160), meta=np.ndarray>
    cp_air                float64 ...
    cp_sw                 float64 ...
    days_in_norm_year     timedelta64[ns] ...
    dz                    (z_t) float32 dask.array<chunksize=(1,), meta=np.ndarray>
    dzw                   (z_w) float32 dask.array<chunksize=(60,), meta=np.ndarray>
    fwflux_factor         float64 ...
    grav                  float64 ...
    heat_to_PW            float64 ...
    hflux_factor          float64 ...
  * lat_aux_grid          (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0
    latent_heat_fusion    float64 ...
    latent_heat_vapor     float64 ...
    mass_to_Sv            float64 ...
    moc_components        (moc_comp) |S256 dask.array<chunksize=(3,), meta=np.ndarray>
  * moc_z                 (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06
    momentum_factor       float64 ...
    nsurface_t            float64 ...
    nsurface_u            float64 ...
    ocn_ref_salinity      float64 ...
    omega                 float64 ...
    ppt_to_salt           float64 ...
    radius                float64 ...
    rho_air               float64 ...
    rho_fw                float64 ...
    rho_sw                float64 ...
    salinity_factor       float64 ...
    salt_to_Svppt         float64 ...
    salt_to_mmday         float64 ...
    salt_to_ppt           float64 ...
    sea_ice_salinity      float64 ...
    sflux_factor          float64 ...
    sound                 float64 ...
    stefan_boltzmann      float64 ...
    transport_components  (transport_comp) |S256 dask.array<chunksize=(5,), meta=np.ndarray>
    transport_regions     (transport_reg) |S256 dask.array<chunksize=(2,), meta=np.ndarray>
    vonkar                float64 ...
  * z_t                   (z_t) float32 500.0
  * z_t_150m              (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0
  * z_w                   (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * z_w_bot               (z_w_bot) float32 1000.0 2000.0 ... 549999.06
Dimensions without coordinates: d2, moc_comp, nlat, nlon, transport_comp, transport_reg
Data variables:
    WVEL                  (member_id, time, z_w_top, nlat, nlon) float32 dask.array<chunksize=(1, 6, 60, 384, 320), meta=np.ndarray>

@jeffdlb
Copy link
Contributor

jeffdlb commented Apr 28, 2020

As an update, I updated the chunking scheme for all existing ocean variables on AWS-S3, removed the grid variables from the zarr stores, and created a standalone grid zarr store

@andersy005 Did you have to create the new Zarr on GLADE and then delete/upload/replace the Zarr stores on S3, or was it possible to re-chunk in place on AWS?

@jeffdlb
Copy link
Contributor

jeffdlb commented Apr 28, 2020

I am updating the dataset landing page to include the new variables.

QUESTION: We added VNS & VNT (salt and heat fluxes in y-direction). Shouldn't we also include UES & UET (salt and heat fluxes in x-direction), and maybe WTS & WTT (fluxes across top face)? I don't see how only one component of the flux vectors can be useful.

@bonnland
Copy link
Collaborator

bonnland commented Apr 28, 2020

I am updating the dataset landing page to include the new variables.

Hi Jeff, those variables are actually in transit now. I was going to announce their availability for performance testing after the transfer was completed. Once they have been transferred, I will update the catalog for AWS users. The variables in transit are:

3D variables: DIC, DOC, UES, UET, WTS, WTT, PD

2D variables: TAUX, TAUY, TAUX2, TAUY2, QFLUX, FW, HMXL, QSW_HTP, QSW_HBL, SHF_QSW, SFWF_WRST, RESID_S, RESID_T

It has been an uphill climb to understand the difficulties of creating very large Zarr stores; the Dask workers were bogging down and crashing at first, but eventually I began understanding what configurations would lead to successful Zarr saves.

@jeffdlb
Copy link
Contributor

jeffdlb commented Apr 28, 2020

@bonnland Excellent! Thank you very much. I will update the landing page to include those (but not publish until you are ready).

@jeffdlb
Copy link
Contributor

jeffdlb commented Apr 28, 2020

FYI the draft unpublished landing page with recent updates is temporarily at
CESM_LENS_on_AWS.20200428.htm

@bonnland
Copy link
Collaborator

@cspencerjones @rabernat @jbusecke Transfer of new ocean data is complete and available on Amazon AWS. It would be very helpful if someone could try a nontrivial computation with the data to make sure performance based on our chunking scheme is adequate.

I've confirmed that the Binder notebook on Amazon works (see the README.md for the link), and the variables are visible in the catalog. Here is what I got:

import intake
intakeEsmUrl = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
col = intake.open_esm_datastore(intakeEsmUrl)

subset = col.search(component='ocn')
subset.unique(columns=['variable', 'experiment', 'frequency'])

{'variable': {'count': 32,
  'values': ['DIC',
   'DOC',
   'FW',
   'HMXL',
   'O2',
   'PD',
   'QFLUX',
   'QSW_HBL',
   'QSW_HTP',
   'RESID_S',
   'RESID_T',
   'SALT',
   'SFWF',
   'SFWF_WRST',
   'SHF',
   'SHF_QSW',
   'SSH',
   'SST',
   'TAUX',
   'TAUX2',
   'TAUY',
   'TAUY2',
   'TEMP',
   'UES',
   'UET',
   'UVEL',
   'VNS',
   'VNT',
   'VVEL',
   'WTS',
   'WTT',
   'WVEL']},
 'experiment': {'count': 3, 'values': ['20C', 'CTRL', 'RCP85']},
 'frequency': {'count': 1, 'values': ['monthly']}}

@cspencerjones
Copy link
Contributor

I tried a few thing with the data this morning, including calculating density from temperature and salinity and plotting sections, transforming some variables to density coordinates and plotting time means etc. I tried using multiple workers as well. This worked ok and I think that the performance is adequate.

@bonnland
Copy link
Collaborator

bonnland commented Apr 30, 2020

That's great to hear; we can tentatively move forward with the remaining variables requested so far. They are all 3D variables:

UVEL2, VVEL2
HDIFB_SALT, HDIFB_TEMP,
HDIFE_SALT, HDIFE_TEMP
HDIFN_SALT, HDIFN_TEMP
KAPPA_ISOP, KAPPA_THIC
KPP_SRC_SALT, KPP_SRC_TEMP
VNT_ISOP, VNT_SUBM
HOR_DIFF

I've spent some time looking at MOC, which has a different parameterization than the other variables. Any thoughts on chunking are appreciated. At first glance, it seems we want to chunk in time, and leave all other dimensions unchunked, aiming for a chunk size between 100 and 200 MB.

netcdf b.e11.B20TRLENS_RCP85.f09_g16.xbmb.010.pop.h.MOC.192001-202912 {
dimensions:
	d2 = 2 ;
	time = UNLIMITED ; // (1320 currently)
	moc_comp = 3 ;
	transport_comp = 5 ;
	transport_reg = 2 ;
	lat_aux_grid = 395 ;
	moc_z = 61 ;
	nlon = 320 ;
	nlat = 384 ;

	float MOC(time, transport_reg, moc_comp, moc_z, lat_aux_grid) ;
		MOC:_FillValue = 9.96921e+36f ;
		MOC:long_name = "Meridional Overturning Circulation" ;
		MOC:units = "Sverdrups" ;
		MOC:coordinates = "lat_aux_grid moc_z moc_components transport_region time" ;
		MOC:cell_methods = "time: mean" ;
		MOC:missing_value = 9.96921e+36f ;

@jeffdlb
Copy link
Contributor

jeffdlb commented May 5, 2020

FYI the draft unpublished landing page with recent updates is temporarily at
CESM_LENS_on_AWS.20200428.htm

Now that the new data have been uploaded, I believe I can publish this draft as the new landing page.
QUESTION: Does the page need to say anything about the new approach to repeated grid variables, or is that completely transparent to the user?

@bonnland
Copy link
Collaborator

bonnland commented May 5, 2020

QUESTION: Does the page need to say anything about the new approach to repeated grid variables, or is that completely transparent to the user?

There are still small inconsistencies to work out, AFAIK. Unless I am mistaken, Anderson republished all the ocean data the grid variables removed, but grid variables still coexist in the atmospheric data, and these grid variables are probably distinct from the ocean variables.

The separate grid variables have been pushed to AWS, but they don't quite fit yet into our catalog framework, which is not yet general enough to handle variables that extend across experiments (CTRL, 20C, RCP85, etc). So the user can't load the grid variables until we generalize the catalog logic to make them available.

And I'm not yet clear on whether transparent loading of these variables is a simple matter. Simpler from a data provider engineering perspective would be to modify the Kay notebook to show how grid variables are loaded for area-based computations, which would require republishing the atmosphere variables. So, some kinks are left to work out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants