# Using the ACCESS-NRI Intake catalog as an alternative to the COSIMA Cookbook

This notebook demonstrates how to use the ACCESS-NRI Intake catalog, and demonstrates key similarities to and differences from the cosima cookbook

This notebook is concise version of the longer [ACCESS-NRI Intake catalog documentation](https://access-nri-intake-catalog.readthedocs.io/) and related [COSIMA training workshop](https://github.com/ACCESS-Hive/cosima-training-workshop-2023/blob/main/Intake.ipynb). Users are encouraged to refer to these for more detail and demonstrations. At the time of writing (Oct 2023), the ACCESS-NRI Intake Catalog is under testing and feedback from users is requested.

Requirements: The `conda/analysis3` (tested on `analysis3-23.07`) module from `/g/data/hh5/public/modules`. 

# Too Long; Didn't Read:

The most commonly used method from the cosima cookbook is `querying.getvar`, e.g.:

```python
import cosima_cookbook as cc

session = cc.database.create_session()

da = cc.querying.getvar(
    expt="expt0", 
    variable="var0", 
    session=session, 
    frequency="1 monthly"
)
```

Using the ACCESS-NRI Intake catalog, the same data can be obtained with:

```python
import intake

catalog = intake.cat.access_nri

ds = catalog["expt0"].search(
    variable="var0", 
    frequency="1mon"
).to_dask()

da=ds[var0]
```

# Start a dask Client

This is not specific to using the ACCESS-NRI Intake catalog. You would do the same thing using the COSIMA cookbook.

In [1]:
from os import environ
environ["PYTHONWARNINGS"] = "ignore"

In [2]:
from dask.distributed import Client

client = Client(threads_per_worker=1)
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /proxy/8787/status,

0,1
Dashboard: /proxy/8787/status,Workers: 12
Total threads: 12,Total memory: 46.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:33005,Workers: 12
Dashboard: /proxy/8787/status,Total threads: 12
Started: Just now,Total memory: 46.00 GiB

0,1
Comm: tcp://127.0.0.1:35001,Total threads: 1
Dashboard: /proxy/43033/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:36523,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-w1fbb7ys,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-w1fbb7ys

0,1
Comm: tcp://127.0.0.1:40837,Total threads: 1
Dashboard: /proxy/42513/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:33585,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-3mkpe188,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-3mkpe188

0,1
Comm: tcp://127.0.0.1:33339,Total threads: 1
Dashboard: /proxy/35057/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:39157,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-f8z0vwtp,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-f8z0vwtp

0,1
Comm: tcp://127.0.0.1:40069,Total threads: 1
Dashboard: /proxy/43565/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:39535,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-s0edeemi,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-s0edeemi

0,1
Comm: tcp://127.0.0.1:37503,Total threads: 1
Dashboard: /proxy/39937/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:45553,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-8jrv2kn1,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-8jrv2kn1

0,1
Comm: tcp://127.0.0.1:35287,Total threads: 1
Dashboard: /proxy/39281/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:33729,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-me614dht,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-me614dht

0,1
Comm: tcp://127.0.0.1:44341,Total threads: 1
Dashboard: /proxy/43647/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:37745,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-88295g4d,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-88295g4d

0,1
Comm: tcp://127.0.0.1:37351,Total threads: 1
Dashboard: /proxy/45263/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:34769,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-im3a5mci,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-im3a5mci

0,1
Comm: tcp://127.0.0.1:36337,Total threads: 1
Dashboard: /proxy/46507/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:39397,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-lqohc23n,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-lqohc23n

0,1
Comm: tcp://127.0.0.1:33167,Total threads: 1
Dashboard: /proxy/35199/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:34343,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-ud7r8unv,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-ud7r8unv

0,1
Comm: tcp://127.0.0.1:44113,Total threads: 1
Dashboard: /proxy/36205/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:37911,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-n6gh11a2,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-n6gh11a2

0,1
Comm: tcp://127.0.0.1:46071,Total threads: 1
Dashboard: /proxy/36653/status,Memory: 3.83 GiB
Nanny: tcp://127.0.0.1:39849,
Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-b1c4u6cm,Local directory: /jobfs/99602418.gadi-pbs/dask-scratch-space/worker-b1c4u6cm


# Opening and searching the catalog

To use the ACCESS-NRI Intake catalog, we need to import `intake`

In [3]:
import intake
import cosima_cookbook as cc # We'll also load the cookbook for comparisons later

We can open the catalog as follows. This is somewhat similar to starting a database session with the COSIMA cookbook. The returned object `catalog` is an instance of the ACCESS-NRI Intake catalog that we can use to find and load data.

In [4]:
catalog = intake.cat.access_nri

Printing the `catalog` object will return a dataframe of experiments that you can browse:

In [5]:
catalog

Unnamed: 0_level_0,model,description,realm,frequency,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01deg_jra55v13_ryf9091,{ACCESS-OM2},{0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.3 RYF9091 repeat year forcing (May 1990 to Apr 1991)},"{seaIce, ocean}","{1mon, 3mon, fx, 1day, 3hr}","{temp_submeso, average_T2, vvel_m, ty_trans_nrho_submeso, aice_m, eta_global, surface_salt, dzt, flatn_ai_m, fswup_m, sw_heat, swflx, temp_yflux_adv, sig2_m, alidf_ai_m, sfc_hflux_pme, wfiform, fm..."
01deg_jra55v140_iaf,{ACCESS-OM2},{Cycle 1 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{seaIce, ocean}","{fx, 1mon, 1day}","{average_T2, eta_nonbouss, vvel_m, ty_trans_nrho_submeso, aice_m, eta_global, surface_salt, dzt, flatn_ai_m, fswup_m, swflx, temp_yflux_adv, alidf_ai_m, sfc_hflux_pme, wfiform, temp_int_rhodz, fme..."
01deg_jra55v140_iaf_cycle2,{ACCESS-OM2},{Cycle 2 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{seaIce, ocean}","{fx, 1mon, 1day}","{average_T2, eta_nonbouss, vvel_m, ty_trans_nrho_submeso, aice_m, eta_global, surface_salt, dzt, flatn_ai_m, fswup_m, swflx, temp_yflux_adv, alidf_ai_m, sfc_hflux_pme, wfiform, temp_int_rhodz, fme..."
01deg_jra55v140_iaf_cycle3,{ACCESS-OM2},{Cycle 3 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{seaIce, ocean}","{fx, 1mon, 1day}","{average_T2, eta_nonbouss, vvel_m, ty_trans_nrho_submeso, aice_m, eta_global, surface_salt, dzt, flatn_ai_m, fswup_m, swflx, temp_yflux_adv, alidf_ai_m, sfc_hflux_pme, wfiform, temp_int_rhodz, fme..."
01deg_jra55v140_iaf_cycle4,{ACCESS-OM2},{Cycle 4 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{seaIce, ocean}","{1mon, fx, 6hr, 1day, 3hr}","{average_T2, eta_nonbouss, src03, alvdr_ai, temp_yflux_adv, alidf_ai_m, det_xflux_adv, meltt, no3_intmld, uarea, total_ocean_heat, uvel_h, v, HTE, strength_m, dyt, total_ocean_pme_river, total_oce..."
01deg_jra55v140_iaf_cycle4_jra55v150_extension,{ACCESS-OM2},{Extensions of cycle 4 of 0.1 degree ACCESS-OM2 + WOMBAT BGC global model configuration with JRA55-do v1.5.0 and v1.5.0.1 interannual forcing},"{seaIce, ocean}","{subhr, fx, 1mon, 1day}","{no3_zflux_adv, average_T2, src05, eta_nonbouss, vvel_m, pprod_gross_int100, ty_trans_nrho_submeso, fe_yflux_adv, albsni, aice_m, eta_global, dic, npp3d, surface_salt, npp1, src03, dzt, flatn_ai_m..."
01deg_jra55v150_iaf_cycle1,{ACCESS-OM2},{Cycle 1 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.5.0 OMIP2 interannual forcing},"{seaIce, ocean}","{fx, 1mon, 1day}","{average_T2, eta_nonbouss, ty_trans_nrho_submeso, aice_m, surface_salt, dzt, swflx, wfiform, sfc_hflux_pme, temp_int_rhodz, evap, dxt, bmf_v, v, dyt, fprec, u, dyu, geolon_t, sens_heat, pme_river,..."
025deg_jra55_iaf_omip2_cycle1,{ACCESS-OM2},{Cycle 1/6 of 0.25 degree ACCESS-OM2 physics-only global configuration with JRA55-do v1.4 OMIP2 interannual forcing (1958-2019)},"{seaIce, ocean}","{fx, 1yr, 1mon, 1day}","{temp_submeso, average_T2, eta_nonbouss, vvel_m, ty_trans_nrho_submeso, flwup_ai_m, aice_m, eta_global, strcorx_m, salt_vdiffuse_diff_cbt_conv, fswup_m, sw_heat, flatn_ai_m, swflx, neutral_diffusi..."
025deg_jra55_iaf_omip2_cycle2,{ACCESS-OM2},{Cycle 1/6 of 0.25 degree ACCESS-OM2 physics-only global configuration with JRA55-do v1.4 OMIP2 interannual forcing (1958-2019)},"{seaIce, ocean}","{fx, 1yr, 1mon, 1day}","{temp_submeso, average_T2, eta_nonbouss, vvel_m, ty_trans_nrho_submeso, flwup_ai_m, aice_m, eta_global, strcorx_m, salt_vdiffuse_diff_cbt_conv, fswup_m, sw_heat, flatn_ai_m, swflx, neutral_diffusi..."
025deg_jra55_iaf_omip2_cycle3,{ACCESS-OM2},{Cycle 3/6 of 0.25 degree ACCESS-OM2 physics-only global configuration with JRA55-do v1.4 OMIP2 interannual forcing (1958-2019)},"{seaIce, ocean}","{fx, 1yr, 1mon, 1day}","{temp_submeso, average_T2, eta_nonbouss, vvel_m, ty_trans_nrho_submeso, flwup_ai_m, aice_m, eta_global, strcorx_m, salt_vdiffuse_diff_cbt_conv, fswup_m, sw_heat, flatn_ai_m, swflx, neutral_diffusi..."


You can also search based on the columns in this dataframe to find experiments that are relevant to you. For example, you might be interested in all ACCESS-OM2 experiments that have the variable `"surface_salt"` at daily frequency. There are 6 such experiments currently available through the catalog:

In [6]:
catalog.search(model="ACCESS-OM2", variable="surface_salt", frequency="1day")

Unnamed: 0_level_0,model,description,realm,frequency,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01deg_jra55v13_ryf9091,{ACCESS-OM2},{0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.3 RYF9091 repeat year forcing (May 1990 to Apr 1991)},{ocean},{1day},{surface_salt}
01deg_jra55v140_iaf,{ACCESS-OM2},{Cycle 1 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},{ocean},{1day},{surface_salt}
01deg_jra55v140_iaf_cycle2,{ACCESS-OM2},{Cycle 2 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},{ocean},{1day},{surface_salt}
01deg_jra55v140_iaf_cycle3,{ACCESS-OM2},{Cycle 3 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},{ocean},{1day},{surface_salt}
01deg_jra55v140_iaf_cycle4,{ACCESS-OM2},{Cycle 4 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},{ocean},{1day},{surface_salt}
01deg_jra55v140_iaf_cycle4_jra55v150_extension,{ACCESS-OM2},{Extensions of cycle 4 of 0.1 degree ACCESS-OM2 + WOMBAT BGC global model configuration with JRA55-do v1.5.0 and v1.5.0.1 interannual forcing},{ocean},{1day},{surface_salt}


In this way, the catalog provides similar functionality to the COSIMA cookbook Database Explorer tool.

# Opening data

There are [multiple ways](https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/quickstart.html#loading-intake-sources) to open data from the experiments in `catalog`. Here we'll demonstrate how to do this when you know the name of the experiment you are interested in, since this typical for COSIMA users.

For example, we can open monthly data for the `surface_salt` variable in the `01deg_jra55v13_ryf9091` experiment as follows:

In [7]:
experiment = "01deg_jra55v13_ryf9091"
variable = "surface_salt"

In [8]:
data_ic = catalog[experiment].search(
    variable=variable, 
    frequency="1mon"
).to_dask()

In [9]:
data_ic["surface_salt"]

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 121.67 GiB 2.32 MiB Shape (3360, 2700, 3600) (1, 675, 900) Dask graph 53760 chunks in 2233 graph layers Data type float32 numpy.ndarray",3600  2700  3360,

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


We can open the same data, using the COSIMA cookbook function `querying.getvar`:

In [10]:
session = cc.database.create_session()

data_cc = cc.querying.getvar(
    expt=experiment,
    variable=variable,
    session=session,
    frequency="1 monthly",
)

In [11]:
data_cc

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 121.67 GiB 2.32 MiB Shape (3360, 2700, 3600) (1, 675, 900) Dask graph 53760 chunks in 2233 graph layers Data type float32 numpy.ndarray",3600  2700  3360,

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


# Some important differences

There are a few differences in functionality between the ACCESS-NRI Intake catalog and the COSIMA cookbook that users should be aware of.

## 1. The cookbook returns DataArrays, whereas the catalog returns Datasets

This is because with the catalog you can load multiple variables into a single dataset with a single call (when these variables are in the same file). E.g.:

In [12]:
data_ic_multivar = catalog[experiment].search(
    variable=["surface_salt", "surface_temp"], 
    frequency="1mon"
).to_dask()

In [13]:
data_ic_multivar

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 121.67 GiB 2.32 MiB Shape (3360, 2700, 3600) (1, 675, 900) Dask graph 53760 chunks in 2233 graph layers Data type float32 numpy.ndarray",3600  2700  3360,

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 121.67 GiB 2.32 MiB Shape (3360, 2700, 3600) (1, 675, 900) Dask graph 53760 chunks in 2233 graph layers Data type float32 numpy.ndarray",3600  2700  3360,

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## 2. The catalog knows which files make up distinct datasets

The following fails with the cookbook because it doesn't know which frequency of data to load:

In [14]:
data_ic_multifreq = cc.querying.getvar(
    expt=experiment,
    variable=variable,
    session=session,
)

QueryWarning: Your query returns files with differing frequencies: {'1 monthly', '1 daily'}. This could lead to unexpected behaviour! Disambiguate by passing frequency= to getvar, specifying the desired frequency.

Additionally, there are cases where the cookbook will [silently load and concatenate data on different grids](https://github.com/COSIMA/cosima-recipes/issues/229), because it does not know which files in the database should/shouldn't be concatenated together.

The catalog know which files make up distinct datasets and provides methods to open multiple datasets from a single query. We can run the equivalent to the cell above using the catalog, using `to_dataset_dict()` rather than `to_dask()`. Doing so returns a dictionary containing Datasets of the variable at all the available frequencies (daily and monthly in this case)

In [15]:
data_ic_multifreq = catalog[experiment].search(variable=variable).to_dataset_dict()


--> The keys in the returned dictionary of datasets are constructed as follows:
	'file_id.frequency'


In [16]:
data_ic_multifreq

{'ocean_daily.1day': <xarray.Dataset>
 Dimensions:       (time: 81580, yt_ocean: 2700, xt_ocean: 3600)
 Coordinates:
   * xt_ocean      (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
   * yt_ocean      (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
   * time          (time) object 1956-04-01 12:00:00 ... 2179-12-31 12:00:00
 Data variables:
     surface_salt  (time, yt_ocean, xt_ocean) float32 dask.array<chunksize=(1, 675, 900), meta=np.ndarray>
 Attributes:
     filename:                        ocean_daily.nc
     title:                           ACCESS-OM2-01
     grid_type:                       mosaic
     grid_tile:                       1
     intake_esm_vars:                 ['surface_salt']
     intake_esm_attrs:realm:          ocean
     intake_esm_attrs:frequency:      1day
     intake_esm_attrs:filename:       ocean_daily.nc
     intake_esm_attrs:file_id:        ocean_daily
     intake_esm_attrs:_data_format_:  netcdf
     intake_esm_dataset_key

Alternatively, multiple datasets can be opened directly into an [xarray-datatree](https://xarray-datatree.readthedocs.io/en/latest/) by calling `to_datatree` rather than `to_dataset_dict` (in an upcoming release, it will be easier for users to control how the groups are structured in the datatree.). E.g.:

In [17]:
data_ic_datatree = catalog[experiment].search(variable=variable).to_datatree()


--> The keys in the returned dictionary of datasets are constructed as follows:
	'file_id/frequency'


In [18]:
data_ic_datatree

Unnamed: 0,Array,Chunk
Bytes,2.88 TiB,2.32 MiB
Shape,"(81580, 2700, 3600)","(1, 675, 900)"
Dask graph,1305280 chunks in 1789 graph layers,1305280 chunks in 1789 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.88 TiB 2.32 MiB Shape (81580, 2700, 3600) (1, 675, 900) Dask graph 1305280 chunks in 1789 graph layers Data type float32 numpy.ndarray",3600  2700  81580,

Unnamed: 0,Array,Chunk
Bytes,2.88 TiB,2.32 MiB
Shape,"(81580, 2700, 3600)","(1, 675, 900)"
Dask graph,1305280 chunks in 1789 graph layers,1305280 chunks in 1789 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 121.67 GiB 2.32 MiB Shape (3360, 2700, 3600) (1, 675, 900) Dask graph 53760 chunks in 2233 graph layers Data type float32 numpy.ndarray",3600  2700  3360,

Unnamed: 0,Array,Chunk
Bytes,121.67 GiB,2.32 MiB
Shape,"(3360, 2700, 3600)","(1, 675, 900)"
Dask graph,53760 chunks in 2233 graph layers,53760 chunks in 2233 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


## 3. The frequency vocabulary is different

With the cookbook `querying.getvar` function, the provided frequency must be one of (where `<int>` is an integer):

```python
"static"
"<int> hourly"
"<int> daily"
"<int> monthly"
"<int> yearly"
```  

In the catalog, frequency follows a standard vocabulary that is very similar to CMIP6:

```python
"fx" # fixed
"subhr" # subhourly
"<int>hr" # hourly
"<int>day" # daily
"<int>mon" # monthly
"<int>yr" # yearly
"<int>dec" # decadal
```

## 4. Passing keyword arguments is different

With the cookbook `querying.getvar` function, additional keyword arguments to be passed to xarray's `open_mfdataset` can be passed directly to the `getvar` function. For example, one might specify the chunks and combining arguments using:

In [19]:
data_cc_kw = cc.querying.getvar(
    expt=experiment,
    variable=variable,
    session=session,
    frequency="1 monthly",
    chunks={"xt_ocean": -1, "yt_ocean": -1},
    compat="override",
    data_vars="minimal",
    coords="minimal"
)

With the catalog, keyword argments for xarray's `open_dataset` and `combine_by_coords` functions are passed separately to `to_dask` (or `to_dataset_dict`). For example, to do the same as the previous cell:

In [20]:
xarray_open_kwargs=dict(
    chunks={"xt_ocean": -1, "yt_ocean": -1}
)
xarray_combine_by_coords_kwargs=dict(
    compat="override",
    data_vars="minimal",
    coords="minimal"
)

data_ic_kw = catalog[experiment].search(
    variable=variable, 
    frequency="1mon"
).to_dask(
    xarray_open_kwargs=xarray_open_kwargs,
    xarray_combine_by_coords_kwargs=xarray_combine_by_coords_kwargs,
)

## 5. You cannot search by start and end date using the catalog

With the cookbook, it's common to specify a time *range* in the `getvar` query. E.g.:

In [21]:
start_time = "2000-01-01"
end_time = "2180-01-01"

data_cc_times = cc.querying.getvar(
    expt=experiment,
    variable=variable,
    session=session,
    frequency="1 monthly",
    start_time=start_time,
    end_time=end_time,
)

data_cc_times = data_cc_times.sel(time=slice(start_time, end_time))

It's not possible to query on a time range with the Intake catalog. 

That is, with the catalog you'd just do (which takes a few seconds longer):

In [22]:
data_ic = catalog[experiment].search(
    variable=variable, 
    frequency="1mon"
).to_dask()

data_ic_times = data_ic.sel(time=slice(start_time, end_time))

This difference is acceptable because the opening of datasets is a parallelized task that is done [lazily](https://docs.xarray.dev/en/stable/user-guide/dask.html#parallel-computing-with-dask),  so opening all files and reducing the times using xarray's `sel` methods doesn't add too much overhead. In most cases where the overhead of opening the files seems large, this can be reduced through sensible choices of keyword arguments provided to `open_dataset` and `combine_by_coords` - see the xarray documentation on [Reading multi-file datasets](https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets) for details.

# Tips, gotchas and workarounds

## 1. Speeding up opening your datasets

Try passing the following argument to your `to_dask` or `to_dataset_dict` call:

```python
xarray_combine_by_coords_kwargs=dict(
    compat="override",
    data_vars="minimal",
    coords="minimal"
)
```

See the xarray documentation on [Reading multi-file datasets](https://docs.xarray.dev/en/stable/user-guide/io.html#reading-multi-file-datasets) for more details about these arguments.

## 2. Choosing chunksizes

Correctly choosing chunk sizes when you open datasets will greatly improve the speed of your analysis. Check out the [Chunking tutorial](https://access-nri-intake-catalog.readthedocs.io/en/latest/usage/chunking.html) in the ACCESS-NRI Intake catalog documentation

## 3. Loading time-invarient variables

Many COSIMA experiments include multiple repeated files containing the same fixed frequency data (e.g. grid information). Both the cookbook and the catalog will fail to concatenate these files since they contain no clear dimension to concatenate along. The workaround with the cookbook is to specify the argument `n=1` to tell the cookbook to return only data from the first file, e.g.

In [23]:
data_cc_fixed = cc.querying.getvar(
    expt=experiment,
    variable="area_t",
    session=session,
    n=1,
)

There is no equivalent `n` argument with the catalog, but you can further restrict your search to only return one file. For example, you could only return the file in the `output000` directory with:

In [24]:
data_ic_fixed = catalog[experiment].search(
    variable='area_t',
    path=".*output000.*"
).to_dask()

## 4. Determining what can be searched upon in an experiment

You can see what can be `search`ed on within an experiment with:

In [25]:
catalog[experiment].df.columns.tolist()

['path',
 'realm',
 'variable',
 'frequency',
 'start_date',
 'end_date',
 'variable_long_name',
 'variable_standard_name',
 'variable_cell_methods',
 'variable_units',
 'filename',
 'file_id']

It can also be helpful sometimes to look at the `catalog[experiment].df` object itself, which is a dataframe of all of the files in the experiment and their metadata

In [26]:
catalog[experiment].df.head()

Unnamed: 0,path,realm,variable,frequency,start_date,end_date,variable_long_name,variable_standard_name,variable_cell_methods,variable_units,filename,file_id
0,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-01.nc,seaIce,"[time_bounds, TLON, TLAT, ULON, ULAT, NCAT, tmask, blkmask, tarea, uarea, dxt, dyt, dxu, dyu, HTN, HTE, ANGLE, ANGLET, hi_m, hs_m, Tsfc_m, aice_m, uvel_m, vvel_m, uatm_m, vatm_m, fswup_m, sst_m, s...",1mon,"1900-01-01, 00:00:00","1900-02-01, 00:00:00","[boundaries for time-averaging interval, T grid center longitude, T grid center latitude, U grid center longitude, U grid center latitude, category maximum thickness, ocean grid mask, ice grid blo...","[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]","[, , , , , , , , , , , , , , , , , , time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, tim...","[days since 1900-01-01 00:00:00, degrees_east, degrees_north, degrees_east, degrees_north, m, , , m^2, m^2, m, m, m, m, m, m, radians, radians, m, m, C, 1, m/s, m/s, m/s, m/s, W/m^2, C, ppt, m/s, ...",iceh.1900-01.nc,iceh_XXXX_XX
1,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-02.nc,seaIce,"[time_bounds, TLON, TLAT, ULON, ULAT, NCAT, tmask, blkmask, tarea, uarea, dxt, dyt, dxu, dyu, HTN, HTE, ANGLE, ANGLET, hi_m, hs_m, Tsfc_m, aice_m, uvel_m, vvel_m, uatm_m, vatm_m, fswup_m, sst_m, s...",1mon,"1900-02-01, 00:00:00","1900-03-01, 00:00:00","[boundaries for time-averaging interval, T grid center longitude, T grid center latitude, U grid center longitude, U grid center latitude, category maximum thickness, ocean grid mask, ice grid blo...","[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]","[, , , , , , , , , , , , , , , , , , time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, tim...","[days since 1900-01-01 00:00:00, degrees_east, degrees_north, degrees_east, degrees_north, m, , , m^2, m^2, m, m, m, m, m, m, radians, radians, m, m, C, 1, m/s, m/s, m/s, m/s, W/m^2, C, ppt, m/s, ...",iceh.1900-02.nc,iceh_XXXX_XX
2,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ice/OUTPUT/iceh.1900-03.nc,seaIce,"[time_bounds, TLON, TLAT, ULON, ULAT, NCAT, tmask, blkmask, tarea, uarea, dxt, dyt, dxu, dyu, HTN, HTE, ANGLE, ANGLET, hi_m, hs_m, Tsfc_m, aice_m, uvel_m, vvel_m, uatm_m, vatm_m, fswup_m, sst_m, s...",1mon,"1900-03-01, 00:00:00","1900-04-01, 00:00:00","[boundaries for time-averaging interval, T grid center longitude, T grid center latitude, U grid center longitude, U grid center latitude, category maximum thickness, ocean grid mask, ice grid blo...","[, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]","[, , , , , , , , , , , , , , , , , , time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, tim...","[days since 1900-01-01 00:00:00, degrees_east, degrees_north, degrees_east, degrees_north, m, , , m^2, m^2, m, m, m, m, m, m, radians, radians, m, m, C, 1, m/s, m/s, m/s, m/s, W/m^2, C, ppt, m/s, ...",iceh.1900-03.nc,iceh_XXXX_XX
3,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ocean/ocean.nc,ocean,"[temp, pot_temp, salt, age_global, u, v, wt, dzt, rho, pot_rho_0, tx_trans, ty_trans, ty_trans_submeso, tx_trans_rho, ty_trans_rho, ty_trans_nrho_submeso, temp_xflux_adv, temp_yflux_adv, average_T...",3mon,"1900-01-01, 00:00:00","1900-04-01, 00:00:00","[Conservative temperature, Potential temperature, Practical Salinity, Age (global), i-current, j-current, dia-surface velocity T-points, t-cell thickness, in situ density, potential density refere...","[, sea_water_potential_temperature, sea_water_salinity, sea_water_age_since_surface_contact, sea_water_x_velocity, sea_water_y_velocity, , cell_thickness, , sea_water_potential_density, ocean_mass...","[time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, time: mean, tim...","[deg_C, degrees K, psu, yr, m/sec, m/sec, m/sec, m, kg/m^3, kg/m^3, kg/s, kg/s, kg/s, kg/s, kg/s, kg/s, Watts, Watts, days since 1900-01-01 00:00:00, days since 1900-01-01 00:00:00, days, days]",ocean.nc,ocean
4,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ocean/ocean_grid.nc,ocean,"[geolon_t, geolat_t, geolon_c, geolat_c, ht, hu, dxt, dyt, dxu, dyu, area_t, area_u, kmt, kmu, drag_coeff]",fx,none,none,"[tracer longitude, tracer latitude, uv longitude, uv latitude, ocean depth on t-cells, ocean depth on u-cells, ocean dxt on t-cells, ocean dyt on t-cells, ocean dxu on u-cells, ocean dyu on u-cell...","[, , , , sea_floor_depth_below_geoid, , , , , , , , , , ]","[time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point]","[degrees_E, degrees_N, degrees_E, degrees_N, m, m, m, m, m, m, m^2, m^2, dimensionless, dimensionless, dimensionless]",ocean_grid.nc,ocean_grid


If you have any further questions after reading this notebook and the documentation linked from this notebook, please open an issue in the [ACCESS-NRI Intake catalog Github repo](https://github.com/ACCESS-NRI/access-nri-intake-catalog) or open topic on the [ACCESS-Hive forum](https://forum.access-hive.org.au/).

In [27]:
client.close()