# Using _Intake_ as an Alternative to the _COSIMA Cookbook_

This notebook shows how to move from the cosima_cookbook to intake

# Too Long; Didn't Read:

The cosima cookbook version:

```
import cosima_cookbook as cc

session = cc.database.create_session()

da = cc.querying.getvar(
    expt="expt0", 
    variable="var0", 
    session=session, 
    frequency="1 monthly"
)
```

translates to this in intake

```
import intake

catalog = intake.cat.access_nri

ds = catalog["expt0"].search(
    variable="var0", 
    frequency="1mon"
).to_dask(
    xarray_combine_by_coords_kwargs={
        'compat':'override','data_vars':'minimal', 'coords':'minimal'
    }
)

da=ds[var0]
```

# Opening the catalog

This notebook is concise version of the longer [COSIMA training workshop](https://github.com/ACCESS-Hive/cosima-training-workshop-2023/blob/main/Intake.ipynb) on the Intake Catalog, and the [documentation](https://access-nri-intake-catalog.readthedocs.io/). At the time of writing (Oct 2023), the ACCESS-NRI Intake Catalog is under testing and feedback from users is requested.

**Notes that are unique to changing from the cookbook to intake are in BOLD**

Requirements: The conda/analysis3 (tested on analysis3-23.04) module from /g/data/hh5/public/modules. 

**Firstly, load modules, using intake instead of cosima_cookbooks**:

In [1]:
import intake # instead of import cosima_cookbook as cc

from dask.distributed import Client
from datetime import timedelta

And start a dask client

In [2]:
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /proxy/8787/status,

0,1
Dashboard: /proxy/8787/status,Workers: 8
Total threads: 48,Total memory: 188.56 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39307,Workers: 8
Dashboard: /proxy/8787/status,Total threads: 48
Started: Just now,Total memory: 188.56 GiB

0,1
Comm: tcp://127.0.0.1:36465,Total threads: 6
Dashboard: /proxy/46333/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:40545,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-1yut9h5j,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-1yut9h5j

0,1
Comm: tcp://127.0.0.1:41981,Total threads: 6
Dashboard: /proxy/36613/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:37635,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-xfkimd8u,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-xfkimd8u

0,1
Comm: tcp://127.0.0.1:35171,Total threads: 6
Dashboard: /proxy/42797/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:38933,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-gbch_vue,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-gbch_vue

0,1
Comm: tcp://127.0.0.1:41895,Total threads: 6
Dashboard: /proxy/45445/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:33813,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-vl55ywcv,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-vl55ywcv

0,1
Comm: tcp://127.0.0.1:35181,Total threads: 6
Dashboard: /proxy/39981/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:43889,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-_6z6i1ng,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-_6z6i1ng

0,1
Comm: tcp://127.0.0.1:34063,Total threads: 6
Dashboard: /proxy/35719/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:45835,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-vncezhts,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-vncezhts

0,1
Comm: tcp://127.0.0.1:45267,Total threads: 6
Dashboard: /proxy/44115/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:38991,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-jktm0e41,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-jktm0e41

0,1
Comm: tcp://127.0.0.1:34489,Total threads: 6
Dashboard: /proxy/45263/status,Memory: 23.57 GiB
Nanny: tcp://127.0.0.1:46429,
Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-a4cx4q1g,Local directory: /jobfs/98446146.gadi-pbs/dask-scratch-space/worker-a4cx4q1g


**Open the catalog (similar to starting a database session)**

In [3]:
catalog = intake.cat.access_nri
# session = cc.database.create_session()

**You can browse the catalogue (instead of the database explorer, just run `catalog`) or browse the results of a search:**

In [4]:
catalog.search(model='ACCESS-OM2')

Unnamed: 0_level_0,model,description,realm,frequency,variable
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
01deg_jra55v13_ryf9091,{ACCESS-OM2},{0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.3 RYF9091 repeat year forcing (May 1990 to Apr 1991)},"{ocean, seaIce}","{1mon, 3mon, 3hr, 1day, fx}","{alidr_ai_m, melt, total_ocean_lw_heat, frazil_m, pe_tot, temp_vdiffuse_impl, area_u, ty_trans, surface_salt, hu, passive_weddell, total_ocean_fprec, temp_surface_ave, ULON, vsurf, temp_yflux_adv,..."
01deg_jra55v140_iaf,{ACCESS-OM2},{Cycle 1/4 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{ocean, seaIce}","{1mon, fx, 1day}","{alidr_ai_m, total_ocean_lw_heat, melt, frazil_m, temp_int_rhodz, pe_tot, area_u, ty_trans, surface_salt, hu, total_ocean_fprec, temp_surface_ave, vvel, ULON, daidtt_m, temp_yflux_adv, opening_m, ..."
01deg_jra55v140_iaf_cycle2,{ACCESS-OM2},{Cycle 2/4 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{ocean, seaIce}","{1mon, fx, 1day}","{alidr_ai_m, total_ocean_lw_heat, melt, frazil_m, temp_int_rhodz, pe_tot, area_u, ty_trans, surface_salt, surface_temp_max, hu, total_ocean_fprec, temp_surface_ave, vvel, ULON, vsurf, daidtt_m, te..."
01deg_jra55v140_iaf_cycle3,{ACCESS-OM2},{Cycle 3/4 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{ocean, seaIce}","{1mon, fx, 1day}","{alidr_ai_m, total_ocean_lw_heat, melt, frazil_m, temp_int_rhodz, pe_tot, area_u, ty_trans, surface_salt, hu, total_ocean_fprec, temp_surface_ave, vvel, ULON, vsurf, daidtt_m, temp_yflux_adv, open..."
01deg_jra55v140_iaf_cycle4,{ACCESS-OM2},{Cycle 4/4 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do v1.4.0 OMIP2 interannual forcing},"{ocean, seaIce}","{1mon, 6hr, 3hr, 1day, fx}","{melt, temp_int_rhodz, radbio3d, hu, fswup, total_ocean_fprec, temp_surface_ave, ULON, opening_m, stf09, alk, Sinz_m, dic_intmld, pbot_t, total_ocean_hflux_coupler, pprod_gross, surface_pot_temp, ..."
01deg_jra55v140_iaf_cycle4_jra55v150_extension,{ACCESS-OM2},{Extensions of cycle 4/4 of 0.1 degree ACCESS-OM2 + WOMBAT BGC global model configuration with JRA55-do v1.5.0 and v1.5.0.1 OMIP2 interannual forcing},"{ocean, seaIce}","{1mon, fx, 0hr, 1day}","{fswabs_ai_m, alidr_ai_m, total_ocean_lw_heat, melt, surface_caco3, temp_int_rhodz, frazil_m, o2_xflux_adv, pe_tot, o2_intmld, area_u, stf03, ty_trans, adic_int100, npp3d, radbio3d, surface_salt, ..."
01deg_jra55v150_iaf_cycle1,{ACCESS-OM2},{Cycle 1/1 of 0.1 degree ACCESS-OM2 global model configuration with JRA55-do \nv1.5.0 OMIP2 interannual forcing},"{ocean, seaIce}","{1mon, fx, 1day}","{melt, temp_int_rhodz, area_u, ty_trans, surface_salt, hu, ULON, pbot_t, sfc_salt_flux_ice, age_global, surface_pot_temp, lprec, fprec_melt_heat, sfc_hflux_from_runoff, sens_heat, aice_m, wfiform,..."
025deg_jra55_iaf_omip2_cycle1,{ACCESS-OM2},{Cycle 1/6 of 0.25 degree ACCESS-OM2 physics-only global configuration with JRA55-do v1.4 OMIP2 interannual forcing (1958-2019)},"{ocean, seaIce}","{1mon, fx, 1yr, 1day}","{fswabs_ai_m, alidr_ai_m, total_ocean_lw_heat, melt, frazil_m, temp_int_rhodz, pe_tot, salt_rivermix, area_u, ty_trans, temp_yflux_gm_int_z, sice_m, flwdn_m, hu, total_ocean_fprec, ty_trans_gm, ps..."
025deg_jra55_iaf_omip2_cycle2,{ACCESS-OM2},{Cycle 1/6 of 0.25 degree ACCESS-OM2 physics-only global configuration with JRA55-do v1.4 OMIP2 interannual forcing (1958-2019)},"{ocean, seaIce}","{1mon, fx, 1yr, 1day}","{fswabs_ai_m, alidr_ai_m, total_ocean_lw_heat, melt, frazil_m, temp_int_rhodz, pe_tot, salt_rivermix, area_u, ty_trans, temp_yflux_gm_int_z, sice_m, flwdn_m, hu, total_ocean_fprec, ty_trans_gm, ps..."
025deg_jra55_iaf_omip2_cycle3,{ACCESS-OM2},{Cycle 3/6 of 0.25 degree ACCESS-OM2 physics-only global configuration with JRA55-do v1.4 OMIP2 interannual forcing (1958-2019)},"{ocean, seaIce}","{1mon, fx, 1yr, 1day}","{fswabs_ai_m, alidr_ai_m, total_ocean_lw_heat, melt, frazil_m, temp_int_rhodz, pe_tot, salt_rivermix, area_u, ty_trans, temp_yflux_gm_int_z, sice_m, flwdn_m, hu, total_ocean_fprec, ty_trans_gm, ps..."


# Finding data

In this example, we load sea ice concentration (`aice_m`) from a Repeat-Year forcing experiment. 

These are the arguments used with 'getvar' from the cosima cookbook

In [5]:
expt="01deg_jra55v13_ryf9091"
variable="aice_m"

**Instead of 'getvar' we use search, and specify the experiment name and the variable**

In [6]:
var=catalog[expt].search(variable=variable)

# var = cc.querying.getvar(
#     expt=expt,
#     variable=variable,
#     session=session, 
#     decode_coords=False
# )


# Loading Data

**At this point we don't have an xarray object yet, we just have a dataframe of entries in the catalog. We need to call 'to_dask()' to create the xarray dataset, which will attempt to merge and concatenate all the files relating to entries in the catalog.**
- For CICE data, its simpler to use `decode_coords:False` in both the cookbook and intake.
- To speed up the xarray combining of data files, we pass some extra `xarray_combine_by_coords_kwargs` arguments. This is safe because we are only using curated results from one model run. Be careful using these arguments if opening results from more than one model or dataset.

In [7]:
%%time
sic=var.to_dask(
    xarray_open_kwargs={
        "decode_coords":False
    },
    xarray_combine_by_coords_kwargs={
    'compat':'override','data_vars':'minimal', 'coords':'minimal'
    }
)

This may cause some slowdown.
Consider scattering data ahead of time and using futures.


CPU times: user 42.1 s, sys: 3.29 s, total: 45.4 s
Wall time: 1min 50s


# Filtering by time

**In the cosima_cookbook, we might have filtered by time using _start_date_ and _end_date_ arguments to _get_var_. Intake doesn't include filtering by time ranges in the 'search' function, but as we haven't loaded the dataset in to memory yet, we can filter by time before loading the data.**

Per [other notebooks](https://cosima-recipes.readthedocs.io/en/latest/DocumentedExamples/IcePlottingExample.html), CICE thinks that monthly data for, say, January occurs at midnight on Jan 31 -- while xarray interprets this as the first milllisecond of February.  
  
To get around this and we now subtract 12 hours from the time dimension. This means that, at least data is sitting in the correct month, and really helps to compute monthly climatologies correctly.

In [8]:
sic['time'] = sic.time.to_pandas() - timedelta(hours = 12)

**As we have only lazy loaded the data so far, this is a good time to subset to only use the years we are interested in**

In [9]:
sic=sic.sel(time=slice('2090','2099'))

Note that `aice_m` is the monthly average of fractional ice area in each grid cell aka the concentration. To find the actual area of the ice we need to know the area of each cell. Unfortunately, CICE doesn't save this for us ... but the ocean model does. So, let's load `area_t` from the ocean model, and rename the coordinates in our ice variable to match the ocean model. Then we can multiply the ice concentration with the cell area to get a total ice area.

**There are many output files with the area field, however we only want one (as they are all the same)**

In [17]:
catalog[expt].search(variable='area_t').df[0:3]

Unnamed: 0,path,realm,variable,frequency,start_date,end_date,variable_long_name,variable_standard_name,variable_cell_methods,filename,file_id
0,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output000/ocean/ocean_grid.nc,ocean,"[geolon_t, geolat_t, geolon_c, geolat_c, ht, hu, dxt, dyt, dxu, dyu, area_t, area_u, kmt, kmu, drag_coeff]",fx,"1900-04-01, 00:00:00","1900-04-01, 00:00:00","[tracer longitude, tracer latitude, uv longitude, uv latitude, ocean depth on t-cells, ocean depth on u-cells, ocean dxt on t-cells, ocean dyt on t-cells, ocean dxu on u-cells, ocean dyu on u-cell...",[sea_floor_depth_below_geoid],"[time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point]",ocean_grid.nc,ocean_grid
1,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output001/ocean/ocean_grid.nc,ocean,"[geolon_t, geolat_t, geolon_c, geolat_c, ht, hu, dxt, dyt, dxu, dyu, area_t, area_u, kmt, kmu, drag_coeff]",fx,"1900-07-01, 00:00:00","1900-07-01, 00:00:00","[tracer longitude, tracer latitude, uv longitude, uv latitude, ocean depth on t-cells, ocean depth on u-cells, ocean dxt on t-cells, ocean dyt on t-cells, ocean dxu on u-cells, ocean dyu on u-cell...",[sea_floor_depth_below_geoid],"[time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point]",ocean_grid.nc,ocean_grid
2,/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output002/ocean/ocean_grid.nc,ocean,"[geolon_t, geolat_t, geolon_c, geolat_c, ht, hu, dxt, dyt, dxu, dyu, area_t, area_u, kmt, kmu, drag_coeff]",fx,"1900-10-01, 00:00:00","1900-10-01, 00:00:00","[tracer longitude, tracer latitude, uv longitude, uv latitude, ocean depth on t-cells, ocean depth on u-cells, ocean dxt on t-cells, ocean dyt on t-cells, ocean dxu on u-cells, ocean dyu on u-cell...",[sea_floor_depth_below_geoid],"[time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point, time: point]",ocean_grid.nc,ocean_grid


**so lets include the start date to force the catalog to only return one file to open. (We used _n=1_ in the cookbook, there is an [open issue](https://github.com/ACCESS-NRI/access-nri-intake-catalog/issues/117) to try and improve this in intake.)**

In [11]:
# area_t = cc.querying.getvar(sic_args['expt'], 'area_t',session,n = 1)
area_t=catalog[expt].search(variable='area_t', start_date='2090-01-01,*').to_dask().load()

(As an aside, a convenient place to work with the area_t field is to add it as a coordinate to the sic dataset:)

In [12]:
sic['area_t']=area_t.area_t

sic=sic.set_coords('area_t')

# Chunks

This section is optional, you can just use the DataSet now, and especially if the data is small. 

At this point, our data is 'lazy loaded' using [dask](https://docs.dask.org/en/latest/array-chunks.html) chunks. These are needed if the data won't fit in memory, but are also useful for parrallezing the analysis.

When we view the DataArray, it shows details about the dask chunks, rather than the values (which are shown if they are loaded in memory). You can see the array size is ~4.35GB, but the chunks are 37MB. 

In [13]:
sic.aice_m

Unnamed: 0,Array,Chunk
Bytes,4.35 GiB,37.08 MiB
Shape,"(120, 2700, 3600)","(1, 2700, 3600)"
Dask graph,120 chunks in 6722 graph layers,120 chunks in 6722 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.35 GiB 37.08 MiB Shape (120, 2700, 3600) (1, 2700, 3600) Dask graph 120 chunks in 6722 graph layers Data type float32 numpy.ndarray",3600  2700  120,

Unnamed: 0,Array,Chunk
Bytes,4.35 GiB,37.08 MiB
Shape,"(120, 2700, 3600)","(1, 2700, 3600)"
Dask graph,120 chunks in 6722 graph layers,120 chunks in 6722 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


The number of chunks is small, only 120, which is good. But the size of each chunk is inefficient - we might aim for them to be 10% of the memory available to each worker. On a _large_ ARE Instance, we have 48GB spread across 12 cores, so with the default number of dask workers (also 12), we should aim for chunk sizes around 400MB.

To keep our chunks oriented with the netcdf source files (which have 1 time step for all X&Y coordinates), we will rechunk to so each chunk has the full X&Y, but multiple timesteps. Through trial and error, we can set this so our chunk sizes are close to 400MB. (Conveniently, out number of chunks is now a multiple of our number of cores too)

In [14]:
sic=sic.chunk({'time':10, 'nj':-1, 'ni':-1})

Our chunk sizes are more efficient now, and load quickly:

In [15]:
sic.aice_m

Unnamed: 0,Array,Chunk
Bytes,4.35 GiB,370.79 MiB
Shape,"(120, 2700, 3600)","(10, 2700, 3600)"
Dask graph,12 chunks in 6723 graph layers,12 chunks in 6723 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.35 GiB 370.79 MiB Shape (120, 2700, 3600) (10, 2700, 3600) Dask graph 12 chunks in 6723 graph layers Data type float32 numpy.ndarray",3600  2700  120,

Unnamed: 0,Array,Chunk
Bytes,4.35 GiB,370.79 MiB
Shape,"(120, 2700, 3600)","(10, 2700, 3600)"
Dask graph,12 chunks in 6723 graph layers,12 chunks in 6723 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


**We are now ready to go, to analyse or plot the data. [Sea ice plotting examples](https://cosima-recipes.readthedocs.io/en/latest/DocumentedExamples/SeaIce_Plot_Example.html) covers plotting this data.**

In [16]:
client.close()