# ACCESS-NRI Intake catalog tutorial

**Aims**: This tutorial will introduce the ACCESS-NRI Intake catalog a show you how to use it to find and load model data for analysis

# Opening the catalog 

We'll start by opening the catalog and getting a feel for what it contains.

In [None]:
import intake

catalog = intake.cat.access_nri

With that, we can now use `catalog` to search and load ~3 PB of data without having to know where the data is or how it's structured. 

The catalog includes a wide variety of climate data products. The "name" column gives the name of the data product and the other columns provide additional metadata associated with each product. As we'll demonstrate below, you can search on metadata in these columns to filter for data products that may be of interest to you. Scroll through the products below and get an idea for what each product is by looking at the entry in the description column.

In [None]:
catalog

# Using the catalog

Each entry (row) in the catalog describes a data product comprising many datasets spread across many files (a "dataset" here is a set of files that can be readily opened and combined for analysis using xarray - more on this later). For example, in a given ACCESS-CM2 product, there may be a dataset of ocean variables at monthly frequency, atmospheric variables at monthly frequency etc. Each entry in the catalog has a corresponding Intake-ESM datastore that can be used to filter for datasets of interest based on metadata in the datastore and then to open those datasets using xarray.

The general process for using the catalog is as follows:

1. Search the ACCESS-NRI catalog for data products that are of interest to you.
2. Open the Intake-ESM datastore(s) for the product(s) of interest. 
3. Search the Intake-ESM datastore(s) for the datasets within each product that are of interest to you.
4. Open the datasets of interest as xarray Dataset(s).
5. Perform some analysis on the xarray Dataset(s).

This process is illustrated in the schematic below. Pink text indicates the methods used to perform each task.

<img src="./catalog_flow.svg" alt="Alternative text" />

If a user knows in advance the name of the product(s) they are after, then the first two steps above are not needed. Instead, those Intake-ESM datastores can be easily retreived directly from the catalog as demonstrated later on.

# Catalog filtering and data discovery

We can search on the columns in the ACCESS-NRI catalog. For example, we could search for all products that use the model `ACCESS-OM2`. The `search` method returns another catalog object with entries that satisfy our search criteria.

In [None]:
catalog_filtered = catalog.search(model="ACCESS-OM2")
catalog_filtered

We can also combine queries in a search. For example, below we search for all products that use the model `ACCESS-OM2` and contain the variable `wdet100` at daily frequency.

In [None]:
catalog.search(model="ACCESS-OM2", frequency="1day", variable="wdet100")

We can also use regex strings in our searches. For example, we could relax our query on variable to look for any variables starting with the letter `"w"`.

In [None]:
catalog.search(model="ACCESS-OM2", frequency="1day", variable="w.*")

Note, metadata in the `realm` and `frequency` columns of the ACCESS-NRI catalog follow a standard vocabulary that is very similar to CMIP6 (but slightly more general):

 - `realm` may be one of:
   - `aerosol`,
   - `atmos`,
   - `atmosChem`,
   - `land`,
   - `landIce`,
   - `none`,
   - `ocean`,
   - `ocnBgchem`,
   - `seaIce`,
   - `unknown`
 - `frequency` may be one of (where `<int>` is an integer):
   - `fx`
   - `subhr`
   - `<int>hr`
   - `<int>day`
   - `<int>mon`
   - `<int>yr`
   - `<int>dec`
  
Some attempt has been made to use consistent model names in the `model` column (e.g. always use "ACCESS-OM2" for ACCESS-OM2), but model naming is not enforced. The variable names in the `variable` column are whatever they're called in the associated data product.

# Loading Intake sources

Remember that each entry in the catalog has an associated Intake-ESM datastore that keeps track of all the files in that product and how they fit together into datasets. There are three ways to open Intake-ESM datastores from the catalog, depening on your use case:

## 1. You know the name of the product you want

In this case, you can open the Intake-ESM datastore for that product directly as an attribute or key. For example

In [None]:
datastore_example = catalog.by647

# Or

datastore_example = catalog["by647"]

## 2. You've filtered the catalog for the products you want and there are multiple remaining

In this case, you can open the Intake-ESM datastores for all entries in a catalog using the `to_source_dict` method. For example

In [None]:
datastore_dict_example = catalog.search(model="ACCESS-OM2", frequency="1day", variable="wdet100").to_source_dict()

## 3. You've filtered the catalog for the products you want and there's only one remaining 

In this case, you can open the Intake-ESM datastore for the remaining product using the `to_source` method (note you could also use `to_source_dict` which would return a dictionary containing the Intake-ESM datastore, rather than the datastore itself). For example

In [None]:
datastore_example = catalog.search(name="by647").to_source()

# Additional source metadata

Each Intake-ESM datastore has its own `.metadata` attribute that contains additional information about that experiment.

In [None]:
catalog.by647.metadata

# An example workflow

As a full example workflow, let's use the ACCESS-NRI catalog to carry out an analysis comparing the SST climatology in the Nino-3.4 region from two data products:

- An ACCESS-ESM1.5 experiment: `HI_C_05_r1`
- An ACCESS-CM2 experiment: `bx944`

(Note, up until this point we haven't actually opened any data from any products. However, below we access data from the `p73` project on Gadi. You will therefore need to be a member of this project to run this section.)

First we'll search directly for the products by their names and get the Intake-ESM datastores for those experiments. Here we use the `to_source_dict` method to load the datastores because there is more than one of them.

In [None]:
datastore_dict = catalog.search(name=["HI_C_05_r1","bx944"]).to_source_dict()

We're going to use dask to help us open and process the data from these datastores. We can start a local distributed dask cluster as follows

In [None]:
from distributed import Client

client = Client(threads_per_worker=1)
client.dashboard_link

Note, it is very helpful to monitor the dask dashboard when working with dask. Click on the dask icon on the far left of the screen (three orange and red squares) and enter the text output by the previous cell in the search bar. Each of the different orange panels is a different dashboard that you can use to monitor what dask is doing. If you don't know which to choose, the "Task Stream", "Progress", "CPU" and "Workers Memory" diagnostics are a good start. Click on these, and drag the windows to where ever you want them in your JupyterLab.

We want to open datasets of SST from each of our datastores in `datastore_dict`. We could do this all in one line, but we'll break it down into multiple steps for clarity. Note the renaming of `xt_ocean` and `yt_ocean` to `longitude` and `latitude` isn't correct globally, but `xt_ocean` and `yt_ocean` can be interpretted as longitudes and latitudes in the region we're interested in. The renaming is done simply to make things are little more convenient later on.

In [None]:
dataset_dict = {}

for name, datastore in datastore_dict.items():

    # Search for monthly SST dataset in the datastore
    datastore_filtered = datastore.search(realm="ocean", variable="sst", frequency="1mon")

    # Open the monthly SST dataset as an xarray Dataset
    # The arguments passes to to_dask aren't essential, but they speed up opening the datasets
    dataset = datastore_filtered.to_dask(
        xarray_open_kwargs = dict(use_cftime=True),
        xarray_combine_by_coords_kwargs = dict(compat="override", coords="minimal")
    )

    # Rename coordinate for convenience
    dataset_dict[name] = dataset.rename({"xt_ocean": "longitude", "yt_ocean": "latitude"})

Now we can do our analysis on these data. To do this, we use two functions: the first computes the monthly climatological mean in the Nino-3.4 region over 1971-2000; and the second wraps the first to compute and plot the climatological mean for a dictionary of datasets containing the variable `sst`.

In [None]:
def compute_nino34_clim(sst):
    """
    Compute the monthly climatological mean in the Nino-3.4 region over 1971-2000
    """
    sst = sst.sel(time=slice("1971", "2000"))
    
    sst = sst.assign_coords(
        {"longitude": (sst["longitude"] + 360) % 360}
    )
    
    # NOTE: really this should be area weighted
    sst_nino34 = sst.where(
        (sst.latitude < 5) & 
        (sst.latitude > -5) & 
        (sst.longitude > 190) & 
        (sst.longitude < 240), 
        drop=True
    ).mean(
        set(sst.latitude.dims + sst.longitude.dims)
    ).compute()
    
    return sst_nino34.groupby("time.month").mean("time")

In [None]:
import matplotlib.pyplot as plt

def plot_nino34_clim(dataset_dict):
    """
    Plot the monthly climatological mean of SST in the Nino-3.4 region over 1971-2000
    """
    for idx, (name, ds) in enumerate(dataset_dict.items()):
        nino34_clim = compute_nino34_clim(ds)["sst"]
        
        # NOTE: there're better ways to deal with units, but this will do for this demo
        if (nino34_clim > 273.15).any().item():
            nino34_clim -= 273.15
            
        nino34_clim.plot.line(x="month", color=f"C{idx}", add_legend=False, label=name)

    plt.title("SST climatology in Nino-3.4 region")
    plt.ylabel("SST")
    plt.grid()

    # Remove duplicates in legend
    handles, labels = plt.gca().get_legend_handles_labels()
    by_label = dict(zip(labels, handles))
    plt.legend(by_label.values(), by_label.keys())

In [None]:
plot_nino34_clim(dataset_dict)

Maybe we'd also like to add some ACCESS-ESM1-5 CMIP6 data to our plot? That's easy because the [NCI CMIP6 Intake-ESM datastores](https://opus.nci.org.au/pages/viewpage.action?pageId=213713098) are included in the ACCESS-NRI catalog.

In [None]:
cmip6_datastore = catalog.search(name="cmip6.*", model="ACCESS-ESM1-5").to_source()

Let's search for and load the ACCESS-ESM1-5 historical run. In this Intake-ESM datastore (which was generated by NCI), each of the 40 ACCESS-ESM1-5 ensemble members are considered separate datasets. We'll open them using `to_dataset_dict` and concantenate them into a single dataset.

In [None]:
cmip6_datastore_filtered = cmip6_datastore.search(
    source_id="ACCESS-ESM1-5", 
    table_id="Omon", 
    variable_id="tos", 
    experiment_id="historical", 
    file_type="f"
)

cmip6_datastore_filtered

In [None]:
import xarray as xr

ds = xr.concat(
    cmip6_datastore_filtered.to_dataset_dict(progressbar=False).values(), 
    dim="member"
)

Now we can add the CMIP6 ACCESS-ESM1-5 ensemble to our plot. Perhaps unsurprisingly, the added climatologies look very similar to the `HI_C_05_r1` climatology (which uses the same model).

In [None]:
dataset_dict["CMIP6 ACCESS-ESM1.5 historical"] = ds.rename({"tos": "sst"})

plot_nino34_clim(dataset_dict)

In [None]:
client.close()