# Using Intake-ESM Catalog

This Jupyter notebook demonstrates how to work with intake-esm esm catalog for
the NCAR Community Earth System Model (CESM) Large Ensemble (LENS) data hosted
on AWS S3 ([doi:10.26024/wt24-5j82](https://doi.org/10.26024/wt24-5j82)).

[Intake-esm Documentation](https://intake-esm.readthedocs.io/en/latest/notebooks/tutorial.html)


## Import packages


In [None]:
import pprint

import intake

## Inspect intake-esm catalog


In [None]:
# Open original collection description file
cat_url = "https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(cat_url)
col

Show the first few lines of the catalog. There are as many lines as there are
paths. The order is the same as that of the CSV catalog file listed in the JSON
description file.


In [None]:
print("Catalog file:", col.esmcol_data["catalog_file"])
col.df.head(10)

**Table:** _First few lines of the original Intake-ESM Catalog showing the model
component, the temporal frequency, the experiment, the abbreviated variable
name, and the AWS S3 path for each Zarr store._


Look at unique values and their counts for given columns


In [None]:
uniques_orig = col.unique(
    columns=["component", "frequency", "experiment", "variable"]
)
pprint.pprint(uniques_orig, compact=True, indent=1, width=80)

## Finding Data

If you happen to know the meaning of the variable names, you can find what data
are available for that variable. For example:


In [None]:
# Filter the catalog to find available data for one variable
col.search(variable="FLNS").df

**Table:** _All available Zarr stores for the "FLNS" data._


In [None]:
# Narrow the filter to specific frequency and expiriment
col.search(variable="FLNS", frequency="daily", experiment="RCP85").df

**Table:** _The single Zarr store for daily "FLNS" data from "RCP85"
experiment._


## The Problem


Do all potential users know that "FLNS" is a CESM-specific abbreviation for "Net
longwave flux at surface"? How would a novice user find out, other than by
finding separate documentation, or by opening a Zarr store in the hopes that the
long name might be recorded there? How do we address the fact that every climate
model code seems to have a different, non-standard name for all the variables,
thus making multi-source research needlessly difficult?


## Solution: use columns with enhanced information


By using additional columns in the Intake-ESM catalog, we should be able to
improve semantic interoperability and provide potentially useful information to
the users.


### Long names

Instead of searching by variable short names, we can use the
`variable_long_name`.

**ISSUE:** _The long names are **not** CF Standard Names, but rather are those
documented
[here](http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html). For
interoperability, the `variable_long_name` column should be supplemented by a
`cf_name` column and possibly an attribute column to disambiguate if needed._


In [None]:
col.df.head(10)

In [None]:
# List all available variables by Long Name, sorted alphabetically
uniques = col.unique(columns=["variable_long_name"])
namelist = sorted(uniques["variable_long_name"]["values"])
[x for x in namelist]

In [None]:
# Show all available data for a specific variable based on long name
varname = "salinity"
col.search(variable_long_name=varname).df

**Table:** _All available data in this catalog for selected variable_


### Substring matches

With use of wildcards and/or regular expressions, we can find all variables with
a particular substring in the long name:


In [None]:
# Find all entries with `wind` in their variable_long_name
col.search(variable_long_name="wind*").df

In [None]:
# Find all entries whose variable long name starts with `wind`
col.search(variable_long_name="^wind").df

**Table(s):** _Information about all matching datasets_


## Loading data into xarray datasets

Once we are satisfied with the results of our searches, we can use the
`to_dataset_dict()` method to load the data into xarray datasets.


In [None]:
dsets = col.search(
    variable_long_name="temp*",
    frequency="monthly",
    experiment="20C",
    component="ocn",
).to_dataset_dict(zarr_kwargs={"consolidated": True})
dsets

### Grid variables


Grid variables, including the latitudes and longitudes of tracer points, do not
have variable names or long names. So, to find them we need to use the
`frequency='static'` query:


In [None]:
col.search(frequency="static").df

In [None]:
# To load grid variables for a specific component and experiment
_, grid = (
    col.search(frequency="static", component="ocn", experiment="20C")
    .to_dataset_dict(aggregate=False, zarr_kwargs={"consolidated": True})
    .popitem()
)
grid