<img src="../../thumbnail.png" width=250 alt="CESM LENS image"></img>

# Enhanced Intake-ESM Catalog Demo

---

## Overview
This notebook compares the original [Intake-ESM](https://intake-esm.readthedocs.io/en/stable/) catalog with an enhanced catalog that includes additional attributes. Both catalogs are an inventory of the NCAR Community Earth System Model (CESM) Large Ensemble (LENS) data hosted on AWS S3 ([doi:10.26024/wt24-5j82](https://doi.org/10.26024/wt24-5j82)).

## Prerequisites

| Concepts | Importance | Notes |
| --- | --- | --- |
| [Intro to Pandas](https://foundations.projectpythia.org/core/pandas/pandas.html) | Necessary | |

- **Time to learn**: 10 minutes

---

## Imports

In [None]:
import intake
import pandas as pd
import pprint

# Allow multiple lines per cell to be displayed without print (default is just last line)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Enable more explicit control of DataFrame display (e.g., to omit annoying line numbers)
from IPython.display import HTML

## Original Intake-ESM Catalog

Open the original collection description file:

In [None]:
cat_url_orig = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le.json'
coll_orig = intake.open_esm_datastore(cat_url_orig)

In [None]:
print(coll_orig.esmcol_data['description']) #Description of collection
print("Catalog file:", coll_orig.esmcol_data['catalog_file'])
print(coll_orig) # Summary of collection structure

Show an expanded version of the collection structure with details:

In [None]:
uniques_orig = coll_orig.unique(columns=["component", "frequency", "experiment", "variable"])
pprint.pprint(uniques_orig, compact=True, indent=1, width=80)

Show the first few lines of the catalog. There are as many lines as there are paths. The order is the same as that of the CSV catalog file listed in the JSON description file.

In [None]:
print("Catalog file:", coll_orig.esmcol_data['catalog_file'])
df = coll_orig.df
HTML(df.head(10).to_html(index=False))

**Table**: First few lines of the original Intake-ESM catalog showing the model component, the temporal frequency, the experiment, the abbreviated variable name, and the AWS S3 path for each Zarr store.

## Finding Data

If you happen to know the meaning of the variable names, you can find what data are available for that variable. For example:

In [None]:
df = coll_orig.search(variable='FLNS').df
HTML(df.to_html(index=False))

We can narrow the filter to specific frequency and experiment:

In [None]:
df = coll_orig.search(variable='FLNS', frequency='daily', experiment='RCP85').df
HTML(df.to_html(index=False))

## The Problem

Do all potential users know that `FLNS` is a CESM-specific abbreviation for “Net longwave flux at surface”? How would a novice user find out, other than by finding separate documentation, or by opening a Zarr store in the hopes that the long name might be recorded there? How do we address the fact that every climate model code seems to have a different, non-standard name for all the variables, thus making multi-source research needlessly difficult?

## Enhanced Intake-ESM Catalog

By adding additional columns to the Intake-ESM catalog, we should be able to improve semantic interoperability and provide potentially useful information to the users. Let's now open the enhanced collection description file:

In [None]:
cat_url = 'https://ncar-cesm-lens.s3-us-west-2.amazonaws.com/catalogs/aws-cesm1-le-enhanced.json'
coll = intake.open_esm_datastore(cat_url)
coll

In [None]:
print(coll.esmcol_data['description']) # Description of collection
print("Catalog file:", coll.esmcol_data['catalog_file'])
print(coll) # Summary of collection structure

### Long names

In the summary above, note the addition of additional elements: `long_name`, `start`, `end`, and `dim`. Here are the first few lines of the enhanced catalog:

In [None]:
print("Catalog file:", coll.esmcol_data['catalog_file'])
HTML(coll.df.head(10).to_html(index=False))

**Table**: First few lines of the enhanced catalog, listing of the same information as the original catalog as well as the long name of each variable and an indication of whether each variable is 2D or 3D.

<div class="admonition alert alert-warning">
    <p class="admonition-title" style="font-weight:bold">Warning</p>
    The long names are <em>not</em> CF Standard Names, but rather are those documented at 
<a href="http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html">http://www.cgd.ucar.edu/ccr/strandwg/CESM-CAM5-BGC_LENS_fields.html</a>. For interoperability, the <code>long_name</code> column should be replaced by a <code>cf_name</code> column and possibly an <code>attribute</code> column to disambiguate if needed.
</div>

List all available variables by long name, sorted alphabetically:

In [None]:
uniques = coll.unique(columns=['long_name'])
nameList = sorted(uniques['long_name']['values'])
print(*nameList, sep='\n') #note *list to unpack each item for print function

Show all available data for a specific variable based on long name:

In [None]:
myName = 'Salinity'
HTML(coll.search(long_name=myName).df.to_html(index=False))

### Substring matches

The current version of the Intake-ESM `.search()` function requires an exact full-string case-sensitive match of `long_name`. (This has been reported as an issue at [https://github.com/NCAR/cesm-lens-aws/issues/48](https://github.com/NCAR/cesm-lens-aws/issues/48)). Demonstrate a work-around: find all variables with a particular substring in the long name

In [None]:
myTerm = 'Wind'
myTerm = myTerm.lower() #search regardless of case
partials = [name for name in nameList if myTerm in name.lower()]
print(f"All datasets with name containing {myTerm}:")
print(*partials, sep='\n')

Display full table for each match (could be lengthy if many matches):

In [None]:
for name in partials:
    df = coll.search(long_name=name).df[['component', 'dim', 'experiment', 'variable', 'long_name']]
    HTML(df.to_html(index=False))
    ###df.head(1) #show only first entry in each group for compactness
    # Note: It is also possible to hide column(s) instead of specifying desired columns
    ###coll.search(long_name=name).df.drop(columns=['path'])

<div class="admonition alert alert-warning">
    <p class="admonition-title" style="font-weight:bold">Warning</p>
    The case-insensitive substring matching is not integrated into Intake ESM, so it is not clear whether resulting search results can be passed directly to Xarray to read data.
</div>

### Other attributes

Other columns in the enhanced catalog may be useful. For example, the dimensionality column enables us to list all data from the ocean component that is 3D.

In [None]:
df = coll.search(dim="3D",component="ocn").df
HTML(df.to_html(index=False))

### Spatiotemporal filtering

If there were both regional and global data available (e.g., LENS and NA-CORDEX data for the same variable, both listed in same catalog), some type of coverage indicator (or columns for bounding box edges) could be listed.

Temporal extent in LENS is conveyed by the experiment (HIST, 20C, etc) but this is imprecise and requires external documentation. We have added start/end columns to the catalog, but Intake ESM currently does not have built-in functionality to filter based on time.

We can do a simple search that exactly matches a temporal value:

In [None]:
df = coll.search(dim="3D",component="ocn", end='2100-12').df
HTML(df.to_html(index=False))

---

## Summary
In this notebook, we used Intake ESM to explore a catalog of CESM LENS data. We then worked through some helpful features of the enhanced catalog.

### What's next?
We will use this data to recreate some figures from a [paper published in BAMS that describes the CEM LENS project](https://journals.ametsoc.org/view/journals/bams/96/8/bams-d-13-00255.1.xml).

## Resources and references
[Original notebook in the Pangeo GAllery](https://gallery.pangeo.io/repos/NCAR/cesm-lens-aws/notebooks/EnhancedIntakeCatalogDemo.html)