I'm pleased to announce the release of intake-esm version 2020.3.16. This is a new release with bug fixes and new features. Everyone is invited to give it a try, and make their thoughts, suggestions, feedback known! This blogpost outlines these changes. Full changelog is available [here](https://intake-esm.readthedocs.io/en/latest/changelog.html#intake-esm-v2020-03-16).


On GitHub: https://github.com/NCAR/intake-esm

Documentation: https://intake-esm.readthedocs.io/

<!-- TEASER_END -->

## Installation

Intake-esm can be installed from PyPI with pip:

```bash
python -m pip intall intake-esm --upgrade
```

It is also available from conda-forge channel for conda isntallations:

```bash
conda install -c conda-forge intake-esm
```

## New Features


### Enhanced search: enforce query criteria via `require_all_on` argument

By default intake-esm's search() method returns entries that fullfill any of the criteria specified in the query. For example, say you define a set of experiments (`experiment_id=['piControl', 'historical']`), and you query the catalog. intake-esm may return a bunch of assets, e.g. `[ModelA.piControl, ModelB.historical, ModelC.piControl, ModelC.historical]`. For some use cases, this may not be a problem at all, however, some analyses may require to exclusively have both experiments. 

Today intake-esm can return entries that fullfill all query criteria when the user supplies the `require_all_on` argument. `require_all_on` consits of attributes/dataframe columns to use when enforcing the query criteria.

For example: `col.search(experiment_id=['piControl', 'historical'], require_all_on='source_id')` would return all assets with `source_id` that has both `piControl` and `historical` experiments.

In [6]:
import intake
url = "https://git.io/JvP9r"
col = intake.open_esm_datastore(url)
col

pangeo-cmip6-ESM Collection with 235624 entries:
	> 15 activity_id(s)

	> 32 institution_id(s)

	> 69 source_id(s)

	> 101 experiment_id(s)

	> 135 member_id(s)

	> 29 table_id(s)

	> 313 variable_id(s)

	> 10 grid_label(s)

	> 235624 zstore(s)

	> 60 dcpp_init_year(s)

In [7]:
# Define our query

query = dict(
    variable_id=["thetao", "o2"],
    experiment_id=["historical", "ssp245", "ssp585"],
    table_id=["Omon"],
)

# Search for assets that fulfill our query
col_subset = col.search(**query)

col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()

Unnamed: 0_level_0,experiment_id,variable_id,table_id
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACCESS-CM2,3,1,1
ACCESS-ESM1-5,3,2,1
AWI-CM-1-1-MR,2,1,1
BCC-CSM2-MR,3,1,1
BCC-ESM1,1,1,1
CAMS-CSM1-0,3,1,1
CESM2,3,1,1
CESM2-FV2,1,1,1
CESM2-WACCM,3,1,1
CESM2-WACCM-FV2,1,1,1


As you can see, the search results above include `source_ids` for which we only have 1 of the two variables, and 1 or 2 or 3 experiments.

By setting `require_all_on=["source_id"]`, intake-esm will discard entries that don't fulfill all query criteria, i.e. it will return entries where the `source_id` has both variables `["thetao", "o2"]`, all three experiments `["historical", "ssp245", "ssp585"]`:

In [8]:
col_subset = col.search(require_all_on=["source_id"], **query)
col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()

Unnamed: 0_level_0,experiment_id,variable_id,table_id
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACCESS-ESM1-5,3,2,1
CNRM-ESM2-1,3,2,1
CanESM5,3,2,1
CanESM5-CanOE,3,2,1
GFDL-CM4,3,2,1
IPSL-CM6A-LR,3,2,1
MIROC-ES2L,3,2,1
MPI-ESM1-2-HR,3,2,1
UKESM1-0-LL,3,2,1


Thanks to [Julius Busecke](https://github.com/jbusecke) for proposing this feature and reviewing the implementation. 

### Single File Catalogs



The earlier version of [esm collection spec](https://github.com/NCAR/esm-collection-spec) required that the `catalog_file` entry in the input json file points to a csv file. In some cases, it is useful to embed the content that would otherwise be in the csv in the input json file itself. To support this use case, a `catalog_dict` entry was added to the esm collection spec (see [NCAR/esm-collection-spec#15](https://github.com/NCAR/esm-collection-spec/pull/15))

Example: `catalog-dict-records.json`

```json
{
    "esmcat_version":"0.1.0",
    "id":"aws-cesm1-le",
    "description":"This is an ESM collection for CESM1 Large Ensemble Zarr dataset publicly available on Amazon S3 (us-west-2 region)",
    "catalog_dict":[
        {
            "component":"atm",
            "frequency":"daily",
            "experiment":"20C",
            "variable":"FLNS",
            "path":"s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.zarr"
        },
        {
            "component":"atm",
            "frequency":"daily",
            "experiment":"20C",
            "variable":"FLNSC",
            "path":"s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNSC.zarr"
        },
        {
            "component":"atm",
            "frequency":"daily",
            "experiment":"20C",
            "variable":"FLUT",
            "path":"s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLUT.zarr"
        },
        {
            "component":"atm",
            "frequency":"daily",
            "experiment":"20C",
            "variable":"FSNS",
            "path":"s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNS.zarr"
        },
        {
            "component":"atm",
            "frequency":"daily",
            "experiment":"20C",
            "variable":"FSNSC",
            "path":"s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FSNSC.zarr"
        }
    ],
    "attributes":[
        {
            "column_name":"component",
            "vocabulary":""
        },
        {
            "column_name":"frequency",
            "vocabulary":""
        },
        {
            "column_name":"experiment",
            "vocabulary":""
        },
        {
            "column_name":"variable",
            "vocabulary":""
        }
    ],
    "assets":{
        "column_name":"path",
        "format":"zarr"
    },
    "aggregation_control":{
        "variable_column_name":"variable",
        "groupby_attrs":[
            "component",
            "experiment",
            "frequency"
        ],
        "aggregations":[
            {
                "type":"union",
                "attribute_name":"variable",
                "options":{
                    "compat":"override"
                }
            }
        ]
    }
}
```

```python
import intake
col = intake.open_esm_datastore("catalog-dict-records.json")
```

Thanks to [Joe Hamman](https://github.com/jhamman) for proposing this feature and reviewing the implementation. Thanks to [Brian Bonnlander](https://github.com/bonnland) for implementing this feature. 

### Relative paths for catalog files

Fetching and loading catalog files in earlier version of `intake-esm` required using absolute paths/urls for the catalog file (`csv`). 

For example:


- `old_sample.json`:
    
```json
{
  "esmcat_version": "0.1.0",
  "id": "campaign-cesm2-cmip6-timeseries",
  "description": "ESM collection for the CESM2 raw output that went into CMIP6 data. Located in campaign storage, accessible via GLADE on casper",
  "catalog_file": "/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.csv.gz",
    ...
}
```
    

Today  the `catalog_file` can point to a full path or a path relative to the input json file path:


- `new_sample.json`:
```json
{
  "esmcat_version": "0.1.0",
  "id": "campaign-cesm2-cmip6-timeseries",
  "description": "ESM collection for the CESM2 raw output that went into CMIP6 data. Located in campaign storage, accessible via GLADE on casper",
  "catalog_file": "campaign-cesm2-cmip6-timeseries.csv.gz",
    ...
}
```

## Acknowledgements

The following people contributed to the [NCAR/intake-esm](https://github.com/NCAR/intake-esm), [NCAR/esm-collection-spec](https://github.com/NCAR/esm-collection-spec) repositories since intake-esm release `2019.12.13` on December 13th, 2019:


- Anderson Banihirwe
- Brian Bonnlander
- Joe Hamman
- Julius Busecke