# Earth System Grid Federation Data Access

The simplest way to search datasets programmatically within the Earth System Grid Federation (ESGF) is to use [`intake-esgf`](https://intake-esgf.readthedocs.io/). It works like `intake-esm`, but instead of loading a static catalog, `intake-esgf` populates its catalog by querying ESGF index nodes. There is also a new graphical search interface called [Metagrid](https://aims2.llnl.gov/search), but it does not yet federate data from all ESGF nodes.

ESGF provides two different mechanisms to retrieve data, the legacy `Solr` index and the new `Globus` index. ESGF is decommissioning the old indexing service based on SOLR, so methods relying on it might break. `intake-esgf` works with both, so should allow for a graceful transition.

What `intake-esgf` does is send search queries to a list of index nodes and aggregates the results. You can configure the list of index nodes it queries, especially since some of the nodes are very slow to respond. For this notebook, we'll query only Globus nodes to speed things up, but to ensure you find all the available data, call `intake_esgf.conf.set(all_indices=True)` to search all nodes.

In [None]:
# NBVAL_IGNORE_OUTPUT

import intake_esgf

# Use only Globus index nodes to speed things up
intake_esgf.conf["solr_indices"] = {}

# Show the configuration
print(intake_esgf.conf)

# Initialize an empty ESGF catalog
cat = intake_esgf.ESGFCatalog()

# Launch a search query.
# Here we're looking for any variable related to humidity within the CMIP6 SSP2-4.5 experiment.
# Results will be stored in a dictionary with keys defined by the `facets` argument.
cat.search(
    project="CMIP6",
    variable_id=["hurs"],
    table_id="Amon",
    experiment_id=["ssp126", "ssp245", "ssp370", "ssp585"],
)

You can get a sense of what datasets are available with the `model_group` method, which counts the number of unique combinations of `source_id, member_id` and `grid_label`.

Other useful methods are `remove_ensembles`, which picks only one `member_id` (the smallest 4 integer values for the *ripf* code, and `remove_incomplete`, which filters model groups according to criteria you can define. See the [docs](https://intake-esgf.readthedocs.io/en/latest/modelgroups.html) for details.

In [None]:
# NBVAL_IGNORE_OUTPUT

# Keep only one member per model, experiment and grid.
cat.remove_ensembles()

# Remove models groups that don't have the four SSPs.
cat.remove_incomplete(lambda df: len(df) == 4)

cat.model_groups()

Now we'll try to access some data. For small queries, a good approach is to use streaming, rather than downloading the whole thing. Here we'll just ask for simulations from CanESM5, and try to stream some data. Getting the file information can take some time.

In [None]:
# NBVAL_IGNORE_OUTPUT

# Let's focus the search on one single model to speed up the rest of the notebook
cat.search(
    project="CMIP6",
    source_id="CanESM5",
    variable_id=["hurs"],
    table_id="Amon",
    experiment_id=["ssp126", "ssp245", "ssp370", "ssp585"],
)
cat.remove_ensembles()

# The `prefer_streaming` argument specifies that we'd rather not download entire files.
# When True, the `add_measures` argument triggers search for variables that are referenced
# in the `cell_measure` attribute, such as `areacella` or `areacello`.
dsd = cat.to_dataset_dict(prefer_streaming=True, add_measures=False)

In [None]:
# NBVAL_IGNORE_OUTPUT

# Here the result is keyed by experiment_id.
dsd["ssp370"]["hurs"]

By default, the ``to_dataset_dict`` method downloads files locally. If you already hold datasets locally, you can specify the ``esg_dataroot`` in the configuration. You can also specify the local_cache where missing datasets will be downloaded.

Please check the [documentation](https://intake-esgf.readthedocs.io/) for more details on how to use `to_dataset_dict`.