# Downloading an ensemble of CMIP6 data with a series of criterias thanks to intake-esgf

---

## Purpose of the notebook

---

This notebook aims at **downloading an ensemble of CMIP6 variables from python** thanks to a dictionnary of **user-defined criterias**.  

All the links of the documents were accessed on the **25/03/2025**.

It uses the *intake-esgf* library : https://github.com/esgf2-us/intake-esgf?tab=readme-ov-file. A beginner guide for this library can be found there : https://intake-esgf.readthedocs.io/en/latest/beginner.html.

A detailed documentation for the CMIP6 can be found here : https://wcrp-cmip.org/cmip-model-and-experiment-documentation/.



Feel free to share, use and improve the following code according to the provided license on the repository.

---

## Model outputs searched

---

Every search needs to be constrained by the attributes of the model outputs we are looking for. A detailed document listing these attributes can be found here : https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit?tab=t.0.

### Experiments

We use two experiments realized during the CMIP6  : **piClim-control** and **piClim-aer**. These are both atmosphere-only climate model simulations in which sea surface temperatures (SSTs) and sea icea concentrations (SICs) are fixed at model-specific preindustrial climatological values. The description of the experiments can be found here : https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html.

> **piClim-control** : assumes aerosols' burdens set to their preindustrial levels, it is the control experiment.
> 
> **piClim-aer** : uses present-day, present-day being 2014, aerosols burdens' levels.

### Variables

The variable used are listed and explicited below according to : https://clipc-services.ceda.ac.uk/dreq/mipVars.html.

> <span style="color:SkyBlue">**clt**</span>  : Total cloud area fraction (%) for the whole atmospheric column
>
> <span style="color:gold">**rsdt / rldt**</span> : Shortwave / Longwave radiation ($W/m^{2}$) **incident** at the TOA
> 
> <span style="color:orange">**rsut / rlut**</span> : Shortwave / Longwave radiation ($W/m^{2}$) **going out**  at the TOA
>
> <span style="color:orangered">**rsutcs / rlutcs**</span> : Shortwave / Longwave radiation ($W/m^{2}$) **going out**  at TOA for **clear-sky conditions**
> 
> <span style="color:Orchid">**rsds / rlds**</span> : Shortwave / Longwave **downwelling** radiation ($W/m^{2}$) at the surface
> 
> <span style="color:Indigo ">**rsdscs / rldscs**</span>  : Shortwave / Longwave **downwelling** radiation ($W/m^{2}$) at the surface for **clear-sky conditions**
> 
> <span style="color:YellowGreen">**rsus / rlus**</span> : Shortwave / Longwave **upwelling** radiation ($W/m^{2}$) at the surface
>
> <span style="color:Darkgreen">**rsuscs / rluscs**</span>: Shortwave / Longwave **upwelling** radiation ($W/m^{2}$) at the surface for **clear-sky conditions**
>
> **areacella** : For every grid, the latitude-dependent surface associated to each grid point.

### Table

The table sets how the variables are organized. We use the **AERmon** table. The details about the tables can be found here : https://clipc-services.ceda.ac.uk/dreq/index/miptable.html.

> **Amon** stands for a set of monthly atmospheric data

---

## Initialisation

---

### Importations

We import the needed libraries.

In [1]:
# ================ IMPORTATIONS ================ #

import intake_esgf  # this gives us access to the ESGF catalog to make queries

import pandas as pd  # to manage the product of the search

import numpy as np  # to manage the pandas arrays

import os  # to get access to commands related to path setting and creation of directories

from folders_handle.create import (
    create_dir,
)  # function to create a cleaned downloading directory

### Set our search criterias 

Here the user may define its search criterias. We create a dictionnary structure that we update with determined variables.

In [2]:
# ================ SEARCH CRITERIAS FOR OUR ANALYSIS ================ #

### EXPERIMENTS ###

experiment_id = [
    "piClim-control",
    "piClim-aer",
]

### VARIABLES ###

variable_id = [
    "clt",
    "rsdt",
    "rsut",
    "rsutcs",
    "rsds",
    "rsus",
    "rsdscs",
    "rsuscs",
    "rlut",
    "rlutcs",
    "rlds",
    "rlus",
]

### TABLE ###

table_id = "Amon"

### Create the folder in which the data will be stored

Here the user can chose to create the folder to store the data. It will erase a pre-existing folder if the *make_a_new_folder* option is set to **True**. This can be quite slow if the data folder is already holding some heavy data.

In [3]:
# ================ CHOSE IF WE CLEAR AN ALREADY EXISTING FOLDER ================ #

do_we_clear = False

In [4]:
# ================ CREATE THE FOLDER TO STORE THE DOWNLOADED DATA ================ #

### DEFINE WHERE TO MOVE THE FILES AT THE END ###

## Home directory ##

homedir_path = os.path.expanduser("~")

## Parent directory ##

parent_path = homedir_path + "/certainty-data"

## Name of the created folder ##

downloading_folder_name = "CMIP6-DATA"

### CREATE THE DIRECTORY AND EMPTY IF MAKE_A_NEW_FOLDER ###

downloading_path = create_dir(
    parent_path=parent_path, name=downloading_folder_name, clear=do_we_clear
)

print(
    "The downloading folder {} is under the path {}.".format(
        downloading_folder_name, downloading_path
    )
)

The downloading folder CMIP6-DATA is under the path /home/jovyan/certainty-data/CMIP6-DATA.


---
## Configure the ESGFCatalog


---

The ESGFCatalog is initially parametrized with default values on may want to change. We will focus on three main changes :

* **defining the nodes that the query will investigate** 

* **setting where the data will be downloaded** 

* **adding a path that is specific to our cluster to search for CMIP6 outputs locally** 

The default configuration of the catalog can be accessed through the following line of code. More details on the configuration may be found here : https://intake-esgf.readthedocs.io/en/latest/configure.html.

In [5]:
print(intake_esgf.conf)

additional_df_cols: []
break_on_error: true
download_db: ~/.config/intake-esgf/download.db
esg_dataroot:
- /p/css03/esgf_publish
- /eagle/projects/ESGF2/esg_dataroot
- /global/cfs/projectdirs/m3522/cmip6/
globus_indices:
  anl-dev: true
  ornl-dev: true
local_cache:
- ~/.esgf/
logfile: ~/.config/intake-esgf/esgf.log
num_threads: 6
solr_indices:
  esg-dn1.nsc.liu.se: false
  esgf-data.dkrz.de: false
  esgf-node.ipsl.upmc.fr: false
  esgf-node.llnl.gov: false
  esgf-node.ornl.gov: false
  esgf.ceda.ac.uk: false
  esgf.nci.org.au: false



There are some variables that are of interest to us :

* The *solr_indices* and the *globus_indices* variables define on which nodes the query is realized. 

* The *local_cache* variable sets where the data will be downloaded.

* The *esg_dataroot* variable sets local path to explore before downloading any data.

We can see that, as said in the documentation of the intake_esgf library, the search is done by default with Globus-based indices at the holdings of OLCF (Oak Ridge Leadership Computing Facility) and ALCF (Argonne Leadership Computing Facility). We may extend the search to all the possible ESGF nodes in order to not miss any model output. Note that the solr nodes are way much slower than the globus ones.

By default, the folder in which is stored the downloaded data is ~/.esgf/. This is a hidden folder in your home repository. It is more convenient, if you are working on a shared resource such as an institutional cluster or group workstation, to define it on a directory dedicated to data.

Finally, if we are working on a cluster having some access to CMIP6 data, it is worth adding the path of this data in our cluster to avoid useless downloading.

In the following part we will show how to modify these variables. 

**To decide if you want to modify them you need to set the following variables to True.**

In [6]:
# ================ CONFIGURE THE ESGF CATALOG ================ #

### USE ALL NODES FOR SEARCH ###

all_indices = False

### SET A NEW DOWNLOADING PATH ###

set_new_downloading_path = True

### SET A CLUSTER SPECIFIC CMIP6 PATH ###

cluster_local_CMIP6_path = ""

### Define the nodes for the research

We may decide to look at all the nodes. Note that the solr nodes are way much slower than the globus ones. 

In [7]:
intake_esgf.conf.set(all_indices=all_indices)

if all_indices:

    print("We are looking at all the nodes.")

else:

    print("We are only looking at the globus nodes")

We are only looking at the globus nodes


### Set where the data will be downloaded

In [8]:
if set_new_downloading_path:

    intake_esgf.conf.set(local_cache=downloading_path)

    print(
        "The CMIP6 data will be downloaded at the path : {}".format(
            intake_esgf.conf["local_cache"]
        )
    )

The CMIP6 data will be downloaded at the path : ['/home/jovyan/certainty-data/CMIP6-DATA']


### Add a cluster-specific CMIP6 path

In [9]:
intake_esgf.conf.set(esg_dataroot=cluster_local_CMIP6_path)

if cluster_local_CMIP6_path != "":

    print(
        "The CMIP6 data will be searched beforehand at the path : {}".format(
            intake_esgf.conf["esg_dataroot"]
        )
    )

else:
    print("No local cluster-specific CMIP6 path")

No local cluster-specific CMIP6 path


### Print the new configuration

In [10]:
print(intake_esgf.conf)

additional_df_cols: []
break_on_error: true
download_db: ~/.config/intake-esgf/download.db
esg_dataroot:
- ''
globus_indices:
  anl-dev: true
  ornl-dev: true
local_cache:
- /home/jovyan/certainty-data/CMIP6-DATA
logfile: ~/.config/intake-esgf/esgf.log
num_threads: 6
solr_indices:
  esg-dn1.nsc.liu.se: false
  esgf-data.dkrz.de: false
  esgf-node.ipsl.upmc.fr: false
  esgf-node.llnl.gov: false
  esgf-node.ornl.gov: false
  esgf.ceda.ac.uk: false
  esgf.nci.org.au: false



---

## Make the query to the ESGFCatalog

---

### Initialise the catalog

The catalog variable is initially empty and will be filled given the criterias that we will impose for the query. 

In [11]:
catalog = intake_esgf.ESGFCatalog()

print(catalog)

Perform a search() to populate the catalog.


### Constrain the catalog

We apply the criterias defined earlier to the catalog.

In [12]:
catalog.search(experiment_id=experiment_id, variable_id=variable_id, table_id=table_id)

   Searching indices:   0%|          |0/2 [       ?index/s]

Summary information for 747 results:
mip_era                                                     [CMIP6]
activity_drs                                    [RFMIP, AerChemMIP]
institution_id    [NCC, NASA-GISS, CCCma, MRI, MPI-M, EC-Earth-C...
source_id         [NorESM2-LM, GISS-E2-1-G, CanESM5, NorESM2-MM,...
experiment_id                          [piClim-aer, piClim-control]
member_id         [r1i1p2f1, r1i1p1f1, r1i1p1f2, r1i1p3f1, r1i1p...
table_id                                                     [Amon]
variable_id       [rsdt, rlut, rlus, rsuscs, rlutcs, rsut, rsds,...
grid_label                                            [gn, gr, gr1]
dtype: object

### Convert the catalog results into a pandas dataframe

The resulting catalog can be converted to a pandas dataframe. This is convenient to isolate some properties of the catalog like the models that it has found also known as the *source_id*.

In [13]:
# ================ CONVERT TO PANDAS DATAFRAME ================ #

initial_search_df = catalog.df

### SHOW THE SOURCE_ID THAT APPEAR AT LEAST ONCE ###

## Extract the list of the models' names ##

# Retrieve the source_id column and take only one example for every duplicate #

list_model_names = initial_search_df.source_id.unique()

## Print the result ##

print("The list of found models' names is : \n{}".format(list_model_names))

The list of found models' names is : 
['NorESM2-LM' 'GISS-E2-1-G' 'CanESM5' 'NorESM2-MM' 'MRI-ESM2-0'
 'MPI-ESM1-2-LR' 'EC-Earth3' 'ACCESS-ESM1-5' 'CESM2' 'HadGEM3-GC31-LL'
 'CNRM-ESM2-1' 'IPSL-CM6A-LR-INCA' 'IPSL-CM6A-LR' 'MPI-ESM-1-2-HAM'
 'EC-Earth3-AerChem' 'MIROC6' 'GFDL-ESM4' 'UKESM1-0-LL' 'E3SM-2-0'
 'BCC-ESM1' 'GFDL-CM4' 'CNRM-CM6-1' 'ACCESS-CM2' 'TaiESM1' 'CESM2-WACCM']


### Regroup the results by model

The number of found results is very large. They are numerous duplicates that come from the different *member_id*  and *grid_label* available for each model. 

The *member_id* or *variant_label* is described by 4 indices defining an ensemble member: *r* for realization, *i* for initialization, *p* for physics, and *f* for forcing. These parameters define an ensemble of experiments that correspond to the main experiment conditions for a given model. Actually, modellers may initialize their model from a different point in time, change the parametrization of a given parameter and so on.

Let's regroup the results according to **(model, variant, grid)** tuples. The result here is truncated.

In [14]:
grouped_models = catalog.model_groups()

print(grouped_models)

source_id          member_id  grid_label
ACCESS-CM2         r1i1p1f1   gn             8
ACCESS-ESM1-5      r1i1p1f1   gn            24
BCC-ESM1           r1i1p1f1   gn            24
CanESM5            r1i1p2f1   gn            24
CESM2              r1i1p1f1   gn            24
CESM2-WACCM        r1i2p1f1   gn            12
CNRM-CM6-1         r1i1p1f2   gr            24
CNRM-ESM2-1        r1i1p1f2   gr            24
E3SM-2-0           r1i1p1f1   gr            12
EC-Earth3          r1i1p1f1   gr            24
                   r2i1p1f1   gr            24
EC-Earth3-AerChem  r1i1p1f1   gr            20
GFDL-CM4           r1i1p1f1   gr1           24
GFDL-ESM4          r1i1p1f1   gr1           14
GISS-E2-1-G        r1i1p1f1   gn            24
                   r1i1p1f2   gn            24
                   r1i1p3f1   gn            24
                   r1i1p3f2   gn            12
HadGEM3-GC31-LL    r1i1p1f3   gn            11
IPSL-CM6A-LR       r1i1p1f1   gr            24
                   

An interesting thing to note is the ultimate column that tells us the number of found results for a given model, variant and grid. In our case since we are looking for **12** variables for **two experiments**, the search will be deemed complete if we find **24 resulting files**.

The next step is therefore to get rid of the incomplete results.

### Test some filters on the incomplete results

The user may define an expected number of netcdf files for a given (model, variant,grid) tuple. In our case this is **24** as explained before. Let's impose this result by keeping only the results that match this condition. We will produce a panda series that will allow us to check if our filter is what we want.

You may define a more refined filter for the results. Please look at the intake-esgf documentation for more information : https://intake-esgf.readthedocs.io/en/latest/modelgroups.html.

In [15]:
# ================ TEST A FILTER BY GROUPING BY MODELS ================ #

### SET THE EXPECTED NUMBER OF FILES ###

expected_number_of_files = 24

### FILTER THE INCOMPLETE RESULTS ACCORDING TO OUR CRITERIE ###

filtered_results = grouped_models[grouped_models == expected_number_of_files]

### PRINT THE FILTERED CATALOG ###

print(filtered_results)

source_id          member_id  grid_label
ACCESS-ESM1-5      r1i1p1f1   gn            24
BCC-ESM1           r1i1p1f1   gn            24
CanESM5            r1i1p2f1   gn            24
CESM2              r1i1p1f1   gn            24
CNRM-CM6-1         r1i1p1f2   gr            24
CNRM-ESM2-1        r1i1p1f2   gr            24
EC-Earth3          r1i1p1f1   gr            24
                   r2i1p1f1   gr            24
GFDL-CM4           r1i1p1f1   gr1           24
GISS-E2-1-G        r1i1p1f1   gn            24
                   r1i1p1f2   gn            24
                   r1i1p3f1   gn            24
IPSL-CM6A-LR       r1i1p1f1   gr            24
                   r2i1p1f1   gr            24
                   r3i1p1f1   gr            24
                   r4i1p1f1   gr            24
IPSL-CM6A-LR-INCA  r1i1p1f1   gr            24
MIROC6             r1i1p1f1   gn            24
                   r11i1p1f1  gn            24
MPI-ESM-1-2-HAM    r1i1p1f1   gn            24
MRI-ESM2-0         

The result is a pandas series that we can manipulate as so. For example if we want to retrieve all the member_id and grid_label associated to a single modell like the *IPSL-CM6A-LR*

In [16]:
filtered_results["IPSL-CM6A-LR"]

member_id  grid_label
r1i1p1f1   gr            24
r2i1p1f1   gr            24
r3i1p1f1   gr            24
r4i1p1f1   gr            24
Name: project, dtype: int64

Good, let's have a look at the quantity of results we have left.

In [17]:
print(
    "The number of remaining (model, variant, grid) tuples is {}.".format(
        filtered_results.shape[0]
    )
)

The number of remaining (model, variant, grid) tuples is 25.


### Applying the filter on the catalog

If that's satisfy us, we need to code a small function that will be **applied to the catalog**. The test will be executed on each model group that we have showed in the previous section.

In [18]:
# ================ DEFINING THE FILTERING FUNCTION ================ #

### SET THE EXPECTED NUMBER OF FILES ###

expected_number_of_files = 24


def filtering_function(model_group):
    if len(model_group) == expected_number_of_files:
        return True

In [19]:
catalog = catalog.remove_incomplete(filtering_function)

In [20]:
catalog = catalog.remove_ensembles()

## Downloading the files and load a dictionnary in memory

The intake-esgf library proposes to store the found results in memory under the form of a dictionnary holding xarray datasets for every single netcdf file found. This process also saves the dictionnary at the previously defined local_cache path. 

By default, the package is looking for the **areacella** variable automatically. But it does it **rather slowly** and does not look first at what is stored on the local_cache. In our analysis, we would rather load the dictionnary with *add_measures* set to **False** and then download (and load) the areacella netcdf files apart with a homemade routine. An example of this process is given afterward.

What's more, we may describe some facets as irrelevant to build the keys of the dictionnary. Indeed, by default, the intake_esgf package will build keys out of the facet values that are different among the entries in the output dictionary. But, some of these facets might not be of any interest for our analysis and we can drop them with the *ignore_facet*.

In [21]:
dataset_dict = catalog.to_dataset_dict(
    add_measures=False, ignore_facets=["activtity_drs", "institution_id, table_id"]
)

Get file information:   0%|          |0/2 [       ?index/s]

KeyboardInterrupt: 

## Homemade routine to retrieve the areacella of a given model, variant, grid_label tuple only once

The main default of the intake-esgf package when it comes to look for measures is that it does so for every single variable. And this is independent of the fact that our 12 variables fore one experiment would share the same areacella grid. What's more, it always start by looking for our full set of facets which is irrevelant. 

Indeed, the only three parameters that define areacella are the **source_id**, **member_id** and obviously the **grid_label**. Conveniently enough, we can group our search results according to these three parameters with the *.model_groups* method. In the end, it may be that our experiment won't hold the areacella variable but it does not really matter, we just need to find one. This is the spirit of this routine made to still get areacella but significantly faster.

It is worth noting that this method is not useful if you need to import a small number of models and variables as you won't really see the time difference.

In [None]:
all_indices = True

In [None]:
intake_esgf.conf.set(all_indices=all_indices)

if all_indices:

    print("We are looking at all the nodes.")

else:

    print("We are only looking at the globus nodes")

In [None]:
# ================ HOMEMADE ROUTINE TO GET AREACELLA GRIDS FASTER ================ #

### INITIALISATION  ###

## Initialise the full dictionnary ##

dict_areacella = {}

## Models grouped by SOURCE_ID | MEMBER_ID | GRID_LABEL ##

series_grouped_models = catalog.model_groups()

## Number of rows ##

n_rows = series_grouped_models.size

### GO THROUGH EVERY ROW OF THE PANDA SERIES ###

for ii in range(n_rows):

    ## Get the SOURCE_ID | MEMBER_ID | GRID_LABEL of the row ##

    # Retrieve the row ##

    row_ii = series_grouped_models.index[ii]

    # Extract the labels #

    source_id, member_id, grid_label = row_ii

    # Build the key for this dictionnary entry ##

    full_key = source_id + "." + member_id + "." + grid_label

    ## Do the full search ##

    areacella_search_full = catalog.search(
        source_id=source_id, grid_label=grid_label, variable_id="areacella", quiet=True
    ).df  # silence the progress bar

    ## Extract the first experiment id that gives an areacella entry ##

    only_first_exp_id = areacella_search_full.experiment_id.values[0]

    ## Get the areacella for the given row ##

    # Search and download it #

    areacella_ii = catalog.search(
        source_id=source_id,
        grid_label=grid_label,
        variable_id="areacella",
        experiment_id=only_first_exp_id,
        quiet=True,
    ).to_dataset_dict(
        add_measures=False, quiet=True
    )  # silence the progress bar

    # Store it in dictionnary #

    dict_areacella[full_key] = areacella_ii["areacella"]

KeyError: 'areacella'

'gr'

In [None]:
dataset_dict

{'piClim-control.rlus': <xarray.Dataset> Size: 30MB
 Dimensions:      (lat: 143, lon: 144, time: 360, axis_nbounds: 2)
 Coordinates:
   * lat          (lat) float32 572B -90.0 -88.73 -87.46 ... 87.46 88.73 90.0
   * lon          (lon) float32 576B 0.0 2.5 5.0 7.5 ... 350.0 352.5 355.0 357.5
   * time         (time) object 3kB 2014-01-16 12:00:00 ... 2043-12-16 12:00:00
 Dimensions without coordinates: axis_nbounds
 Data variables:
     time_bounds  (time, axis_nbounds) object 6kB ...
     rlus         (time, lat, lon) float32 30MB ...
 Attributes: (12/53)
     name:                   /ccc/work/cont003/gencmip6/p25sima/IGCM_OUT/LMDZO...
     Conventions:            CF-1.7 CMIP-6.2
     creation_date:          2018-10-03T12:57:01Z
     tracking_id:            hdl:21.14100/6acc2ba1-ea91-490e-80a4-3f609ca6045d
     description:            30-year atmosphere only integration using preindu...
     title:                  IPSL-CM6A-LR model output prepared for CMIP6 / RF...
     ...          