# Downloading an ensemble of CMIP6 data with a series of criterias thanks to intake-esgf

---

## Purpose of the notebook

---

This notebook aims at **downloading an ensemble of CMIP6 variables from python** thanks to a dictionary of **user-defined criterias**.  

All the links of the documents were accessed on the **25/03/2025**.

It uses the *intake-esgf* library : https://github.com/esgf2-us/intake-esgf?tab=readme-ov-file. A beginner guide for this library can be found there : https://intake-esgf.readthedocs.io/en/latest/beginner.html.

A detailed documentation for the CMIP6 can be found here : https://wcrp-cmip.org/cmip-model-and-experiment-documentation/.



Feel free to share, use and improve the following code according to the provided license on the repository.

---

## Model outputs searched

---

Every search needs to be constrained by the attributes of the model outputs we are looking for. A detailed document listing these attributes can be found here : https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit?tab=t.0.

### Experiments

We use two experiments realized during the CMIP6  : **piClim-control** and **piClim-aer**. These are both atmosphere-only climate model simulations in which sea surface temperatures (SSTs) and sea icea concentrations (SICs) are fixed at model-specific preindustrial climatological values. The description of the experiments can be found here : https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html.

> **piClim-control** : assumes aerosols' burdens set to their preindustrial levels, it is the control experiment.
> 
> **piClim-aer** : uses present-day, present-day being 2014, aerosols burdens' levels.

### Variables

The variable used are listed and explicited below according to : https://clipc-services.ceda.ac.uk/dreq/mipVars.html.

> <span style="color:SkyBlue">**clt**</span>  : Total cloud area fraction (%) for the whole atmospheric column
>
> <span style="color:gold">**rsdt**</span> : Shortwave radiation ($W/m^{2}$) **incident** at the TOA
> 
> <span style="color:orange">**rsut**</span> : Shortwave radiation ($W/m^{2}$) **going out**  at the TOA
>
> <span style="color:orangered">**rsutcs**</span> : Shortwave radiation ($W/m^{2}$) **going out**  at TOA for **clear-sky conditions**
> 
> <span style="color:Orchid">**rsds**</span> : Shortwave **downwelling** radiation ($W/m^{2}$) at the surface
> 
> <span style="color:Indigo ">**rsdscs**</span>  : Shortwave **downwelling** radiation ($W/m^{2}$) at the surface for **clear-sky conditions**
> 
> <span style="color:YellowGreen">**rsus**</span> : Shortwave **upwelling** radiation ($W/m^{2}$) at the surface
>
> <span style="color:Darkgreen">**rsuscs**</span>: Shortwave **upwelling** radiation ($W/m^{2}$) at the surface for **clear-sky conditions**
>
> **areacella** : For every grid, the latitude-dependent surface associated to each grid point.

### Table

The table sets how the variables are organized. We use the **AERmon** table. The details about the tables can be found here : https://clipc-services.ceda.ac.uk/dreq/index/miptable.html.

> **Amon** stands for a set of monthly atmospheric data

---

## Initialisation

---

### Importations

We import the needed libraries.

In [1]:
# ================ IMPORTATIONS ================ #

import intake_esgf  # this gives us access to the ESGF catalog to make queries

import pandas as pd  # to manage the product of the search

import numpy as np  # to manage the pandas arrays

import os  # to get access to commands related to path setting and creation of directories

from utilities.download.folders_handle.create import (
    create_dir,
)  # function to create a cleaned downloading directory

### Set our search criterias 

Here the user may define its search criterias. We create a dictionary structure that we update with determined variables.

In [2]:
# ================ SEARCH CRITERIAS FOR OUR ANALYSIS ================ #

### EXPERIMENTS ###

experiment_id = [
    "piClim-control",
    "piClim-aer",
]

### VARIABLES ###

variable_id = [
    "clt",
    "rsdt",
    "rsut",
    "rsutcs",
    "rsds",
    "rsus",
    "rsdscs",
    "rsuscs",
]

### TABLE ###

table_id = "Amon"

### DEFINE SEARCH CRITERIAS dictionary ###

search = {
    "experiment_id" : experiment_id,
    "variable_id" : variable_id,
    "table_id" : table_id
}

### Create the folder in which the data will be stored

Here the user can chose to create the folder to store the data. It will erase a pre-existing folder if the *do_we_clear* option is set to **True**. This can be quite slow if the data folder is already holding some heavy data.

In [3]:
# ================ CHOSE IF WE CLEAR AN ALREADY EXISTING FOLDER ================ #

do_we_clear = True

In [4]:
# ================ CREATE THE FOLDER TO STORE THE DOWNLOADED DATA ================ #

### DEFINE WHERE TO MOVE THE FILES AT THE END ###

## Home directory ##

homedir_path = os.path.expanduser("~")

## Parent directory ##

parent_path = homedir_path + "/certainty-data"

## Name of the created folder ##

downloading_folder_name = "CMIP6-DATA"

### CREATE THE DIRECTORY AND EMPTY IF MAKE_A_NEW_FOLDER ###

downloading_path = create_dir(
    parent_path=parent_path, name=downloading_folder_name, clear=do_we_clear
)

print(
    "The downloading folder {} is under the path {}.".format(
        downloading_folder_name, downloading_path
    )
)

The downloading folder CMIP6-DATA is under the path /home/jovyan/certainty-data/CMIP6-DATA.


In [None]:
from utilities.download.load_cmip6 import loading_cmip6

full_cmip6_dict, areacella_dict = loading_cmip6(
    parent_path = parent_path, 
    downloading_folder_name = downloading_folder_name, 
    case = "ZELINKA-SW", 
    do_we_clear = True,
    remove_ensembles = False
    )

The CMIP6 data will be searched and downloaded at the path : ['/home/jovyan/certainty-data/CMIP6-DATA']

The search criterias are : {'experiment_id': ['piClim-control', 'piClim-aer'], 'variable_id': ['clt', 'rsdt', 'rsut', 'rsutcs', 'rsds', 'rsus', 'rsdscs', 'rsuscs'], 'table_id': 'Amon'}

Filling the catalog with the search criterias...



   Searching indices:   0%|          |0/9 [       ?index/s]

KeyboardInterrupt: 

---
## Configure the ESGFCatalog


---

The ESGFCatalog is initially parametrized with default values on may want to change. We will focus on three main changes :

* **defining the nodes that the query will investigate** 

* **setting where the data will be downloaded** 

* **adding a path that is specific to our cluster to search for CMIP6 outputs locally** 

The default configuration of the catalog can be accessed through the following line of code. More details on the configuration may be found here : https://intake-esgf.readthedocs.io/en/latest/configure.html.

In [None]:
print(intake_esgf.conf)

There are some variables that are of interest to us :

* The *solr_indices* and the *globus_indices* variables define on which nodes the query is realized. 

* The *local_cache* variable sets where the data will be downloaded.

* The *esg_dataroot* variable sets local path to explore before downloading any data.

We can see that, as said in the documentation of the intake_esgf library, the search is done by default with Globus-based indices at the holdings of OLCF (Oak Ridge Leadership Computing Facility) and ALCF (Argonne Leadership Computing Facility). We may extend the search to all the possible ESGF nodes in order to not miss any model output. Note that the solr nodes are way much slower than the globus ones.

By default, the folder in which is stored the downloaded data is ~/.esgf/. This is a hidden folder in your home repository. It is more convenient, if you are working on a shared resource such as an institutional cluster or group workstation, to define it on a directory dedicated to data.

Finally, if we are working on a cluster having some access to CMIP6 data, it is worth adding the path of this data in our cluster to avoid useless downloading.

In the following part we will show how to modify these variables. 

**To decide if you want to modify them you need to set the following variables to True.**

In [None]:
# ================ CONFIGURE THE ESGF CATALOG ================ #

### USE ALL NODES FOR SEARCH ###

all_indices = True

### SET A NEW DOWNLOADING PATH ###

set_new_downloading_path = True

### SET A CLUSTER SPECIFIC CMIP6 PATH ###

cluster_local_CMIP6_path = ""

### Define the nodes for the research

We may decide to look at all the nodes. Note that the solr nodes are way much slower than the globus ones. 

In [None]:
intake_esgf.conf.set(all_indices=all_indices)

if all_indices:

    print("We are looking at all the nodes.")

else:

    print("We are only looking at the globus nodes")

### Set where the data will be downloaded

In [None]:
if set_new_downloading_path:

    intake_esgf.conf.set(local_cache=downloading_path)

    print(
        "The CMIP6 data will be downloaded at the path : {}".format(
            intake_esgf.conf["local_cache"]
        )
    )

### Add a cluster-specific CMIP6 path

In [None]:
if cluster_local_CMIP6_path != "":

    intake_esgf.conf.set(esg_dataroot=cluster_local_CMIP6_path)

    print(
        "The CMIP6 data will be searched beforehand at the path : {}".format(
            intake_esgf.conf["esg_dataroot"]
        )
    )

else:
    print("No local cluster-specific CMIP6 path")

### Print the new configuration

In [None]:
print(intake_esgf.conf)

---

## Make the query to the ESGFCatalog

---

### Initialise the catalog

The catalog variable is initially empty and will be filled given the criterias that we will impose for the query. 

In [None]:
catalog = intake_esgf.ESGFCatalog()

print(catalog)

### Constrain the catalog

We apply the criterias defined earlier to the catalog.

In [None]:
catalog.search(**search)

In [None]:
print(catalog.session_log())

### Convert the catalog results into a pandas dataframe

The resulting catalog can be converted to a pandas dataframe. This is convenient to isolate some properties of the catalog like the models that it has found also known as the *source_id*.

In [None]:
# ================ CONVERT TO PANDAS DATAFRAME ================ #

initial_search_df = catalog.df

### SHOW THE SOURCE_ID THAT APPEAR AT LEAST ONCE ###

## Extract the list of the models' names ##

# Retrieve the source_id column and take only one example for every duplicate #

list_model_names = initial_search_df.source_id.unique()

## Print the result ##

print("The list of found models' names is : \n{}".format(list_model_names))

### Regroup the results by model

The number of found results is very large. They are numerous duplicates that come from the different *member_id*  and *grid_label* available for each model. 

The *member_id* or *variant_label* is described by 4 indices defining an ensemble member: *r* for realization, *i* for initialization, *p* for physics, and *f* for forcing. These parameters define an ensemble of experiments that correspond to the main experiment conditions for a given model. Actually, modellers may initialize their model from a different point in time, change the parametrization of a given parameter and so on.

Let's regroup the results according to **(model, variant, grid)** tuples. The result here is truncated.

In [None]:
grouped_models = catalog.model_groups()

print(grouped_models)

An interesting thing to note is the ultimate column that tells us the number of found results for a given model, variant and grid. In our case since we are looking for **8** variables for **two experiments**, the search will be deemed complete if we find **16 resulting files**.

The next step is therefore to get rid of the incomplete results according to an user-defined criteria.

### Investigate for a specific model lacking variables

From what we have seen from the previous method, some models lack the expected number of variables. Thanks to the pandas' dataframe structure, we may investigate what are the variables missing for which experiment.

In [None]:
# ================ SEE WHAT VARIABLES ARE MISSING ================ #

### LOOK AT A GIVEN MODEL ###

## Define source_id and member_id ##

# Source_id #

looked_source_id = "GFDL-ESM4"

# Member_id #

looked_member_id = "r1i1p1f1"

## Select the given source_id and member_id ##

# Source_id #

selected_source_id = initial_search_df[initial_search_df.source_id == looked_source_id]

# Member_id #

selected_model = selected_source_id[selected_source_id.member_id == looked_member_id]

## Generate the variable_id series for the piClim-control experiment ##

variable_id_for_control_exp = selected_model[selected_model.experiment_id == "piClim-control"].variable_id

## Generate the variable_id series for the piClim-aer experiment ##

variable_id_for_aer_exp = selected_model[selected_model.experiment_id == "piClim-aer"].variable_id

### FIND VARIABLES THAT ARE MISSING FOR BOTH EXPERIMENTS ###

## Make the criteria variable_id list as a panda series ##

searched_variable_ids = pd.Series(variable_id, dtype = str)

## Get the variables that are found in at least one experiment ##

var_in_at_least_one = pd.Series(np.union1d(variable_id_for_control_exp, variable_id_for_aer_exp), dtype = str) 

## See the missing ones compared to the variable_id list ##

print("Variables missing in both experiments of {}.{} : \n".format(looked_source_id, looked_member_id))

missing_variables_from_both = searched_variable_ids[~searched_variable_ids.isin(var_in_at_least_one)].values

print(missing_variables_from_both)

### GET THE VARIABLES MISSING IN ONLY ONE EXPERIMENT ###

## Extract the variables found in both experiments ##

var_in_both = pd.Series(np.intersect1d(variable_id_for_control_exp, variable_id_for_aer_exp), dtype = str) 

## Get the variables not in common between the experiments ##

# we keep the variables that are not in both experiments 
# by removing the ones that are found in both

var_not_in_common = var_in_at_least_one[~var_in_at_least_one.isin(var_in_both)] 

## See the missing ones compared to the variable_id list ##

missing_variables_not_in_common = searched_variable_ids[searched_variable_ids.isin(var_not_in_common)].values

## Find the concerned experiment if missing_variables_not_in_common is not empty ##

# Check if it's empty #

if (missing_variables_not_in_common.size > 0) :

    # Get the number of variables found for each experiment #

    number_variables_control = len(variable_id_for_control_exp.values)

    number_variables_aer = len(variable_id_for_aer_exp.values)

    # Get the least furnished experiment #

    if number_variables_control < number_variables_aer :

        print("\nOnly the control experiment of {}.{} is lacking these variables :\n".format(looked_source_id, looked_member_id))

        print(missing_variables_not_in_common)

    else :
        
        print("\nOnly the aerosol experiment of {}.{} is lacking these variables :\n".format(looked_source_id, looked_member_id))

        print(missing_variables_not_in_common)

### Test some filters on the incomplete results

The user may define an expected number of netcdf files for a given (model, variant,grid) tuple. In our case this is **16** as explained before. Let's impose this result by keeping only the results that match this condition. We will produce a panda series that will allow us to check if our filter is what we want.

You may define a more refined filter for the results. Please look at the intake-esgf documentation for more information : https://intake-esgf.readthedocs.io/en/latest/modelgroups.html.

In [None]:
# ================ TEST A FILTER BY GROUPING BY MODELS ================ #

### SET THE EXPECTED NUMBER OF FILES ###

expected_number_of_files = 16

### FILTER THE INCOMPLETE RESULTS ACCORDING TO OUR CRITERIE ###

filtered_results = grouped_models[grouped_models == expected_number_of_files]

### PRINT THE FILTERED CATALOG ###

print(filtered_results)

The result is a pandas series that we can manipulate as so. For example if we want to retrieve all the member_id and grid_label associated to a single modell like the *IPSL-CM6A-LR*

In [None]:
filtered_results["IPSL-CM6A-LR"]

Good, let's have a look at the quantity of results we have left.

In [None]:
print(
    "The number of remaining (model, variant, grid) tuples is {}.".format(
        filtered_results.shape[0]
    )
)

### Applying the filter on the catalog

If that's satisfy us, we need to code a small function that will be **applied to the catalog**. The test will be executed on each model group that we have showed in the previous section. If we wish to reproduce the results from *Zelinka and al (2023)*, we can also keep only the matching models and variant couples with a the provided list in the article.

**Reference**

Zelinka, M. D., Smith, C. J., Qin, Y., and Taylor, K. E.: Comparison of methods to estimate aerosol effective radiative forcings in climate models, Atmos. Chem. Phys., 23, 8879–8898, https://doi.org/10.5194/acp-23-8879-2023, 2023.

In [None]:
# ================ DEFINE THE MODEL.VARIANT LIST OF ZELINKA'S ARTICLE ================ #

source_id_zelinka_2023 = [
    "ACCESS-CM2",
    "ACCESS-ESM1-5",
    "BCC-ESM1",
    "CESM2",
    "CNRM-CM6-1",
    "CNRM-ESM2-1",
    "CanESM5",
    "GFDL-CM4",
    "GFDL-ESM4",
    "GISS-E2-1-G",
    "GISS-E2-1-G",
    "GISS-E2-1-G",
    "HadGEM3-GC31-LL",
    "IPSL-CM6A-LR-INCA",
    "IPSL-CM6A-LR",
    "IPSL-CM6A-LR",
    "IPSL-CM6A-LR",
    "IPSL-CM6A-LR",
    "MIROC6",
    "MIROC6",
    "MPI-ESM-1-2-HAM",
    "MRI-ESM2-0",
    "NorESM2-LM",
    "NorESM2-LM",
    "NorESM2-MM",
    "UKESM1-0-LL" 
]

member_id_zelinka_2023 = [
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f2",
    "r1i1p1f2",
    "r1i1p2f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f2",
    "r1i1p3f1",
    "r1i1p1f3",
    "r1i1p1f1",
    "r1i1p1f1",
    "r2i1p1f1",
    "r3i1p1f1",
    "r4i1p1f1",
    "r11i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p2f1",
    "r1i1p1f1",
    "r1i1p1f4"
]

zelinka_2023_model_variant_table = pd.DataFrame({"source_id" : source_id_zelinka_2023, "member_id" : member_id_zelinka_2023}, dtype = str)

In [None]:
# ================ DEFINING THE FILTERING FUNCTION ================ #

### SET THE EXPECTED NUMBER OF FILES ###

expected_number_of_files = 16

### DEFINE THE GLOBAL OPTIONAL PARAMETERS OF THE FILTERING FUNCTION ###

filtering_by_name = True

keep_only_dataframe = zelinka_2023_model_variant_table

### DEFINE THE FILTERING FUNCTION ###

def filtering_function(grouped_model_entry : pd.DataFrame) -> bool:
    """

    ### DEFINITION ###

    This function allows the intake-esgf catalog to be cleaned of the entries that are not complete.
    In the default case it means entries that do not meet the expected number of files. 

    The user can also set a condition that only the model and variant couples present in a provided pandas dataframe are kept.
    Since the nature of this function is to be an input for the intake-esgf package, we define the optional arguments outside
    of the function.

    ### INPUTS

    GROUPED_MODEL_ENTRY : Pandas DataFrame | sub dataframe containing all the variables of a given source_id, member_id and grid tuple.

    ### OPTIONAL ARGUMENTS (DEFINED GLOBALLY)

    FILTERING_BY_NAME : BOOL | defines if we filter the entry by source_id and member_id or not

    KEEP_ONLY_DATAFRAME : Pandas DataFrame | associated dataframe holding the source_id and member_id to conserve
    
    ### OUTPUTS

    BOOL | whether we keep this model group or not
    """

    ### TEST THE NUMBER OF VARIBALES ###

    if len(grouped_model_entry) == expected_number_of_files:

        ### NUMBER OF VARIABLES' TEST SUCCEEDED ###

        ## Do we keep only the Zelinka's article model and variant couples ? ##

        # NO : We keep everything that matched the variable number's test #

        if not(filtering_by_name):

            return True
        
        ### YES : KEEPING ONLY THE COUPLES PRESENT IN ZELINKA 2023 ###

        else :
            
            ## Doing the test on the grouped_model_entry's source_id and member_id ##

            # Extract the grouped_model_entry data #

            grouped_model_entry_source_id = grouped_model_entry.source_id.unique()[0]

            grouped_model_entry_member_id = grouped_model_entry.member_id.unique()[0]

            # Can we find the grouped_model_entry's source_id and member_id in one row of is_in_keep_only_dataframe ? #

            is_in_keep_only_dataframe = ((keep_only_dataframe['source_id'] == grouped_model_entry_source_id) 
            & (keep_only_dataframe['member_id'] == grouped_model_entry_member_id)).any()

            # Return the result of the test #

            return is_in_keep_only_dataframe
    
    ### NUMBER OF VARIABLES' TEST FAILED ###

    else :

        return False

In [None]:
catalog = catalog.remove_incomplete(filtering_function)

Looking at the resulting dataframe we see that we have indeed filtered the unwanted entries :

In [None]:
catalog.df

## Downloading the files and load a dictionary in memory

The intake-esgf library proposes to store the found results in memory under the form of a dictionary holding xarray datasets for every single netcdf file found. This process also saves the dictionary at the previously defined local_cache path. 

By default, the package is looking for the **areacella** variable automatically. But it does it **rather slowly** and does not look first at what is stored on the local_cache. In our analysis, we would rather load the dictionary with *add_measures* set to **False** and then download (and load) the areacella netcdf files apart with a homemade routine. An example of this process is given afterward.

What's more, we may describe some facets as irrelevant to build the keys of the dictionary. Indeed, by default, the intake_esgf package will build keys out of the facet values that are different among the entries in the output dictionary. But, some of these facets might not be of any interest for our analysis and we can drop them with the *ignore_facet*.

### Trying to download lot of entries from the solr nodes.

There is an issue that may be encountered if, as in this example, the user tried to download a lot of different entries from the solr index nodes. Indeed, the downloading of some variables of the data present on these nodes will fail. A solution that has been found in this case is to make a request for a **(source_id, member_id, grid_label)** at once.

In [None]:
# ================ FUNCTION TO DOWNLOAD ONE ENTRY AT A TIME ================ #

def generate_single_model_search_criterias(search_criterias : dict,
                              grouped_models_dataframe : pd.DataFrame, 
                              index : int,
                              ) -> tuple[dict, str] :
    
    """
    ---

    ### DEFINITION ###

    This function generates search_criterias for a single entry of grouped_models_dataframe precising 
    the (source_id, member_id, grid_label) criterias needed to download only one entry.
    ---

    ### INPUTS ###

    SEARCH_CRITERIAS :  DICT | the original search criterias dictionary 

    GROUPED_MODELS_DATAFRAME : PANDAS DATAFRAME | the entries dels that have been grouped together by source_id, member_id and grid_label

    INDEX : INT | selected row index

    ---

    ### OUTPUTS ###

    SEARCH_CRITERIAS_GIVEN_ROW : dictionary | search criterias updated with the source_id, member_id and grid_label criterias of the given row.

    SINGLE_MODEL_NAME : str | name of the single model we set the download for
    ---

    """
     
    ### COPYING THE ORIGINAL SEARCH CRITERIAS dictionary ###

    search_criterias_given_row = search.copy()
    
    ## Get the row information ##

    # Source_id #

    source_id_to_download = grouped_models_dataframe.iloc[index].source_id

    # Member_id #

    member_id_to_download = grouped_models_dataframe.iloc[index].member_id

    # Grid_label #

    grid_label_to_download = grouped_models_dataframe.iloc[index].grid_label

    ## Update the search criterias ##

    # Source_id #

    search_criterias_given_row["source_id"] = source_id_to_download

    # Member_id #

    search_criterias_given_row["member_id"] = member_id_to_download

    # Grid_label #

    search_criterias_given_row["grid_label"] = grid_label_to_download

    ## Generate the single model name ##

    single_model_name = source_id_to_download + "." + member_id_to_download + "." + grid_label_to_download
    
    return search_criterias_given_row, single_model_name

In [None]:
# ================ EXTRACT THE FILTERED CATALOG'S INFORMATION ================ #

### GET THE INFORMATIONS THAT THE CATALOG EXTRACTED ###

## Generate the full dataframe of the files found by the search ##

selected_entries_full_dataframe = catalog.df

## Save the grouped model pandas series for the next part ##

series_grouped_models = catalog.model_groups()

### GET ONLY THE NEEDED INFORMATION ###

## We extract the (source_id, member_id, grid_label) tuples from the full dataframe ##

# we remove the duplicates to only keep one row per tuple

grouped_models_dataframe = selected_entries_full_dataframe[["source_id","member_id","grid_label"]].drop_duplicates().reset_index(drop = True)


In [None]:
# ================ DOWNLOAD EVERY SINGLE OUTPUT ================ #

### EXTRACT THE ROWS AND DOWNLOAD THE ASSOCIATED ENTRY ###

for index in grouped_models_dataframe.index:

    ## Reset the catalog ##

    catalog = intake_esgf.ESGFCatalog()

    ## Generate the associated search criterias ##
    
    search_criterias_given_row, single_model_name = generate_single_model_search_criterias(
        search_criterias = search, 
        grouped_models_dataframe = grouped_models_dataframe,
        index = index
        )
    
    ## Generate the single model's output name ##

    print("\nDownloading {} ...\n".format(single_model_name))

    ## Apply the search criterias ##

    catalog.search(
            **search_criterias_given_row,
        )
    
    ## Downloading the output... ##

    single_model_dictionary = catalog.to_dataset_dict(
            add_measures=False,
            ignore_facets=["project","mip_era","activtity_drs", "institution_id, table_id","grid_label","version"],
            quiet=True,
        )

In [47]:
## Reset the catalog ##

index = 0

catalog = intake_esgf.ESGFCatalog()

## Generate the associated search criterias ##

search_criterias_given_row, single_model_name = generate_single_model_search_criterias(
    search_criterias = search, 
    grouped_models_dataframe = grouped_models_dataframe,
    index = index
    )

## Generate the single model's output name ##

print("\nDownloading {} ...\n".format(single_model_name))

## Apply the search criterias ##

catalog.search(
        **search_criterias_given_row,
    )

## Downloading the output... ##

single_model_dictionary = catalog.to_dataset_dict(
        add_measures=False,
        ignore_facets=["project","mip_era","activtity_drs", "institution_id, table_id","grid_label","version"],
        quiet=True,
    )


Downloading IPSL-CM6A-LR.r1i1p1f1.gr ...



   Searching indices:   0%|          |0/9 [       ?index/s]



## Homemade routine to retrieve the areacella of a given model, variant, grid_label tuple only once

The main default of the intake-esgf package when it comes to look for measures is that it does so for every single variable. And this is independent of the fact that our 12 variables fore one experiment would share the same areacella grid. What's more, it always start by looking for our full set of facets which is irrevelant. 

Indeed, the only three parameters that define areacella are the **source_id**, **member_id** and obviously the **grid_label**. Conveniently enough, we can group our search results according to these three parameters with the *.model_groups* method. In the end, it may be that our experiment won't hold the areacella variable but it does not really matter, we just need to find one. This is the spirit of this routine made to still get areacella but significantly faster.

It is worth noting that this method is not useful if you need to import a small number of models and variables as you won't really see the time difference.

In [None]:
all_indices = True

In [None]:
intake_esgf.conf.set(all_indices=all_indices)

if all_indices:

    print("We are looking at all the nodes.")

else:

    print("We are only looking at the globus nodes")

In [None]:
# ================ HOMEMADE ROUTINE TO GET AREACELLA GRIDS FASTER ================ #

### INITIALISATION  ###

## Initialise the full dictionary ##

dict_areacella = {}

## Number of rows ##

n_rows = series_grouped_models.size

### GO THROUGH EVERY ROW OF THE PANDA SERIES ###

for ii in range(n_rows):

    ## Get the SOURCE_ID | MEMBER_ID | GRID_LABEL of the row ##

    # Retrieve the row ##

    row_ii = series_grouped_models.index[ii]

    # Extract the labels #

    source_id, member_id, grid_label = row_ii

    # Build the key for this dictionary entry ##

    full_key = source_id + "." + member_id + "." + grid_label

    ## Special case for the IPSL-CM6A-LR-INCA model ##

    if source_id == "IPSL-CM6A-LR-INCA":

        source_id = "IPSL-CM6A-LR"

    ## Do the full search ##

    areacella_search_full = catalog.search(
        source_id=source_id, grid_label=grid_label, variable_id="areacella", quiet=True
    ).df  # silence the progress bar

    ## Extract the first experiment id that gives an areacella entry ##

    only_first_exp_id = areacella_search_full.experiment_id.values[0]

    ## Extract the first member id that gives an areacella entry ##

    only_first_member_id = areacella_search_full.member_id.values[0]

    ## Get the areacella for the given row ##

    # Search and download it #

    areacella_ii = catalog.search(
        source_id=source_id,
        grid_label=grid_label,
        variable_id="areacella",
        experiment_id=only_first_exp_id,
        member_id=only_first_member_id,
        quiet=True,
    ).to_dataset_dict(
        add_measures=False, quiet=True
    )  # silence the progress bar

    # Store it in dictionary #

    dict_areacella[full_key] = areacella_ii["areacella"]