# Downloading and preparing an ensemble of CMIP6 data obtained with a series of criterias thanks to intake-esgf

---

## Purpose of the notebook

---

This notebook aims at **downloading an ensemble of CMIP6 variables from python** thanks to a dictionary of **user-defined criterias**. It then **process to prepare the raw data** to make it into a **dictionary monthly climatologies** suitable for an analysis python code.

This notebook introduces routines that are coded in functions in the /utilities subfolder of the repository. Namely, we are using routines that are coded in *folders_handle*, *load_raw_data*, *store_data* and *prepare_data*.

The user should expect a running time of about 2 hours if the data is not present on disk.

All the links of the documents were accessed on the **25/03/2025**.

It uses the *intake-esgf* library : https://github.com/esgf2-us/intake-esgf?tab=readme-ov-file. A beginner guide for this library can be found there : https://intake-esgf.readthedocs.io/en/latest/beginner.html.

A detailed documentation for the CMIP6 can be found here : https://wcrp-cmip.org/cmip-model-and-experiment-documentation/.

Feel free to share, use and improve the following code according to the provided license on the repository.

---

## Model outputs searched

---

Every search needs to be constrained by the attributes of the model outputs we are looking for. A detailed document listing these attributes can be found here : https://docs.google.com/document/d/1h0r8RZr_f3-8egBMMh7aqLwy3snpD6_MrDz1q8n5XUk/edit?tab=t.0.

### Experiments

We use two experiments realized during the CMIP6  : **piClim-control** and **piClim-aer**. These are both atmosphere-only climate model simulations in which sea surface temperatures (SSTs) and sea icea concentrations (SICs) are fixed at model-specific preindustrial climatological values. The description of the experiments can be found here : https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_experiment_id.html.

> **piClim-control** : assumes aerosols' burdens set to their preindustrial levels, it is the control experiment.
> 
> **piClim-aer** : uses present-day, present-day being 2014, aerosols burdens' levels.

### Variables

The variable used are listed and explicited below according to : https://clipc-services.ceda.ac.uk/dreq/mipVars.html.

> <span style="color:SkyBlue">**clt**</span>  : Total cloud area fraction (%) for the whole atmospheric column
>
> <span style="color:gold">**rsdt**</span> : Shortwave radiation ($W/m^{2}$) **incident** at the TOA
> 
> <span style="color:orange">**rsut**</span> : Shortwave radiation ($W/m^{2}$) **going out**  at the TOA
>
> <span style="color:orangered">**rsutcs**</span> : Shortwave radiation ($W/m^{2}$) **going out**  at TOA for **clear-sky conditions**
> 
> <span style="color:Orchid">**rsds**</span> : Shortwave **downwelling** radiation ($W/m^{2}$) at the surface
> 
> <span style="color:Indigo ">**rsdscs**</span>  : Shortwave **downwelling** radiation ($W/m^{2}$) at the surface for **clear-sky conditions**
> 
> <span style="color:YellowGreen">**rsus**</span> : Shortwave **upwelling** radiation ($W/m^{2}$) at the surface
>
> <span style="color:Darkgreen">**rsuscs**</span>: Shortwave **upwelling** radiation ($W/m^{2}$) at the surface for **clear-sky conditions**
>
> **areacella** : For every grid, the latitude-dependent surface associated to each grid point.

### Table

The table sets how the variables are organized. We use the **Amon** table. The details about the tables can be found here : https://clipc-services.ceda.ac.uk/dreq/index/miptable.html.

> **Amon** stands for a set of monthly atmospheric data

---

## Initialisation

---

### Importations

We import the needed libraries.

In [None]:
# ================ IMPORTATIONS ================ #

### DOWNLADING THE ENTRIES ###

import intake_esgf  # this gives us access to the ESGF catalog to make queries

### DATA OBJECTS AND ASSOCIATED COMPUTATION ###

import pandas as pd  # to manage the product of the search

import numpy as np  # to manage the pandas arrays

### HANDLE PATHS ###

import os  # to get access to commands related to path setting and creation of directories

### PROGRESS BAR ###

from tqdm import tqdm

### HOMEMADE ROUTINES ###

## Bash folder routines from python ##

from utilities.get_cmip6_data.folders_handle.create import (
    create_dir,  # function to create a cleaned downloading directory
)

## Per-entry download routine ##

from utilities.get_cmip6_data.load_raw_data.load_cmip6 import (
    update_single_entry_keys,  # makes the good key format for the full cmip6 dictionary
    generate_single_model_search_criterias,  # generates search criterias for a single model
)

## Convert a dictionary into a netcf file and vice-versa ##

from utilities.get_cmip6_data.store_data.dict_netcdf_transform import (
    dict_to_netcdf,  # transform a dictionary of xarray datasets into a set netcdf file
)

## Generate the monthly climatologies for every entry and experiments ##

from utilities.get_cmip6_data.prepare_data.extract_climatologies import (
    create_climatology_dict,  # this function generate the dictionary made of monthly climatology xarrays
    generate_per_model_dict_key,  # this function uses the key of the raw data dictionary to generate an unique key list per entry and experiment
    add_one_variable_to_dataset,  # this function adds a variable to a dataset (it can create it)
)

### Set our search criterias 

Here the user may define its search criterias. We create a dictionary structure whose keys are the search criterias.

In [None]:
# ================ SEARCH CRITERIAS FOR OUR ANALYSIS ================ #

### EXPERIMENTS ###

experiment_id = [
    "piClim-control",
    "piClim-aer",
]

### VARIABLES ###

variable_id = [
    "clt",
    "rsdt",
    "rsut",
    "rsutcs",
    "rsds",
    "rsus",
    "rsdscs",
    "rsuscs",
]

### TABLE ###

table_id = "Amon"

### DEFINE SEARCH CRITERIAS dictionary ###

search = {
    "experiment_id": experiment_id,
    "variable_id": variable_id,
    "table_id": table_id,
}

### Create the folder in which the data will be stored

Here the user can chose to create the folder to store the data. It will erase a pre-existing folder if the *do_we_clear* option is set to **True**. This can be quite slow if the data folder is already holding some heavy data.

In [None]:
# ================ CHOSE IF WE CLEAR AN ALREADY EXISTING FOLDER ================ #

do_we_clear = False

Next, the user needs to define the paths at which will be downloaded the data and saved the climatologies. These paths are the absolute paths from the home directory.

In [None]:
# ================ CREATE THE FOLDER TO STORE THE DOWNLOADED DATA ================ #

### DEFINE WHERE TO MOVE THE FILES AT THE END ###

## Home directory ##

homedir_path = os.path.expanduser("~")

## Parent directory ##

parent_path = homedir_path + "/certainty-data"

## Name of the created folder ##

downloading_folder_name = "CMIP6-DATA"

### CREATE THE DIRECTORY AND EMPTY IF MAKE_A_NEW_FOLDER ###

downloading_path = create_dir(
    parent_path=parent_path, name=downloading_folder_name, clear=do_we_clear
)

print(
    "The downloading folder {} is under the path {}.".format(
        downloading_folder_name, downloading_path
    )
)

---
## Configure the ESGFCatalog


---

The ESGFCatalog is initially parametrized with default values one may want to change. We will focus on three main changes :

* **defining the nodes that the query will investigate** 

* **setting where the data will be downloaded** 

* **adding a path that is specific to our cluster to search for CMIP6 outputs locally** 

The default configuration of the catalog can be accessed through the following line of code. More details on the configuration may be found here : https://intake-esgf.readthedocs.io/en/latest/configure.html.

In [None]:
print(intake_esgf.conf)

There are some variables that are of interest to us :

* The *solr_indices* and the *globus_indices* variables define on which nodes the query is realized. 

* The *local_cache* variable sets where the data will be downloaded.

* The *esg_dataroot* variable sets local path to explore before downloading any data.

We can see that, as said in the documentation of the intake_esgf library, the search is done by default with Globus-based indices at the holdings of OLCF (Oak Ridge Leadership Computing Facility) and ALCF (Argonne Leadership Computing Facility). We may extend the search to all the possible ESGF nodes in order to not miss any model output. Note that the solr nodes are way much slower than the globus ones.

By default, the folder in which is stored the downloaded data is ~/.esgf/. This is a hidden folder in your home repository. It is more convenient, if you are working on a shared resource such as an institutional cluster or group workstation, to define it on a directory dedicated to data.

Finally, if we are working on a cluster having some access to CMIP6 data, it is worth adding the path of this data in our cluster to avoid useless downloading.

In the following part we will show how to modify these variables. 

**To decide if you want to modify them you need to set the following variables to True or define them.**

In [None]:
# ================ CONFIGURE THE ESGF CATALOG ================ #

### USE ALL NODES FOR SEARCH ###

all_indices = True

### SET A NEW DOWNLOADING PATH ###

set_new_downloading_path = True

### SET A CLUSTER SPECIFIC CMIP6 PATH ###

cluster_local_CMIP6_path = ""

### Define the nodes for the research

We may decide to look at all the nodes. Note that the solr nodes are way much slower than the globus ones. 

In [None]:
intake_esgf.conf.set(all_indices=all_indices)

if all_indices:

    print("We are looking at all the nodes.")

else:

    print("We are only looking at the globus nodes")

### Set where the data will be downloaded

In [None]:
if set_new_downloading_path:

    intake_esgf.conf.set(local_cache=downloading_path)

    print(
        "The CMIP6 data will be downloaded at the path : {}".format(
            intake_esgf.conf["local_cache"]
        )
    )

### Add a cluster-specific CMIP6 path

In [None]:
if cluster_local_CMIP6_path != "":

    intake_esgf.conf.set(esg_dataroot=cluster_local_CMIP6_path)

    print(
        "The CMIP6 data will be searched beforehand at the path : {}".format(
            intake_esgf.conf["esg_dataroot"]
        )
    )

else:
    print("No local cluster-specific CMIP6 path")

### Print the new configuration

In [None]:
print(intake_esgf.conf)

---

## Make the query to the ESGFCatalog

---

### Initialise the catalog

The catalog variable is initially empty and will be filled given the criterias that we will impose for the query. 

In [None]:
catalog = intake_esgf.ESGFCatalog()

print(catalog)

### Constrain the catalog

We apply the criterias defined earlier to the catalog.

In [None]:
catalog.search(**search)

### Convert the catalog results into a pandas dataframe

The resulting catalog can be converted to a pandas dataframe. This is convenient to isolate some properties of the catalog like the models' names that it has found also known as the *source_id*.

In [None]:
# ================ CONVERT TO PANDAS DATAFRAME ================ #

initial_search_df = catalog.df

### SHOW THE SOURCE_ID THAT APPEAR AT LEAST ONCE ###

## Extract the list of the models' names ##

# Retrieve the source_id column and take only one example for every duplicate #

list_model_names = initial_search_df.source_id.unique()

## Print the result ##

print("The list of found models' names is : \n{}".format(list_model_names))

### Regroup the results by model

The number of found results is very large. They are numerous duplicates that come from the different *member_id*  and *grid_label* available for each model. 

The *member_id* or *variant_label* is described by 4 indices defining an ensemble member: *r* for realization, *i* for initialization, *p* for physics, and *f* for forcing. These parameters define an ensemble of experiments that correspond to the main experiment conditions for a given model. Actually, modellers may initialize their model from a different point in time, change the parametrization of a given parameter and so on.

Let's regroup the results according to **(model, variant, grid)** tuples. From now on, we will refer one row of this table as an **entry**, that is to say, the whole bunch of variables and experiments associated to one tuple.

In [None]:
grouped_models = catalog.model_groups()

print(grouped_models)

An interesting thing to note is the ultimate column that tells us the number of found results for a given model, variant and grid. In our case since we are looking for **8** variables for **two experiments**, the search will be deemed complete if we find **16 resulting files**.

The next step is therefore to get rid of the incomplete results according to an user-defined criteria.

### Investigate for a specific model lacking variables

From what we have seen from the previous method, some models lack the expected number of variables. Thanks to the pandas' dataframe structure, we may investigate what are the variables missing for which experiment.

In [None]:
# ================ SEE WHAT VARIABLES ARE MISSING ================ #

### LOOK AT A GIVEN MODEL ###

## Define source_id and member_id ##

# Source_id #

looked_source_id = "GFDL-ESM4"

# Member_id #

looked_member_id = "r1i1p1f1"

## Select the given source_id and member_id ##

# Source_id #

selected_source_id = initial_search_df[initial_search_df.source_id == looked_source_id]

# Member_id #

selected_model = selected_source_id[selected_source_id.member_id == looked_member_id]

## Generate the variable_id series for the piClim-control experiment ##

variable_id_for_control_exp = selected_model[
    selected_model.experiment_id == "piClim-control"
].variable_id

## Generate the variable_id series for the piClim-aer experiment ##

variable_id_for_aer_exp = selected_model[
    selected_model.experiment_id == "piClim-aer"
].variable_id

### FIND VARIABLES THAT ARE MISSING FOR BOTH EXPERIMENTS ###

## Make the criteria variable_id list as a panda series ##

searched_variable_ids = pd.Series(variable_id, dtype=str)

## Get the variables that are found in at least one experiment ##

var_in_at_least_one = pd.Series(
    np.union1d(variable_id_for_control_exp, variable_id_for_aer_exp), dtype=str
)

## See the missing ones compared to the variable_id list ##

print(
    "Variables missing in both experiments of {}.{} : \n".format(
        looked_source_id, looked_member_id
    )
)

missing_variables_from_both = searched_variable_ids[
    ~searched_variable_ids.isin(var_in_at_least_one)
].values

print(missing_variables_from_both)

### GET THE VARIABLES MISSING IN ONLY ONE EXPERIMENT ###

## Extract the variables found in both experiments ##

var_in_both = pd.Series(
    np.intersect1d(variable_id_for_control_exp, variable_id_for_aer_exp), dtype=str
)

## Get the variables not in common between the experiments ##

# we keep the variables that are not in both experiments
# by removing the ones that are found in both

var_not_in_common = var_in_at_least_one[~var_in_at_least_one.isin(var_in_both)]

## See the missing ones compared to the variable_id list ##

missing_variables_not_in_common = searched_variable_ids[
    searched_variable_ids.isin(var_not_in_common)
].values

## Find the concerned experiment if missing_variables_not_in_common is not empty ##

# Check if it's empty #

if missing_variables_not_in_common.size > 0:

    # Get the number of variables found for each experiment #

    number_variables_control = len(variable_id_for_control_exp.values)

    number_variables_aer = len(variable_id_for_aer_exp.values)

    # Get the least furnished experiment #

    if number_variables_control < number_variables_aer:

        print(
            "\nOnly the control experiment of {}.{} is lacking these variables :\n".format(
                looked_source_id, looked_member_id
            )
        )

        print(missing_variables_not_in_common)

    else:

        print(
            "\nOnly the aerosol experiment of {}.{} is lacking these variables :\n".format(
                looked_source_id, looked_member_id
            )
        )

        print(missing_variables_not_in_common)

### Test some filters on the incomplete results

The user may define an expected number of netcdf files for a given (model, variant,grid) tuple. In our case this is **16** as explained before. Let's impose this result by keeping only the results that match this condition. We will produce a panda series that will allow us to check if our filter is what we want.

You may define a more refined filter for the results. Please look at the intake-esgf documentation for more information : https://intake-esgf.readthedocs.io/en/latest/modelgroups.html.

In [None]:
# ================ TEST A FILTER BY GROUPING BY MODELS ================ #

### SET THE EXPECTED NUMBER OF FILES ###

expected_number_of_files = 16

### FILTER THE INCOMPLETE RESULTS ACCORDING TO OUR CRITERIE ###

filtered_results = grouped_models[grouped_models == expected_number_of_files]

### PRINT THE FILTERED CATALOG ###

print(filtered_results)

The result is a pandas series that we can manipulate as so. For example if we want to retrieve all the member_id and grid_label associated to a single modell like the *IPSL-CM6A-LR*

In [None]:
filtered_results["IPSL-CM6A-LR"]

Good, let's have a look at the quantity of results we have left.

In [None]:
print(
    "The number of remaining (model, variant, grid) tuples is {}.".format(
        filtered_results.shape[0]
    )
)

### Applying the filter on the catalog

If that's satisfy us, we need to code a small function that will be **applied to the catalog**. The test will be executed on each model group that we have showed in the previous section. 

What's more, the pandas dataframe structure of the model groups is quite convenient as it allows us to apply more complex filters. One may want to select some specific models and variant label for example. In our case, we wish to reproduce the results from *Zelinka and al (2023)*. Therefore, we keep only the matching models and variant couples with a the provided list in the article.

Firstly, we define the model.variant list of Zelinka's article.

**Reference**

Zelinka, M. D., Smith, C. J., Qin, Y., and Taylor, K. E.: Comparison of methods to estimate aerosol effective radiative forcings in climate models, Atmos. Chem. Phys., 23, 8879–8898, https://doi.org/10.5194/acp-23-8879-2023, 2023.

In [None]:
# ================ DEFINE THE MODEL.VARIANT LIST OF ZELINKA'S ARTICLE ================ #

source_id_zelinka_2023 = [
    "ACCESS-CM2",
    "ACCESS-ESM1-5",
    "BCC-ESM1",
    "CESM2",
    "CNRM-CM6-1",
    "CNRM-ESM2-1",
    "CanESM5",
    "GFDL-CM4",
    "GFDL-ESM4",
    "GISS-E2-1-G",
    "GISS-E2-1-G",
    "GISS-E2-1-G",
    "HadGEM3-GC31-LL",
    "IPSL-CM6A-LR-INCA",
    "IPSL-CM6A-LR",
    "IPSL-CM6A-LR",
    "IPSL-CM6A-LR",
    "IPSL-CM6A-LR",
    "MIROC6",
    "MIROC6",
    "MPI-ESM-1-2-HAM",
    "MRI-ESM2-0",
    "NorESM2-LM",
    "NorESM2-LM",
    "NorESM2-MM",
    "UKESM1-0-LL",
]

member_id_zelinka_2023 = [
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f2",
    "r1i1p1f2",
    "r1i1p2f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f2",
    "r1i1p3f1",
    "r1i1p1f3",
    "r1i1p1f1",
    "r1i1p1f1",
    "r2i1p1f1",
    "r3i1p1f1",
    "r4i1p1f1",
    "r11i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p1f1",
    "r1i1p2f1",
    "r1i1p1f1",
    "r1i1p1f4",
]

zelinka_2023_model_variant_table = pd.DataFrame(
    {"source_id": source_id_zelinka_2023, "member_id": member_id_zelinka_2023},
    dtype=str,
)

Then we are able to define a filtering function that will be applied to each grouped mdoel entry individually.

In [None]:
# ================ DEFINING THE FILTERING FUNCTION ================ #

### SET THE EXPECTED NUMBER OF FILES ###

expected_number_of_files = 16

### DEFINE THE GLOBAL OPTIONAL PARAMETERS OF THE FILTERING FUNCTION ###

filtering_by_name = True

keep_only_dataframe = zelinka_2023_model_variant_table

### DEFINE THE FILTERING FUNCTION ###


def filtering_function(grouped_model_entry: pd.DataFrame) -> bool:
    """

    ### DEFINITION ###

    This function allows the intake-esgf catalog to be cleaned of the entries that are not complete.
    In the default case it means entries that do not meet the expected number of files.

    The user can also set a condition that only the model and variant couples present in a provided pandas dataframe are kept.
    Since the nature of this function is to be an input for the intake-esgf package, we define the optional arguments outside
    of the function.

    ### INPUTS

    GROUPED_MODEL_ENTRY : Pandas DataFrame | sub dataframe containing all the variables of a given source_id, member_id and grid tuple.

    ### OPTIONAL ARGUMENTS (DEFINED GLOBALLY)

    FILTERING_BY_NAME : BOOL | defines if we filter the entry by source_id and member_id or not

    KEEP_ONLY_DATAFRAME : Pandas DataFrame | associated dataframe holding the source_id and member_id to conserve

    ### OUTPUTS

    BOOL | whether we keep this model group or not
    """

    ### TEST THE NUMBER OF VARIBALES ###

    if len(grouped_model_entry) == expected_number_of_files:

        ### NUMBER OF VARIABLES' TEST SUCCEEDED ###

        ## Do we keep only the Zelinka's article model and variant couples ? ##

        # NO : We keep everything that matched the variable number's test #

        if not (filtering_by_name):

            return True

        ### YES : KEEPING ONLY THE COUPLES PRESENT IN ZELINKA 2023 ###

        else:

            ## Doing the test on the grouped_model_entry's source_id and member_id ##

            # Extract the grouped_model_entry data #

            grouped_model_entry_source_id = grouped_model_entry.source_id.unique()[0]

            grouped_model_entry_member_id = grouped_model_entry.member_id.unique()[0]

            # Can we find the grouped_model_entry's source_id and member_id in one row of is_in_keep_only_dataframe ? #

            is_in_keep_only_dataframe = (
                (keep_only_dataframe["source_id"] == grouped_model_entry_source_id)
                & (keep_only_dataframe["member_id"] == grouped_model_entry_member_id)
            ).any()

            # Return the result of the test #

            return is_in_keep_only_dataframe

    ### NUMBER OF VARIABLES' TEST FAILED ###

    else:

        return False

The following function allows us to apply the filtering function we have just defined.

In [None]:
catalog = catalog.remove_incomplete(filtering_function)

Looking at the resulting dataframe we see that we have indeed filtered the unwanted entries :

In [None]:
catalog.df

This final catalog can be transformed into a table written for a *.md* readme file. This can be useful if one wants to show the data they use on github for example.

In [None]:
# ================ TURN THE CATALOG INTO A TABLE FOR GITHUB ================ #

### GENERATE THE TABLE ###

## Create the dataframe from the catalog ##

remaining_entries_dataframe = catalog.df

## We extract only the needed information to describe the selected entries once ###

remaining_grouped_models_dataframe = remaining_entries_dataframe[
    ["source_id", "member_id", "grid_label"]
].drop_duplicates()

## We sort the models' names by alphabetical order ##

remaining_entries_dataframe_sorted = remaining_grouped_models_dataframe.sort_values(
    ["source_id", "member_id"]
).reset_index(drop=True)

### PRINTING IT FOR COPY ###

print(remaining_entries_dataframe_sorted.to_markdown())

---

## Downloading the files and load a dictionary in memory

---

The intake-esgf library proposes to store the found results in memory under the form of a dictionary holding xarray datasets for every single netcdf file found. This process also saves the netcdf files at the previously defined *local_cache path*. This part is less general than the previous sections as we have developed more complex routines that what the intake-esgf package provides. Still, if your request is pretty straightforward it may be relevant to first look at the intake-esgf documentation : https://intake-esgf.readthedocs.io/en/latest/quickstart.html.

We have indeed developed a specific downloading routine. This is because there is an issue that may be encountered if, as in this example, the user tried to download a lot of different entries from the solr index nodes. Indeed, the downloading of some variables of the data present on these nodes will fail. A solution that has been found in this case is to make a request for a **(source_id, member_id, grid_label)** tuple at once. That is to say, **download each entry independently and then regroup them under the same dictionary**.

In addition, by default, the package is looking for the **areacella** variable automatically. Areacella is the variable associated to the model's grid. It tells us the surface associated to each grid point and allows us to easily do spatial averages accross grid points. So, intake-esgf does look for areacella but it does it **rather slowly** as it first looks for an areacella variable for our full search facets, which may lead to no results, and then expands the search. Moreover, it does so for for every single variable of a given model which is redundant. In our analysis, we would rather load the dictionary with *add_measures* set to **False** and then download (and load) the areacella netcdf files apart with a homemade routine. An example of this process is given afterward.

### Download of a lot of entries from the solr nodes.

We prepare the download of each individual set of variables for a given model. We use for that two functions that can be found in the *load_cmip6.py* submodule in the same folder as this notebook.

- The *generate_single_model_search_criterias* function which generates the *search_criterias* used to download a single entry. It is done by the needed tuple of (source_id, member_id, grid_label) criterias to the global *search* dictionary.

- The *update_single_entry_keys* function which updates, once we downloaded it, the keys of the one entry dictionary to allow to concatenate all dictionaries together.

With these two functions, we can generate a dictionnary holding all entries even though we downloaded them separatly.

Before using them, we need to extract some information out the filtered catalog.

In [None]:
# ================ EXTRACT THE FILTERED CATALOG'S INFORMATION ================ #

### GET THE INFORMATIONS THAT THE CATALOG EXTRACTED ###

## Generate the full dataframe of the files found by the search ##

selected_entries_full_dataframe = catalog.df

## Save the grouped model pandas series for the areacella part ##

series_grouped_models = catalog.model_groups()

### GET ONLY THE NEEDED INFORMATION ###

## We extract the (source_id, member_id, grid_label) tuples from the full dataframe ##

# we remove the duplicates to only keep one row per tuple

grouped_models_dataframe = (
    selected_entries_full_dataframe[["source_id", "member_id", "grid_label"]]
    .drop_duplicates()
    .reset_index(drop=True)
)

Once we have all the needed variables, the following routine builds the full dictionary by downloading every single entry independently. This is what does the function *loading_cmip6* from the *load_cmip6* submodule.

In [None]:
# ================ DOWNLOAD EVERY SINGLE OUTPUT ================ #

### DOWNLOAD EVERY SINGLE ENTRY AND COMBINE THEM INTO A DICTIONARY ###

## Initialize the full dictionary ##

full_cmip6_dict = {}

## Downloading all the models one entry at a time ##

print("Downloading and/or loading the data one entry at a time...\n")

for index in grouped_models_dataframe.index:

    ## Reset the catalog ##

    catalog = intake_esgf.ESGFCatalog()

    ## Generate the associated search criterias ##

    search_criterias_given_row, single_model_name = (
        generate_single_model_search_criterias(
            search_facets=search,
            grouped_models_dataframe=grouped_models_dataframe,
            index=index,
        )
    )

    ## Generate the single model's output name ##

    print("\nDownloading {} ...\n".format(single_model_name))

    ## Apply the search criterias ##

    catalog.search(
        **search_criterias_given_row,
    )

    ## Downloading the output... ##

    single_model_dictionary = catalog.to_dataset_dict(
        add_measures=False,
        ignore_facets=[
            "project",
            "mip_era",
            "activtity_drs",
            "institution_id, table_id",
            "grid_label",
            "version",
        ],
        quiet=True,
    )

    ## Updating its keys ##

    single_model_dictionary = update_single_entry_keys(
        single_model_dictionary, single_model_name
    )

    ## Updating the full dictionary ##

    full_cmip6_dict = full_cmip6_dict | single_model_dictionary

### Homemade routine to retrieve the areacella of a given entry only once

As we have said in the introduction of this section, the main default of the intake-esgf package when it comes to look for measures is that it does so for every single variable. And this is independent of the fact that our 8 variables fore one experiment would share the same areacella grid. What's more, it always start by looking for our full set of facets which is irrevelant. 

Indeed, the only three parameters that define areacella are the **source_id**, **member_id** and obviously the **grid_label**. Conveniently enough, we can group our search results according to these three parameters with the *.model_groups* method. In the end, it may be that our experiment won't hold the areacella variable but it does not really matter, we just need to find one. This is the spirit of this routine made to still get areacella but significantly faster. It is worth noting that this method is not useful if you need to import a small number of models and variables as you won't really see the time difference. 

Let's be sure that we look at all the nodes for this part.

In [None]:
all_indices = True

In [None]:
intake_esgf.conf.set(all_indices=all_indices)

if all_indices:

    print("We are looking at all the nodes.")

else:

    print("We are only looking at the globus nodes")

Then we apply the following routine to produce the areacella dictionary. This is described in the function *get_areacella_apart* from the *load_cmip6* submodule.

In [None]:
# ================ HOMEMADE ROUTINE TO GET AREACELLA GRIDS FASTER ================ #

### INITIALISATION  ###

## Initialise the full dictionary ##

dict_areacella = {}

## Number of rows ##

n_rows = series_grouped_models.size

### GO THROUGH EVERY ROW OF THE PANDA SERIES ###

for ii in range(n_rows):

    ## Get the SOURCE_ID | MEMBER_ID | GRID_LABEL of the row ##

    # Retrieve the row ##

    row_ii = series_grouped_models.index[ii]

    # Extract the labels #

    source_id, member_id, grid_label = row_ii

    # Build the key for this dictionary entry ##

    full_key = source_id + "." + member_id + "." + grid_label

    ## Special case for the IPSL-CM6A-LR-INCA model ##

    if source_id == "IPSL-CM6A-LR-INCA":

        source_id = "IPSL-CM6A-LR"

    ## Do the full search ##

    areacella_search_full = catalog.search(
        source_id=source_id,
        grid_label=grid_label,
        variable_id="areacella",
        quiet=True,
    ).df  # silence the progress bar

    ## Extract the first experiment id that gives an areacella entry ##

    only_first_exp_id = areacella_search_full.experiment_id.values[0]

    ## Extract the first member id that gives an areacella entry ##

    only_first_member_id = areacella_search_full.member_id.values[0]

    ## Get the areacella for the given row ##

    # Search and download it #

    print("\nDownloading areacella for {} ...\n".format(full_key))

    areacella_ii = catalog.search(
        source_id=source_id,
        grid_label=grid_label,
        variable_id="areacella",
        experiment_id=only_first_exp_id,
        member_id=only_first_member_id,
        quiet=True,
    ).to_dataset_dict(
        add_measures=False, quiet=True
    )  # silence the progress bar

    # Store it in dictionary #

    dict_areacella[full_key] = areacella_ii["areacella"]

---

## Combine the extracted data into full xarray datasets with monthly climatologies and saves them on disk

---

The next step is to treat this data to make it usable for our analysis, in this case we wish to have monthly climatologies for every variable. We start from a bunch of netcdf files on disk associated to each entry we wished to have. 

However, if we wanted to retrieve this raw data in another script, we would have to call again this whole routine with intake-esgf that would, this time, find the files locally. What's more, each variable is separated under a dictionary key. This is not convenient for the generation of the climatologies as we would prefer full xarray datasets that are way more flexible.

A solution that has been found here is to regroup every variable for a given experiment and a given entry under the **same xarray dataset**, generate the **monthly climatologies of every variable** and then save the datasets as **netcdf files**. The **paths** of these netcdf files are saved in a **pandas data series** that can be loaded at the beginning of a script. As a result, we can reload a simpler dictionary holding pre-treated datasets that are easier objects to manipulate without calling the ESGF catalog. 

### Generate monthly climatologies for every experiment and entry 

First and foremost, this part aims at generating a monthly climatology for every variable of every experiment and so for every model. Then, we regroup these climatologies into two xarray datasets, one per experiment, per entry. What's more, we add the areacella variable that is for now in another dictionary as a variable of the xarray datasets. This will allow for easy weighted spatial averages.

We start by defining the path where we will save the datasets.

In [None]:
# ================ DEFINE THE PATHS WHERE TO SAVE THE TREATED DATA ================ #

### SAVE PATH OF THE TREATED DATA ###

parent_path_save_clim = (
    homedir_path + "/certainty-data/" + downloading_folder_name + "/climatologies"
)

We go through every variable for every entry and experiment. We thus generate monthly climatology datasets holding all the variables and areacella. These datasets are then saved into a new dictionary structure. This is what does the *create_climatology_dict* function from the *extract_climatologies* submodule.

In [None]:
# ================ GENERATE THE XARRAYS DATASETS FOR EVERY ENTRY AND EXPERIMENT ================ #

### INITIALISATION ###

## Create the dictionnary ##

full_cmip6_dict_clim = {}

## Generate the general key associated to each model.variant and experiment ##

keys_without_variable_unique = generate_per_model_dict_key(full_cmip6_dict)

## Define the number of unique entry and experiments couples ##

n_entry_and_exp = len(keys_without_variable_unique)

### GO THROUGH EACH MODEL.VARIANT.GRID AND EXPERIMENT ###

## Define a progress bar while we go through the unique entry keys ##

for index in tqdm(
    range(n_entry_and_exp), desc="Generating the climatologies' dictionnary..."
):

    ## Retrieve the key ##

    key = keys_without_variable_unique[index]

    ## Initialize the dataset with the first variable ##

    # Define the variable #

    var = variable_id[0]

    # Define that the dataset does not exist yet #

    modify_data = False

    # Copy the key without variable #

    key_with_var = key

    # Add the variable name #

    key_with_var[-1] = var

    # Generate the key by joining the str list with "." #

    key_with_var_full = ".".join(key_with_var)

    # Retrieve the variable data array #

    var_datarray = full_cmip6_dict[key_with_var_full]

    # Generate or update the dataset for the given model.variant and experiment #

    dataset_given_exp = add_one_variable_to_dataset(
        variable_name=var,
        var_datarray=var_datarray,
        modify_data=modify_data,
        do_clim=True,
    )

    # Set that now the dataset already exists #

    modify_data = True

    ## Go through the rest of the variables ##

    for var in variable_id[1:]:

        # Copy the key without variable #

        key_with_var = key

        # Add the variable name #

        key_with_var[-1] = var

        # Generate the key by joining the str list with "." #

        key_with_var_full = ".".join(key_with_var)

        # Retrieve the variable data array #

        var_datarray = full_cmip6_dict[key_with_var_full]

        # Update the dataset with the climatology of this variable #

        add_one_variable_to_dataset(
            variable_name=var,
            var_datarray=var_datarray,
            modify_data=modify_data,
            dataset=dataset_given_exp,
            do_clim=True,
        )

    ## Generate the key for full_cmip6_dict_clim ##

    # Retrieving the key information #

    # key =  [source_id, member_id, grid, experiment_id, '*']

    source_id = key[0]

    member_id = key[1]

    grid_label = key[2]

    experiment_id = key[3]

    # Create the new key #

    new_simpler_key_given_exp = ".".join(
        [source_id, member_id, grid_label, experiment_id]
    )

    ## Use the gathered information to get the areacella entry of the given model.variant and experiment ##

    # Build the areacella key #

    key_areacella = ".".join([source_id, member_id, grid_label])

    # Retrieve the given areacella #

    areacella_datarray = dict_areacella[key_areacella]

    # Update the dataset of the given model.variant and experiment with the associated areacella #

    dataset_given_exp["areacella"] = (
        ("lat", "lon"),
        areacella_datarray["areacella"].values,
    )

    ## Add the dataset to the output dictionnary ##

    full_cmip6_dict_clim[new_simpler_key_given_exp] = dataset_given_exp

We save the structure by turning every xarray dataset into netcdf files and also by generating a *key_paths_table.pkl* file. It is a way for us to save a pandas' data series, *key_paths_table*, that **associates every netcdf file's path with its key in the dictionary**. This routine is coded into the *dict_to_netcdf* function of the *store_data* submodule.

Therefore, in an another script, we will able to rebuild the dictionary from scratch by using this table and the function *netcdf_to_dict* from the *store_data* submodule.

In [None]:
# ================ SAVE THE GENERATED DICTIONNARY ================ #

### INITIALISATION ###

## Get the list of the keys of the dictionnary #

# Extract the list #

list_keys = list(full_cmip6_dict_clim.keys())

# Get the number of keys #

n_keys = len(list_keys)

## Generate the array of the paths ##

paths = np.empty(n_keys, dtype=object)  # dtype = object otherwise it truncates the str

### GO THROUGH THE ENTRIES ###

for ii, key in enumerate(list_keys):

    ## Generate a filename with the key ##

    # Split the key into a list of keywords #

    splitted_key = key.split(".")

    # Connect them with a "_" to make a filename that is not broken #

    full_name = "_".join(splitted_key)

    # Define the filename #

    filename = full_name + ".nc"

    ## Create the directory associated to the entry and keep its path ##

    saving_path_given_entry = create_dir(
        parent_path=parent_path_save_clim, name=full_name, clear=do_we_clear
    )

    ## Generate the full path with the filename ##

    path_to_nc = saving_path_given_entry + "/" + filename

    ## Save the entry's dataset ##

    # Save it #

    full_cmip6_dict_clim[key].to_netcdf(path=path_to_nc)

    # Conserve the path at which we saved it in the array #

    paths[ii] = path_to_nc

### GENERATE THE PANDAS DATAFRAME ASSOCIATING KEYS WITH PATHS ###

## Create the pandas dataframe from a dictionnary ##

# Define the table key vs path #

key_paths_dict = {"key": list_keys, "path": paths}

# Define the dataframe #

key_paths_table = pd.DataFrame(key_paths_dict)

## Save the pandas dataframe ##

# Create the table folder to hold it #

saving_path_table = create_dir(
    parent_path=parent_path_save_clim, name="table", clear=do_we_clear
)

# Save it #

key_paths_table.to_pickle(saving_path_table + "/key_paths_table.pkl")