# Downloading needed ensemble CMIP6 data from model.variant entries in a .csv file

## Purpose of the notebook

This notebook aims at retrieving the wget script to download an ensemble of CMIP6 variables from a list of models.variant couples with all the criterias written in a *.csv* file. One example of such a file can be found in the repository. The main criterias, common to the whole ensemble, are defined **within this jupyter**.

It uses the *esgf-pyclient* library : https://esgf-pyclient.readthedocs.io/en/latest/index.html. Then it creates temporary folders to download the associated *.nc* files and moves these files to a more permanent storage folder. All the folders are cleaned up to insure that no mess is done. The path of the temporary downloading and the permanent storage folders are **user-defined**.

Feel free to share, use and improve the following code according to the provided license on the repository.

## Initialization 

### Set environment parameters

We set this parameter in order to avoid the display of needless and impactless errors regarding the *facets* variable.

In [2]:
# ================ SET ENVIRONMENT PARAMETERS ================ #

### NO FACETS WARNING ###

%env ESGF_PYCLIENT_NO_FACETS_STAR_WARNING = 42



### Importations

In [3]:
# ================ IMPORTATIONS ================ #

from pyesgf.search import SearchConnection  # the query package of pyesgf for cmip6 data

import os  # to handle path's management

import subprocess  # to run the wget code from python

import random # to generate the name of our downloading scripts

import pandas as pd # to retrieve the .csv data 

from library_for_download_notebooks import * # set of functions used in the notebook : to make it more readable

### Set the user-defined paths

These user defined paths are essential to make the program know 

* where it will create its **temporary download folder**
* where it will created its **permanent storage folder**
* where *research-criterias.csv* is located 

In [4]:
# ================ USER-DEFINED PATHS ================ #

### DEFINE TEMPORARY DOWNLOAD FOLDER'S PATH ###

download_dir = "/data/lgiboni/downloaded-cmip6-datadir-example"

### DEFINE PERMANENT STORAGE FOLDER'S PATH ###

storage_dir = "/scratchu/lgiboni/CMIP6_DATA"

### DEFINE LOCATION OF RESARCH-CRITERIAS.CSV ###

path_csv_file = "/home/lgiboni/CMIP6-Arctic-Aerosol-Analysis/utilities/download/research-criterias.csv"

### Dictionnary of ESGF nodes' url

Here we define the list of ESGF nodes to check in order to download the data. If the data is not found for a node, we might want to look to another one.

In [12]:
# ================ ESGF NODES' URL ================ #

dict_urls = {"IPSL" : "https://esgf-node.ipsl.upmc.fr/esg-search", "CEDA" : "https://esgf.ceda.ac.uk/esg-search", "DRKZ" : "http://esgf-data.dkrz.de/esg-search"}

# nb : I've never managed to make the US url work for some reason ?

In [31]:
class NotAllFilesFounds(Exception):
    ''' Raise when despite trying all the nodes, the total expected files of one entry are found '''

## Extract the criterias from the .csv file 

All the criterias for our searchs are stored in the *research-criterias.csv* file. We will read it with pandas.

### Open the .csv file in a pandas dataframe

Beware that the *csv* file must have its columns' names defined as the query criterias defined for the  *esgf-pyclient* library : https://esgf-pyclient.readthedocs.io/en/latest/index.html.

In [5]:
# ================ OPEN THE CSV FILE ================ #

### READ THE FILE AND WRITE IT INTO A PANDAS DATA FRAME ###

## Save it ##

criterias = pd.read_csv(path_csv_file, delimiter = ";")

## Print it for the user ##

criterias

Unnamed: 0,project,variable,table_id,source_id,variant_label,experiment_id
0,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,ACCESS-ESM1-5,r1i1p1f1,piClim-control
1,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,BCC-ESM1,r1i1p1f1,piClim-control
2,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,CESM2,r1i1p1f1,piClim-control
3,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,CNRM-CM6-1,r1i1p1f2,piClim-control
4,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,CNRM-ESM2-1,r1i1p1f2,piClim-control
5,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,CanESM5,r1i1p2f1,piClim-control
6,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,GFDL-CM4,r1i1p1f1,piClim-control
7,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,GFDL-ESM4,r1i1p1f1,piClim-control
8,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,GISS-E2-1-G,r1i1p1f1,piClim-control
9,CMIP6,"[""clt"",""rsdt"",""rsut"",""rsutcs"",""rsds"",""rsus"",""r...",Amon,GISS-E2-1-G,r1i1p1f2,piClim-control


We have a duplicate for every **experiment_id** because it's an exclusive parameter. We want two different datasets for a different experiment.

### Get a list of the query criterias given for the search 

The header of the csv file gives us all the categories provided for the query.

In [6]:
# ================ GET THE QUERY CRITERIAS GIVEN ================ #

### EXTRACT THE CRITERIAS PRESENT IN THE FILE ###

given_criterias_in_the_file = criterias.columns.values.tolist()

## We generate the facets variable for the query ##

facets = ', '.join(given_criterias_in_the_file)

## Print them for the user ##

print("The query criterias present in the file are :\n{}".format(facets))

The query criterias present in the file are :
project, variable, table_id, source_id, variant_label, experiment_id


## Iteration over every entry of the .csv file : verify that we can find every single .nc file.

In this loop we verify that we find every *.nc* file for each entry in the node we have chosen. If that's not the case we print the model.variant couple that is not found and we check the other nodes.

### Evaluate the number of files you get for a given model.variant couple 

We check one entry to see what would be the expected number of files to find.\
Beware that this **assumes that every model has the same number of files** which may not be true.

In [33]:
# ================ EVALUATE THE NUMBER OF .NC FILE FOR ONE ENTRY ================ #

### CHOOSE A NODE ###

## Ad-hoc choice : UK node ##

chosen_url = dict_urls["CEDA"]

conn = SearchConnection(chosen_url, distrib=True)

### EVALUATE THE NUMBER OF FILES FOR A GIVEN ROW ###

## Set the rows' index ##

index = 0

## For the given row : make a dictionnary of the criterias ##

# We initialize the dictionnary #
    
criterias_given_row = {}

# We generate the dictionnary #

for criteria in given_criterias_in_the_file :
    
    criterias_given_row.update({criteria : criterias.iloc[index][criteria]})

## Make the query for the specific row : using criterias_given_row as keyword parameters ##

query = conn.new_context(
**criterias_given_row, # ** Allows the function to read the dictionnary as keyword parameters
latest=True,  # Set to "True" to get the latest version
replica=False,  # Set to "False" to avoid getting duplicate in the search results
facets=facets, # list of the search criterias by name
)  


## Number of results ##

results_count = query.hit_count

# Define it as a global test variable #

n_expected_files = results_count

# Print it #

print("The number of expected .nc files for each entry is {}".format(n_expected_files))

The number of expected .nc files for each entry is 12


In [None]:
# ================ VERIFY THAT WE CAN FIND EVERY SINGLE .NC FILE ================ #

### CHOOSE A NODE ###

## Ad-hoc choice : UK node ##

chosen_url = dict_urls["CEDA"]

conn = SearchConnection(chosen_url, distrib=True)

### LOOP OVER THE ROWS OF THE .CSV FILE ###

for index in criterias.index:

    ## For a given row : make a dictionnary of the criterias ##

    # We initialize the dictionnary #
    
    criterias_given_row = {}

    # We generate the dictionnary #
    
    for criteria in given_criterias_in_the_file :
        
        criterias_given_row.update({criteria : criterias.iloc[index][criteria]})

    ## Make the query for the specific row : using criterias_given_row as keyword parameters ##
    
    query = conn.new_context(
    **criterias_given_row, # ** Allows the function to read the dictionnary as keyword parameters
    latest=True,  # Set to "True" to get the latest version
    replica=False,  # Set to "False" to avoid getting duplicate in the search results
    facets=facets, # list of the search criterias by name
)  


    ## Number of results ##
    
    results_count = query.hit_count
    
    ### CASES WHERE LESS ENTRIES ARE FOUND THAN EXPECTED ###

    if results_count < n_expected_files:

        print("\n====== \nresult_count = {} < {} for".format(results_count, n_expected_files), "{}.{}".format(criterias.iloc[index]["source_id"], criterias.iloc[index]["variant_label"]),"| index = {}".format(index), 
              "\nLooking at the other nodes...","\n======")

        ## Try the other urls ##
        """
        try :
            
            ## Go through the nodes ##
            
            for institution in dict_urls.keys() :

                chosen_url = dict_urls[institution]

                
        """

In [196]:
# ================ DOWNLOAD THE FILES AND SAVE THEM FOR EVERY ENTRY ================ #

### INTIALISATION ###

## Set the starting node for the search ##

# conn = SearchConnection("https://esgf-node.ipsl.upmc.fr/esg-search", distrib=True)

conn = SearchConnection("https://esgf.ceda.ac.uk/esg-search", distrib=True)


### LOOP OVER EVERY ROW ###

### GET THE QUERY CRITERIAS FROM THE GIVEN ROW ###

### QUERY ###

## Launch the query ##

query = conn.new_context(
    **dic,
    latest=True,  # Set to "True" to get the latest version
    replica=False,  # Set to "False" to avoid getting duplicate in the search results
    facets=facets, # List of search criterias by name
)  


## Number of results ##

results_count = query.hit_count

## Print the number of results ##

print(f"The search has returned {results_count} results")

The search has returned 12 results


### Run the actual search

We start the search with the IPSL node. Since the variable *distrib* is set to **True**, we will extend the search to all the esgf nodes in case of lack of match.

In [96]:
# ================ RUN THE QUERY FOR THE PICLIM-CONTROL EXPERIMENT ================ #

### SET THE STARTING NODE FOR THE SEARCH ###

conn = SearchConnection("https://esgf-node.ipsl.upmc.fr/esg-search", distrib=True)

### QUERY ###

## Launch the query ##

query = conn.new_context(
    project=project,
    frequency=frequency,
    variable=variables,
    source_id=source_id,
    variant_label=variant_label,
    experiment_id=experiment_id,
    latest=True,  # Set to "True" to get the latest version
    replica=False,  # Set to "False" to avoid getting duplicate in the search results
    facets=facets,
)  # facets confirms the search criterias we have made


## Number of results ##

results_count = query.hit_count

## Print the number of results ##

print(f"The search has returned {results_count} results")

NameError: name 'project' is not defined

Since we have 12 variables, we have one result per variable for the researched model.variant couple and experiment.

## Create the folders for the downloading routine

Before saving the script and launching it, we need to make the folder associated to our given request. To do so, we create a temporary main folder for the downloading of the CMIP6 data and then make a subfolder for our given request.

### Make the temporary main folder for downloading the data

We create the folder cmip6-download-tmp at the path defined by the storage_dir variable thanks to the create_dir function.

In [9]:
# ================ MAKE THE MAIN TEMPORARY DOWNLOADING FOLDER ================ #

### USE OF THE CREATE DIR FUNCTION ###

temporary_downloading_folder_path = create_dir(storage_dir,  "cmip6-download-tmp")

Pre-existing folder at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp removed
New folder at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp created


### Make the temporary subfolders for our given request

We first define the name of the subfolders for our given request and call the create_dir function.

In [10]:
# ================ BUILD THE FOLDERS' NAME ================ #

### DEFINE THE SUBFOLDERS' PATH FOR OUR GIVEN REQUEST ###

## Define the model.variant folder's name ##

model_variant_name = source_id[0] + "." + variant_label[0]

## Define the model.variant/experiment subfolder's path ##

subfolder_request = model_variant_name + "/" + experiment_id[0]

### PRINT THE NAME ###

print("We will create {} at {}".format(subfolder_request, temporary_downloading_folder_path))

We will create IPSL-CM6A-LR.r1i1p1f1/piClim-control at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp


In [11]:
# ================ MAKE THE MAIN TEMPORARY DOWNLOADING FOLDER ================ #

### USE OF THE CREATE DIR FUNCTION ###

downloading_path_for_given_request = create_dir(temporary_downloading_folder_path,  subfolder_request)

New folder at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp/IPSL-CM6A-LR.r1i1p1f1/piClim-control created


## Launch the wget script for the request

Now that the folders have been created, we only have to write the wget script and download the data.

### Define its name and path

In [12]:
# ================ DEFINE ITS NAME AND PATH ================ #

### DEFINE ITS NAME ###

## Use of the random library to generate a unique downloading name

# Set the random number identifier #

random_id  = int(random.uniform(1,10)*1000) # we extract a random decimal number between 1 and 10 and keep the 4 first decimals

# Turn it into a str #

random_id = str(random_id)

## Define the downloading script's name ## 

downloading_script_name = "download-" + random_id + ".sh"

### DEFINE ITS PATH ###

downloading_script_path = downloading_path_for_given_request + "/" + downloading_script_name

### Write the downloading script

In [13]:
# ================ WRITE THE WGET SCRIPT ================ #

### RETRIEVE THE SCRIPT ###

wget_script_content = query.get_download_script()

### STORE IT IN THE DOWNLOADING FOLDER ###

## Write the wget script ##

with open(downloading_script_path, "w") as writer:

    # Writing it #
    
    writer.write(wget_script_content)

## Make it readable, writable and executable for the user ##

os.chmod(downloading_script_path, 0o750)  # it is also executable and readable by other users

### Run the downloading script 

In [14]:
# ================ RUN THE WGET SCRIPT ================ #

### RUN THE WGET SCRIPT ###

## We store the results in the download_process variable to produce a log ##

download_process = subprocess.run(
    "bash " + downloading_script_name + " -s",
    cwd=downloading_path_for_given_request,
    shell=True,
    check=True,
    text=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
) 

# -s option to run the script without credential | cwd is the directory of the script

## Store the log ##

with open(downloading_path_for_given_request + "/log.txt", "w") as log_file:

    log_file.write(download_process.stdout)

## Move the files to a more sustainable folder

After having downloaded the files for one model.variant couple, we would like the files to be stored in a place on our server that is known to not be cleaned often. For that, the user needs to define the variable *download_dir* that is a path.

### Create the folder at the download_dir path

Since we do a fresh download, we clean up the folder at the download_dir path and make a new one.

In [15]:
# ================ MAKE A FRESH DATA SAVING FOLDER ================ #

### EXTRACT THE PATH OF THE DIRECTORY IN WHICH LIES OUR SAVING DIRECTORY ###

download_dir_directory_path = os.path.dirname(download_dir)

### EXTRACT THE NAME OF THE SAVING DIRECTORY  ###

name_of_download_dir = os.path.basename(download_dir)

### USE OF THE CREATE DIR FUNCTION ###

download_dir = create_dir(download_dir_directory_path,  name_of_download_dir)

Pre-existing folder at /data/lgiboni/downloaded-cmip6-datadir-example removed
New folder at /data/lgiboni/downloaded-cmip6-datadir-example created


### Move the files from python

To move the model.variant folder, we will use the *mv_downloaded_data* function.

In [16]:
# ================ MOVING THE DOWNLOADED FILES TO download_dir PATH ================ #

### GET THE MODEL.VARIANT FOLDER ###

model_variant_folder_path = os.path.dirname(downloading_path_for_given_request)

### RUNNING THE MV BASH COMMAND FROM PYTHON ###

subprocess.run(
            "mv {} {}".format(model_variant_folder_path, download_dir), shell=True, check=True
        )
    
print("The {} folder was moved to the path {}".format(model_variant_name, download_dir))

The IPSL-CM6A-LR.r1i1p1f1 folder was moved to the path /data/lgiboni/downloaded-cmip6-datadir-example
