# Downloading needed CMIP6 data on one model's output :

## Purpose of the notebook

This notebook aims at retrieving the wget script to download the needed CMIP6 variables for the **piClim-control** experiment and the **r1i1p1f1** variant for the **IPSL-CM6A-LR** model. It uses the *esgf-pyclient* library : https://esgf-pyclient.readthedocs.io/en/latest/index.html.

Feel free to share, use and improve the following code according to the provided license on the repository.

## Initialization 

### Set environment parameters

We set this parameter in order to avoid the display of needless and impactless errors regarding the *facets* variable.

In [143]:
# ================ SET ENVIRONMENT PARAMETERS ================ #

### NO FACETS WARNING ###

%env ESGF_PYCLIENT_NO_FACETS_STAR_WARNING = 42



### Importations

In [144]:
# ================ IMPORTATIONS ================ #

from pyesgf.search import SearchConnection  # the query package of pyesgf for cmip6 data

import os  # to handle path's management

import pandas as pd  # to store the urls under a table format

from tqdm import tqdm  # to make cool progress bars

import tempfile  # to make a temporary file

import subprocess  # to run the wget code from python

import random # to generate the name of our downloading scripts

### Custom exceptions for error clarity 

In [145]:
# ================ DEFINING OUR OWN EXCEPTIONS TO BE RAISED ================ #

class InputPathError(ValueError):
    ''' Raise when the parent path ends with a '/' or the child path starts with a '/' '''

### Set the paths for where the saving and downloading folder will be created

These user defined paths are essential to make the program know **where it will create its download folder** and also **where the downloaded folders will be moved in the end**. 

In [146]:
# ================ SAVING PATH ================ #

### DEFINE WHERE TO MOVE THE FILES AT THE END ###

save_dir = "/data/lgiboni/downloaded-cmip6-datadir-example/"

### WHERE ARE WE PUTTING THE TEMPORARY DOWNLOADING FOLDERS ###

where_download_dir = "/scratchu/lgiboni/CMIP6_DATA"

### Global criterias for the data research

In [147]:
# ================ COMMON SEARCH CRITERIAS FOR OUR ANALYSIS ================ #

### CMIP6 DATA ###

project = ["CMIP6"]

### MONTHLY FREQUENCY ###

frequency = ["mon"]

### VARIABLE TO SEARCH FOR ###

variables = [
    "clt",
    "rsdt",
    "rsut",
    "rsutcs",
    "rsds",
    "rsus",
    "rsdscs",
    "rsuscs",
    "rlut",
    "rlutcs",
    "rlds",
    "rlus",
]

## Function definitions :

### create_dir : creates the directories we need for the downloading part 

In [148]:
def create_dir(parent_dir : str, to_be_created_dir : str) :

    """
    ### DEFINITION

    This function creates the directories we need for the downloading part calling bash routines.
    WARNING : This wont work if you put your '/' wildly. At the end of a given path there should be not '/'.
    
    ### INPUTS 

    PARENT_DIR : STR | parent directory of the to be created directory : it must exist or an error will be sent
    
    TO_BE_CREATED_DIR : STR | the directory that is to be created : it can be a folder or a path. 
    if it's a path, several folders and subfolders will be created.

    ### OUTPUT

    FULL_PATH : STR | the full path of the folders and maybe subfolders created.
    """

    ### WE DEFINE THE FULL PATH ###

    ## We try to define the full_path variable ##
    
    # We check that at the end of parent_dir there is no '/' #

    try :

        # If the end of parent_dir there is '/' we raise an error #
        
        if parent_dir[-1] == '/' :

            raise InputPathError

        elif to_be_created_dir[0] == '/' :

            raise InputPathError

        else :

            full_path = parent_dir + "/" + to_be_created_dir
            
    except InputPathError as error :

        raise InputPathError("neither parent_dir variable should end with a '/' nor to_be_created_dir variable start with a '/'")
        
    ### WE TRY TO MAKE THE FOLDER ###
    
    ## We try to make the folder but it will raise an error if parent_dir is wrong ##
    
    try : 
    
        ## We first check if to_be_created_dir already exists or not ##
    
        # If it exists : we remove it #
        
        if os.path.isdir(full_path) :
    
            # We run the rm -rf command from python to erase the folder and its content #
    
            subprocess.run(
            "rm -rf {}".format(full_path), shell=True, check=True
        )
            print("Pre-existing folder at {} removed".format(full_path))
        
        ## We run the mkdir command from python to make the folder ##
            
        subprocess.run(
            "mkdir  -p {}".format(full_path), shell=True, check=True  # -p option allows for simultaneous folder and subfolder creation
        )
    
        print("New folder at {} created".format(full_path))
    
    ## An error was caught : where_download_dir does not exist ##
    
    except FileNotFoundError as error:
    
        print("The where_download_dir variable is wrong, such a path does not exists" + " : " + parent_dir + "\n")
    
        raise 

    return full_path

## Search for one model's output : example for the piClim-control experiment

We look for the defined variables for the **IPSL-CM6A-LR** model for the **r1i1p1f1** variant. 

### Definition of the specific query parameters we need

We define all the specific information that are specific to our model.variant search as well as for the given experiment we are looking for.

In [149]:
# ================ DEFINE THE ADDITiONNAL SEARCH CRITERIAS FOR A GiVEN MODEL'S OUTPUT ================ #

### SOURCE ID : NAME OF THE CHOSEN MODEL ###

source_id = ["IPSL-CM6A-LR"]

### VARIANT OF THE MODEL RUN ###

variant_label = ["r1i1p1f1"]

### EXPERIMENT ID ###

experiment_id = ["piClim-control"]

### WHOLE SET OF QUERY PARAMETERS ###

facets = (
    "project,frequency,variable,source_id,variant_label,experiment_id,latest,replica"
)

*facets* will be used to confirm that our query holds all the parameters we are looking for.\
The element *latest* is to specify we want the latest data.\
The element *replica* is to specify whether we want duplicates to be included in our search results.

### Run the actual search

We start the search with the IPSL node. Since the variable *distrib* is set to **True**, we will extend the search to all the esgf nodes in case of lack of match.

In [150]:
# ================ RUN THE QUERY FOR THE PICLIM-CONTROL EXPERIMENT ================ #

### SET THE STARTING NODE FOR THE SEARCH ###

conn = SearchConnection("https://esgf-node.ipsl.upmc.fr/esg-search", distrib=True)

### QUERY ###

## Launch the query ##

query = conn.new_context(
    project=project,
    frequency=frequency,
    variable=variables,
    source_id=source_id,
    variant_label=variant_label,
    experiment_id=experiment_id,
    latest=True,  # Set to "True" to get the latest version
    replica=False,  # Set to "False" to avoid getting duplicate in the search results
    facets=facets,
)  # facets confirms the search criterias we have made


## Number of results ##

results_count = query.hit_count

## Print the number of results ##

print(f"The search has returned {results_count} results")

The search has returned 12 results


Since we have 12 variables, we have one result per variable for the researched model.variant couple and experiment.

## Create the folders for the downloading routine

Before saving the script and launching it, we need to make the folder associated to our given request. To do so, we create a temporary main folder for the downloading of the CMIP6 data and then make a subfolder for our given request.

### Make the temporary main folder for downloading the data

We create the folder cmip6-download-tmp at the path defined by the where_download_dir variable thanks to the create_dir function.

In [151]:
# ================ MAKE THE MAIN TEMPORARY DOWNLOADING FOLDER ================ #

### USE OF THE CREATE DIR FUNCTION ###

temporary_downloading_folder_path = create_dir(where_download_dir,  "cmip6-download-tmp")

Pre-existing folder at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp removed
New folder at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp created


### Make the temporary subfolders for our given request

We first define the name of the subfolders for our given request and call the create_dir function.

In [152]:
# ================ BUILD THE FOLDERS' NAME ================ #

### DEFINE THE SUBFOLDERS' PATH FOR OUR GIVEN REQUEST ###

subfolder_request = source_id[0] + "." + variant_label[0] + "/" + experiment_id[0] + "/"

### PRINT THE NAME ###

print("We will create {} at {}".format(subfolder_request, temporary_downloading_folder_path))

We will create IPSL-CM6A-LR.r1i1p1f1/piClim-control/ at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp


In [153]:
# ================ MAKE THE MAIN TEMPORARY DOWNLOADING FOLDER ================ #

### USE OF THE CREATE DIR FUNCTION ###

downloading_path_for_given_request = create_dir(temporary_downloading_folder_path,  subfolder_request)

New folder at /scratchu/lgiboni/CMIP6_DATA/cmip6-download-tmp/IPSL-CM6A-LR.r1i1p1f1/piClim-control/ created


## Launch the wget script for the request

Now that the folders have been created, we only have to write the wget script and download the data.

### Define its name and path

In [154]:
# ================ DEFINE ITS NAME AND PATH ================ #

### DEFINE ITS NAME ###

## Use of the random library to generate a unique downloading name

# Set the random number identifier #

random_id  = int(random.uniform(1,10)*1000) # we extract a random decimal number between 1 and 10 and keep the 4 first decimals

# Turn it into a str #

random_id = str(random_id)

## Define the downloading script's name ## 

downloading_script_name = "download-" + random_id + ".sh"

### DEFINE ITS PATH ###

downloading_script_path = downloading_path_for_given_request + downloading_script_name

### Write the downloading script

In [155]:
# ================ WRITE THE WGET SCRIPT ================ #

### RETRIEVE THE SCRIPT ###

wget_script_content = query.get_download_script()

### STORE IT IN THE DOWNLOADING FOLDER ###

## Write the wget script ##

with open(downloading_script_path, "w") as writer:

    # Writing it #
    
    writer.write(wget_script_content)

## Make it readable, writable and executable for the user ##

os.chmod(downloading_script_path, 0o750)  # it is also executable and readable by other users

### Run the downloading script 

In [156]:
# ================ RUN THE WGET SCRIPT ================ #

### RUN THE WGET SCRIPT ###

## We store the results in the download_process variable to produce a log ##

download_process = subprocess.run(
    "bash " + downloading_script_name + " -s",
    cwd=downloading_path_for_given_request,
    shell=True,
    check=True,
    text=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT,
) 

# -s option to run the script without credential | cwd is the directory of the script

## Store the log ##

with open(downloading_path_for_given_request + "log.txt", "w") as log_file:

    log_file.write(download_process.stdout)

## Unpack and store the found url(s) :

Each result contains a certain number of files. Each file is associated with a unique URL that can be used to download it. The following cell  is written to extract these URLs and store them into a list. Here it is quite easy because we have only one url per found result.

In [None]:
# ================ UNPACK THE FOUND URLS ================ #

### INITIALIZE THE LIST OF URLS ###

urls = []

### LOOP OVER THE FOUND RESULTS ###

## Cool progress bar for the loop ##

for ii in tqdm(range(results_count), desc="Extracting the urls"):

    ## Extract the ii-th variable ##

    result_ii = query.search()[ii]

    ## Extract the list of url(s) associated to this ii-th result ##

    url_list_ii = result_ii.file_context().search()

    ## Loop over the list of url for the ii-th result ##

    for url in url_list_ii:

        ## Add it to the list of urls ##

        urls.append(url.download_url)

Extracting the urls:  58%|█████▊    | 7/12 [00:25<00:16,  3.29s/it]

In [None]:
query.get_download_script()