# Downloading needed CMIP6 data on one model's output :

## Purpose of the notebook

This notebook aims at retrieving the wget script to download the needed CMIP6 variables for the **piClim-control** experiment and the **r1i1p1f1** variant for the **IPSL-CM6A-LR** model. It uses the *esgf-pyclient* library : https://esgf-pyclient.readthedocs.io/en/latest/index.html.

Feel free to share, use and improve the following code according to the provided license on the repository.

## Initialization 

### Set environment parameters

We set this parameter in order to avoid the display of needless and impactless errors regarding the *facets* variable.

In [1]:
# ================ SET ENVIRONMENT PARAMETERS ================ #

### NO FACETS WARNING ###

%env ESGF_PYCLIENT_NO_FACETS_STAR_WARNING = 42



### Importations

In [2]:
# ================ IMPORTATIONS ================ #

from pyesgf.search import SearchConnection # the query package of pyesgf for cmip6 data

import os # to handle path's management

import pandas as pd # to store the urls under a table format

from tqdm import tqdm # to make cool progress bars

import tempfile # to make a temporary file

import subprocess # to run the wget code from python

### Set the saving directory's path 

In [3]:
# ================ SAVING PATH ================ #

### DEFINE WHERE TO MOVE THE FILES AT THE END ###

save_dir = "/data/lgiboni/downloaded-cmip6-datadir-example/"

### TEMPORARY FOLDER IN WHICH TO STORE THE WGET FILE AND THE DATA ###

download_dir = "/sratchu/lgiboni/cmip6-temporary-downloading-folder/"

### Global criterias for the data research

In [4]:
# ================ COMMON SEARCH CRITERIAS FOR OUR ANALYSIS ================ #

### CMIP6 DATA ###

project = ["CMIP6"]

### MONTHLY FREQUENCY ###

frequency = ["mon"]

### VARIABLE TO SEARCH FOR ###

variables = ['clt','rsdt','rsut','rsutcs','rsds','rsus','rsdscs','rsuscs','rlut','rlutcs','rlds','rlus']

## Search for one model's output : example for the piClim-control experiment

We look for the defined variables for the **IPSL-CM6A-LR** model for the **r1i1p1f1** variant. 

### Definition of the specific query parameters we need

We define all the specific information that are specific to our model.variant search as well as for the given experiment we are looking for.

In [5]:
# ================ DEFINE THE ADDITiONNAL SEARCH CRITERIAS FOR A GiVEN MODEL'S OUTPUT ================ #

### SOURCE ID : NAME OF THE CHOSEN MODEL ###

source_id = ["IPSL-CM6A-LR"]

### VARIANT OF THE MODEL RUN ###

variant_label = ["r1i1p1f1"]

### EXPERIMENT ID ###

experiment_id = ["piClim-control"]

### WHOLE SET OF QUERY PARAMETERS ###

facets = 'project,frequency,variable,source_id,variant_label,experiment_id,latest,replica'

*facets* will be used to confirm that our query holds all the parameters we are looking for.\
The element *latest* is to specify we want the latest data.\
The element *replica* is to specify whether we want duplicates to be included in our search results.

### Run the actual search

We start the search with the IPSL node. Since the variable *distrib* is set to **True**, we will extend the search to all the esgf nodes in case of lack of match.

In [6]:
# ================ RUN THE QUERY FOR THE PICLIM-CONTROL EXPERIMENT ================ #

### SET THE STARTING NODE FOR THE SEARCH ###

conn = SearchConnection('https://esgf-node.ipsl.upmc.fr/esg-search', distrib = True)

### QUERY ###

## Launch the query ##

query = conn.new_context(project = project,
                         frequency = frequency,
                         variable = variables,
                         source_id = source_id,
                         variant_label = variant_label,
                         experiment_id = experiment_id,
                         latest = True, # Set to "True" to get the latest version
                         replica = False, # Set to "False" to avoid getting duplicate in the search results
                         facets = facets) # facets confirms the search criterias we have made
                         


## Number of results ##

results_count = query.hit_count 

## Print the number of results ##

print (f"The search has returned {results_count} results")

The search has returned 12 results


Since we have 12 variables, we have one result per variable for the researched model.variant couple and experiment.

## Wget script

Now that the query has been fulfilled, we can retrieve the wget script associated to the matching results and execute it.

### Build the folders' names 

Before extracting and saving the script locally, we set the saving and downloading paths according to the model.variant couple and experiment name defined earlier.

In [7]:
# ================ BUILD THE FOLDERS' NAME ================ #

### DEFINE THE EXTRACT PATH FOR OUR GIVEN REQUEST ###

subpath_request = source_id[0] + "." + variant_label[0] + "/" + experiment_id[0] + "/"

### DEFINE WHERE TO MOVE THE FILES AT THE END ###

save_dir_for_given_search = save_dir + subpath_request

### TEMPORARY FOLDER IN WHICH TO STORE THE WGET FILE AND THE DATA ###

download_dir_for_given_search = download_dir + subpath_request

In [8]:
### DEFINE OUR OWN TEMPORARY DIRECTORY ###

tempfile.tempdir = "/scratchu/lgiboni/tmp"

### Extract the script

In [9]:
# ================ EXTRACT THE WGET SCRIPT ================ #

### RETRIEVE THE SCRIPT ###

wget_script_content = query.get_download_script()

### STORE IT ON SCRATCH ###

## Set the script writing's path ##
script_fd, script_path = tempfile.mkstemp(suffix='.sh', prefix='download-')

## Write the wget script ##

with os.fdopen(script_fd, "w") as writer:
    
    # Writing it #
    writer.write(wget_script_content)


## Make it readable, writable and executable for the user ##

os.chmod(script_path, 0o750) # it is also executable and readable by other users

## Retrieve the directory where we run the downloading script ## 

download_dir = os.path.dirname(script_path)

## Create a log file's name ##

# Retrieve the name of our downloading script #

script_name = os.path.basename(script_path) # we remove the .sh extension from the script's name


## Run the wget script from python ##

#("{} -s &> {}".format(script_path, log_name), , shell = True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
#subprocess.check_output

download_process = subprocess.run("bash " + script_name + " -s", cwd=download_dir,  shell=True, check = True,
                         text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

# -s option to run the script without credential | cwd is the directory of the script 

KeyboardInterrupt: 

In [None]:
download_dir

In [None]:
with open(download_dir + "/log.txt", "w") as log_file :

    log_file.write(download_process.stdout)

## Unpack and store the found url(s) :

Each result contains a certain number of files. Each file is associated with a unique URL that can be used to download it. The following cell  is written to extract these URLs and store them into a list. Here it is quite easy because we have only one url per found result.

In [None]:
# ================ UNPACK THE FOUND URLS ================ #

### INITIALIZE THE LIST OF URLS ###

urls = [] 

### LOOP OVER THE FOUND RESULTS ###

## Cool progress bar for the loop ##

for ii in tqdm(range(results_count), desc="Extracting the urls"):

    ## Extract the ii-th variable ##
    
    result_ii = query.search()[ii] 

    ## Extract the list of url(s) associated to this ii-th result ##

    url_list_ii = result_ii.file_context().search()

    ## Loop over the list of url for the ii-th result ##
    
    for url in url_list_ii: 

        ## Add it to the list of urls ##
        
        urls.append(url.download_url)

Extracting the urls:  58%|█████▊    | 7/12 [00:25<00:16,  3.29s/it]

In [None]:
query.get_download_script()