# Downloading needed CMIP6 data on one model's output :

## Purpose of the notebook

This notebook aims at retrieving all the needed url(s) to download the needed CMIP6 variables for the **piClim-control** experiment and the **r1i1p1f1** variant for the **IPSL-CM6A-LR** model. It uses the *esgf-pyclient* library : https://esgf-pyclient.readthedocs.io/en/latest/index.html.

Feel free to share, use and improve the following code according to the provided license on the repository.

## Initialization 

### Set environment parameter

We set this parameter in order to avoid the display of needless and impactless errors regarding the *facets* variable.

In [1]:
# ================ SET ENVIRONMENT PARAMETERS ================ #

### NO FACETS WARNING ###

%env ESGF_PYCLIENT_NO_FACETS_STAR_WARNING = 42



### Importations

In [2]:
# ================ IMPORTATIONS ================ #

from pyesgf.search import SearchConnection # the query package of pyesgf for cmip6 data

import os # to handle path's management

import pandas as pd # to store the urls under a table format

from tqdm import tqdm # to make cool progress bars

### Set the saving directory's path 

In [4]:
# ================ SAVING PATH ================ #

### DEFINE WHERE TO DOWNLOAD THE FILES ###

save_dir = "/data/lgiboni/downloaded-cmip6-datadir-example/"

### Global criterias for the data research

In [5]:
# ================ COMMON SEARCH CRITERIAS FOR OUR ANALYSIS ================ #

### CMIP6 DATA ###

project = ["CMIP6"]

### MONTHLY FREQUENCY ###

frequency = ["mon"]

### VARIABLE TO SEARCH FOR ###

variables = ['clt','rsdt','rsut','rsutcs','rsds','rsus','rsdscs','rsuscs','rlut','rlutcs','rlds','rlus']

## One model's output example for the piClim-control experiment

We look for the defined variables for the **IPSL-CM6A-LR** model for the **r1i1p1f1** variant. 

### Definition of the specific query parameters we need

We define all the specific information that are specific to our model.variant search as well as for the given experiment we are looking for.

In [6]:
# ================ DEFINE THE ADDITiONNAL SEARCH CRITERIAS FOR A GiVEN MODEL'S OUTPUT ================ #

### SOURCE ID : NAME OF THE CHOSEN MODEL ###

source_id = ["IPSL-CM6A-LR"]

### VARIANT OF THE MODEL RUN ###

variant_label = ["r1i1p1f1"]

### EXPERIMENT ID ###

experiment_id = ["piClim-control"]

### WHOLE SET OF QUERY PARAMETERS ###

facets = 'project,frequency,variable,source_id,variant_label,experiment_id,latest,replica'

*facets* will be used to confirm that our query holds all the parameters we are looking for.\
The element *latest* is to specify we want the latest data.\
The element *replica* is to specify whether we want duplicates to be included in our search results.

### Run the actual search

We start the search with the IPSL node. Since the variable *distrib* is set to **True**, we will extend the search to all the esgf nodes in case of lack of match.

In [7]:
# ================ RUN THE QUERY FOR THE PICLIM-CONTROL EXPERIMENT ================ #

### SET THE STARTING NODE FOR THE SEARCH ###

conn = SearchConnection('https://esgf-node.ipsl.upmc.fr/esg-search', distrib = True)

### QUERY ###

## Launch the query ##

query = conn.new_context(project = project,
                         frequency = frequency,
                         variable = variables,
                         source_id = source_id,
                         variant_label = variant_label,
                         experiment_id = experiment_id,
                         latest = True, # Set to "True" to get the latest version
                         replica = False, # Set to "False" to avoid getting duplicate in the search results
                         facets = facets) # facets confirms the search criterias we have made
                         


## Number of results ##

results_count = query.hit_count 

## Print the number of results ##

print (f"The search has returned {results_count} results")

The search has returned 12 results


Since we have 12 variables, we have one result per variable for the researched model.variant couple and experiment.

### Unpack and store the found url(s) :

Each result contains a certain number of files. Each file is associated with a unique URL that can be used to download it. The following cell  is written to extract these URLs and store them into a list. Here it is quite easy because we have only one url per found result.

In [8]:
# ================ UNPACK THE FOUND URLS ================ #

### INITIALIZE THE LIST OF URLS ###

urls = [] 

### LOOP OVER THE FOUND RESULTS ###

## Cool progress bar for the loop ##

for ii in tqdm(range(results_count), desc="Extracting the urls"):

    ## Extract the ii-th variable ##
    
    result_ii = query.search()[ii] 

    ## Extract the list of url(s) associated to this ii-th result ##

    url_list_ii = result_ii.file_context().search()

    ## Loop over the list of url for the ii-th result ##
    
    for url in url_list_ii: 

        ## Add it to the list of urls ##
        
        urls.append(url.download_url)

Processing: 100%|██████████| 12/12 [00:42<00:00,  3.57s/it]
