# Downloading needed CMIP6 data

## Purpose of the notebook

This notebook aims at retrieving all the needed url(s) to download the needed CMIP6 data for a given analysis. It uses the *esgf-pyclient* library : https://esgf-pyclient.readthedocs.io/en/latest/index.html.

Feel free to share, use and improve the following code according to the provided license on the repository.

## Initialization 

### Importations

In [3]:
# ================ IMPORTATIONS ================ #

from pyesgf.search import SearchConnection
import os
import pandas as pd
import requests

### Global criterias for the data research

In [4]:
# ================ COMMON SEARCH CRITERIAS FOR BOTH EXPERIMENT ================ #

### CMIP6 DATA ###

project = ["CMIP6"]

### MONTHLY FREQUENCY ###

frequency = ["mon"]

### VARIABLE TO SEARCH FOR BOTH EXPERIMENTS ###

variables = ['clt','rsdt','rsut','rsutcs','rsds','rsus','rsdscs','rsuscs','rlut','rlutcs','rlds','rlus']

## One model's output example for the piClim-control experiment

We look for the defined variables for the **IPSL-CM6A-LR** model for the **r1i1p1f1** variant. 

### Definition of the specific query parameters we need

We define all the specific information that are specific to our model.variant search as well as for the given experiment we are looking for.

In [32]:
# ================ DEFINE THE ADDITiONNAL SEARCH CRITERIAS FOR A GiVEN MODEL'S OUTPUT ================ #

### SOURCE ID : NAME OF THE CHOSEN MODEL ###

source_id = ["IPSL-CM6A-LR"]

### VARIANT OF THE MODEL RUN ###

variant_label = ["r1i1p1f1"]

### EXPERIMENT ID ###

experiment_id = ["piClim-control"]

### WHOLE SET OF QUERY PARAMETERS ###

facets = 'project,frequency,variable,source_id,variant_label,experiment_id,latest,replica'

*facets* will be used to confirm that our query holds all the parameters we are looking for.\
The element *latest* is to specify we want the latest data.\
The element *replica* is to specify whether we want duplicates to be included in our search results.

### Run the actual search

We start the search with the IPSL node. Since the variable *distrib* is set to **True**, we will extend the search to all the esgf nodes in case of lack of match.

In [36]:
# ================ RUN THE QUERY FOR THE PICLIM-CONTROL EXPERIMENT ================ #

### SET THE STARTING NODE FOR THE SEARCH ###

conn = SearchConnection('https://esgf-node.ipsl.upmc.fr/esg-search', distrib = True)

### QUERY ###

## Launch the query ##

query = conn.new_context(project = project,
                         frequency = frequency,
                         variable = variables,
                         source_id = source_id,
                         variant_label = variant_label,
                         experiment_id = experiment_id,
                         latest = True, # Set to "True" to get the latest version
                         replica = False, # Set to "False" to avoid getting duplicate in the search results
                         facets = facets) # facets confirms the search criterias we have made
                         


## Number of results ##

results_count = query.hit_count 

## Print the number of results ##

print (f"The search has returned {results_count} results")

The search has returned 12 results


In [39]:
query.facet_counts

{"'project": {},
 'frequency': {'mon': 12},
 'variable': {'rsutcs': 1,
  'rsut': 1,
  'rsuscs': 1,
  'rsus': 1,
  'rsdt': 1,
  'rsdscs': 1,
  'rsds': 1,
  'rlutcs': 1,
  'rlut': 1,
  'rlus': 1,
  'rlds': 1,
  'clt': 1},
 'source_id': {'IPSL-CM6A-LR': 12},
 'variant_label': {'r1i1p1f1': 12},
 'experiment_id': {'piClim-control': 12},
 'latest': {'true': 12},
 "replica'": {}}

### Unpack the found url(s) :

In [42]:
files_list

%env ESGF_PYCLIENT_NO_FACETS_STAR_WARNING = 42



In [43]:
ii = 0

for ii in range(results_count):
    dataset = query.search()[ii]
    
    files_list = dataset.file_context().search()
    
    for file in files_list:
        
        print(file.download_url)

http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/RFMIP/IPSL/IPSL-CM6A-LR/piClim-control/r1i1p1f1/Amon/clt/gr/v20181204/clt_Amon_IPSL-CM6A-LR_piClim-control_r1i1p1f1_gr_201401-204312.nc
http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/RFMIP/IPSL/IPSL-CM6A-LR/piClim-control/r1i1p1f1/Amon/rsdscs/gr/v20181204/rsdscs_Amon_IPSL-CM6A-LR_piClim-control_r1i1p1f1_gr_201401-204312.nc
http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/RFMIP/IPSL/IPSL-CM6A-LR/piClim-control/r1i1p1f1/Amon/rlutcs/gr/v20181204/rlutcs_Amon_IPSL-CM6A-LR_piClim-control_r1i1p1f1_gr_201401-204312.nc
http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/RFMIP/IPSL/IPSL-CM6A-LR/piClim-control/r1i1p1f1/Amon/rsds/gr/v20181204/rsds_Amon_IPSL-CM6A-LR_piClim-control_r1i1p1f1_gr_201401-204312.nc
http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/RFMIP/IPSL/IPSL-CM6A-LR/piClim-control/r1i1p1f1/Amon/rlds/gr/v20181204/rlds_Amon_IPSL-CM6A-LR_piClim-control_r1i1p1f1_gr_201401-204312.nc
http://vesg.ipsl.upmc.fr/thredds/fileServer/cmip6/RFMI

Each result contains a certain number of files. Each file is associated with a unique URL that can be used to download it. The following cell  is written to extract these URLs and store them into a list.

In [None]:
# ================ UNPACK THE FOUND URLS ================ #

urls = [] 

### LOOP OVER THE FOUND RESULTS ###

for i in range(results_count): 

    dataset = query.search()[i] # This open a dataset 

    files_list = dataset.file_context().search() # This create a list of files contained in the opened dataset

    for file in files_list: # This loop will iterate over each file of the list to extract their URLs

        urls.append(file.download_url)

    print (f"Results {i+1} out of {results_count} processed")

# Saving the URLs in an Excel spreadsheet
df = pd.DataFrame(urls, columns = ["Links"])

df.to_excel("C:/Users/gilunga/Documents/files_url.xlsx")
