# Downloading from the ESGF archive

This Notebook runs through all the steps to download CORDEX data from the Earth System Grid Federation archive, from a single dataset, to multiple ensembles including various models, variables, and experiments. Before you start, check out the 'Before you start' section to ensure you have the appropriate setup to run the code:

### CORDEX

CORDEX, or the Coordinate Regional Downscaling Experiment, is an internationally coordinated effort to produce high-resolution regional climate model data for several of the world's key regions. Boundary conditions for the regions are provided by an ensemble of General Circulation Models (GCMs), with high-resolution Regional Climate Models (RCMs) handling the dynamics within the region. The project has standardised a number of experiments for each GCM-RCM pair to run, including a historical run and one for each Representative Concentration Pathway (RCP). The full dataset can be browsed manually at https://esgf-data.dkrz.de/search/cordex-dkrz/. For the Pyrenees, the region of interest has the code 'EUR-11'.

Even for just a single variable and a handful of experiments, the data available is in the dozens of GBs and hundreds of individual files. For this reason, it is significantly more convenient to download the data we need programmatically. This notebook sets out how we can do that using a handful of useful libraries, notably pyesgf for querying th ESGF database and aiohttp / asyncio for asynchronous downloading of files.

### Before you start

Before you start, please ensure:

- All libraries listed below installed by either pip or conda (installation instructions available in the online documentation)
- An ESGF account on the german node: https://esgf-data.dkrz.de
- Approved access to the CORDEX datasets. Apply here: https://esg-dn1.nsc.liu.se/ac/subscribe/CORDEX_Research
    (you may need to apply twice before your account is flagged approved)
- evironment variables saved for:
    - ESGF_USERNAME - the ESGF username for your german-node account
    - ESGF_PASSWORD - the ESGF password for your german-node account
    - DATA_HOME - a local path to the directory in which you want to store this data

You can save evironment variables by using either the terminal commands:

        export ESGF_USERNAME=myusername
        export ESGF_PASSWORD=mypassword
        export DATA_HOME=path/to/data
        
OR save them for all future sessions by copying the above commands into your bash profile (~/.bashrc for unix operating systems)

In [75]:
import os
import ssl
import pyesgf
import aiohttp
import asyncio
import xarray as xr
from libs.download import test
from itertools import product
from pyesgf.logon import LogonManager
from pyesgf.search import SearchConnection

import nest_asyncio
nest_asyncio.apply()

ImportError: cannot import name 'test' from 'libs.download' (/Users/jonniebarnsley/Documents/Python/Field trip/libs/download.py)

In [74]:
t

AttributeError: module 'libs.download' has no attribute 'test'

In [59]:
# define your query
query = {
    'project': 'CORDEX',
    'domain': 'EUR-11',
    'experiment': 'rcp85',
    'variable': 'tas',
    'time_frequency': 'mon',
    'ensemble': 'r1i1p1'
}

# ensure the following are saved as environment variables
USERNAME = os.environ['ESGF_USERNAME']
PASSWORD = os.environ['ESGF_PASSWORD']
DATA_PATH = os.environ['DATA_HOME']

In [60]:
# check ESGF for number of datasets that satisfy query
conn = SearchConnection('http://esgf-data.dkrz.de/esg-search', distrib=True)
context = conn.new_context(**query, facets=query.keys())
context.hit_count

type(conn)

pyesgf.search.connection.SearchConnection

In [61]:
# login to ESGF and generate SSL context
myproxy_host = 'esgf-data.dkrz.de'

lm = LogonManager()
lm.logon(username=USERNAME, password=PASSWORD, hostname=myproxy_host)

sslcontext = ssl.create_default_context(purpose=ssl.Purpose.SERVER_AUTH)
sslcontext.load_verify_locations(capath=lm.esgf_certs_dir)
sslcontext.load_cert_chain(lm.esgf_credentials)

lm.is_logged_on()

True

In [62]:
type(sslcontext)

ssl.SSLContext

In [63]:
# generate results and check an example dataset to verify all is working as expected
results = context.search()
example_dataset = results[0]
example_dataset.dataset_id

'cordex.output.EUR-11.CLMcom.MPI-M-MPI-ESM-LR.rcp85.r1i1p1.CCLM4-8-17.v1.mon.tas.v20140515|esgf1.dkrz.de'

In [64]:
# and now an example file within that dataset, including its http download link
example_files = example_dataset.file_context().search(ignore_facet_check=True)
example_file = example_files[0]
example_file.download_url

'http://esgf1.dkrz.de/thredds/fileServer/cordex/cordex/output/EUR-11/CLMcom/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/CLMcom-CCLM4-8-17/v1/mon/tas/v20140515/tas_EUR-11_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_CLMcom-CCLM4-8-17_v1_mon_200601-201012.nc'

In [73]:
# call download_ensemble on the context generated by your query
downloads = dl.download_ensemble(context, DATA_PATH, ssl=sslcontext, verbose=True)

TypeError: download_ensemble() got an unexpected keyword argument 'ssl'

NB: If you're only downloading a handful of variables/experiments, the above code will be perfectly suitable for your needs. The following code blocks are only for circumstances where you want to leave your code running for a very long time (eg overnight) and would like to queue everything up for one big execution.

In [9]:
# or make multiple queries

queries = {
    'project': ['CORDEX'],
    'domain': ['EUR-11'],
    'experiment': ['historical', 'rcp26', 'rcp85'],
    'variable': ['tas', 'pr'],
    'time_frequency': ['mon'],
    'ensemble': ['r1i1p1']
}

def make_multiple_queries(
            queries: dict,
            conn: pyesgf.search.connection.SearchConnection
):
    '''
    Takes a queries dictionary with ESGF facets as keys and lists of strings as values.
    Prints a table of all configurations of query and the number of datasets that match.
    Returns a dictionary of contexts for each configuration.
    Has the structure -> dict[config] = context
    '''
    headers = queries.keys()
    values = queries.values()
    h = len(headers)
    contexts = {}

    print('querying ESGF...')
    print('found the following datasets matching your queries:\n')

    # print table
    print(('{:<14} '*h).format(*headers), 'hit_count')
    for config in product(*values):
        query = {key: value for key, value in zip(headers, config)}
        context = conn.new_context(**query, facets=headers)
        contexts[config] = context
        hit_count = context.hit_count
        print(('{:<14} '*h).format(*config), hit_count)

    return contexts

# call function
contexts = make_multiple_queries(queries=queries, conn=conn)

querying ESGF...
found the following datasets matching your queries:

project        domain         experiment     variable       time_frequency ensemble        hit_count
CORDEX         EUR-11         historical     tas            mon            r1i1p1          48
CORDEX         EUR-11         historical     pr             mon            r1i1p1          48
CORDEX         EUR-11         rcp26          tas            mon            r1i1p1          23
CORDEX         EUR-11         rcp26          pr             mon            r1i1p1          23
CORDEX         EUR-11         rcp85          tas            mon            r1i1p1          46
CORDEX         EUR-11         rcp85          pr             mon            r1i1p1          46


In [13]:
# and then queue them to download one after the other

def download_multiple_ensembles(
            queries:dict,
            conn:pyesgf.search.connection.SearchConnection,
            verbose:bool=False
):
    '''
    Takes a queries dictionary in the same form as make_multiple_queries. Carries out said queries
    and then requests confirmation from the user to proceed. Following confirmation, continues to
    download each ensemble one by one
    '''
    contexts = make_multiple_queries(queries, conn)
    successfully_downloaded = set()
    encountered_errors = set()

    response = input('proceed? (y/n)')
    if response != 'y':
        return

    for config in contexts:

        print('\ndownloading next config:', *config)
        context = contexts[config]
        downloads = download_ensemble(context, verbose)
        for dataset in downloads:
            successfully_downloaded |= dataset['success']
            encountered_errors |= dataset['errors']

    print('\nall downloads now complete. successfully downloaded {} out of {} datasets.'.format(
                len(successfully_downloaded),
                len(successfully_downloaded)+len(encountered_errors)
    ))
    print('a full list of datasets omitted due to errors is available below:\n')
    print(*encountered_errors, sep='\n')

# pull the trigger
download_multiple_ensembles(queries=queries, conn=conn)

querying ESGF...
found the following datasets matching your queries:

project        domain         experiment     variable       time_frequency ensemble        hit_count
CORDEX         EUR-11         historical     tas            mon            r1i1p1          48
CORDEX         EUR-11         historical     pr             mon            r1i1p1          48
CORDEX         EUR-11         rcp26          tas            mon            r1i1p1          23
CORDEX         EUR-11         rcp26          pr             mon            r1i1p1          23
CORDEX         EUR-11         rcp85          tas            mon            r1i1p1          46
CORDEX         EUR-11         rcp85          pr             mon            r1i1p1          46


In [183]:
def check_for_corrupt(path_to_data):
    '''
    Very ugly function to verify all datasets have been downloaded properly and
    check for any files that may be corrupt.
    '''

    path = os.path.join(path_to_data, 'cordex', 'EUR-11')
    variables = [var for var in os.listdir(path) if not '.DS_Store' in var]
    errors = []

    for variable in variables:
        new_path = os.path.join(path, variable)
        experiments = [ex for ex in os.listdir(new_path) if not '.DS_Store' in ex]
        for experiment in experiments:
            new2path = os.path.join(new_path, experiment, 'r1i1p1')
            gcms = [gcm for gcm in os.listdir(new2path) if not '.DS_Store' in gcm]
            for gcm in gcms:
                new3path = os.path.join(new2path, gcm)
                rcms = [rcm for rcm in os.listdir(new3path) if not '.DS_Store' in rcm]
                for rcm in rcms:
                    new4path = os.path.join(new3path, rcm)
                    filenames = os.listdir(new4path)
                    filepaths = [os.path.join(new4path, filename) for filename in filenames if '.DS_Store' not in filename]
                    try:
                        xr.open_mfdataset(filepaths)
                    except Exception:
                        errors.append(new4path)

    print(*errors, sep='\n')

check_for_corrupt(DATA_PATH)


