# Data Ingest

This notebook contains a workflow for downloading meteorological data from the European Centre for Medium-Range Weather Forecasts (ECMWF) Copernicus [Climate Data Store](https://cds.climate.copernicus.eu/) and uploading to a Google Cloud Storage bucket.

In this particular case, we want to ingest ECMWF 5th Reanalysis (ERA5) hourly data over a 30 year interval. Making a single request would be prohibitively large, so we break the request up by day. This data ingest is intended to support an analysis of Earth's radiation budget, so we will request four variables to serve as boundary conditions:

* Top of Atmosphere Incident Solar Radiation (`toa_incident_solar_radiation`)
* Top of Atmosphere Net Solar Radiation (`top_net_solar_radiation`)
* Downwelling Solar Radiation at the Surface (`surface_solar_radiation_downwards`)
* Net Solar Radiation at the Surface (`surface_net_solar_radiation`)

## Preliminaries

This notebook depends on several python packages:

In [1]:
%pip install -q cdsapi joblib tqdm urllib3 certifi gsutil

Note: you may need to restart the kernel to use updated packages.


## Imports

In [2]:
import os
from datetime import timedelta, date
import multiprocessing

import cdsapi
import certifi
from joblib import Parallel, delayed
from tqdm.notebook import trange, tqdm
import urllib3

http = urllib3.PoolManager(
    cert_reqs='CERT_REQUIRED',
    ca_certs=certifi.where()
)

In [3]:
!gsutil version -l

gsutil version: 4.62
checksum: PACKAGED_GSUTIL_INSTALLS_DO_NOT_HAVE_CHECKSUMS (!= fe14a00285d4702ed626050d0f9ae955)
boto version: 2.49.0
python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
OS: Linux 4.19.0-16-cloud-amd64
multiprocessing available: True
using cloud sdk: False
pass cloud sdk credentials to gsutil: False
config path(s): /etc/boto.cfg, /home/jupyter/.boto
gsutil path: /opt/conda/bin/gsutil
compiled crcmod: True
installed via package manager: True
editable install: False


## Functions

In [4]:
def daterange(start_date, end_date):
    """Make a date range object spanning two dates.
    
    Args:
      start_date: date object to start from.
      end_date: date object to end at.
    
    Yields:
      date object for iteration.
    """
    for n in trange(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)


def ingest_cds_gcp(single_date, bucket):
    """Retrieve data from Copernicus Data Service and upload to Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket: Google Cloud Storage bucket to upload to.
        
    Returns:
        Nothing; uploads data to Google Cloud Storage as side effect.
    """
    c = cdsapi.Client(progress=False, quiet=True)
    c.retrieve(
    "reanalysis-era5-single-levels",
    {
        "product_type": "reanalysis",
        "variable": [
            "surface_solar_radiation_downwards", "toa_incident_solar_radiation",
            "surface_net_solar_radiation", "top_net_solar_radiation"
        ],
        "year": single_date.strftime("%Y"),
        "month": single_date.strftime("%m"),
        "day": single_date.strftime("%d"),
        "time": [
            "00:00", "01:00", "02:00",
            "03:00", "04:00", "05:00",
            "06:00", "07:00", "08:00",
            "09:00", "10:00", "11:00",
            "12:00", "13:00", "14:00",
            "15:00", "16:00", "17:00",
            "18:00", "19:00", "20:00",
            "21:00", "22:00", "23:00",
        ],
        "format": "netcdf",
    },
    f"{single_date.strftime('%Y%m%d')}.nc")
    os.system(f"gsutil -m cp -r {single_date.strftime('%Y%m%d')}.nc \
                gs://{bucket}/ && rm {single_date.strftime('%Y%m%d')}.nc")

## Workflow

In [5]:
start_date = date(1991, 1, 13)
end_date = date(1991, 1, 14)
bucket = "era5-single-level"

Parallel(n_jobs=-2,  # use all but one CPU
         backend="multiprocessing")(delayed(ingest_cds_gcp)(day, bucket) 
                                    for day in daterange(start_date, end_date))

  0%|          | 0/1 [00:00<?, ?it/s]

2021-06-03 21:32:24,506 INFO Welcome to the CDS
2021-06-03 21:32:24,507 INFO Sending request to https://cds.climate.copernicus.eu/api/v2/resources/reanalysis-era5-single-levels
2021-06-03 21:32:24,615 INFO Request is queued
2021-06-03 21:32:33,157 INFO Request is running
2021-06-03 21:32:46,034 INFO Request is completed
2021-06-03 21:32:46,035 INFO Downloading https://download-0001.copernicus-climate.eu/cache-compute-0001/cache/data4/adaptor.mars.internal-1622755952.7667806-19122-31-56963ef6-3919-4087-a597-f65990f05c96.nc to 19910113.nc (190.1M)
2021-06-03 21:33:13,262 INFO Download rate 7M/s


[None]