# Data Ingest

This notebook contains a workflow to:
1. Download meteorological reanalysis data from the Copernicus [Climate Data Store](https://cds.climate.copernicus.eu/) and
2. Upload the data to a Google Cloud Storage bucket.

This data ingest is necessary to support an analysis of Earth's radiation budget, so we will request four variables to serve as boundary conditions:

* Top of Atmosphere Incident Solar Radiation (`toa_incident_solar_radiation`)
* Top of Atmosphere Net Solar Radiation (`top_net_solar_radiation`)
* Downwelling Solar Radiation at the Surface (`surface_solar_radiation_downwards`)
* Net Solar Radiation at the Surface (`surface_net_solar_radiation`)

## Preliminaries

### Requirements

* A Copernicus Climate Data Store account ([Create new account](https://cds.climate.copernicus.eu/user/register))
* A Google Cloud project with Cloud Storage enabled
* The following Python packages:

In [1]:
%pip install -q cdsapi joblib tqdm urllib3 certifi gsutil

Note: you may need to restart the kernel to use updated packages.


### Imports

In [2]:
from os import environ, path, system
from datetime import timedelta, date

import cdsapi
import certifi
import contextlib
import joblib
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import urllib3

## Setup

Our analysis seeks a long-term estimate of the amount of outgoing radiation that Earth's surface can reflect. ERA5 has 42 years of hourly data available. A long-term climatology is typically defined as 30 years. Thus, we ingest the latest 30 years: 1991 through 2020. Since making a single request would be prohibitively large, we break the request up by day. 

Our ingest workflow downloads ECMWF Copernicus Climate Data Store (CDS) files locally, uploads them to a Google Cloud Storage bucket, and removes the local file. We define a GCS bucket here, although a user may use whatever bucket they choose in their Google Cloud project.

The CDS Client API must be configured prior to use ([How to use the CDS API](https://cds.climate.copernicus.eu/api-how-to)). On initialization `cdsapi.Client()` looks for a `url` and `key` in environment variables, in a `.cdsapirc` file, or in the class input arguments.

In [3]:
start_date = date(1991, 5, 1)
end_date = date(1991, 6, 1)
hourly_data_bucket = "era5-single-level"

# ECMWF C3S CDS Client Configuration
try:
    if None not in (environ["CDSAPI_URL"], environ["CDSAPI_KEY"]):
        pass
except KeyError:
    # Fallback to .cdsapirc
    try:
        with open(path.join(path.expanduser("~"),".cdsapirc"), "r") as f:
            output = f.read()
    except IOError:
        logging.warning("No $CDSAPI_URL, $CDSAPI_KEY, or .cdsapirc found.")

# Certificate management
http = urllib3.PoolManager(
    cert_reqs='CERT_REQUIRED',
    ca_certs=certifi.where()
)

In [4]:
!gsutil version -l

gsutil version: 4.62
checksum: PACKAGED_GSUTIL_INSTALLS_DO_NOT_HAVE_CHECKSUMS (!= fe14a00285d4702ed626050d0f9ae955)
boto version: 2.49.0
python version: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
OS: Linux 4.19.0-16-cloud-amd64
multiprocessing available: True
using cloud sdk: False
pass cloud sdk credentials to gsutil: False
config path(s): /etc/boto.cfg, /home/jupyter/.boto
gsutil path: /opt/conda/bin/gsutil
compiled crcmod: True
installed via package manager: True
editable install: False


## Functions

In [5]:
@contextlib.contextmanager
def tqdm_joblib(tqdm_object):
    """Patch joblib to report into tqdm progress bar given as argument."""

    def tqdm_print_progress(self):
        if self.n_completed_tasks > tqdm_object.n:
            n_completed = self.n_completed_tasks - tqdm_object.n
            tqdm_object.update(n=n_completed)

    original_print_progress = joblib.parallel.Parallel.print_progress
    joblib.parallel.Parallel.print_progress = tqdm_print_progress

    try:
        yield tqdm_object
    finally:
        joblib.parallel.Parallel.print_progress = original_print_progress
        tqdm_object.close()


def daterange(start_date, end_date):
    """Make a date range object spanning two dates.
    
    Args:
      start_date: date object to start from.
      end_date: date object to end at.
    
    Yields:
      date object for iteration.
    """
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)


def ingest_cds_to_gcs(single_date, bucket, cleanup=False):
    """Retrieve data from Copernicus Data Service and upload to Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket: Google Cloud Storage bucket to upload to.
        cleanup: Optionally remove the downloaded CDS data after upload to GCS.
        
    Returns:
        Nothing; downloads a file, uploads it to GCS, and optionally removes it as side effects.
    """
    c = cdsapi.Client(progress=False, quiet=True)
    c.retrieve(
    "reanalysis-era5-single-levels",
    {
        "product_type": "reanalysis",
        "variable": [
            "surface_solar_radiation_downwards", "toa_incident_solar_radiation",
            "surface_net_solar_radiation", "top_net_solar_radiation"
        ],
        "year": single_date.strftime("%Y"),
        "month": single_date.strftime("%m"),
        "day": single_date.strftime("%d"),
        "time": [
            "00:00", "01:00", "02:00",
            "03:00", "04:00", "05:00",
            "06:00", "07:00", "08:00",
            "09:00", "10:00", "11:00",
            "12:00", "13:00", "14:00",
            "15:00", "16:00", "17:00",
            "18:00", "19:00", "20:00",
            "21:00", "22:00", "23:00",
        ],
        "format": "netcdf",
    },
    f"{single_date.strftime('%Y%m%d')}.nc")
    
    upload_status = system(f"gsutil -m cp -r {single_date.strftime('%Y%m%d')}.nc gs://{bucket}/")
    if (upload_status == 0) & cleanup:
        system(f"rm {single_date.strftime('%Y%m%d')}.nc")

## Workflow

The workflow is straightforward: for each day between the specified start and end dates, make a request for hourly data containing 4 variables, download the data, and upload it to Google Cloud Storage. `tqdm` tracks the progress of the workflow.

In [6]:
with tqdm_joblib(tqdm(total=sum(1 for _ in daterange(start_date, end_date)))) as pbar:
    Parallel(n_jobs=-2,  # use all but one CPU
             backend="multiprocessing")(delayed(ingest_cds_to_gcs)(day, hourly_data_bucket, cleanup=True)
                                        for day in daterange(start_date, end_date))

  0%|          | 0/31 [00:00<?, ?it/s]

## Discussion

In the next notebook, `02-Processing.ipynb` we'll demonstrate the steps necessary to distill the ingested hourly data into daily-mean data.