# Data Ingest

This notebook contains a workflow to:
1. Download meteorological reanalysis data from the Copernicus [Climate Data Store](https://cds.climate.copernicus.eu/) and
2. Upload the data to a Google Cloud Storage bucket.

This data ingest is necessary to support an analysis of Earth's radiation budget, so we will request four variables to serve as boundary conditions:

* Top of Atmosphere Incident Solar Radiation (`toa_incident_solar_radiation`)
* Top of Atmosphere Net Solar Radiation (`top_net_solar_radiation`)
* Downwelling Solar Radiation at the Surface (`surface_solar_radiation_downwards`)
* Net Solar Radiation at the Surface (`surface_net_solar_radiation`)

## Preliminaries

### Requirements

* A Copernicus Climate Data Store account ([Create new account](https://cds.climate.copernicus.eu/user/register))
* A Google Cloud project with Cloud Storage enabled
* The following Python packages:

In [1]:
%pip install -q cdsapi joblib tqdm urllib3 certifi google-cloud-storage

Note: you may need to restart the kernel to use updated packages.


### Imports

In [2]:
from datetime import timedelta, date
import logging
from os import environ, path, system
import sys

import cdsapi
import certifi
import contextlib
from google.cloud import storage
import joblib
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import urllib3

## Setup

Our analysis seeks a long-term estimate of the amount of outgoing radiation that Earth's surface can reflect. ERA5 has 42 years of hourly data available. A long-term climatology is typically defined as 30 years. Thus, we ingest the latest 30 years: 1991 through 2020. Since making a single request would be prohibitively large, we break the request up by day. 

Our ingest workflow downloads ECMWF Copernicus Climate Data Store (CDS) files locally, uploads them to a Google Cloud Storage bucket, and removes the local file. We define a GCS bucket here, although a user may use whatever bucket they choose in their Google Cloud project.

The CDS Client API must be configured prior to use ([How to use the CDS API](https://cds.climate.copernicus.eu/api-how-to)). On initialization `cdsapi.Client()` looks for a `url` and `key` in environment variables, in a `.cdsapirc` file, or in the class input arguments.

In [3]:
start_date = date(1992, 12, 1)
end_date = date(1993, 1, 1)
hourly_data_bucket = "era5-single-level"
n_jobs = -3  # number of jobs for parallelization; if 1, then serial; if negative, then (n_cpus + 1 + n_jobs) are used

# ECMWF C3S CDS Client Configuration
try:
    if None not in (environ["CDSAPI_URL"], environ["CDSAPI_KEY"]):
        pass
except KeyError:
    # Fallback to .cdsapirc
    try:
        with open(path.join(path.expanduser("~"),".cdsapirc"), "r") as f:
            output = f.read()
    except IOError:
        logging.warning("No $CDSAPI_URL, $CDSAPI_KEY, or .cdsapirc found.")

# Certificate management
http = urllib3.PoolManager(
    cert_reqs='CERT_REQUIRED',
    ca_certs=certifi.where()
)

logging.basicConfig(filename="ingest.log", filemode="w", level=logging.INFO)

## Functions

In [4]:
@contextlib.contextmanager
def tqdm_joblib(tqdm_object):
    """Patch joblib to report into tqdm progress bar given as argument."""

    def tqdm_print_progress(self):
        if self.n_completed_tasks > tqdm_object.n:
            n_completed = self.n_completed_tasks - tqdm_object.n
            tqdm_object.update(n=n_completed)

    original_print_progress = joblib.parallel.Parallel.print_progress
    joblib.parallel.Parallel.print_progress = tqdm_print_progress

    try:
        yield tqdm_object
    finally:
        joblib.parallel.Parallel.print_progress = original_print_progress
        tqdm_object.close()


def daterange(start_date, end_date):
    """Make a date range object spanning two dates.
    
    Args:
      start_date: date object to start from.
      end_date: date object to end at.
    
    Yields:
      date object for iteration.
    """
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)
        

def put_date_data_gcs(single_date, bucket_name):
    """Upload a dataset for a single date to Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to download from.
        
    Returns:
        Nothing; uploads data to Google Cloud Storage as side effect.
    """
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)    
    blob = bucket.blob(f"{single_date.strftime('%Y%m%d')}.nc")
    blob.upload_from_filename(filename=f"./{single_date.strftime('%Y%m%d')}.nc")
    

def check_blob_size(single_date, bucket_name, raise_threshold=1):
    """Verify that a GCS blob is larger than a specified threshold.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to upload to.
        raise_threshold: file size below which an exception should be raised.
    """
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)    
    blob = bucket.get_blob(f"{single_date.strftime('%Y%m%d')}.nc")
    if blob.size < raise_threshold:
        raise Exception(f"{single_date.strftime('%Y%m%d')} data file size is smaller than expected")
    else:
        logging.info(f"{single_date.strftime('%Y%m%d')} file size in GCS is {int(blob.size * 1e-6)}MB")


def ingest_cds_to_gcs(single_date, bucket_name, cleanup=False):
    """Retrieve data from Copernicus Data Service and upload data to Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to upload to.
        cleanup: Optionally remove the downloaded CDS data after upload to GCS.
        
    Returns:
        Nothing; downloads a file, uploads it to GCS, and removes it (optionally) as side effects.
    """
    client = cdsapi.Client(progress=False, quiet=True)
    client.retrieve(
    "reanalysis-era5-single-levels",
    {
        "product_type": "reanalysis",
        "variable": [
            "surface_solar_radiation_downwards", "toa_incident_solar_radiation",
            "surface_net_solar_radiation", "top_net_solar_radiation"
        ],
        "year": single_date.strftime("%Y"),
        "month": single_date.strftime("%m"),
        "day": single_date.strftime("%d"),
        "time": [
            "00:00", "01:00", "02:00",
            "03:00", "04:00", "05:00",
            "06:00", "07:00", "08:00",
            "09:00", "10:00", "11:00",
            "12:00", "13:00", "14:00",
            "15:00", "16:00", "17:00",
            "18:00", "19:00", "20:00",
            "21:00", "22:00", "23:00",
        ],
        "format": "netcdf",
    },
    f"{single_date.strftime('%Y%m%d')}.nc")
    
    put_date_data_gcs(single_date, bucket_name)
    check_blob_size(single_date, bucket_name, raise_threshold=1.8e+8)
    
    if cleanup:
        system(f"rm {single_date.strftime('%Y%m%d')}.nc")

## Workflow

The workflow is straightforward: for each day between the specified start and end dates, make a request for hourly data containing 4 variables, download the data, and upload it to Google Cloud Storage. `tqdm` tracks the progress of the workflow.

In [5]:
with tqdm_joblib(tqdm(total=sum(1 for _ in daterange(start_date, end_date)))) as pbar:
    Parallel(n_jobs=n_jobs,
             backend="multiprocessing")(delayed(ingest_cds_to_gcs)(day, 
                                                                   hourly_data_bucket, 
                                                                   cleanup=True)
                                        for day in daterange(start_date, end_date))

  0%|          | 0/31 [00:00<?, ?it/s]

## Discussion

In the next notebook, `02-Processing.ipynb` we'll demonstrate the steps necessary to distill the ingested hourly data into daily-mean data.