# Data Ingest

This notebook contains a workflow to:
1. Download meteorological reanalysis data from the Copernicus [Climate Data Store](https://cds.climate.copernicus.eu/) and
2. Upload the data to a Google Cloud Storage bucket.

This data ingest is necessary to support an analysis of Earth's radiation budget, so we will request four available variables to serve as boundary conditions:

* Top of Atmosphere Incident Solar Radiation (`toa_incident_solar_radiation`)
* Top of Atmosphere Net Solar Radiation (`top_net_solar_radiation`)
* Downwelling Solar Radiation at the Surface (`surface_solar_radiation_downwards`)
* Net Solar Radiation at the Surface (`surface_net_solar_radiation`)

## Preliminaries

### Requirements

* A Copernicus Climate Data Store account ([Create new account](https://cds.climate.copernicus.eu/user/register))
* A Google Cloud project with Cloud Storage enabled ([Create new account](https://cloud.google.com/))
* Python packages. See `environments` directory for platform specific environment files.

### Imports

In [None]:
from utils import check_environment

check_environment("ingest")

import contextlib
from datetime import timedelta, date
import logging
import multiprocessing
from os import environ
import os
from sys import platform
import urllib.request

import cdsapi
import certifi
from google.cloud import storage
import joblib
from joblib import Parallel, delayed
from tqdm.notebook import tqdm
import urllib3

### Setup

Our analysis seeks a long-term estimate of the amount of outgoing radiation that Earth's surface can reflect. ERA5 has 42 years of hourly data available. A long-term climatology is typically defined as 30 years. Thus, we ingest the latest 30 years: 1991 through 2020. Since making a single request would be prohibitively large, we break the request up by day. 

Our ingest workflow downloads ECMWF Copernicus Climate Data Store (CDS) files locally, uploads them to a Google Cloud Storage bucket, and removes the local file. We define a GCS bucket here, although a user may use whatever bucket they choose in their Google Cloud project.

The CDS Client API must be configured prior to use ([How to use the CDS API](https://cds.climate.copernicus.eu/api-how-to)). On initialization `cdsapi.Client()` looks for a `url` and `key` in environment variables, in a `.cdsapirc` file, or in the class input arguments.

In [None]:
start_date = date(2010, 1, 1)
end_date = date(2011, 1, 1)
hourly_data_bucket = "era5-single-level"
n_jobs = -3  # number of jobs for parallelization; if 1, then serial; if negative, then (n_cpus + 1 + n_jobs) are used

# ECMWF C3S CDS Client Configuration
try:
    if None not in (environ["CDSAPI_URL"], environ["CDSAPI_KEY"]):
        pass
except KeyError:
    # Fallback to .cdsapirc
    try:
        with open(os.path.join(os.path.expanduser("~"),".cdsapirc"), "r") as f:
            output = f.read()
    except IOError:
        logging.warning("No $CDSAPI_URL, $CDSAPI_KEY, or .cdsapirc found.")

# Certificate management
http = urllib3.PoolManager(
    cert_reqs='CERT_REQUIRED',
    ca_certs=certifi.where()
)

# Logging configuration
logging.basicConfig(filename="ingest.log", filemode="w", level=logging.INFO)

# Multiprocessing configuration for MacOS
if platform == "darwin":
    multiprocessing.set_start_method("fork", force=True)  # ipython bug workaround https://github.com/ipython/ipython/issues/12396
    
# Project ID
url = "http://metadata.google.internal/computeMetadata/v1/project/project-id"
req = urllib.request.Request(url)
req.add_header("Metadata-Flavor", "Google")
project_id = urllib.request.urlopen(req).read().decode()

## Functions

In [None]:
@contextlib.contextmanager
def tqdm_joblib(tqdm_object):
    """Patch joblib to report into tqdm progress bar given as argument."""

    def tqdm_print_progress(self):
        if self.n_completed_tasks > tqdm_object.n:
            n_completed = self.n_completed_tasks - tqdm_object.n
            tqdm_object.update(n=n_completed)

    original_print_progress = joblib.parallel.Parallel.print_progress
    joblib.parallel.Parallel.print_progress = tqdm_print_progress

    try:
        yield tqdm_object
    finally:
        joblib.parallel.Parallel.print_progress = original_print_progress
        tqdm_object.close()


def daterange(start_date, end_date):
    """Make a date range object spanning two dates.
    
    Args:
      start_date: date object to start from.
      end_date: date object to end at.
    
    Yields:
      date object for iteration.
    """
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)


def put_date_data_gcs(single_date, bucket_name, user_project=None):
    """Upload a dataset for a single date to Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to download from.
        user_project: project ID for requester pays billing.

    Returns:
        Nothing; uploads data to Google Cloud Storage.
    """
    client = storage.Client()
    bucket = client.bucket(bucket_name, user_project=user_project)    
    blob = bucket.blob(f"{single_date.strftime('%Y%m%d')}.nc")
    blob.upload_from_filename(filename=f"./{single_date.strftime('%Y%m%d')}.nc")
    

def check_blob_size(single_date, bucket_name, user_project=None, raise_threshold=1e+2):
    """Verify that a GCS blob is larger than a specified threshold.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to upload to.
        user_project: project ID for requester pays billing.
        raise_threshold: file size below which an exception should be raised.
        
    Returns:
        Nothing; logs an info message about the size of the blob.
        
    Raises:
        Exception: if the blob file size is less than the specified threshold
    """
    client = storage.Client()
    bucket = client.bucket(bucket_name, user_project=user_project)    
    blob = bucket.get_blob(f"{single_date.strftime('%Y%m%d')}.nc")
    if blob.size < raise_threshold:
        raise Exception(f"{single_date.strftime('%Y%m%d')} data file size is smaller than expected")
    else:
        logging.info(f"{single_date.strftime('%Y%m%d')} file size in GCS is {int(blob.size * 1e-6)}MB")


def ingest_cds_to_gcs(single_date, bucket_name, user_project=None, cleanup=False):
    """Retrieve data from Copernicus Data Service and upload data to Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to upload to.
        user_project: project ID for requester pays billing.
        cleanup: Optionally remove the downloaded CDS data after upload to GCS.
        
    Returns:
        Nothing; downloads a file, uploads it to GCS, and optionally removes it.
    """
    client = cdsapi.Client(progress=False, quiet=True)
    client.retrieve(
    "reanalysis-era5-single-levels",
    {
        "product_type": "reanalysis",
        "variable": [
            "surface_solar_radiation_downwards", "toa_incident_solar_radiation",
            "surface_net_solar_radiation", "top_net_solar_radiation"
        ],
        "year": single_date.strftime("%Y"),
        "month": single_date.strftime("%m"),
        "day": single_date.strftime("%d"),
        "time": [
            "00:00", "01:00", "02:00",
            "03:00", "04:00", "05:00",
            "06:00", "07:00", "08:00",
            "09:00", "10:00", "11:00",
            "12:00", "13:00", "14:00",
            "15:00", "16:00", "17:00",
            "18:00", "19:00", "20:00",
            "21:00", "22:00", "23:00",
        ],
        "format": "netcdf",
    },
    f"{single_date.strftime('%Y%m%d')}.nc")
    
    put_date_data_gcs(single_date, bucket_name, user_project=user_project)
    check_blob_size(single_date, bucket_name, user_project=user_project, raise_threshold=1.8e+8)
    
    if cleanup:
        os.remove(f"{single_date.strftime('%Y%m%d')}.nc")

## Workflow

For each day between the specified start and end dates, make a request for hourly data containing 4 variables, download the data locally, and upload it to Google Cloud Storage. A `tqdm` progress bar tracks completion of the workflow.

In [None]:
with tqdm_joblib(tqdm(total=sum(1 for _ in daterange(start_date, end_date)))) as pbar:
    Parallel(n_jobs=n_jobs,
             backend="multiprocessing")(delayed(ingest_cds_to_gcs)(day, 
                                                                   hourly_data_bucket,
                                                                   user_project=project_id,
                                                                   cleanup=True)
                                        for day in daterange(start_date, end_date))

## Discussion

In the next notebook, `02-Preprocess.ipynb` we'll demonstrate the steps necessary to process the ingested hourly-mean data into daily-mean data with all the necessary variables in our preferred units.