# Data Preprocessing

This notebook contains a workflow to:
1. Download hourly ERA5 data from a Google Cloud Storage bucket,
1. Process the hourly ERA5 data into daily ERA5 data with the fields necessary for our subsequent analysis, and
1. Upload the daily data to a Google Cloud Storage bucket.

This data processing is necessary to support an analysis of Earth's radiation budget based on daily solar fluxes at the surface and the top of atmosphere, so we will process hourly averages into daily averages. ERA5 does not output top of atmosphere outgoing solar radiation or upwelling solar radiation at the surface, however these quantities can be inferred using the available fluxes at those levels (e.g. incoming radiation and net radiation at the top of atmosphere).

## Preliminaries

### Requirements

* A Google Cloud project with Cloud Storage enabled ([Create new account](https://cloud.google.com/))
* The following Python packages:

In [None]:
%pip install -q tqdm xarray dask netCDF4 joblib google-cloud-storage

### Imports

In [None]:
import contextlib
from datetime import timedelta, date
import logging
import multiprocessing
from os import system, path
from sys import platform
import warnings

from google.cloud import storage
from tqdm.notebook import tqdm
import numpy as np
import joblib
from joblib import Parallel, delayed
import xarray as xr

### Setup

Our analysis seeks a long-term estimate of the amount of outgoing radiation that Earth's surface can reflect. ERA5 has 42 years of hourly data available. A long-term climatology is typically defined as 30 years. Thus, we ingest the latest 30 years: 1991 through 2020. Since making a single request would be prohibitively large, we break the request up by day. 

In [None]:
start_year = 1991
end_year = 2020
hourly_data_bucket = "era5-single-level"
daily_data_bucket = "era5-single-level-daily"
annual_data_bucket = "era5-single-level-annual"
n_jobs = -3  # number of jobs for parallelization; if 1, then serial; if negative, then (n_cpus + 1 + n_jobs) are used

# Xarray configuration
xr.set_options(keep_attrs=True)

# Multiprocessing configuration for MacOS
if platform == "darwin":
    multiprocessing.set_start_method("fork", force=True)  # ipython bug workaround https://github.com/ipython/ipython/issues/12396

# Logging configuration
logging.basicConfig(filename="process.log", filemode="w", level=logging.INFO)

## Functions

In [None]:
@contextlib.contextmanager
def tqdm_joblib(tqdm_object):
    """Patch joblib to report into tqdm progress bar given as argument."""

    def tqdm_print_progress(self):
        if self.n_completed_tasks > tqdm_object.n:
            n_completed = self.n_completed_tasks - tqdm_object.n
            tqdm_object.update(n=n_completed)

    original_print_progress = joblib.parallel.Parallel.print_progress
    joblib.parallel.Parallel.print_progress = tqdm_print_progress

    try:
        yield tqdm_object
    finally:
        joblib.parallel.Parallel.print_progress = original_print_progress
        tqdm_object.close()


def daterange(start_date, end_date):
    """Make a date range object spanning two dates.
    
    Args:
        start_date: date object to start from.
        end_date: date object to end at.
    
    Yields:
        date object for iteration.
    """
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)


def get_data_gcs(file_name, bucket_name):
    """Download a dataset for a single date from Google Cloud Storage.
    
    Args:
        file_name: file_name to download from gcs.
        bucket_name: Google Cloud Storage bucket to download from.
    
    Returns:
        Nothing; downloads data from Google Cloud Storage as a side effect.
    """
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)    
    blob = bucket.blob(file_name)
    blob.download_to_filename(filename=file_name)


def put_data_gcs(file_name, bucket_name):
    """Upload a dataset for a single date to Google Cloud Storage.
    
    Args:
        file_name: name of file to upload to gcs.
        bucket_name: Google Cloud Storage bucket to upload to.
        
    Returns:
        Nothing; uploads data to Google Cloud Storage as a side effect.
    """
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)    
    blob = bucket.blob(file_name)
    blob.upload_from_filename(filename=file_name)
    

def check_blob_size(single_date, bucket_name, raise_threshold=1):
    """Verify that a GCS blob is larger than a specified threshold.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to upload to.
        raise_threshold: file size below which an exception should be raised.
    """
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)    
    blob = bucket.get_blob(f"{single_date.strftime('%Y%m%d')}.nc")
    if blob.size < raise_threshold:
        raise Exception(f"{single_date.strftime('%Y%m%d')} data file size is smaller than expected")
    else:
        logging.info(f"{single_date.strftime('%Y%m%d')} file size in GCS is {int(blob.size * 1e-6)}MB")


def modify_units(dataset, starting_units, ending_units, conversion_factor):
    """Modify the units of a variable.
    
    Args:
        dataset: xarray Dataset
        starting_units: str of units to be modified
        ending_units: str of units after modification
        conversion_factor: numerical factor to apply to convert units
    
    Returns:
        xarray Dataset with units modified for variables with units matching the starting unit.
    """
    for variable in dataset:
        if dataset[variable].attrs["units"] == starting_units:
            dataset[variable] = dataset[variable] * conversion_factor
            dataset[variable].attrs["units"] = ending_units
    return dataset

        
def compute_daily_average(dataset):
    """Compute the daily average and an input xarray dataset."""
    return dataset.resample(time='1D').sum() / dataset.sizes["time"]


def compute_boundary_fluxes(dataset):
    """Compute missing boundary fluxes at the surface and top of atmosphere if possible.
    
    Use available radiative fluxes e.g. net solar radiation and incoming solar radiation to 
    compute outgoing solar radiation.
    
    Args:
        dataset: xarray Dataset with radiative fluxes at the surface and top of atmosphere.
        
    Returns:
        xarray Dataset with missing fluxes at the boundaries.
    """
    if ("tosr" not in dataset) and all(x in dataset for x in ["tisr", "tsr"]):
        dataset["tosr"] = dataset["tisr"] - dataset["tsr"]
        dataset["tosr"].attrs["long_name"] = "TOA outgoing solar radiation"
        dataset["tosr"].attrs["standard_name"] = "toa_outgoing_shortwave_flux"
    if ("ssru" not in dataset) and all(x in dataset for x in ["ssrd", "ssr"]):
        dataset["ssru"] = dataset["ssrd"] - dataset["ssr"]
        dataset["ssru"].attrs["long_name"] = "Surface solar radiation upwards"
        dataset["ssru"].attrs["standard_name"] = "surface_upwelling_shortwave_flux_in_air"
    return dataset


def drop_unneccesary_variables(dataset, keep_vars):
    """Drop variables not specified as necessary.
    
    Args:
        dataset: xarray Dataset.
        keep_vars: list of variables to keep.
    
    Returns:
        xarray Dataset with specified variables.
    """
    drop_vars = list(set(dataset.data_vars).symmetric_difference(set(keep_vars)))
    return dataset.drop_vars(drop_vars)


def preprocess_hourly_data(single_date, hourly_data_bucket, daily_data_bucket, cleanup=False):
    """Process hourly average data into daily average data.
    
    Args:
        single_date: date object representing day to retrieve data for.
        hourly_data_bucket: str name of Google Cloud Storage bucket for hourly data.
        daily_data_bucket: str name of Google Cloud Storage bucket for daily data.
        cleanup: boolean option to remove downloaded data after processing.
    
    Returns:
        Exit status of system call to upload processed data to Google Cloud Storage.
        
    Raises:
        Exception: if the same bucket is provided for both hourly and daily data.
    """
    if hourly_data_bucket == daily_data_bucket:
        raise Exception("You must provide different buckets for hourly and daily data.")
    
    get_data_gcs(file_name=f"{single_date.strftime('%Y%m%d')}.nc", bucket_name=hourly_data_bucket)
    
    with xr.open_dataset(f"{single_date.strftime('%Y%m%d')}.nc") as ds:
        ds = compute_boundary_fluxes(ds)
        ds = compute_daily_average(ds)
        ds = drop_unneccesary_variables(ds, keep_vars=["ssrd", "ssru", "tisr", "tosr"])
        ds = modify_units(ds, "J m**-2", "W m**-2", (1 / 3600))  # 3600 seconds in an hour
        ds.to_netcdf(f"{single_date.strftime('%Y%m%d')}.nc")

    put_data_gcs(file_name=f"{single_date.strftime('%Y%m%d')}.nc", bucket_name=daily_data_bucket)
    check_blob_size(single_date, daily_data_bucket, raise_threshold=1e+7)
    
    if cleanup:
        system(f"rm {single_date.strftime('%Y%m%d')}.nc")

## Workflow

For each day between the specified start and end dates, download the hourly data, process it, and upload it to a different bucket for daily data.

In [None]:
for year in range(start_year, end_year + 1):
    
    start_date = date(year, 1, 1)
    end_date = date(year + 1, 1, 1)

    with tqdm_joblib(tqdm(total=sum(1 for _ in daterange(start_date, end_date)))) as pbar:
        Parallel(n_jobs=n_jobs,
                 backend="multiprocessing")(delayed(preprocess_hourly_data)(day, 
                                                                            hourly_data_bucket, 
                                                                            daily_data_bucket, 
                                                                            cleanup=False)
                                            for day in daterange(start_date, end_date))
    
    ds = xr.open_mfdataset("*.nc", parallel=True)
    ds.mean(dim="time").compute().assign_coords(time=year).expand_dims("time").to_netcdf(f"{year}.nc")
    
    put_data_gcs(file_name=f"{year}.nc", bucket_name=annual_data_bucket)
    
    system(f"rm *.nc")

## Discussion

In this notebook, we preprocessed ERA5 data by computing radiative fluxes at the atmospheric boundaries (surface, top of atmosphere) and averaging hourly-means to daily-means. The resulting daily-mean fields were uploaded to their own Google Cloud Storage bucket in netCDF files by date and by year. In the next notebook, `03-Analyze.ipynb` we'll analyze the annual averages using a simple model of reflected radiation.