# Data Concatenation

This notebook contains a workflow to:
1. Download daily ERA5 data from a Google Cloud Storage bucket,
1. Combine the daily data in a Zarr store, and
1. Upload the Zarr store to a Google Cloud Storage bucket.

This step is a convenience that facilitaties easier sharing via the Zarr format for chunked, compressed, N-D arrays. In this way, we follow the lead of the [Pangeo project](https://pangeo.io/). They also [store data in the cloud](https://pangeo.io/data.html) using the Zarr format.

## Preliminaries

### Requirements

* A Google Cloud project with Cloud Storage enabled ([Create new account](https://cloud.google.com/))
* 300 GB of local storage for data
* The following Python packages:

In [1]:
%pip install -q tqdm xarray dask scipy netCDF4 zarr google-cloud-storage

Note: you may need to restart the kernel to use updated packages.


In [2]:
import contextlib
from datetime import timedelta, date
import logging
import multiprocessing
from os import system, path
from sys import platform

from dask.diagnostics import ProgressBar
from google.cloud import storage
from tqdm.notebook import tqdm, trange
import joblib
from joblib import Parallel, delayed
import xarray as xr

### Setup

Our start and end dates span 30 years. Daily data are maintained in a Google Cloud Storage bucket separate from the Zarr store we seek to create. We also need to specify a local path to the Zarr store produced by the workflow.

In [3]:
start_date = date(1991, 1, 1)
end_date = date(2021, 1, 1)
daily_data_bucket = "era5-single-level-daily"
local_path_to_store = "./era5-daily.zarr"
zarr_bucket = "rom-input"
n_jobs = -3  # number of jobs for parallelization; if 1, then serial; if negative, then (n_cpus + 1 + n_jobs) are used

# Xarray configuration
xr.set_options(keep_attrs=True)

# Multiprocessing configuration for MacOS
if platform == "darwin":
    multiprocessing.set_start_method("fork", force=True)  # ipython bug workaround https://github.com/ipython/ipython/issues/12396
    
# Logging configuration
logging.basicConfig(filename="combine.log", filemode="w", level=logging.INFO)

## Functions

In [None]:
@contextlib.contextmanager
def tqdm_joblib(tqdm_object):
    """Patch joblib to report into tqdm progress bar given as argument."""

    def tqdm_print_progress(self):
        if self.n_completed_tasks > tqdm_object.n:
            n_completed = self.n_completed_tasks - tqdm_object.n
            tqdm_object.update(n=n_completed)

    original_print_progress = joblib.parallel.Parallel.print_progress
    joblib.parallel.Parallel.print_progress = tqdm_print_progress

    try:
        yield tqdm_object
    finally:
        joblib.parallel.Parallel.print_progress = original_print_progress
        tqdm_object.close()


def daterange(start_date, end_date):
    """Make a date range object spanning two dates.
    
    Args:
        start_date: date object to start from.
        end_date: date object to end at.
    
    Yields:
        date object for iteration.
    """
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)


def get_date_data_gcs(single_date, bucket_name):
    """Download a dataset for a single date from Google Cloud Storage.
    
    Args:
        single_date: date object representing day to retrieve data for.
        bucket_name: Google Cloud Storage bucket to download from.
    
    Returns:
        Nothing; downloads data from Google Cloud Storage as a side effect.
    """
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)    
    blob = bucket.blob(f"{single_date.strftime('%Y%m%d')}.nc")
    blob.download_to_filename(filename=f"./{single_date.strftime('%Y%m%d')}.nc")
    

def combine_data(path_to_data, path_to_store, start_year, end_year):
    """Combine daily data files into a single store."""
    for year in trange(start_year, end_year):
        
        files = [f"{path_to_data}/{single_date.strftime('%Y%m%d')}.nc" 
                 for single_date in daterange(date(year, 1, 1), date(year + 1, 1, 1))]
        ds = xr.open_mfdataset(files, parallel=True)
        ds.attrs["institution"] = "ECMWF"
        ds.attrs["source"] = "ERA5"
        ds.attrs["title"] = "Reflective Earth optimization map inputs"
        ds.attrs["comment"] = "Hourly-mean ERA5 boundary solar radiation fields were averaged into daily-means"
        
        if path.exists(path_to_store):
            delayed_store = ds.to_zarr(path_to_store, consolidated=True, compute=False, append_dim="time")
        else:
            delayed_store = ds.to_zarr(path_to_store, consolidated=True, compute=False)
        
        with ProgressBar():
             results = delayed_store.compute()

## Workflow

First we need to download all of the daily data files to a local directory, then we combine the data into a single Zarr store, and finally upload it to Google Cloud storage.

In [None]:
with tqdm_joblib(tqdm(total=sum(1 for _ in daterange(start_date, end_date)))) as pbar:
    Parallel(n_jobs=n_jobs,
             backend="multiprocessing")(delayed(get_date_data_gcs)(day, daily_data_bucket)
                                        for day in daterange(start_date, end_date))

In [None]:
combine_data(path_to_data=".",
             path_to_store=local_path_to_store,
             start_year=start_date.year, end_year=end_date.year)

In [None]:
system(f"gsutil -m cp -r {local_path_to_store} gs://{zarr_bucket}/

## Discussion

Now that we have combined daily ERA5 data into a Zarr store, we can analyze the data using a simple model of reflected solar radiation. Look for this in the next notebook, `04-Analyze.ipynb`.