#### Date of last run : February 11th 2022
#### Author : Emile Tenezakis
### Purpose : 
This notebook was written and run to solve the following issue : https://github.com/ClimateImpactLab/downscaleCMIP6/issues/529 in the CMIP6 downscaling project. The issue was that the precipitation `NorESM2-LM` historical data grid does not exactly match that of the projection grids (future ssp projections), specifically the longitudinal bounds. It was decided, as a workaround, to manually modify the raw downloaded data rather than the workflow code in order to avoid introducing special cases. Therefore, In this notebook, we download the historical and ssp 
precipitation `NorESM2-LM` data from CMIP6-in-the-cloud, we replace the `lon_bnds` variable in the historical data by that of any other ssp dataset, and we overwrite the historical zarr store with that new dataset in our GCS raw data bucket.

In [51]:
import xarray as xr
import numpy as np
import gcsfs
import intake

Helper for gcs I/O

In [52]:
def fetch_pangeo(experiment_id):
    if 'ssp' in experiment_id:
        version = '20191108'
        activity_ID = 'ScenarioMIP'
    elif 'historical' in experiment_id:
        version = '20190815'
        activity_ID = 'CMIP'
    else: 
        raise ValueError('invalid experiment id')
    col = intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6-noQC.json")
    cat = col.search(
        activity_id=activity_ID,
        experiment_id=experiment_id,
        table_id='day',
        variable_id='pr',
        source_id='NorESM2-LM',
        member_id='r1i1p1f1',
        grid_label='gn',
        version=int(version),
    )
    d = cat.to_dataset_dict(progressbar=False)
    assert len(d) == 1
    return d[list(d.keys())[0]]

In [53]:
def read_gcs_zarr(zarr_url, token='/opt/gcsfuse_tokens/impactlab-data.json', check=False, consolidated=True):
    """
    takes in a GCSFS zarr url, bucket token, and returns a dataset, given authentication.
    """
    fs = gcsfs.GCSFileSystem(token=token)
    store_path = fs.get_mapper(zarr_url, check=check)
    ds = xr.open_zarr(store_path, consolidated=consolidated)
    ds.close()
    return ds 

In [54]:
def write_gcs_zarr(ds, zarr_url, token='/opt/gcsfuse_tokens/impactlab-data.json', check=False, mode='w-'):
    """
    takes in a GCSFS zarr url, bucket token, dataset, and writes the dataset to URL.
    """
    fs = gcsfs.GCSFileSystem(token=token)
    store_path = fs.get_mapper(zarr_url, check=check)
    ds.to_zarr(store_path, mode=mode, compute=True)

In [55]:
def return_fs(token='/opt/gcsfuse_tokens/impactlab-data.json'):
    return gcsfs.GCSFileSystem(token=token)

Download datasets directly from CMIP6-in-the-cloud

In [56]:
IDs = ['historical',
           'ssp126',
           'ssp245',
           'ssp370',
           'ssp585']
dataset_dict = {}
for i in IDs:
    dataset_dict[i] = fetch_pangeo(i)

For convenience

In [57]:
hist = dataset_dict['historical']
ssp126 = dataset_dict['ssp126']
ssp245 = dataset_dict['ssp245']
ssp370 = dataset_dict['ssp370']
ssp585 = dataset_dict['ssp585']

Compare longitudinal bounds between ssps : all identical.

np.testing.assert_equal(ssp126.lon_bnds.values, ssp245.lon_bnds.values)
np.testing.assert_equal(ssp126.lon_bnds.values, ssp370.lon_bnds.values)
np.testing.assert_equal(ssp126.lon_bnds.values, ssp585.lon_bnds.values)

Compare longitudinal bounds between historical and ssp : not identical ...

In [58]:
np.testing.assert_raises(AssertionError, np.testing.assert_equal, hist.lon_bnds.values, ssp126.lon_bnds.values)

... everything is identical except the first and last around the prime meridian.

In [59]:
np.testing.assert_equal(hist.lon_bnds.values[1:142,:], ssp126.lon_bnds.values[1:142, :])

In [60]:
hist.lon_bnds.values[(0, 143),:]

array([[  0.  ,   1.25],
       [356.25, 360.  ]])

In [61]:
ssp126.lon_bnds.values[(0, 143),:]

array([[ -1.25,   1.25],
       [356.25, 358.75]])

In the historical dataset, one band (on the left of the meridian) is wider than the other. We replace the longitudinal bounds in the historical dataset by the longitudinal bounds in any of the ssp datasets, in order to remove this discrepancy.

In [62]:
hist['lon_bnds'] = ssp126['lon_bnds']

Here we verify that the operation worked : the longitudinal bounds are now equal.

In [63]:
np.testing.assert_equal(hist.lon_bnds.values, ssp126.lon_bnds.values)

We add an attribute to document this change in the dataset

In [64]:
record = 'date : 2022-02-11, author : Emile Tenezakis, description : in the raw downloaded CMIP6-in-the-cloud data, modified the first and last entries of the longitudinal bounds array (lon_bnds). These entries change from array([[0., 1.25], [356.25, 360.]]) to array([[-1.25, 1.25], [356.25, 358.75]]). This change was made so that this dataset spatial bounds match with the SSP projection dataset bounds'
hist.attrs['modifications'] = record

Finally, we overwrite the zarr store with the modified data.

In [65]:
out_url = 'gs://raw-305d04da/cmip6/CMIP/NCC/NorESM2-LM/historical/r1i1p1f1/day/pr/gn/v20190815.zarr' # overwrite at this location

In [66]:
write_gcs_zarr(hist, out_url, mode="w")

Verifying the changes persisted

In [67]:
np.testing.assert_equal(read_gcs_zarr(out_url).lon_bnds.values, ssp126.lon_bnds.values)

In [68]:
assert read_gcs_zarr(out_url).attrs['modifications'] == record