# Python Data Access

We will first breifly introduce the functionality of Xarray, and then access CMIP6 data in goole cloud with intake-esm

____________
## 1. Xarray
____________

In [None]:
import xarray as xr
import zarr
xr.set_options(display_style='html') # make the display_style of xarray more user friendly

In [None]:
# use the North American air temperature dataset in Xarray tutorial
tas = xr.tutorial.open_dataset("air_temperature")
# we have a xarray dataset: A labelled N-D array!
tas

In [None]:
# access the xarray DataArray within the Dataset; xarray hasn't load the data into memory (xarray is lazy).
tas.air

In [None]:
# let's acess year 2013 and 2014 separately and write them to our home directory
tas.sel(time='2013').to_netcdf('./tas_2013.nc')
tas.sel(time='2014').to_netcdf('./tas_2014.nc')

In [None]:
# read in one of the two files that we just output
tas=xr.open_dataset('./tas_2013.nc')
tas.air

In [None]:
Tc=tas.air-273.15
# now we do calculations, xarray loads data into memory
# but we lost the attributes
# that's because xarray, by default only keep attributes in unambiguous circumstances
Tc

In [None]:
# set option globally to inform xarray always keep attributes
# you can also pass in keep_attrs=* within many individual xarray operations 
xr.set_options(keep_attrs=True)

In [None]:
# do the calculation again: attributes are kept
Tc=tas.air-273.15
Tc

In [None]:
# but with wrong unit (kelvin) which we need to change manually
Tc.attrs["units"] ='degC'
Tc

In [None]:
# Now let's output Tc to zarr format using xarray
# we need to change Tc from xarray DataArray to xarray Dataset first!
Tc.to_dataset().to_zarr('./Tc_zarr/')

In [None]:
# read in again
Tc_zr=xr.open_zarr('./Tc_zarr/')
# zarr automatically chunked the array for us when we output it above. We can also manually set the chunk size
Tc_zr.air

In [None]:
# Above, all the chunks are stored within a directory containing many small files
# which may not be preferable on a HPC cluster
# zarr offer several other storage alternatives
import bsddb3
store = zarr.DBMStore('./Tc_zarr.bdb', open=bsddb3.btopen)
Tc_zr.to_zarr(store)
# we need to close the store explicitly
store.close()

In [None]:
# xarray can open multiple files in parallel
# by default it will be chunked in the way that each file correspond to one chunk
tas_mf=xr.open_mfdataset('./tas_*.nc',parallel=True)
tas_mf

lots of useful functionality of xarray: **the power of labelling!**

In [None]:
# easy index by label
tas.air.sel(time='2013-07-01',lat=slice(50,20),lon=slice(250,300))

In [None]:
# annual mean
tas.air.mean('time')

In [None]:
# zonal averages at certain latitudes
tas.air.mean('time').sel(lat=slice(50,30)).mean('lon')

In [None]:
# monthly mean value (climatological monthly mean if we have say 30 years)
# groupby can be very handy
tas.air.groupby('time.month').mean()

In [None]:
# resample: daily maximum
tas.air.resample({'time':'D'}).max()

____________
## 2. Acess CMIP6 data in the Cloud
____________

### 2.1 Opening a single Zarr data store

A standalone Zarr data store can be opened using xarray’s ```open_zarr()``` function. The function takes a Python-native ```MutableMapping``` as input, which can be acquired from a Zarr store URL using ```fsspec```

In [None]:
# fsspec: Filesystem interfaces to work with remote filesystems
import fsspec

In [None]:
# create a MutableMapping from a store URL
mapper = fsspec.get_mapper("gs://cmip6/CMIP6/CMIP/AS-RCEC/TaiESM1/1pctCO2/r1i1p1f1/Amon/hfls/gn/v20200225/")

In [None]:
# read in
# consolidate metadata objects into a single one which can increase the speed of reading the array metadata
ds = xr.open_zarr(mapper, consolidated=True)
ds

### 2.2 Manually searching the catalog

Wait! Where can I get the zstore URL?

- We can download a master CSV file enumerating all available data stores
- we can interact with the spreadsheet through a pandas DataFrame to search and explore for relevant data using the CMIP6 controlled vocabulary

In [None]:
import pandas as pd
# for Google Cloud:
df = pd.read_csv("https://cmip6.storage.googleapis.com/pangeo-cmip6.csv")
# for AWS S3:
# df = pd.read_csv("https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.csv")
df

In [None]:
# query it based on your needs!
df_subset = df.query("activity_id=='CMIP' & source_id=='CESM2' & table_id=='Amon' & variable_id=='tas' \
                     & member_id=='r1i1p1f1' & grid_label=='gn'")
df_subset

In [None]:
# we have a bunch of zstore URLs
df_subset.zstore.values

In [None]:
# let's say we want to access the last one!
zstore = df_subset.zstore.values[-1]
mapper = fsspec.get_mapper(zstore)
ds = xr.open_zarr(mapper, consolidated=True)
ds

### 2.3 working with multiple data stores at the same time
- It seems not user friendly to open all zstores one by one manually.

- ```intake-ESM``` can help combine several data stores to form a dataset.

- ```intake-ESM``` is an addon of ```intake``` which is a python package aiming to provide a consistent data access API.

In [None]:
import intake

In [None]:
# for Google Cloud:

# provide a link to an esm collection file which have a bunch of metadata including 
# how data stores can be combined to yield highly aggregated datasets
col = intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json")
# Using this esm collection file, intake-esm connect a database (CSV file) that contains data assets locations 
# and associated metadata.
col

# for AWS S3:
#col = intake.open_esm_datastore("https://cmip6-pds.s3.amazonaws.com/pangeo-cmip6.json")

In [None]:
col.df.head() #viewed as a DataFrame

In [None]:
# do query!
query = dict(experiment_id=['historical'],
             table_id='Amon',
             variable_id=['tas','tasmax'],
             member_id = 'r1i1p1f1',
             grid_label='gn')
# intake-esm provides functionality to execute queries against the catalog
col_subset = col.search(require_all_on=['source_id'], **query)
# subset catalog and get some metrics grouped by 'source_id'
col_subset.df.groupby('source_id')[['experiment_id', 'variable_id', 'table_id']].nunique()

In [None]:
col_subset.df #viewed as a DataFrame

In [None]:
# intake-esm provides functionality to directly loads data to a dictionary of xarray dataset
dset_dict = col_subset.to_dataset_dict(zarr_kwargs={'consolidated': True})

In [None]:
list(dset_dict.keys())

In [None]:
# we got a xarry dataset that contains two xarray DataArray
dset_dict['CMIP.CSIRO.ACCESS-ESM1-5.historical.Amon.gn']