# Using data from AWS with intake_esm

A significant amount of Earth System Model (ESM) data is publicly available online, including data from the CESM Large Ensemble, CMIP5, and CMIP6 datasets. For accessing a single file, we can specify the file (typically netcdf or zarr format) and its location and then use fsspec (the "Filesystem Spec+  python package) and xarray  to create a array.dataset.  For several files, the intake_esm python module (https://github.com/intake/intake-esm) is particulary nice for obtaining the data and put it into an xarray.dataset.

This notebook assumes familiarity with the Tutorial Notebook.  It additionally shows how to gather data from an ESM collection, put it into a dataset, and then create simple plots using the data with ldcpy.

#### Example Data

The example data we use is from the CESM Large Ensemble, member 31. This ensemble data has been lossily compressed and reconstructed as part of a blind evaluation study of lossy data compression in LENS (e.g., http://www.cesm.ucar.edu/projects/community-projects/LENS/projects/lossy-data-compression.html or https://gmd.copernicus.org/articles/9/4381/2016/).

Most of the data from the CESM Large Ensemble Project has been made available on Amazon Web Services (Amazon S3), see http://ncar-aws-www.s3-website-us-west-2.amazonaws.com/CESM_LENS_on_AWS.htm .

For comparison purposes, the original (non-compressed) data for Ensemble 31 has recently been made available on Amazon Web Services (Amazon S3)  in the "ncar-cesm-lens-baker-lossy-compression-test" bucket,.

In [1]:
# Add ldcpy root to system path
import sys
sys.path.insert(0,'../../../')

# Import ldcpy package
# Autoreloads package everytime the package is called, so changes to code will be reflected in the notebook if the above sys.path.insert(...) line is uncommented.
%load_ext autoreload
%autoreload 2
import ldcpy

import intake
import fsspec
import xarray as xr

#display the plots in this notebook
%matplotlib inline

## Method 1: using fsspec and xr.open_zarr

First, specify the filesystem and location of the data.  Here we are accessing the original data from CESM-LENS ensemble 31, which is available on Amazon S3 in the store named _"ncar-cesm-lens-baker-lossy-compression-test"_ bucket.

First we listing all available files (which are timeseries files containing a single variable) for that dataset. Note that unlike in the TutorialNotebook (which used NetCDF files), these files are all zarr format. Both monthly and daily data is available.

In [2]:
fs = fsspec.filesystem('s3', anon=True)
stores = fs.ls("ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/")[1:]
stores[:]

['ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-FLNS.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-FLNSC.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-FLUT.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-FSNS.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-FSNSC.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-FSNTOA.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-ICEFRAC.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-LHFLX.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-PRECL.zarr',
 'ncar-cesm-lens-baker-lossy-compression-test/lens-ens31/cesmle-atm-ens31-20C-daily-PRECSC.zarr',
 'ncar-cesm-lens-baker-lossy-

Then we select the file from the store that we want and open it as an xarray.Dataset using xr.open_zarr(). Here we grab data for the first 2D daily variable, FLNS (net longwave flux at surface, in $W/m^2$), in the list (accessed by it location -- stores[0]).

In [6]:
store = fs.get_mapper(stores[0])
ds_flns = xr.open_zarr(store, consolidated=True, decode_times=False)
ds_flns

Unnamed: 0,Array,Chunk
Bytes,6.94 GB,127.40 MB
Shape,"(31391, 192, 288)","(576, 192, 288)"
Count,56 Tasks,55 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 6.94 GB 127.40 MB Shape (31391, 192, 288) (576, 192, 288) Count 56 Tasks 55 Chunks Type float32 numpy.ndarray",288  192  31391,

Unnamed: 0,Array,Chunk
Bytes,6.94 GB,127.40 MB
Shape,"(31391, 192, 288)","(576, 192, 288)"
Count,56 Tasks,55 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,502.26 kB,251.14 kB
Shape,"(31391, 2)","(15696, 2)"
Count,3 Tasks,2 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 502.26 kB 251.14 kB Shape (31391, 2) (15696, 2) Count 3 Tasks 2 Chunks Type float64 numpy.ndarray",2  31391,

Unnamed: 0,Array,Chunk
Bytes,502.26 kB,251.14 kB
Shape,"(31391, 2)","(15696, 2)"
Count,3 Tasks,2 Chunks
Type,float64,numpy.ndarray


The above returned an xarray.Dataset, and we can now easily plot the mean (of FLNS) over the first 5 days with ldcpy.plot:

**ALEX** - what is the meaning of the label for set1 when we didn't open with open_dataset...

In [None]:
ldcpy.plot(ds_flns, "FLNS", set1='orig', metric='mean', subset='first5')

Now let's grab the PRECT (precipitation rate) data and plot the mean over the first 5 days

In [5]:
#PRECT data
store2 = fs.get_mapper(stores[11])
ds_prect = xr.open_zarr(store2, consolidated=True, decode_times=False)

In [None]:
#plot PRECT mean (5 days)
ldcpy.plot(ds_prect, "PRECT", set1='orig', metric='mean', subset='first5')

In [7]:
#TS data
store3 = fs.get_mapper(stores[20])
ds_ts = xr.open_zarr(store3, consolidated=True, decode_times=False)
ds_ts

Unnamed: 0,Array,Chunk
Bytes,6.94 GB,127.40 MB
Shape,"(31391, 192, 288)","(576, 192, 288)"
Count,56 Tasks,55 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 6.94 GB 127.40 MB Shape (31391, 192, 288) (576, 192, 288) Count 56 Tasks 55 Chunks Type float32 numpy.ndarray",288  192  31391,

Unnamed: 0,Array,Chunk
Bytes,6.94 GB,127.40 MB
Shape,"(31391, 192, 288)","(576, 192, 288)"
Count,56 Tasks,55 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,502.26 kB,251.14 kB
Shape,"(31391, 2)","(15696, 2)"
Count,3 Tasks,2 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 502.26 kB 251.14 kB Shape (31391, 2) (15696, 2) Count 3 Tasks 2 Chunks Type float64 numpy.ndarray",2  31391,

Unnamed: 0,Array,Chunk
Bytes,502.26 kB,251.14 kB
Shape,"(31391, 2)","(15696, 2)"
Count,3 Tasks,2 Chunks
Type,float64,numpy.ndarray


In [None]:
#plot TS mean
ldcpy.plot(ds_ts, "TS", set1='orig', metric='mean', subset='first5')

## Method 2: Using intake_esm

Now we will demonstrate using the intake_esm module.  We can use the intake_esm module to search for and open several files as xarray.Dataset objects. Then we can use ldcpy as before. The code below is modified from the intake_esm documentation, available here: https://intake-esm.readthedocs.io/en/latest/?badge=latest#overview.

We will now use ensemble 31 data from the CEMS LENS collection on AWS, which (as explained above) has been subjected to lossy compression. Many catalogs for publicly available datasets are accessible via intake-esm can be found at https://github.com/NCAR/intake-esm-datastore/tree/master/catalogs, including for CESM-LENS. We can open that collection as follows (see here: https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md#attribute-object): 

In [12]:
col_loc = "https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json"
col = intake.open_esm_datastore(col_loc)
col

Unnamed: 0,unique
component,5
frequency,5
experiment,6
variable,75
path,365


Next, we search for the subset of the collection (dataset and variables) that we are interested in.  Let's grab FLNS, TS, and PRECT daily data from the atm componeny

In [14]:
#we want daily data for FLNS, PRECT, and TS
aws_col_subset = col.search(component='atm', frequency='daily', experiment='20C',
                        variable=["FLNS", "PRECT", "TS"])
#display header info
aws_col_subset.df.head()

Unnamed: 0,component,frequency,experiment,variable,path
0,atm,daily,20C,FLNS,s3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS....
1,atm,daily,20C,PRECT,s3://ncar-cesm-lens/atm/daily/cesmLE-20C-PRECT...
2,atm,daily,20C,TS,s3://ncar-cesm-lens/atm/daily/cesmLE-20C-TS.zarr


Then we load matching catalog entries into xarray datasets (https://intake-esm.readthedocs.io/en/latest/api.html#intake_esm.core.esm_datastore.to_dataset_dict):



In [15]:
dsets = aws_col_subset.to_dataset_dict(zarr_kwargs={"consolidated": True}, 
                                   storage_options={"anon": True})
dsets



--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency'


KeyError: '.zmetadata'

Check the dataset keys to ensure that the dataset we want is present:

In [None]:
dsets.keys()

Finally, put the values we are interested from the dictionary into their own dataset variable:

In [None]:
aws_ds=(dsets['atm.20C.daily'])
aws_ds

## Now compare the orig data to the lossy compressed data

## Make your own catalog
Do we need this?

We created a test_catalog.csv and test_collection.json file for this particular simple examples.

Then we load matching catalog entries into xarray datasets (https://intake-esm.readthedocs.io/en/latest/api.html#intake_esm.core.esm_datastore.to_dataset_dict):

As we can see above, the dataset has the required lat, lon, and time dimensions and coordinates, and a ICEFRAC, PRECT and FLUT variables. We can use ldcpy's plotting function to plot metrics from this dataset. Below, we plot the mean value of FLUT for the first five time slices at each lat and lon coordinate in the dataset (the full dataset may take a  long time to run on a personal computer):

ICEFRAC - Fraction of sfc area covered by sea-ice (note this will be 0 over land)

PRECT - Total (convective and large-scale) precipitation rate (liq + ice)

FLUT - 	Upwelling longwave flux at top of model

In [None]:
#ldcpy.plot(aws_ds, "ICEFRAC", set1='orig', metric='mean', subset='first5')

In [None]:
#ldcpy.plot(aws_ds, "PRECT", set1='orig', metric='mean', subset='first5')

In [None]:
#ldcpy.plot(aws_ds, "FLUT", set1='orig', metric='mean', subset='first5')