# 09. Zarr Access for NetCDF4 files

## Summary

Zarr is an open source library for storing N-dimensional array data.  It supports multidimensional arrays with attributes and dimensions similar to NetCDF4, and it can be read by XArray.  Zarr is often used for data held in cloud object storage (like Amazon S3), because it is better optimized for these situations than NetCDF4.

The [zarr-eosdis-store library](https://github.com/nasa/zarr-eosdis-store) allows NASA EOSDIS NetCDF4 files to be read more efficiently by transferring only file metadata and data needed for computation in a small number of requests, rather than moving the whole file or making many small requests.  It works by making the files directly readable by the [Zarr Python library](https://zarr.readthedocs.io) and XArray across a network.  To use it, files must have a corresponding metadata file ending in `.dmrpp`, which increasingly true for cloud-accessible EOSDIS data.  https://github.com/nasa/zarr-eosdis-store

The zarr-eosdis-store library provides several benefits over downloading EOSDIS data files and accessing them using XArray, NetCDF4, or HDF5 Python libraries:

* It only downloads the chunks of data you actually read, so if you don't read all variables or the full spatiotemporal extent of a file, you usually won't spend time downloading those portions of the file
* It parallelizes and optimizes downloads for the portions of files you do read, so download speeds can be faster in general
* It automatically interoperates with Earthdata Login if you have a .netrc file set up
* It is aware of some EOSDIS cloud implementation quirks and provides caching that can save time for repeated requests to individual files

It can also be faster than using XArray pointing NetCDF4 files with s3:// URLs, depending on the file's internal structure, and is often more convenient.

Consider using this library when:
1. The portion of the data file you need to use is much smaller than the full file, e.g. in cases of spatial subsets or reading a single variable from a file containing several
1. s3:// URLs are not readily available
1. Code need to run outside of the AWS cloud or us-west-2 region or in a hybrid cloud / non-cloud manner
1. s3:// access using XArray seems slower than you would expect (possibly due to unoptimized internal file structure)
1. No readily-available, public, cloud-optimized version of the data exists already. The example we show _is_ also available as an AWS Public Dataset: https://registry.opendata.aws/mur/
1. Adding ".dmrpp" to the end of a data URL returns a file

### Objectives

1. Build on prior knowledge from CMR and Earthdata Login tutorials
2. Work through an example of using the EOSDIS Zarr Store to access data using XArray
3. Learn about the Zarr format and library for accessing data in the cloud
___


## Exercise

In this exercise, we will be using the eosdis-zarr-store library to aggregate and analyze a month of sea surface temperature for the Great Lakes region

### Set up

#### Import Required Packages

In [None]:
#

Also set the width / height for plots we show

In [None]:
#

#### Set Dataset, Time, and Region of Interest

Look in PO.DAAC's cloud archive for Group for High Resolution Sea Surface Temperature (GHRSST) Level 4 Multiscale Ultrahigh Resolution (MUR) data

In [None]:
#

Looking for data from the month of September over the Great Lakes

In [None]:
#

### Find URLs for the dataset and AOI

Set up a CMR granules search for our area of interest, as we saw in prior tutorials

In [None]:
#

Search for granules in our area of interest, expecting one granule per day of September

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

### Open and view our AOI without downloading a whole file

#### Check to see if we can use an efficient partial-access technique

In [None]:
#

Open our first URL using the Zarr library

In [None]:
#

That's it!  No downloads, temporary credentials, or S3 filesystems.  Hereafter, we interact with the `ds` variable as with any XArray dataset.  We need not worry about the EosdisStore anymore.

View the file's variable structure

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

### Aggregate and analyze 30 files

Set up a function to open all of our URLs as XArrays in parallel

In [None]:
#

Combine the individual file-based datasets into a single xarray dataset with a time axis

In [None]:
#

Look at the Analysed SST variable metadata

In [None]:
#

Create a dataset / variable that is only our area of interest and view its metadata

In [None]:
#

XArray reads data lazily, i.e. only when our code actually needs it.  Up to this point, we haven't read any data values, only metadata.  The next line will force XArray to read the portions of the source files containing our area of interest.  Behind the scenes, the eosdis-zarr-store library is ensuring data is fetched as efficiently as possible.

Note: This line isn't strictly necessary, since XArray will automatically read the data we need the first time our code tries to use it, but calling this will make sure that we can read the data multiple times later on without re-fetching anything from the source files.

This line will take several seconds to complete, but since it is retrieving only about 50 MB of data from 22 GB of source files, several seconds constitutes a significant time, bandwidth, and disk space savings.

In [None]:
#

Now we can start looking at aggregations across the time dimension.  In this case, plot the standard deviation of the temperature at each point to get a visual sense of how much temperatures fluctuate over the course of the month.

In [None]:
#

#### Interactive animation of a month of data

This section isn't as important to fully understand.  It shows us a way to get an interactive animation to see what we have retrieved so far

Define an animation function to plot the `i`th time step.  We need to make sure each plot is using the same color scale, set by `vmin` and `vmax` so the animation is consistent

In [None]:
#

Render each time slice once and show it as an HTML animation with interactive controls

In [None]:
#

### Supplemental: What's happening here?

For EOSDIS data in the cloud, we have begun producing a metadata sidecar file in a format called DMR++ that extracts all of the information about arrays, variables, and dimensions from data files, as well as the byte offsets in the NetCDF4 file where data can be found.  This information is sufficient to let the Zarr library read data from our NetCDF4 files, but it's in the wrong format.  zarr-eosdis-store knows how to fetch the sidecar file and transform it into something the Zarr library understands.  Passing it when reading Zarr using XArray or the Zarr library lets these libraries interact with EOSDIS data exactly as if they were Zarr stores in a way that's more optimal for reading data in the cloud.  Beyond this, the zarr-eosdis-store library makes some optimizations in the way it reads data to help make up for situations where the NetCDF4 file is not internally arranged well for cloud-based access patterns.
