# Zarr in Practice

This notebook demonstrates how to create, explore and modify a Zarr store.

It also includes links to and example use of public Zarr stores for geospatial data.

## How to create a Zarr store

In [1]:
import numpy as np
import sys
import xarray as xr
import zarr

# Here we create a simple Zar store.
zstore = zarr.array(np.arange(10))

This is an in-memory Zarr store. To persist it to disk, we can use `.save`.

In [None]:
zarr.save("test.zarr", zstore)

We can open the metadata about this dataset, which gives us some interesting information. It's has a shape of 10 chunks of 10, so we know all the data was stored in 1 chunk, and was compressed with the `blosc` compressor.

In [None]:
!cat test.zarr/.zarray 

This was a pretty basic example. Let's explore the other things we might want to do when creating Zarr.

## How to create a group

In [6]:
root = zarr.group()
group1 = root.create_group('group1')
group2 = root.create_group('group2')
z1 = group1.create_dataset('ds_in_group', shape=(100,100), chunks=(10,10), dtype='i4')
z2 = group2.create_dataset('ds_in_group', shape=(1000,1000), chunks=(10,10), dtype='i4')
tree = zarr.util.TreeViewer(root)

In [7]:
tree

AttributeError: 'Tree' object has no attribute '_ipython_display_'

/
 ├── group1
 │   └── ds_in_group (100, 100) int32
 └── group2
     └── ds_in_group (1000, 1000) int32

## How to Examine and Modify the Chunk Shape

If your data is sufficiently large, Zarr will chose a chunksize for you.

In [None]:
zarr_no_chunks = zarr.array(np.arange(100), chunks=True)
zarr_no_chunks.chunks, zarr_no_chunks.shape

In [None]:
zarr_with_chunks = zarr.array(np.arange(10000000), chunks=True)
zarr_with_chunks.chunks, zarr_with_chunks.shape

For `zarr_with_chunks` we see the chunks are smaller than the shape, so we know the data has been chunked. Other ways to examine the chunk structure are `zarr.info` and `zarr.cdata_shape`.

In [None]:
?zarr_no_chunks.cdata_shape

In [None]:
zarr_no_chunks.cdata_shape, zarr_with_chunks.cdata_shape

For the zarr store with chunks, we see it has 64 chunks, and we can verify the number of chunks multiplied by the chunk size equals the length of the whole array.

In [None]:
zarr_with_chunks.cdata_shape[0] * zarr_with_chunks.chunks[0] == zarr_with_chunks.shape[0]

### What's the storage size of these chunks?

The default chunks are pretty small.

In [None]:
sys.getsizeof(zarr_with_chunks.chunk_store['0']) # this is in bytes

In [None]:
zarr_with_big_chunks = zarr.array(np.arange(10000000), chunks=(500000))

In [None]:
zarr_with_big_chunks.chunks, zarr_with_big_chunks.shape, zarr_with_big_chunks.cdata_shape

In [None]:
sys.getsizeof(zarr_with_big_chunks.chunk_store['0'])

These chunks are still pretty small, but this is just a silly example. In the real world, you will likely want to deal in Zarr chunks of 1MB or greater, especially when dealing with remote storatge options where data is read over a network and the number of requests should be minimized.

## Exploring and Modifying Data Compression

Following from the example above, we can tell that Zarr has also compressed the data for us using `zarr.info` or `zarr.compressor`. 

In [None]:
zarr_with_chunks.compressor

The `Blosc` compressor is actually a meta compressor so actually implements multiple different internal compressors. In this case, it has implemented `lz4` compression. We can also explore how much space was saved by using this compression method.

In [None]:
zarr_with_chunks.info

We can see, from the storage ratio above, that compression has made our data 155 times smaller 😱 .

You can set compression=None when created your array to turn off this behavior, but I'm not sure why you would do that.

Let's see what happens when we use a different compression method. We can checkout a full list of numcodecs compressors here: [https://numcodecs.readthedocs.io/](https://numcodecs.readthedocs.io/).

In [None]:
from numcodecs import GZip
compressor = GZip()
zstore_gzip_compressed = zarr.array(np.arange(10000000), chunks=True, compressor=compressor)
zstore_gzip_compressed.info

In this case, the storage ratio is 5.3 - so not as good! How to chose a compression algorithm is a topic for future investigation.

## Consolidating metadata

It's important to consolidate metadata. So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up with local storage is insignificant, but becomes significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage.

In [None]:
root = zarr.group()
zarr_store = 'example.zarr'
# Let's create many groups and many arrays
num_groups, num_arrays_per_group = 100, 100
for i in range(num_groups):
    group = root.create_group(f'group-{i}')
    for j in range(num_arrays_per_group):
        group.create_dataset(f'array-{j}', shape=(1000,1000), dtype='i4')

store = zarr.DirectoryStore(zarr_store)
zarr.save(store, root)

In [None]:
!cat {zarr_store}/.zmetadata

In [None]:
zarr.consolidate_metadata(zarr_store)

In [None]:
zarr.open_consolidated(zarr_store)

In [None]:
!cat {zarr_store}/.zmetadata

# Example of Cloud-Optimized Access for this Format

Fortunately, there are many publicly accessible cloud archives of Zarr data.

Zarr provides storage backends for all of these cloud providers: [Zarr Tutorial - Distributed/cloud storage](https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage).

Here are a few we are aware of:

* [Zarr data in Microsoft's Planetary Computer](https://planetarycomputer.microsoft.com/catalog?filter=zarr)
* [Zarr data from Google](https://console.cloud.google.com/marketplace/browse?filter=solution-type:dataset&_ga=2.226354714.1000882083.1692116148-1788942020.1692116148&pli=1&q=zarr)
* [Amazon Sustainability Data Initiative available from Registry of Open Data on AWS](https://registry.opendata.aws/collab/asdi/) - Enter "Zarr" in the Search input box.
* [Pangeo-Forge Data Catalog](https://pangeo-forge.org/catalog)

The Pangeo-Forge Data Catalog provides handy examples of how to open each dataset, for example, from the [Global Precipitation Climatology Project (GPCP)](https://pangeo-forge.org/dashboard/feedstock/42) page:

In [None]:
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'

In [None]:
ds = xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)
ds

Microsoft's Planetary Computer goes above and beyond, providing tutorials alongside each dataset. We recommend exploring these on your own to get an idea of what you can do with Zarr and Xarray. See all tutorials here: [microsoft/PlanetaryComputerExamples](https://github.com/microsoft/PlanetaryComputerExamples/tree/main/tutorials). Note, this repo contains ALL tutorials, not just Zarr tutorials, so you may want to filter for Zarr.

For example, from https://planetarycomputer.microsoft.com/dataset/daymet-daily-pr#Example-Notebook:

In [None]:
import cartopy.crs as ccrs
import fsspec
import matplotlib.pyplot as plt
import pystac
import xarray as xr
import warnings

warnings.simplefilter("ignore", RuntimeWarning)

In [None]:
url = "https://planetarycomputer.microsoft.com/api/stac/v1/collections/daymet-daily-hi"
collection = pystac.read_file(url)
asset = collection.assets["zarr-https"]
store = fsspec.get_mapper(asset.href)
ds = xr.open_zarr(store, **asset.extra_fields["xarray:open_kwargs"])
ds

# Additional Resources

* Jupyter Notebook for a high level overview of Zarr on Google Cloud: [![image](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/tyson-swetnam/agic-2022/blob/main/docs/notebooks/zarr.ipynb)