# OME-Zarr Storage Backends

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/InsightSoftwareConsortium/GetYourBrainTogether/blob/main/HCK02_2023_Allen_Institute_Hybrid/Tutorials/WorkingWithOMEZarrNGFF/OME-Zarr_Storage_Backends.ipynb)

# Learning Objectives

- Understand that Zarr is designed for **different storage backends**
- Learn the Zarr storage bandend interface
    * A **maps of string keys to bytes values**
- Identify storage backend options and some of their advantages
    * **Filesystem directory stores**: *local, parallel creation*
    * **Zip stores**: *limit inodes*
    * **Cloud object stores**: Utilize S3, GCS, Azure Blob
    * **IPFS, IPLD stores**: Web3 decentralized, content-addressed stores
    * **In memory stores**: Cached store interface
    * **Zarr interface adapters**: Keep data in native format, interface with OME-Zarr tooling

In [1]:
import sys

!{sys.executable} -m pip install -q zarr 'fsspec[s3]' ngff-zarr rich multiscale-spatial-image tifffile pooch tqdm

In [2]:
import fsspec
import tifffile
from multiscale_spatial_image import to_multiscale, Methods
from spatial_image import to_spatial_image
import zarr
import pooch
from rich import print

## The Zarr Storage Interface

In the [previous section](../OME-Zarr_Structure.ipynb), we learned that OME-Zarr is a chunked, multiscale scientific image data structure build on the Zarr format. The Zarr format is comprised of JSON-compatible metadata and multidimensional arrays stored as binary blobs.

In this section, we will learn that how this data is store is extremely flexible and based on a simple interface.

Let's download a TIFF image and generate an OME-Zarr from it. 

In [3]:
tiff_url = f'https://s3.us-west-2.amazonaws.com/aind-open-data/SmartSPIM_614952_2023-03-07_15-39-19/derivatives/Ex_445_Em_469_MIP/442060/442060_177040/040000.tiff'
file_name = pooch.retrieve(tiff_url,
                           fname='image.tiff',
                           known_hash='529bc0c4e5de8fe3386f0be0a7f88c3a59ee10dff8ea7605b0f7ccd0780ea78e',
                           progressbar=True)

In [4]:
pixel_data = tifffile.imread(file_name)

In [5]:
spatial_image = to_spatial_image(pixel_data)
multiscales = to_multiscale(spatial_image, scale_factors=[2, 2, 2])
print(multiscales)

We have an OME-Zarr multiscale image data structure, *what do we write it to*?

In Zarr, the data is written to a `store`. In Python, a `store` [provides a `MutableMapping` interface](https://zarr.readthedocs.io/en/stable/api/storage.html). Thas is, a structure that maps `str` keys and `bytes` values.

The canonical MutableMapping in Python is a dictionary. If we use a `dict` as the store, the OME-Zarr is stored in memory.

In [6]:
store = dict()

multiscales.to_zarr(store)

print('key type', 'value type', 'key', 'length of value')
from itertools import islice
for k, v in islice(store.items(), 6):
    print(type(k), type(v), k, len(v))

Keys in the store are the paths to the metadata file-equivalents or array chunk file-equivalents if a filesystem store was used.

Values are always bytes, even if the content of the value is a JSON string.

## Storage Backend Options

In this section, we will survey a few of the storage backend options and their advantages.

### Filesystem Directory Stores

A string specifying a path to the local filesystem is a common store. In this case, the data is stored in a directory.

This store useful for local creation and parallel writing of components of the store as separate files as they are created.

In [7]:
multiscales.to_zarr('image.zarr')

In [8]:
# Equivalent
from zarr.storage import DirectoryStore
store = DirectoryStore('image.zarr')
multiscales.to_zarr(store)

### Zip stores

Storing all data in a single file, e.g. an uncompressed zip file, reduces the number of files and directories on the filesystem. This can be created directory or by first creating a directory store, then zipping the result.

In [11]:
from zarr.storage import ZipStore
store = ZipStore('image.zarr.zip', mode='w')
multiscales.to_zarr(store)

### Cloud Object Stores

Data can be stored naturally as object in cloud object store, e.g. AWS S3 buckets, Google Cloud Storage (GCS), Azure Blob Storage, or a locally hosted MinIO object store service.

The [fsspec](https://filesystem-spec.readthedocs.io/en/latest/#) library provides an abstraction to cloud-storage and types of stores using a protocol + options interface. Supported options are specified as options when the package is installed. We installed `fsspec[s3]`.

In [12]:
from fsspec.registry import known_implementations

known_implementations

{'file': {'class': 'fsspec.implementations.local.LocalFileSystem'},
 'memory': {'class': 'fsspec.implementations.memory.MemoryFileSystem'},
 'dropbox': {'class': 'dropboxdrivefs.DropboxDriveFileSystem',
  'err': 'DropboxFileSystem requires "dropboxdrivefs","requests" and "dropbox" to be installed'},
 'http': {'class': 'fsspec.implementations.http.HTTPFileSystem',
  'err': 'HTTPFileSystem requires "requests" and "aiohttp" to be installed'},
 'https': {'class': 'fsspec.implementations.http.HTTPFileSystem',
  'err': 'HTTPFileSystem requires "requests" and "aiohttp" to be installed'},
 'zip': {'class': 'fsspec.implementations.zip.ZipFileSystem'},
 'tar': {'class': 'fsspec.implementations.tar.TarFileSystem'},
 'gcs': {'class': 'gcsfs.GCSFileSystem',
  'err': 'Please install gcsfs to access Google Storage'},
 'gs': {'class': 'gcsfs.GCSFileSystem',
  'err': 'Please install gcsfs to access Google Storage'},
 'gdrive': {'class': 'gdrivefs.GoogleDriveFileSystem',
  'err': 'Please install gdrivef

In [13]:
from zarr.storage import FSStore
import s3fs

fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name="us-west-2"))
store = FSStore('aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr',
                fs=fs)
root = zarr.open_group(store)
print(root.info)

HTTPS Stores

HTTPS stores are universal way to share read-only data without access control barriers.

In [14]:
from ngff_zarr import from_ngff_zarr

fsstore = FSStore('https://s3.us-west-2.amazonaws.com/aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr')
print(from_ngff_zarr(fsstore))

### Web3 Stores

For distributed, content-addressed storarge, there is suppport for [Web3](https://en.wikipedia.org/wiki/Web3) storage through [ipfsspec](https://github.com/fsspec/ipfsspec) and [ipldstore](https://github.com/d70-t/ipldstore).

### Zarr storage adapters

Finally, stores can be created that are act as adaptors, i.e. transformations on data, before it provides the expected storage interfaces.

This can be used, for example, to [consolidate store contents](https://github.com/thewtex/shardedstore) so there are not too many small files or a single extremely large file.

Or, views [[1]](https://github.com/fsspec/kerchunk) [[2]](https://github.com/d70-t/preffs) can be provided into existing datasets so they provide both the Zarr storage interfaces and the expected Zarr keys and values.

The example below views our tiff file directly as an OME-Zarr store without data conversions or copies.

In [15]:
import tifffile

store = tifffile.imread(file_name, aszarr=True, multiscales=True)
root = zarr.open_group(store)
print(root.info)

## Exercises

### Exercise 1: Key counts

Could the keys generated by our original multiscales zarr serialization. Why might a large number of keys cause a problems when stored as a directory store? 

In [36]:
# Solution: https://github.com/InsightSoftwareConsortium/GetYourBrainTogether/blob/main/HCK02_2023_Allen_Institute_Hybrid/Tutorials/WorkingWithOMEZarrNGFF/OME-Zarr_Storage_Backends_Exercise_1_Solution.py
# %load OME-Zarr_Storage_Backends_Exercise_1_Solution.py

### Exercise 2: Chunk directory count

Multidimensional arrays, the multidimenional image pixel data, can be stored in chunks that are identified by a `.` separating a dimension in different directories. Count the number of files in each case. Which approach limits the number of files in a directory, which results in more efficient access?

In [22]:
file_separator_store = DirectoryStore('file_separated.zarr', dimension_separator='.')
multiscales.to_zarr(file_separator_store)

directory_separator_store = DirectoryStore('directory_separated.zarr', dimension_separator='/')
multiscales.to_zarr(directory_separator_store)

In [39]:
# Solution: https://github.com/InsightSoftwareConsortium/GetYourBrainTogether/blob/main/HCK02_2023_Allen_Institute_Hybrid/Tutorials/WorkingWithOMEZarrNGFF/OME-Zarr_Storage_Backends_Exercise_2_Solution.py
# %load OME-Zarr_Storage_Backends_Exercise_2_Solution.py