# OME-Zarr Structure

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/InsightSoftwareConsortium/GetYourBrainTogether/blob/main/HCK02_2023_Allen_Institute_Hybrid/Tutorials/WorkingWithOMEZarrNGFF/OME-Zarr_Structure.ipynb)

## Learning Objectives

- Identify the basic components of Zarr
    * **JSON-compatible metadata** and
    *  **multidimensional arrays stored as binary blobs**
- Identify the structure of an OME-Zarr
    * **Chunked, multiscale image(s)**
    * **Image metadata**

### Install and import notebook dependencies

In [1]:
import sys

!{sys.executable} -m pip install -q zarr s3fs ome-zarr ngff-zarr

In [2]:
import s3fs
import zarr
import ome_zarr
import ngff_zarr
from rich import print

This notebook works on public SmartSPIM mouse brain data from AWS S3. Data is made available by the Allen Institute for Neural Dynamics (AIND).

### Zarr Contents

The **Zarr format** is based on

1. **JSON-compatible metadata** - in Zarr version 2, these are stored in *.zattrs*, *.zgroup*, *.zarray*, and *.zmetadata* files.
3. **A hierarchical structure*** - similar to the folder structure in a filesystem
3. **Multidimensional binary arrays** - stored as compressed subarray chunks

See also the [Zarr documentation](https://zarr.dev/).

### S3 Contents

The first step in fetching data is knowing what data is available to fetch.

SmartSPIM data is stored in multiscale OME-Zarr format.

AIND has the following naming conventions in its data:
 - "aind-open-data": The AIND public S3 bucket where data is stored
 - "SmartSPIM_\<id>_\<date>_\<stitched>_\<date>": Organize data by collection/stitching date
 - "processed/OMEZarr": working with processed data in OME-Zarr format
 - "Ex_\<num>_Em_\<num>.zarr": Organize data by excitation/emission metrics for imaging
 
Naming conventions are subject to change in the future.

In [3]:
SAMPLES_BUCKET_NAME="aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr"
SAMPLE_NAME="Ex_647_Em_690.zarr"

In [4]:
s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name="us-west-2"))
store = s3fs.S3Map(
    root=f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}',
    s3=s3,
    check=False,
)
# Caches fetch data in memory, discards the Least Recently Used data when max_size is reached
cache = zarr.LRUStoreCache(store, max_size=2**28)
root = zarr.group(store=cache)

In [5]:
fs = s3fs.S3FileSystem(anon=True)

In [6]:
# See all samples available
fs.ls(SAMPLES_BUCKET_NAME)

['aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/.zgroup',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_488_Em_525.zarr',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_561_Em_600.zarr',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr']

In [7]:
fs.ls(f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}')

['aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/.zattrs',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/.zgroup',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/0',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/1',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/2',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/3',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/4']

The *.zgroup* file is a **JSON** file.

In [8]:
# View sample metadata
with fs.open(
    f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}/.zgroup',
    "r",
) as f:
    print(f.read())

**Multiscale** samples are typically organized by resolution with 0 being the highest resolution (raw data) 4 being the lowest resolution (downsampled).

In [9]:
fs.ls(f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}/4')

['aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/4/.zarray',
 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/4/0']

Zarr array metadata describes:

1. The shape of the subarray chunks
2. The compression codec
3. The type of the array and its shape

In [10]:
# View sample metadata
with fs.open(
    f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}/4/.zarray',
    "r",
) as f:
    print(f.read())

Each **chunk of an array** is store as a **compressed binary blob**.

In [11]:
fs.stat(f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}/4/0/0/0/0/0')

{'ETag': '"a332886379ed1a0f79a6f03536d63ca2"',
 'LastModified': datetime.datetime(2022, 11, 11, 11, 3, 56, tzinfo=tzutc()),
 'size': 62193,
 'name': 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/4/0/0/0/0/0',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'binary/octet-stream'}

## OME-Zarr Structure

The [OME-Zarr NGFF specification](https://ngff.openmicroscopy.org/0.4/index.html) adds a **data model** on the Zarr specification for **multiscale scientific image data**.

See also the [OME-Zarr NGFF documentation](https://ngff.openmicroscopy.org/).

In the follow sections, we will open the brain volumes with multiple Python libraries provide perspectives on the OME-Zarr image from ranging from a low level interface to a high level interface.

### Zarr Python

In [12]:
# Verify root is a zarr Group
root.info

0,1
Name,/
Type,zarr.hierarchy.Group
Read-only,False
Store type,zarr.storage.LRUStoreCache
No. members,5
No. arrays,5
No. groups,0
Arrays,"0, 1, 2, 3, 4"


Each array corresponds to a downscaled version of the image pixel data.

We can quickly probe the structure of this large dataset without downloading the entire volume

In [13]:
# Verify [4] is smallest resolution
for index in range(0,5):
    print(root[index])

OME-Zarr metadata is stored in a `multiscales` key of the root Zarr group (folder) attributes.

In [14]:
print(root.attrs['multiscales'])

### OME-Zarr-Py

Inspect the same dataset with [ome-zarr-py](https://github.com/ome/ome-zarr-py).

In [15]:
from ome_zarr.reader import Reader as OMEZarrReader
from ome_zarr.io import ZarrLocation
import ome_zarr.utils

In [16]:
s3_https_url = f'https://s3.us-west-2.amazonaws.com/{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}'

_ = next(ome_zarr.utils.info(s3_https_url))

https://s3.us-west-2.amazonaws.com/aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/ [zgroup]
 - metadata
   - Multiscales
   - OMERO
 - data
   - (1, 1, 4200, 10240, 7400)
   - (1, 1, 2100, 5120, 3700)
   - (1, 1, 1050, 2560, 1850)
   - (1, 1, 525, 1280, 925)
   - (1, 1, 262, 640, 462)


In [17]:
zarr_location = ZarrLocation(s3_https_url)
reader = OMEZarrReader(zarr_location)
image_node = next(reader())
print(image_node)

In [18]:
print(image_node.metadata)

In [19]:
print(image_node.data)

### NGFF-Zarr

Inspect the same dataset with [ngff-zarr](https://pypi.org/project/ngff-zarr/)

In [20]:
from ngff_zarr import from_ngff_zarr

In [21]:
multiscales = from_ngff_zarr(root.store)
print(multiscales)

## Exercises

### Exercise 1: Explore the OME-Zarr Validator

OME provides an [online validator](https://ome.github.io/ome-ngff-validator/) to check that data has correct OME-Zarr content.

Check the content of the example validator dataset and our tutorial's sample dataset.

- What does the validator check for?
- Does our sample pass the checks?
- What are the chunk sizes optimized for?

In [1]:
# Solution: https://github.com/InsightSoftwareConsortium/GetYourBrainTogether/blob/main/HCK02_2023_Allen_Institute_Hybrid/Tutorials/WorkingWithOMEZarrNGFF/OME-Zarr_Structure_Exercise_1_Solution.py
# %load OME-Zarr_Structure_Exercise_1_Solution.py

### Exercise 2: Compression Performance

Zarr supports next generation codecs, such as [blosc](https://www.blosc.org/) that provide exceptional

1. Compression ratios
2. Compression and decompression speed

Compare the size of a chunk compressed with a blosc codec with the size compressed with the traditional zlib.

In [67]:
scale = 2
fs.stat(f'{SAMPLES_BUCKET_NAME}/{SAMPLE_NAME}/{scale}/0/0/0/0/0')

{'ETag': '"f77c057c04f5a34339e3a34c59442c53"',
 'LastModified': datetime.datetime(2022, 11, 11, 11, 3, 35, tzinfo=tzutc()),
 'size': 1725765,
 'name': 'aind-open-data/SmartSPIM_631680_2022-09-09_13-52-33_stitched_2022-11-10_17-18-18/processed/OMEZarr/Ex_647_Em_690.zarr/2/0/0/0/0/0',
 'type': 'file',
 'StorageClass': 'INTELLIGENT_TIERING',
 'VersionId': None,
 'ContentType': 'binary/octet-stream'}

In [68]:
# Get the decompressed data
multiscales.images[scale].data

Unnamed: 0,Array,Chunk
Bytes,9.26 GiB,9.03 MiB
Shape,"(1, 1, 1050, 2560, 1850)","(1, 1, 1, 2560, 1850)"
Dask graph,1050 chunks in 2 graph layers,1050 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 9.26 GiB 9.03 MiB Shape (1, 1, 1050, 2560, 1850) (1, 1, 1, 2560, 1850) Dask graph 1050 chunks in 2 graph layers Data type uint16 numpy.ndarray",1  1  1850  2560  1050,

Unnamed: 0,Array,Chunk
Bytes,9.26 GiB,9.03 MiB
Shape,"(1, 1, 1050, 2560, 1850)","(1, 1, 1, 2560, 1850)"
Dask graph,1050 chunks in 2 graph layers,1050 chunks in 2 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


In [69]:
import numpy as np
chunk_data = np.asarray(multiscales.images[scale].data.blocks[0,0,0,0,0])

In [70]:
print(f'Blosc compression: {315927 / chunk_data.nbytes}')

In [2]:
# Solution: https://github.com/InsightSoftwareConsortium/GetYourBrainTogether/blob/main/HCK02_2023_Allen_Institute_Hybrid/Tutorials/WorkingWithOMEZarrNGFF/OME-Zarr_Structure_Exercise_2_Solution.py
# %load OME-Zarr_Structure_Exercise_2_Solution.py