# LBM : Benchmark Chunk Sizes with Zarr

## Data Storage: Zarr

[Zarr documentation](https://zarr.readthedocs.io/en/stable/tutorial.html)

Deciding how to save data on a host operating system is far from straight foreward.
Read/write operations will vary widely between data saved in a **single file**
structure vs smaller chunks, e.g. one image per file, one image per epoch, etc. 
 
The former strategy is clean/consice and easy to handle but is *not* feasable with large (>10GB) datasets. 

The latter strategy of spreading files acrossed nested groups of directories, each with their own metadata/attributes has been widely adopted as the more sensible approach. HDF5 has 
been the frontrunner in scientific data I/O but suffers from widely inconsistent within academia.  

- Zarr, similar to H5, is a heirarchical data storage specification (or in non-alien speak: "rules of how data is stored on disk").
- Zarr nicely hides the complexities inherent in linking filesystem heirarchy with efficient data I/O.


In [None]:
import scanreader
import os
import sys
from pathlib import Path

import cv2
import numpy as np
import zarr

# Give this notebook access to the root package
sys.path.append('../../')  # TODO: Take this out when we upload to pypi
print(sys.path[0])

import bokeh.plotting as bpl
import holoviews as hv
from IPython import get_ipython
import logging
import matplotlib.pyplot as plt

try:
    import dask.array as da
    has_dask = True
except ImportError:
    has_dask = False

try:
    cv2.setNumThreads(0)
except():
    pass

try:
    if __IPYTHON__:
        get_ipython().run_line_magic('load_ext', 'autoreload')
        get_ipython().run_line_magic('autoreload', '2')
except NameError:
    pass

bpl.output_notebook()
hv.notebook_extension('bokeh')

# logging
logging.basicConfig(format="{asctime} - {levelname} - [{filename} {funcName}() {lineno}] - pid {process} - {message}",
                    filename=None,
                    level=logging.WARNING, style="{") # this shows you just errors that can harm your program
# level=logging.DEBUG, style="{") # this shows you general information that developers use to trakc their program 
# (be careful when playing movies, there will be a lot of debug messages)

# set env variables 
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"


In [None]:
datapath = Path().home() / 'Documents' / 'data'
savepath = datapath / 'save' # string pointing to directory containing
savepath.mkdir(exist_ok=True, parents=True)

htiffs = [str(x) for x in datapath.glob('*.tif')]      # this accumulates a list of every filepath which contains a .tif file

In [4]:
reader = scanreader.read_scan(htiffs, join_contiguous=True)
scan = reader[:,:,:,0,5:605].squeeze()
scan.shape

(600, 576, 600)

In [7]:
import dask.array as da
arr = da.array(scan)
arr

Unnamed: 0,Array,Chunk
Bytes,395.51 MiB,127.65 MiB
Shape,"(600, 576, 600)","(406, 406, 406)"
Dask graph,8 chunks in 1 graph layer,8 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray
"Array Chunk Bytes 395.51 MiB 127.65 MiB Shape (600, 576, 600) (406, 406, 406) Dask graph 8 chunks in 1 graph layer Data type int16 numpy.ndarray",600  576  600,

Unnamed: 0,Array,Chunk
Bytes,395.51 MiB,127.65 MiB
Shape,"(600, 576, 600)","(406, 406, 406)"
Dask graph,8 chunks in 1 graph layer,8 chunks in 1 graph layer
Data type,int16 numpy.ndarray,int16 numpy.ndarray


In [8]:
da.to_hdf5("/home/rbo/caiman_data/test.h5","/mov",arr)

In [56]:
savepath = Path("/home/rbo/caiman_data/mbo")
store = zarr.DirectoryStore(str(savepath.with_suffix(".zarr")))  # save data to persistent disk storage
z = zarr.zeros(arr.shape, dtype='int16', store=store, overwrite=True)

if hasattr(arr, 'compute'):
    z[:] = arr.compute()             # this will auto-chunk based on the specified chunks in 'open'
else:
    z[:] = arr