**This notebook provides a better explanation of the zarr file data structure and how to interact with zarr directly**

#### Select Kernel
In the ['Setup.ipynb'](../Setup/Setup.ipynb) notebook we have alreay created a `LSST_Train` which includes all packages neccessary to access the explore the training data, but it is not guaranteed that when you open a new notebook (for example, this one), the `LSST_Train` will be selected automatically. So we recommend first making sure that the correct kernel is used by checking the top right corner ([observe kernel currently used](../Setup/figs/kernel.jpg)) of the notebook that you are in. If it displays something other than `LSST_Train` (e.g. `Python 3`), you should click on it and change the kernel to `LSST_Train` ([openup window](../Setup/figs/kernel_window.jpg) -> [select 'LSST_Train'](../Setup/figs/select_kernel.jpg)

In [1]:
import zarr
import pandas as pd
pd.set_option('display.max_columns', 999)

Below are a few words quoted directly from zarr's [website](https://zarr.readthedocs.io/en/stable/#highlights).

> Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.

#### Highlights:
> * Create N-dimensional arrays with any NumPy dtype.  
> * Chunk arrays along any dimension.  
> * Compress and/or filter chunks using any NumCodecs codec.  
> * Store arrays in memory, on disk, inside a Zip file, on S3, …  
> * Read an array concurrently from multiple threads or processes.  
> * Write to an array concurrently from multiple threads or processes.  
> * Organize arrays into hierarchies via groups.  

I choose zarr as the backend storage format mostly because it supports concurrent read/write, which will be extremely beneficial when our training set grows too large to fit into the memory. In addition, it intergrates well with the [dask](https://dask.org/) package to implement multi-processing/threading algorithms. 

Here I will briefly show how to directly interact with the training set stored in zarr. For more detailed tutorial for zarr, please check out [zarr's documentation](https://zarr.readthedocs.io/en/stable/#highlights).

In [2]:
# on SciServer
path2zarr = '/home/idies/workspace/Temporary/ywx649999311/LSST_AGN/Class_Training/Data/LCs.zarr.zip'

# # offline
# path2zarr = 'path/to/data'

# create connection to the file (without loading in anything)
root = zarr.open(path2zarr)

In [3]:
# check out datasets/groups under the root directory
# leave out level=1 will show all hieracrchies, will crash if too many subdirectories or datasets
print(root.tree(level=1))

/
 └── sdss_lc


The SDSS light curves are stored in the 'sdss_lc' group under root. Each light curve is a dataset within the 'sdss_lc' group and indexed by the `train_id` of the represented object. Below I am showing how to directly load in the light curve data given a `train_id`.

In [4]:
train_id = 0
lc_df = pd.DataFrame(root['sdss_lc/{}'.format(train_id)][:])
lc_df.head()

Unnamed: 0,run,dered_u,dered_g,dered_r,dered_i,dered_z,psfmagerr_u,psfmagerr_g,psfmagerr_r,psfmagerr_i,psfmagerr_z,mjd_u,mjd_g,mjd_r,mjd_i,mjd_z,offsetRa_u,offsetRa_g,offsetRa_i,offsetRa_z,offsetDec_u,offsetDec_g,offsetDec_i,offsetDec_z,airmass_u,airmass_g,airmass_r,airmass_i,airmass_z
0,2583,20.249483,20.006676,19.933714,19.685217,19.802973,0.06221,0.020544,0.023241,0.025836,0.079934,52172.23,52172.23,52172.23,52172.23,52172.23,-0.000334,0.026004,-0.006556,0.002877,0.065078,0.005305,0.018693,0.012118,1.320765,1.327729,1.314015,1.317368,1.324221
1,6433,19.960081,20.077534,20.092773,19.856043,19.74728,0.113201,0.060271,0.051735,0.044494,0.090797,54012.11,54012.11,54012.11,54012.11,54012.11,-0.181637,0.038433,-0.047627,-0.033974,0.043335,-0.004236,-0.014651,-0.050503,1.183268,1.18269,1.183976,1.183609,1.182964
2,7124,19.510082,19.863846,19.817478,19.611767,19.812647,0.106592,0.031646,0.05353,0.06425,0.316121,54407.09,54407.1,54407.09,54407.09,54407.09,0.746968,0.078411,-0.07733,0.009763,-0.041699,0.02003,-0.00559,0.049248,1.261037,1.26602,1.256229,1.258615,1.263507
3,7069,19.500845,19.787628,19.76417,19.547508,19.72891,0.073585,0.024414,0.030826,0.041999,0.138001,54390.19,54390.19,54390.18,54390.19,54390.19,-0.159137,0.036304,0.015132,0.030568,0.056769,-0.013882,0.009143,0.003616,1.479595,1.491362,1.46817,1.473846,1.485436
4,6417,19.557869,20.092735,20.055176,19.783768,19.492689,0.071082,0.028604,0.028007,0.028818,0.083242,54008.12,54008.12,54008.12,54008.12,54008.12,-0.027197,-0.039675,-0.024857,0.013995,0.111805,0.073699,0.020218,0.090663,1.182689,1.18224,1.183269,1.182966,1.18245


**!! This is only a simple notebook to demonstrate how to interact with zarr files directly. For more sophisticated task, please refer to the official documentation.**