**This notebook provides a better explanation of the zarr file data structure and how to interact with zarr directly**

#### Select Kernel
In the ['Setup.ipynb'](../Setup/Setup.ipynb) notebook we have alreay created a `LSST_Train` which includes all packages neccessary to access the explore the training data, but it is not guaranteed that when you open a new notebook (for example, this one), the `LSST_Train` will be selected automatically. So we recommend first making sure that the correct kernel is used by checking the top right corner ([observe kernel currently used](../Setup/figs/kernel.jpg)) of the notebook that you are in. If it displays something other than `LSST_Train` (e.g. `Python 3`), you should click on it and change the kernel to `LSST_Train` ([openup window](../Setup/figs/kernel_window.jpg) -> [select 'LSST_Train'](../Setup/figs/select_kernel.jpg)

In [1]:
import zarr
import pandas as pd
pd.set_option('display.max_columns', 999)

Below are a few words quoted directly from zarr's [website](https://zarr.readthedocs.io/en/stable/#highlights).

> Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.

#### Highlights:
> * Create N-dimensional arrays with any NumPy dtype.  
> * Chunk arrays along any dimension.  
> * Compress and/or filter chunks using any NumCodecs codec.  
> * Store arrays in memory, on disk, inside a Zip file, on S3, …  
> * Read an array concurrently from multiple threads or processes.  
> * Write to an array concurrently from multiple threads or processes.  
> * Organize arrays into hierarchies via groups.  

I choose zarr as the backend storage format mostly because it supports concurrent read/write, which will be extremely beneficial when our training set grows too large to fit into the memory. In addition, it intergrates well with the [dask](https://dask.org/) package to implement multi-processing/threading algorithms. 

Here I will briefly show how to directly interact with the training set stored in zarr. For more detailed tutorial for zarr, please check out [zarr's documentation](https://zarr.readthedocs.io/en/stable/#highlights).

In [2]:
# on SciServer
path2zarr = '/home/idies/workspace/Temporary/ywx649999311/LSST_AGN/Class_Training/Data/qso.zarr.zip'

# # offline
# path2zarr = 'path/to/data'

# create connection to the file (without loading in anything)
root = zarr.open(path2zarr)

In [3]:
# check out datasets/groups under the root directory
# leave out level=1 will show all hieracrchies, will crash if too many subdirectories or datasets
print(root.tree(level=1))

/
 ├── catalog (24570,) [('train_id', '<i8'), ('ra', '<f8'), ('dec', '<f8'), ('z', '<f8'), ('z_err', '<f8'), ('thing_id', '<f8'), ('specobjid', '<i8'), ('spec', '<i8'), ('sdss_objid', '<i8'), ('psfmagerr_u', '<f8'), ('psfmagerr_g', '<f8'), ('psfmagerr_r', '<f8'), ('psfmagerr_i', '<f8'), ('psfmagerr_z', '<f8'), ('extinction_u', '<f8'), ('extinction_g', '<f8'), ('extinction_r', '<f8'), ('extinction_i', '<f8'), ('extinction_z', '<f8'), ('type', '<i8'), ('run', '<i8'), ('src2photo', '<f8'), ('dered_u', '<f8'), ('dered_g', '<f8'), ('dered_r', '<f8'), ('dered_i', '<f8'), ('dered_z', '<f8'), ('ra_sp', '<f8'), ('dec_sp', '<f8'), ('spies_id', '<f8'), ('flux_auto_ch1', '<f8'), ('fluxerr_auto_ch1', '<f8'), ('flux_auto_ch2', '<f8'), ('fluxerr_auto_ch2', '<f8'), ('class_star_ch1', '<f8'), ('class_star_ch2', '<f8'), ('src2spies', '<f8'), ('src2gaia', '<f8'), ('gaia_id', '<f8'), ('parallax', '<f8'), ('parallax_error', '<f8'), ('pmra', '<f8'), ('pmra_error', '<f8'), ('pmdec', '<f8'), ('pmdec_error', '

In [4]:
# get the catalog dataset
cat_df = pd.DataFrame(root['catalog'][:])
cat_df.head(2)

Unnamed: 0,train_id,ra,dec,z,z_err,thing_id,specobjid,spec,sdss_objid,psfmagerr_u,psfmagerr_g,psfmagerr_r,psfmagerr_i,psfmagerr_z,extinction_u,extinction_g,extinction_r,extinction_i,extinction_z,type,run,src2photo,dered_u,dered_g,dered_r,dered_i,dered_z,ra_sp,dec_sp,spies_id,flux_auto_ch1,fluxerr_auto_ch1,flux_auto_ch2,fluxerr_auto_ch2,class_star_ch1,class_star_ch2,src2spies,src2gaia,gaia_id,parallax,parallax_error,pmra,pmra_error,pmdec,pmdec_error,galex_id,fuv_mag,fuv_magerr,nuv_mag,nuv_magerr,src2galex,mpstype,stdColor[0],stdColor[1],stdColor[2],stdColor[3],psFlux[u],psFlux[g],psFlux[r],psFlux[i],psFlux[z],psPm[0],psPm[1],psParallax,lcN
0,0,310.0377,-1.005592,2.167268,0.000676,,276352159556042752,7,8647475119809364088,0.008446,0.003265,0.003569,0.002969,0.00596,0.332155,0.244397,0.177257,0.134408,0.095297,6,206,0.083856,19.312095,18.875304,18.513243,18.263452,18.025003,,,,,,,,,,,0.024083,4.226319e+18,0.22524,0.283972,-0.609215,0.464463,-0.206735,0.305267,6.401869e+18,,,23.184412,0.43626,1.696868,0.0,0.436792,0.36206,0.249792,0.238449,65942.328054,68419.255122,142798.907734,179737.984132,228014.565848,-0.609215,-0.206735,0.22524,55.0
1,3,311.6088,0.393812,0.333015,0.000351,,537282917827608576,7,8647475121420632840,0.011383,0.001996,0.001869,0.002298,0.003088,0.494767,0.364045,0.264036,0.20021,0.141952,3,206,0.051798,19.489293,18.512125,17.722814,17.6837,16.975958,,,,,,,,,,,0.092993,4.228017e+18,-0.149525,0.160723,0.044235,0.275073,0.075344,0.179892,2.468468e+18,21.218325,0.06131,21.05294,0.047003,0.742009,0.0,0.977168,0.789311,0.039115,0.707741,56011.332155,58115.957521,295732.889174,306580.43435,599285.639829,0.044235,0.075344,-0.149525,63.0


In [5]:
# show definition for each columns in catalog dataset
root['catalog'].attrs.asdict()

{'class_star_ch1': 'SpIES 3.6 micrometer morphology classification, > 0.5 for resolved source',
 'class_star_ch2': 'SpIES 4.5 micrometer morphology classification, > 0.5 for resolved source',
 'dec': 'Source DEC from SDSS in degrees (J2000)',
 'dec_sp': 'DEC from SpIES in degrees (J2000)',
 'dered_{band}': 'Extinction corrected PSF mag (psfMag - extinction), replace band with u, g, r, i, z',
 'extinction_{band}': 'Galactic extinction in band, obtained using SDSS CasJobs, replace band with u, g, r, i, z',
 'flux_auto_ch1': 'SpIES 3.6 micrometer flux value automatically extracted using SExtractor',
 'flux_auto_ch2': 'SpIES 4.5 micrometer flux value automatically extracted using SExtractor',
 'fluxerr_auto_ch1': 'SpIES 3.6 micrometer flux error given by SExtractor',
 'fluxerr_auto_ch2': 'SpIES 4.5 micrometer flux error given by SExtractor',
 'fuv_mag': 'GALEX FUV mag',
 'fuv_magerr': 'Photometric error in GALEX FUV mag',
 'gaia_id': 'Gaia DR2 source id',
 'galex_id': 'GALEX ID for of matc

The light curves are stored in the 'sdss_lc' group under root. Each light curve is a dataset within the 'sdss_lc' group and indexed by the `train_id` of the represented object. Below I am showing how to directly load in the light curve data given a `train_id`.

In [6]:
# train_id = 0
train_id = 0
lc_df = pd.DataFrame(root['sdss_lc/{}'.format(train_id)][:])
lc_df.head()

Unnamed: 0,run,dered_u,dered_g,dered_r,dered_i,dered_z,psfmagerr_u,psfmagerr_g,psfmagerr_r,psfmagerr_i,psfmagerr_z,offsetRa_u,offsetRa_g,offsetRa_i,offsetRa_z,offsetDec_u,offsetDec_g,offsetDec_i,offsetDec_z,airmass_u,airmass_g,airmass_r,airmass_i,airmass_z,mjd_u,mjd_g,mjd_r,mjd_i,mjd_z
0,7150,19.21996,18.7766,18.41048,18.18466,17.9712,0.034109,0.016798,0.009564,0.013713,0.025251,-0.007248,-0.025836,-0.020824,-0.034578,-0.021959,-0.007406,-0.012763,0.009291,1.605493,1.62066,1.590782,1.598078,1.613011,54415.13,54415.13,54415.13,54415.13,54415.13
1,5878,19.25597,18.87548,18.53624,18.2805,18.01789,0.040118,0.017254,0.011124,0.012605,0.027835,-0.005084,0.011672,-0.021312,0.013219,-0.108469,0.031196,-0.014625,-0.0106,1.455936,1.466554,1.445628,1.450741,1.461198,53693.09,53693.09,53693.09,53693.09,53693.09
2,6441,19.27425,18.8251,18.48587,18.37828,18.09434,0.050127,0.012878,0.023105,0.029367,0.036967,0.058937,0.023346,0.015212,-0.016007,0.003563,0.006684,-0.024566,-0.074804,1.225002,1.227639,1.222511,1.223735,1.226299,54019.13,54019.13,54019.13,54019.13,54019.13
3,4849,19.21765,18.89544,18.50986,18.27946,17.97352,0.042384,0.016987,0.016066,0.010939,0.024926,0.076904,0.006407,0.014746,0.014358,-0.028267,0.00497,-0.02116,0.038742,1.208063,1.206803,1.209457,1.20874,1.207415,53270.14,53270.14,53270.13,53270.13,53270.14
4,4207,19.31172,18.86929,18.54255,18.23268,18.0236,0.032061,0.015984,0.012351,0.015167,0.030417,0.000597,-0.034614,0.003318,-0.045888,-0.005566,-0.007635,0.000787,0.003568,1.202108,1.20256,1.201789,1.201929,1.202314,52936.07,52936.07,52936.07,52936.07,52936.07


**!! This is only a simple notebook to demonstrate how to interact with zarr files directly. For more sophisticated task, please refer to the official documentation.**