**This notebook provides a better explanation of the zarr file data structure and how to interact with zarr directly**

In [2]:
import zarr
import pandas as pd
pd.set_option('display.max_columns', 999)

Below are a few words quoted directly from zarr's [website](https://zarr.readthedocs.io/en/stable/#highlights).

> Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.

#### Highlights:
> * Create N-dimensional arrays with any NumPy dtype.  
> * Chunk arrays along any dimension.  
> * Compress and/or filter chunks using any NumCodecs codec.  
> * Store arrays in memory, on disk, inside a Zip file, on S3, …  
> * Read an array concurrently from multiple threads or processes.  
> * Write to an array concurrently from multiple threads or processes.  
> * Organize arrays into hierarchies via groups.  

I choose zarr as the backend storage format mostly because it supports concurrent read/write, which will be extremely beneficial when our training set grows too large to fit into the memory. In addition, it intergrates well with the [dask](https://dask.org/) package to implement multi-processing/threading algorithms. 

Here I will briefly show how to directly interact with the training set stored in zarr. For more detailed tutorial for zarr, please check out [zarr's documentation](https://zarr.readthedocs.io/en/stable/#highlights).

In [3]:
# on SciServer
path2zarr = '/home/idies/workspace/Storage/ywx649999311/AGN_training/Data/qso.zarr.zip'

# # offline
# path2zarr = 'path/to/data'

# create connection to the file (without loading in anything)
root = zarr.open(path2zarr)

In [5]:
# check out datasets/groups under the root directory
# leave out level=1 will show all hieracrchies, will crash if too many subdirectories or datasets
print(root.tree(level=1))

/
 ├── catalog (24579,) [('train_id', '<i8'), ('ra_sdss', '<f8'), ('dec_sdss', '<f8'), ('sdssj', '<U20'), ('z', '<f8'), ('z_err', '<f8'), ('thing_id', '<i8'), ('specobjid', '<i8'), ('spec', '<i8'), ('objid', '<i8'), ('psfmag_u', '<f8'), ('psfmag_g', '<f8'), ('psfmag_r', '<f8'), ('psfmag_i', '<f8'), ('psfmag_z', '<f8'), ('psfmagerr_u', '<f8'), ('psfmagerr_g', '<f8'), ('psfmagerr_r', '<f8'), ('psfmagerr_i', '<f8'), ('psfmagerr_z', '<f8'), ('extinction_u', '<f8'), ('extinction_g', '<f8'), ('extinction_r', '<f8'), ('extinction_i', '<f8'), ('extinction_z', '<f8'), ('spec2coadd', '<f8'), ('ra_sp', '<f8'), ('dec_sp', '<f8'), ('SPIES_ID', '<i8'), ('FLUX_AUTO_ch1', '<f8'), ('FLUXERR_AUTO_ch1', '<f8'), ('FLUX_AUTO_ch2', '<f8'), ('FLUXERR_AUTO_ch2', '<f8'), ('CLASS_STAR_ch1', '<f8'), ('CLASS_STAR_ch2', '<f8'), ('sdss2spies', '<f8'), ('sdss2gaia', '<f8'), ('gaia_id', '<i8'), ('parallax', '<f8'), ('parallax_error', '<f8'), ('pmra', '<f8'), ('pmra_error', '<f8'), ('pmdec', '<f8'), ('pmdec_error', '<

In [6]:
# see brief explanation for the purpose of each dataset/group
root.attrs.asdict()

{'catalog': 'Master catalog dataset, one source per row',
 'sdss_lc': 'This group stores light curve from SDSS, each dataset is one light curve'}

In [7]:
# get the catalog dataset
cat_df = pd.DataFrame(root['catalog'][:])
cat_df.head(2)

Unnamed: 0,train_id,ra_sdss,dec_sdss,sdssj,z,z_err,thing_id,specobjid,spec,objid,psfmag_u,psfmag_g,psfmag_r,psfmag_i,psfmag_z,psfmagerr_u,psfmagerr_g,psfmagerr_r,psfmagerr_i,psfmagerr_z,extinction_u,extinction_g,extinction_r,extinction_i,extinction_z,spec2coadd,ra_sp,dec_sp,SPIES_ID,FLUX_AUTO_ch1,FLUXERR_AUTO_ch1,FLUX_AUTO_ch2,FLUXERR_AUTO_ch2,CLASS_STAR_ch1,CLASS_STAR_ch2,sdss2spies,sdss2gaia,gaia_id,parallax,parallax_error,pmra,pmra_error,pmdec,pmdec_error,lcN,psPm[0],psPm[1],psParallax,dered_u,dered_g,dered_r,dered_i,dered_z,stdColor[0],stdColor[1],stdColor[2],stdColor[3],psFlux[u],psFlux[g],psFlux[r],psFlux[i],psFlux[z]
0,0,310.0377,-1.005592,204009.04-010020.1,2.167268,0.000676,-99,276352159556042752,7,8647475119809364088,19.64425,19.1197,18.6905,18.39786,18.1203,0.008446,0.003265,0.003569,0.002969,0.00596,0.332155,0.244396,0.177257,0.134408,0.095297,0.083856,,,-99,,,,,,,,0.024083,4226318531605913088,0.22524,0.283972,-0.609215,0.464463,-0.206735,0.305267,60,-0.609215,-0.206735,0.22524,19.312095,18.875304,18.513243,18.263452,18.025003,0.436791,0.36206,0.249792,0.238449,65942.328054,68419.255122,142798.907734,179737.984132,228014.565848
1,3,311.6088,0.393812,204626.10+002337.7,0.333015,0.000351,-99,537282917827608576,7,8647475121420632840,19.98406,18.87617,17.98685,17.88391,17.11791,0.011383,0.001996,0.001869,0.002298,0.003088,0.494767,0.364045,0.264036,0.20021,0.141952,0.051798,,,-99,,,,,,,,0.092993,4228017414511443968,-0.149525,0.160723,0.044235,0.275073,0.075344,0.179892,69,0.044235,0.075344,-0.149525,19.489293,18.512125,17.722814,17.6837,16.975958,0.977168,0.789311,0.039115,0.707741,56011.332155,58115.957521,295732.889174,306580.43435,599285.639829


In [8]:
# show definition for each columns in catalog dataset
root['catalog'].attrs.asdict()

{'CLASS_STAR_ch1': '3.6 micrometer morphology classification, > 0.5 for resolved source',
 'CLASS_STAR_ch2': '4.5 micrometer morphology classification, > 0.5 for resolved source',
 'FLUXERR_AUTO_ch1': '3.6 micrometer flux error given by SExtractor',
 'FLUXERR_AUTO_ch2': '4.5 micrometer flux error given by SExtractor',
 'FLUX_AUTO_ch1': '3.6 micrometer flux value automatically extracted using SExtractor',
 'FLUX_AUTO_ch2': '4.5 micrometer flux value automatically extracted using SExtractor',
 'SPIES_ID': 'Unique ID assigned to each source in SpIES if match exists, otherwise -99',
 'dec_sdss': 'DEC from SDSS in degrees (J2000 degree)',
 'dec_sp': 'DEC from SpIES in degrees (J2000)',
 'dered_{band}': 'SDSS mag corrected for extinction',
 'extinction_{band}': 'Extinction in u,g,r,i,z',
 'gaia_id': 'Gaia DR2 source id if match exists, otherwise -99',
 'lcN': 'Number of data points in the corresponding light curve',
 'objid': 'DR7 coadd photo object id',
 'parallax': 'Gaia DR2 parallax in ma

GTR: This next cell needs some explanation.  

In [9]:
# see the first few keys in the sdss_lc group, keys are the corresponding ID_sdss value for each source in catalog
list(root['sdss_lc'].array_keys())[:2]

['0', '10']