**This notebook provides a better explanation of the zarr file data structure and how to interact with zarr directly**

In [36]:
import zarr
import pandas as pd
pd.set_option('display.max_columns', 999)

Below are a few words quoted directly from zarr's [website](https://zarr.readthedocs.io/en/stable/#highlights).

> Zarr is a Python package providing an implementation of chunked, compressed, N-dimensional arrays.

#### Highlights:
> * Create N-dimensional arrays with any NumPy dtype.  
> * Chunk arrays along any dimension.  
> * Compress and/or filter chunks using any NumCodecs codec.  
> * Store arrays in memory, on disk, inside a Zip file, on S3, …  
> * Read an array concurrently from multiple threads or processes.  
> * Write to an array concurrently from multiple threads or processes.  
> * Organize arrays into hierarchies via groups.  

I choose zarr as the backend storage format mostly because it supports concurrent read/write, which will be extremely beneficial when our training set grows too large to fit into the memory. In addition, it intergrates well with dask package to implement multi-processing/threading algorithms. 

Here I will briefly show how to directly interact with the training set stored in zarr. For more detailed tutorial for zarr, please check out [zarr's documentation](https://zarr.readthedocs.io/en/stable/#highlights).

In [37]:
# create connection to the file (without loading in anythin)
root = zarr.open('../Data/qso.zarr')

In [38]:
# check out datasets/groups under the root directory
# leave out level=1 will show all hieracrchies, will crash if too many subdirectories or datasets
root.tree(level=1) 

In [39]:
# see brief explanation for the purpose of each dataset/group
root.attrs.asdict()

{'catalog': 'Master catalog dataset, one source per row',
 'cross_id': 'This group stores the CRTS light curve ID for matched sources',
 'crts_lc': 'This group stores light curve from CRTS, each dataset is one light curve',
 'sdss_lc': 'This group stores light curve from SDSS, each dataset is one light curve'}

In [34]:
# get the catalog dataset
cat_df = pd.DataFrame(root['catalog'][:])
cat_df.head(2)

Unnamed: 0,RA_sdss,DEC_sdss,Z,Z_ERR,SpecQ,Var_LC,good_z,ID_sdss,RA_sp,DEC_sp,SPIES_ID,FLUX_AUTO_ch1,FLUXERR_AUTO_ch1,FLUX_AUTO_ch2,FLUXERR_AUTO_ch2,CLASS_STAR_ch1,CLASS_STAR_ch2,sdss2spies,sdss2gaia,gaia_id,pmra,pmra_error,pmdec,pmdec_error,r,ug,gr,ri,iz,gN,gAmpl,rN,rAmpl,iN,iAmpl
0,2.169302,1.238649,1.0733,0.0017,7,0,1,70,-999.0,-999.0,-999,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,2.8e-05,2546563316130849792,-999.0,-999.0,-999.0,-999.0,20.135056,0.293376,0.248568,0.142036,0.026312,60,0.379,60,0.409,59,0.414
1,1.091028,0.962126,0.7926,0.0007,7,0,1,98,-999.0,-999.0,-999,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,3e-05,2738365148137353216,1.004082,1.614945,0.956702,1.407345,19.751278,0.558688,0.297034,0.007393,0.366406,52,0.631,52,0.485,52,0.622


In [32]:
# show definition for each columns in catalog dataset
root['catalog'].attrs.asdict()

{'CLASS_STAR_ch1': '3.6 micrometer morphology classification, > 0.5 for resolved source',
 'CLASS_STAR_ch2': '4.5 micrometer morphology classification, > 0.5 for resolved source',
 'DEC_sdss': 'DEC from SDSS in degrees (J2000)',
 'DEC_sp': 'DEC from SpIES in degrees (J2000)',
 'FLUXERR_AUTO_ch1': '3.6 micrometer flux error given by SExtractor',
 'FLUXERR_AUTO_ch2': '4.5 micrometer flux error given by SExtractor',
 'FLUX_AUTO_ch1': '3.6 micrometer flux value automatically extracted using SExtractor',
 'FLUX_AUTO_ch2': '4.5 micrometer flux value automatically extracted using SExtractor',
 'ID_sdss': 'Unique ID for both QSO and variable stars',
 'RA_sdss': 'RA from SDSS in degrees (J2000)',
 'RA_sp': 'RA from THE SPITZER IRAC EQUATORIAL SURVEY (SpIES) in degrees (J2000)',
 'SPIES_ID': 'Unique ID assigned to each source in SpIES',
 'SpecQ': 'Source of spectrum, 7 for SDSS DR7 and 14 for SDSS DR14',
 'Var_LC': '1 for light curve of corresponding object comes from Ivezic s82 variables catalo

In [40]:
# see the first few keys in the sdss_lc group, keys are the corresponding ID_sdss value for each source in catalog
list(root['sdss_lc'].array_keys())[:2]

['1000679', '1000743']