# Exploring a Data Repository

<br>Owner: **Rob Morgan** ([@rmorgan10](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@rmorgan10)), **Phil Marshall** ([@drphilmarshall](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@drphilmarshall))
<br>Last Verified to Run: **2018-12-07**
<br>Verified Stack Release: **16.0**

This notebook shows how to find out what's in a data repository, and how to find out which inputs went into each component of it.  

### Learning Objectives:
After working through and studying this notebook you should be able to understand how to use the Butler to figure out: 
   1. Which data types are present in a data repository;
   2. If coadds have been made, what the available tracts are;
   3. Which parts of the sky those tracts cover.
   
### Logistics
This notebook is intended to be runnable on `lsst-lspdev.ncsa.illinois.edu` from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.


## Set Up

In [1]:
import os
import sys
import warnings
import matplotlib.pyplot as plt
%matplotlib inline

# Filter some warnings printed by v16.0 of the stack
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=UserWarning)

## The HSC Data Repo: What's in there?
We'll use the `hsc` data repositories as our testing ground, and start by figuring out what they contain.

We'll need a butler to interrogate the `hsc` data repository.

In [2]:
from lsst.daf.persistence import Butler

# Instantiate the butler to bring us some HSC data.

depth = 'WIDE' # WIDE, DEEP, UDEEP
field = 'SSP_WIDE' # SSP_WIDE, SSP_DEEP, SSP_UDEEP

repo = '/datasets/hsc/repo/rerun/DM-13666/%s/'%(depth)
butler = Butler(repo)

print(repo)

/datasets/hsc/repo/rerun/DM-13666/WIDE/


In [4]:
visits = butler.queryMetadata('calexp', ['visit'])
pointings = butler.queryMetadata('calexp', ['pointing'])
ccds = butler.queryMetadata('calexp', ['ccd'])
fields = butler.queryMetadata('calexp', ['field'])
filters = butler.queryMetadata('calexp', ['filter'])

print(len(visits), len(pointings), len(ccds), len(fields), len(filters))

10333 133 112 60 13


In [15]:
visits = butler.queryMetadata('forced_src', ['visit'])
pointings = butler.queryMetadata('forced_src', ['pointing'])
ccds = butler.queryMetadata('forced_src', ['ccd'])
fields = butler.queryMetadata('forced_src', ['field'])
filters = butler.queryMetadata('forced_src', ['filter'])

print(len(visits), len(pointings), len(ccds), len(fields), len(filters))

10333 133 112 60 13


In [14]:
for item in shortlist:
    try:
        keys = butler.getKeys(item)
        if 'pointing' in keys.keys():
            print(item, keys)
    except:
        pass

calexp {'pointing': <class 'int'>, 'filter': <class 'str'>, 'visit': <class 'int'>, 'ccd': <class 'int'>, 'field': <class 'str'>, 'dateObs': <class 'str'>, 'taiObs': <class 'str'>, 'expTime': <class 'float'>}
calexpBackground {'pointing': <class 'int'>, 'filter': <class 'str'>, 'visit': <class 'int'>, 'ccd': <class 'int'>}
calexpThumb {'pointing': <class 'int'>, 'filter': <class 'str'>, 'visit': <class 'int'>, 'ccd': <class 'int'>, 'field': <class 'str'>, 'dateObs': <class 'str'>, 'taiObs': <class 'str'>, 'expTime': <class 'float'>}
calexp_camera {'pointing': <class 'int'>, 'filter': <class 'str'>, 'visit': <class 'int'>}
calibrated_exp {'pointing': <class 'int'>, 'filter': <class 'str'>, 'tract': <class 'int'>, 'visit': <class 'int'>, 'ccd': <class 'int'>}
calibrated_src {'pointing': <class 'int'>, 'filter': <class 'str'>, 'tract': <class 'int'>, 'visit': <class 'int'>, 'ccd': <class 'int'>}
dcor {'pointing': <class 'int'>, 'filter': <class 'str'>, 'tract': <class 'int'>, 'visit': <cl

In [5]:
visits = butler.queryMetadata('deepCoadd_mergeDet', ['visit'])
pointings = butler.queryMetadata('deepCoadd_mergeDet', ['pointing'])
ccds = butler.queryMetadata('deepCoadd_mergeDet', ['ccd'])
fields = butler.queryMetadata('deepCoadd_mergeDet', ['field'])
filters = butler.queryMetadata('deepCoadd_mergeDet', ['filter'])

print(visits, pointings, ccds, fields, filters)

TypeError: sequence item 0: expected str instance, NoneType found

Data repositories contain either a `_mapper` file or a `repositoryCfg.yml` file, to record which "obs package" was used to organize the data. In the `hsc` case, the `_mapper` file is in the top level folder, while the data repo for each field is a few levels down.

In [None]:
! more /datasets/hsc/repo/_mapper

The `HscMapper` mapper class is defined in [HscMapper.py](https://github.com/lsst/obs_subaru/blob/master/python/lsst/obs/hsc/hscMapper.py). Let's read about it.

In [9]:
from lsst.obs.hsc import HscMapper

In [None]:
# help(HscMapper)

The mapper defines a (large) number of different "dataset types". Some of these are specific to this particular data repo, others are more general. Even filtering out some intermediate dataset types, we are still left with a long list. But, once we figure out which dataset types we are interested in, we can start querying for information about those datasets.

In [10]:
mapper = HscMapper(root=repo)
all_dataset_types = mapper.getDatasetTypes()

remove = ['_config', '_filename', '_md', '_sub', '_len', '_schema', '_metadata']

shortlist = []
for dataset_type in all_dataset_types:
    keep = True
    for word in remove:
        if word in dataset_type:
            keep = False
    if keep:
        shortlist.append(dataset_type)

print(shortlist)

['bfKernel', 'bias', 'bias_camera', 'brightObjectMask', 'calexp', 'calexpBackground', 'calexpThumb', 'calexp_bbox', 'calexp_calib', 'calexp_camera', 'calexp_detector', 'calexp_filter', 'calexp_visitInfo', 'calexp_wcs', 'calibrated_exp', 'calibrated_exp_bbox', 'calibrated_exp_calib', 'calibrated_exp_detector', 'calibrated_exp_filter', 'calibrated_exp_visitInfo', 'calibrated_exp_wcs', 'calibrated_src', 'camera', 'ccdExposureId', 'ccdExposureId_bits', 'coaddTempExp', 'coaddTempExp_bbox', 'coaddTempExp_calib', 'coaddTempExp_detector', 'coaddTempExp_filter', 'coaddTempExp_visitInfo', 'coaddTempExp_wcs', 'dark', 'dark_camera', 'dcor', 'dcor_bbox', 'dcor_calib', 'dcor_detector', 'dcor_filter', 'dcor_visitInfo', 'dcor_wcs', 'dcrCoadd', 'dcrCoaddId', 'dcrCoaddId_bits', 'dcrCoadd_bbox', 'dcrCoadd_calexp', 'dcrCoadd_calexp_background', 'dcrCoadd_calexp_bbox', 'dcrCoadd_calexp_calib', 'dcrCoadd_calexp_detector', 'dcrCoadd_calexp_filter', 'dcrCoadd_calexp_visitInfo', 'dcrCoadd_calexp_wcs', 'dcrCoad

The `butler` purports to be able to check whether a dataset actually exists or not, but it needs a specific dataset ID to check whether that specific dataset exists. Here's what you get when you pass in a null dataset ID:

In [None]:
butler.datasetExists('calexp', dataId={})

Instead, one can try querying the metadata and checking for an error.

In [None]:
datasettype = 'calexp'

try:
    datasetkeys = butler.getKeys(datasettype)
    onekey = list(datasetkeys.keys())[0]
    metadata = butler.queryMetadata(datasettype, [onekey])
    print("{} dataset exists.".format(datasettype))
except:
    print("{} dataset doesn't exist.".format(datasettype))

### Basic dataset properties
For this dataset, we can look at the filters used, number of visits, number of pointings, etc. by examining the Butler's keys and metadata:

In [None]:
# Interesting dataset types for the HSC_mapper dataset. 
datasettypes = ['calexp', 'calexpBackground', 'icSrc', 
                'src', 'srcMatch', 'srcMatchFull', 'ossImage', 
                'flattenedImage', 'wcs', 'fcr', 'photoCalib',
                'jointcal_wcs', 'jointcal_photoCalib', 'skyCorr',
                'calexp_camera', 'brightObjectMask', 'deepCoadd_calexp', 
                'deepCoadd_det', 'deepCoadd_meas', 'deepCoadd_measMatch', 
                'deepCoadd_mergeDet', 'deepCoadd_ref', 'deepCoadd_forced_src', 
                'forced_src' ]

For these basic properties, we will look at the `calexp` and `src` tables.

In [None]:
# This would be faster if only one query were issued
visits = butler.queryMetadata('calexp', ['visit'])
pointings = butler.queryMetadata('calexp', ['pointing'])
ccds = butler.queryMetadata('calexp', ['ccd'])
fields = butler.queryMetadata('calexp', ['field'])
filters = butler.queryMetadata('calexp', ['filter'])

# Collect number of objects from Source Catalog
sources = butler.queryMetadata('src', ['id'])

In [None]:
num_visits = len(visits)
num_pointings = len(pointings)
num_ccds = len(ccds)
num_fields = len(fields)
num_filters = len(filters)

num_sources = len(sources)

In [None]:
from IPython.display import display, Markdown
import numpy as np

One key quantity for astronomers is the total sky area imaged. We can estimate this from the coadd tract info.

In [None]:
# Collect tracts from files
import os, glob
tracts = sorted([int(os.path.basename(x)) for x in
                 glob.glob(os.path.join(repo, 'deepCoadd-results', 'merged', '*'))])
num_tracts = len(tracts)

#Note: I'd like to do this with the butler, but it appears 'tracts' have to be
#      specified in the dataId to be queried, so the queryMetadata method fails

In [None]:
# Calculate area from all tracts
skyMap = butler.get('deepCoadd_skyMap')
total_area = 0.0  #deg^2
plotting_vertices = []
for test_tract in tracts:
    # Get inner vertices for tract
    tractInfo = skyMap[test_tract]
    vertices = tractInfo._vertexCoordList
    plotting_vertices.append(vertices)
    
    #calculate area of box
    av_dec = 0.5 * (vertices[2][1] + vertices[0][1])
    av_dec = av_dec.asRadians()
    delta_ra_raw = vertices[0][0] - vertices[1][0] 
    delta_ra = delta_ra_raw.asDegrees() * np.cos(av_dec)
    delta_dec= vertices[2][1] - vertices[0][1]
    area = delta_ra * delta_dec.asDegrees()
    
    #combine areas
    total_area += area
    
print("Total area imaged (sq deg): ",total_area)

#round total area for table purposes
rounded_total_area = round(total_area, 2)


It might also be interesting to plot the sky coverage while we have the tractInfo...

Note that Objects (in the LSST data products definition document sense of the word) can be thought of as being the `coadd_sources` contained in the `deepCoadd_mergeDet` dataset.

In [None]:
# Print out a report of the metadata

# dataset_name = 'HSC'
# display(Markdown('# Dataset: %s' %dataset_name))

# A more automated version of the table title:
dataset_name = 'HSC'
display(Markdown('### %s' % repo))


# Make a table of the collected metadata
collected_data = [num_visits, num_pointings, num_ccds, num_fields, num_filters, num_sources, 
                  num_tracts, rounded_total_area]
data_names = ("Number of Visits", "Number of Pointings", "Number of CCDs", "Number of Fields", 
              "Number of Filters", "Number of Sources", "Number of Tracts", "Total Sky Area (deg$^2$)")

output_table = "|   Metadata Characteristic  | Value | \n  | ---: | ---: | \n "
counter = 0
while counter < len(collected_data):
    output_table += "| %s |  %s | \n" %(data_names[counter], collected_data[counter])
    counter += 1
display(Markdown(output_table))

# Show which fields and filters we're talking about:
display(Markdown('Fields: (%i total)' %num_fields))
print(fields)
display(Markdown('Filters: (%i total)' %num_filters))
print(filters)


### Plotting the sky coverage

For this we will need our list of `tracts` from above, and also the `skyMap` object. We can then extract the sky coordinates of the corners of each tract, and use them to draw a set of rectangles to illustrate the sky coverage, following Jim Chiang's LSST DESC tutorial [dm_butler_skymap.ipynb](https://github.com/LSSTDESC/DC2-analysis/blob/master/tutorials/dm_butler_skymap.ipynb).

In the future, we could imagine overlaying the focal plane and color the individual visits, using more of the code from Jim's notebook. Let's see what functionality the Gen3 Butler provides first, and then return to visualization.

In [None]:
# How many tracts do we have?
print("Found {} tracts".format(len(tracts)))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
plt.figure()

for tract in tracts:
    tractInfo = skyMap[tract]
        
    corners = [(x[0].asDegrees(), x[1].asDegrees()) for x in tractInfo.getVertexList()]
    x = [k[0] for k in corners] + [corners[0][0]]
    y = [k[1] for k in corners] + [corners[0][1]]
    
       
    plt.plot(x,y, color='b')
    
plt.xlabel('RA (deg)')
plt.ylabel('Dec (deg)')
plt.title('2D Projection of Sky Coverage')

plt.show()

We could imagine plotting the patches as well, to show which tracts were incomplete - but this gives us a rough idea of where our data is on the sky.