# Exploring the Gen-3 Butler

<br>Owners: **Alex Drlica-Wagner** ([@kadrlica](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@kadrlica)), **Douglas Tucker** ([@douglasleetucker](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@douglasleetucker))
<br>Last Verified to Run: **2021-04-16**
<br>Verified Stack Release: **w_2021_16**

## Core Concepts

This notebook provides a first look at the structure and organization of the DC2 repo created with the Gen-3 Butler. The Gen-3 Butler is still under development, so this notebook is expected to be updated after the official Gen-3 release. For the time being, be sure that you are using the verified version of the stack specified above.

## Learning Objectives:

This notebook lays out features of how the Gen-3 butler functions:

1. Create a Gen-3 butler
2. Programmatically explore a Gen-3 repo
3. Get some data

## Setup

In [None]:
# This should match the verified version listed at the start of the notebook
! eups list -s lsst_distrib

In [None]:
# Generic imports
import os,glob
import pylab as plt

In [None]:
# Stack imports
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay

In [None]:
# Only one dataset right now: DC2
dataset='DC2'
repo='/repo/dc2'
collection='2.2i/runs/DP0.1'

## Gen-3 Butler

One of the strengths of the Gen-3 butler relative to Gen-2 is the ability to explore a repo and find out what it contains. Starting from scratch, we want to be able to get going *with only the path to the repo*. 

We can do this by creating a butler without specifying the collection (since we have no idea what collections exist at this point).

In [None]:
butler = dafButler.Butler(repo)

With the butler created, we can now access the data `registry` (a database containing information about available data products)

In [None]:
registry = butler.registry

# We can examine the registry with
#help(registry)

The `registry` is a good tool for investigating a repo (more on the registry schema can be found [here](https://dmtn-073.lsst.io/)). For example, we can get a list of all collections, with

In [None]:
for c in sorted(registry.queryCollections()):
    print(c)

This is our first glimpse at the data contained in the repo, but it doesn't teach us *which* collection we are actually interested in. The names do give us some hints though...

* `calib` - refers to calibration products that are used for instrument signature removal
* `refcats` - refers to the reference catalogs
* `skymaps` - are the geometric representations of the sky coverage
* `u/` - collections that begin with `u/` are used for personal re-runs

We can generally get access to everything we are intersted in for DC2 Run 2.2i DP0.1 by selecting the collection `2.2i/runs/DP0.1`. This is a pointer to other collections that expand out recursively... More on collections can be found here: https://pipelines.lsst.io/v/weekly/modules/lsst.daf.butler/organizing.html#collections

In [None]:
# If this collection is a pointer to other collections, expand those out recursively.
print(collection)
for c in sorted(registry.queryCollections(collection,flattenChains=True)):
    print(c, registry.getCollectionType(c))

In [None]:
# Create a new butler with the collection of interest
butler = dafButler.Butler(repo,collections=collection)
registry = butler.registry

DatasetTypes don't belong to collections, so when you query for them you always get all the DatasetTypes that belong to the repo. This is all datasetTypes that were created by anyone during any processing. There may be intermediate products that were created during processing, but no longer exist.

In [None]:
for x in sorted(registry.queryDatasetTypes()):
    print(x)

It is possible to get all `DatasetRef` (which include the `dataId`) for a specific `datasetType` in a specific collection with a query like this. Note that this doesn't necessarily guarentee that the specific data set still exists on disk...

In [None]:
datasetRefs = registry.queryDatasets(datasetType='calexp',collections=collection)
for i,ref in enumerate(datasetRefs):
    print(ref.dataId)
    try: butler.getURI(ref)
    except: print("File not found...")
    if i > 10: break

In [None]:
# You can also sub-select on specific properties of a data set
datasetRefs = registry.queryDatasets(datasetType='calexp',dataId={'band': 'z'}, where='visit > 700000', collections=collection)
for i,ref in enumerate(datasetRefs):
    print(ref.dataId)
    try: butler.getURI(ref)
    except: print("File not found...")
    if i > 10: break

Ok, now that we know what collections exist (`2.2i/runs/DP0.1` in particular), the `datasetTypes` that are defined for that collection, and the `datasetRefs` (which contain `dataIds`) for data products of the requested type. This is all the information that we need to get the dataset of interest!

From the list above, we choose the first dataId

In [None]:
# The dataId that we found before...
x = list(datasetRefs)
print(x[0].dataId)

In [None]:
# We could get the src table using the dataId as we did above for the calexp, 
# but this would require the butler to perform another query of the database. 
# Instead, we can just pass the ref itself directly to butler.get
src = butler.get('calexp')
src = src.copy(True)
src.asAstropy()

Now to get the `calexp` associated with this exposures and detector we pass the `dataId` to the butler witht the `calexp` datasetType. Note that this performs another query to the registry database to find a calexp that matches our dataId requirements.

In [None]:
# To get the calexp, we pass the dataId
calexp = butler.get('calexp', dataId=ref.dataId)

We can now plot the calexp with the src catalog overlaid. We leave the investigation of this image as an exercise to the user :)

In [None]:
# And plot!
afwDisplay.setDefaultBackend('matplotlib') 
fig = plt.figure(figsize=(10,8))
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')

with afw_display.Buffering():
    for s in src:
        afw_display.dot('+', s.getX(), s.getY(), ctype=afwDisplay.RED)
        afw_display.dot('o', s.getX(), s.getY(), size=20, ctype='orange') 

In the case above, both the src and calexp can be found by the registry, but this will not necessarily be the case. The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that dataId) or ask for different dataId keys than what is used to identify the dataset (which invokes various built-in relationships). An example of this is provided below:

In [None]:
# Use queryDimensions to provide more flexible access
dataIds = list(registry.queryDimensions(["exposure", "detector"], datasets=["calexp","src"], collections="shared/ci_hsc_output"))
for dataId in dataIds:
    print(dataId)

Now say we wanted to select all detectors with calexp and src datasets associated with a specific filter. We can add that constraint to our query, but first we need to figure out what the filters are called... Looking at the dataId object, we see the attributes `abstract_filter` and `physical_filter` look promising.

In [None]:
dataIds[0].full

In [None]:
print(f"physical_filter = {dataId['physical_filter']}")
print(f"abstract_filter = {dataId['abstract_filter']}")

It looks like `abstract_filter` is what we want, so we put it in the `where` argument of `queryDimensions`.

In [None]:
# Use queryDimensions to grab the dataIds for all i-band detectors
dataIds = list(registry.queryDimensions(["exposure", "detector"], datasets=["calexp","src"], where="abstract_filter='i'",collections="shared/ci_hsc_output"))
for dataId in dataIds:
    print(dataId['abstract_filter'], dataId)

You can also get more metadata about a data product from `records`.

In [None]:
records = registry.queryDimensionRecords('exposure', where='visit = 971990')
for i,rec in enumerate(records):
    print(rec)

## Some Exploration

Below is a scratch space for playing with things...