# Exploring the Gen-3 Butler

<br>Owners: **Alex Drlica-Wagner** ([@kadrlica](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@kadrlica)), **Douglas Tucker** ([@douglasleetucker](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@douglasleetucker))
<br>Last Verified to Run: **2020-08-10**
<br>Verified Stack Release: **v20.0.0**

## Core Concepts

This notebook provides a first look at the structure and organization of a repo created with the Gen-3 Butler. The Gen-3 Butler is still under development, so this notebook is expected to be updated after the Gen-3 release. For the time being, be sure that you are using the verified version of the stack specified above.

## Learning Objectives:

This notebook lays out features of how the Gen-3 butler functions:

1. Explore a Gen-3 data repo
2. Create a Gen-3 butler
3. Use the Gen-3 butler to explore the ci_hsc_gen3 data repo


In [None]:
# This should match the verified version listed at the start of the notebook
! eups list -s lsst_distrib

In [None]:
# Generic imports
import os,glob
import pylab as plt

In [None]:
# Stack imports
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay

To generate a data repo that was run with the Gen-3 butler, I used the HSC continuous integration (CI) sample. This was generated with code like this:

In [None]:
# Directory where the data repo lives
repo='/project/shared/data/ci_hsc_gen3-w_2020_22/DATA'

You can poke around this directory a bit to see what outputs have been created.

In [None]:
# The base directory for the repo
!ls $repo

In [None]:
# The outputs are stored in the `shared/ci_hsc_output`
outdir=glob.glob(f'{repo}/shared/ci_hsc_output/*')[0]
!ls $outdir

To create the butler you need to pass it a configuration file and a run name. The run name tells the butler where to place output files. More on Butler configuration can be found [here](https://pipelines.lsst.io/modules/lsst.daf.butler/configuring.html). By investigating the directory structue, we find that the 'collection' is `shared/ci_hsc_output`.

In [None]:
butler = dafButler.Butler(repo,collections="shared/ci_hsc_output")

# Optionally, you can specify the repo config explicitly
#config = os.path.join(repo,'butler.yaml')
#butler = dafButler.Butler(config=config,collections="shared/ci_hsc_output")

With the Gen-2 butler, there was no good way to investigate what data exist in a repo. To get around this, we all developed a habit of investigating the directory structure and file names to figure out what data existed.

In [None]:
!ls $outdir/calexp

In [None]:
!ls $outdir/calexp/r/HSC-R

In [None]:
!ls $outdir/calexp/r/HSC-R/903338

Based on these filenames, we have enough to specify the dataId to pass to the butler...

In [None]:
dataId = {'visit':903338,'detector':25,'instrument':'HSC'}
calexp = butler.get('calexp', dataId=dataId)

In [None]:
afwDisplay.setDefaultBackend('matplotlib') 
fig = plt.figure(figsize=(10,8))
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')
# And if it wasn't sacrilege I would rotate this image...

## Gen-3 Butler

Ok, so how do we do this in Gen-3 land? Starting from scratch, we want to be able to get going *with only the path to the repo*. 

We can now do this by creating a butler without specifying the collection (since we have no idea what collections exist at this point).

In [None]:
butler = dafButler.Butler(repo)

With the butler created, we can now access the data `registry` (a database containing information about available data products)

In [None]:
registry = butler.registry

# We can examine the registry with
#help(registry)

The `registry` is a good tool for investigating a repo (more on the registry schema can be found [here](https://dmtn-073.lsst.io/)). For example, we can get a list of all collections, which includes the `ci_hsc_output` collection that we were using before.

In [None]:
for c in registry.queryCollections():
    print(c)

Now that we "know" that `ci_hsc_output` exists, let's create our butler with this collection:

In [None]:
butler = dafButler.Butler(repo,collections='shared/ci_hsc_output')
registry = butler.registry

We can also use the registry to get a list of all dataset types (for example, we see that `calexp` is available, but that we could also ask directly for `calexp.image` or `calexp.mask`).

In [None]:
for x in registry.queryDatasetTypes():
    print(x)

We suspect that this is all datasetTypes that the processing has *tried* to create during the processing. There may be intermediate products that were created during processing, but no longer exist.

It is now possible to get all `DatasetRef` (including `dataId`) for a specific `datasetType` in a specific collection with a query like the one that follows.

In [None]:
datasetRefs = list(registry.queryDatasets(datasetType='src',collections=['shared/ci_hsc_output']))
for ref in datasetRefs:
    print(ref.dataId)

Ok, now that we know what collections exist (`shared/ci_hsc_output` in particular), the `datasetTypes` that are defined for that collection, and the `datasetRefs` (which contain `dataIds`) for data products of the requested type. This is all the information that we need to get the dataset of interest!

From the list above, we find that the dataId we were investigating before has index 16

In [None]:
# The dataId that we found before...
ref = datasetRefs[16]
print(ref.dataId)

In [None]:
# We could get the src table using the dataId as we did above for the calexp, 
# but this would require the butler to perform another query of the database. 
# Instead, we can just pass the ref itself directly to butler.get
src = butler.get(ref)
src = src.copy(True)
src.asAstropy()

Now to get the `calexp` associated with this exposures and detector we pass the `dataId` to the butler witht the `calexp` datasetType. Note that this performs another query to the registry database to find a calexp that matches our dataId requirements.

In [None]:
# To get the calexp, we pass the dataId
calexp = butler.get('calexp', dataId=ref.dataId)

We can now plot the calexp with the src catalog overlaid. We leave the investigation of this image as an exercise to the user :)

In [None]:
# And plot!
afwDisplay.setDefaultBackend('matplotlib') 
fig = plt.figure(figsize=(10,8))
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')

with afw_display.Buffering():
    for s in src:
        afw_display.dot('+', s.getX(), s.getY(), ctype=afwDisplay.RED)
        afw_display.dot('o', s.getX(), s.getY(), size=20, ctype='orange') 

In the case above, both the src and calexp can be found by the registry, but this will not necessarily be the case. The `queryDimensions` method provides a more flexible way to query for multiple datasets (requiring an instance of all datasets to be available for that dataId) or ask for different dataId keys than what is used to identify the dataset (which invokes various built-in relationships). An example of this is provided below:

In [None]:
# Use queryDimensions to provide more flexible access
dataIds = list(registry.queryDimensions(["exposure", "detector"], datasets=["calexp","src"], collections="shared/ci_hsc_output"))
for dataId in dataIds:
    print(dataId)

Now say we wanted to select all detectors with calexp and src datasets associated with a specific filter. We can add that constraint to our query, but first we need to figure out what the filters are called... Looking at the dataId object, we see the attributes `abstract_filter` and `physical_filter` look promising.

In [None]:
dataIds[0].full

In [None]:
print(f"physical_filter = {dataId['physical_filter']}")
print(f"abstract_filter = {dataId['abstract_filter']}")

It looks like `abstract_filter` is what we want, so we put it in the `where` argument of `queryDimensions`.

In [None]:
# Use queryDimensions to grab the dataIds for all i-band detectors
dataIds = list(registry.queryDimensions(["exposure", "detector"], datasets=["calexp","src"], where="abstract_filter='i'",collections="shared/ci_hsc_output"))
for dataId in dataIds:
    print(dataId['abstract_filter'], dataId)

## Some Exploration

Below is a scratch space for playing with things...