# Exploring the Gen-3 Butler

<br>Owners: **Alex Drlica-Wagner** ([@kadrlica](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@kadrlica)), **Douglas Tucker** ([@douglasleetucker](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@douglasleetucker))
<br>Last Verified to Run: **2019-08-08**
<br>Verified Stack Release: **w_2020_03**

## Core Concepts

This notebook provides a first look at the structure and organization of a repo created with the Gen-3 Butler. The Gen-3 Butler is still under development, so this notebook is expected to be updated after the Gen-3 release.

1. Create a Gen-3 butler
2. Use the Gen-3 butler to explore the ci_hsc_gen3 data repo

## Learning Objectives:

This notebook lays out features of how the Gen-3 butler functions:

1. Explore a Gen-3 data repo

In [None]:
# Generic imports
import os
import pylab as plt

In [None]:
# Stack imports
import lsst.daf.butler as dafButler
import lsst.afw.display as afwDisplay

To generate a data repo that was run with the Gen-3 butler, I used the HSC continuous integration sample. This was generated with code like this:

In [None]:
# Directory where the repo lives
repo='/project/shared/data/ci_hsc_gen3'

You can poke around this directory a bit to see what outputs have been created.

In [None]:
# The base directory for the repo
!ls $repo

In [None]:
# The outputs are stored in the `shared/ci_hsc_output`
outdir=f'{repo}/DATA/shared/ci_hsc_output'
!ls $outdir

To create a butler you need to pass it a configuration file and a run name. The run name tells the butler where the place output files. More on Butler configuration can be found [here](https://pipelines.lsst.io/modules/lsst.daf.butler/configuring.html). By investigating the directory structue, we find that the 'collection' is `shared/ci_hsc_output`.

In [None]:
config = os.path.join(repo,'DATA','butler.yaml')
butler = dafButler.Butler(config=config,collection="shared/ci_hsc_output")

With the Gen-2 butler, there was no good way to investigate what data exist in a repo. To get around this, we all developed a habit of investigating the directory structure and file names to figure out what data existed.

In [None]:
!ls $outdir/calexp

In [None]:
!ls $outdir/calexp/r/HSC-R

In [None]:
!ls $outdir/calexp/r/HSC-R/903338

Based on these filenames, we have enough to specify the dataId to pass to the butler...

In [None]:
dataId = {'visit':903338,'detector':25,'instrument':'HSC'}
calexp = butler.get('calexp', dataId=dataId)

In [None]:
afwDisplay.setDefaultBackend('matplotlib') 
fig = plt.figure(figsize=(10,8))
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')
# And if it wasn't sacrilege I would rotate this image...

## Gen-3 Butler

Ok, so how do we do this in Gen-3 land? Starting from scratch, we want to be able to get going *with only the path to the butler config*.

In [None]:
config = os.path.join(repo,'DATA','butler.yaml')

We would expect that we could just create a butler without specifying the collection (since we have no idea what collections exist at this point); however, this throws an exception (note: I think this is a bug, or at least an undesireable feature).

In [None]:
try: butler = dafButler.Butler(config=config)
except ValueError as e: print(e)

However, we can get around this by specifying an empty string for the collection.

In [None]:
butler = dafButler.Butler(config=config,collection="")

With the butler created, we can now access the registry, which allows us to get a list of collections

In [None]:
registry = butler.registry
registry.getAllCollections()

The `registry` seems like a good tool for investigating a repo (more on the registry schema can be found [here](https://dmtn-073.lsst.io/)). For example, we can use the registry to get a list of all dataset types:

In [None]:
registry.getAllDatasetTypes()

We suspect that this is all datasetTypes that the processing has *tried* to create during the processing. There may be intermediate products that were created during processing, but no longer exist.

It is now possible to get all `DatasetRef` (including `dataId`) for a specific `datasetType` in a specific collection with a query like the one that follows.

In [None]:
query = registry.queryDatasets(datasetType='src',collections=['shared/ci_hsc_output'])
for x in query:
    print(x)

Ok, now that we know what collections exist (`shared/ci_hsc_output` in particular) we can set the collection for the butler and then we can query for one of the dataIds in our list above...

In [None]:
# Not sure this is the safest way...
butler.collection = 'shared/ci_hsc_output'
# Could instead create a new butler with the collection specified
#butler = dafButler.Butler(config=config,collection='shared/ci_hsc_output')

In [None]:
# The dataId that we found...
dataId = {'instrument': 'HSC', 'detector': 22, 'visit': 903334}

In [None]:
# Grab the calexp
calexp = butler.get('calexp', dataId=dataId)

In [None]:
# Grab the source table
src = butler.get('src',dataId=dataId)
src = src.copy(True)
src.asAstropy()

In [None]:
# And plot!
afwDisplay.setDefaultBackend('matplotlib') 
fig = plt.figure(figsize=(10,8))
afw_display = afwDisplay.Display(1)
afw_display.scale('asinh', 'zscale')
afw_display.mtv(calexp)
plt.gca().axis('off')

with afw_display.Buffering():
    for s in src:
        afw_display.dot('+', s.getX(), s.getY(), ctype=afwDisplay.RED)
        afw_display.dot('o', s.getX(), s.getY(), size=20, ctype='orange') 

## Some Exploration

Below is a scratch space for playing with things...