# Exploring a Data Repository

<br>Owner: **Rob Morgan** ([@rmorgan10](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@rmorgan10)), **Phil Marshall** ([@drphilmarshall](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@drphilmarshall))
<br>Last Verified to Run: **2018-12-07**
<br>Verified Stack Release: **17.0**

This notebook shows how to find out what's in a data repository, and how to find out which inputs went into each component of it.  

### Learning Objectives:
After working through and studying this notebook you should be able to understand how to use the Butler to figure out: 
   1. What a data repo is;
   2. Which data types are present in a data repository;
   3. If coadds have been made, what the available tracts are;
   4. Which parts of the sky those tracts cover.
   
### Logistics
This notebook is intended to be runnable on `lsst-lsp-stable.ncsa.illinois.edu` from a local git clone of https://github.com/LSSTScienceCollaborations/StackClub.


## Set Up

In [None]:
import os
import sys
import warnings
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
import numpy as np
import os, glob
%matplotlib inline

# Filter some warnings printed by v16.0 of the stack
# warnings.simplefilter("ignore", category=FutureWarning)
# warnings.simplefilter("ignore", category=UserWarning)

## What is a Data Repo?

Data repositories contain either a `_mapper` file or a `repositoryCfg.yaml` file, to record which "obs package" was used to organize the data. These files give a repository more strucutre and organization than an ordinary data directory. Let's take a look at this file structure in the HSC data repo.

### The HSC Data Repo: What's in there?

In [None]:
repo = '/datasets/hsc/repo'

We'll use the `hsc` data repository as our testing ground, and start by figuring out what it contains. In the `hsc` case, the `_mapper` file is in the top level folder, while the data repo for each field is a few levels down.

In [None]:
! ls /datasets/hsc/repo/

We can see the `_mapper` file here, and at contains one line giving the name of the `Mapper` object for the HSC repo:

In [None]:
! cat /datasets/hsc/repo/_mapper

# Import the Mapper object once you know its name
from lsst.obs.hsc import HscMapper

You can get some more information on this object like this:

In [None]:
help(HscMapper)

The mapper defines a (large) number of different "dataset types". Some of these are specific to this particular data repo, others are more general. Even filtering out some intermediate dataset types, we are still left with a long list. But, once we figure out which dataset types we are interested in, we can start querying for information about those datasets.

In [None]:
mapper = HscMapper(root=repo)
all_dataset_types = mapper.getDatasetTypes()

remove = ['_config', '_filename', '_md', '_sub', '_len', '_schema', '_metadata']

shortlist = []
for dataset_type in all_dataset_types:
    keep = True
    for word in remove:
        if word in dataset_type:
            keep = False
    if keep:
        shortlist.append(dataset_type)

print(shortlist)

The `Butler`, directed by the `Mapper` will have access to all the above dataset types. 

### Great, but where's the data?

The next level to dig into here is the `rerun` directory. This is where we will find some of the data we can take a look at. In the next few cells, we'll see how to get from the main level of the directory to where the data and `repositoryCfg.yaml` file is.

In [None]:
! ls /datasets/hsc/repo/rerun

In [None]:
! ls /datasets/hsc/repo/rerun/DM-13666

In [None]:
! ls /datasets/hsc/repo/rerun/DM-13666/WIDE

Congrats! Now we're getting to some data! The numerical directories are the tract indices and digging deeper into those directories is a starting point for familiarizing oneself with the repo. We'll skip that here and move on to looking at the data with the `Butler`.

The `Butler` will look at the repository as a whole. To see why this is, take a look at `repositoryCfg.yaml` file:

In [None]:
! cat /datasets/hsc/repo/rerun/DM-13666/WIDE/repositoryCfg.yaml

The `_root` parameter is directing the `Butler` to look at the top level of the repo `/datasets/hsc/repo/`.

### Instantiating the Butler and looking for Dataset Types

Now that we have an idea of the structure of the repo itself, let's use the Butler to explore the data within the repo. Here we will demonstrate a few useful `Butler` methods for learning about the data in a repo.

In [None]:
from lsst.daf.persistence import Butler

butler = Butler(repo)

# We defined 'repo' earlier, but here it is:
print(repo)

The `butler` purports to be able to check whether a dataset actually exists or not, but it needs a specific dataset ID to check whether that specific dataset exists. Here's what you get when you pass in a null dataset ID:

In [None]:
butler.datasetExists('calexp', dataId={})

Instead, one can try querying the metadata and checking for an error. Note that this way of checking for dataset existence is a little faster too.

In [None]:
datasettype = 'calexp'

try:
    datasetkeys = butler.getKeys(datasettype)
    onekey = list(datasetkeys.keys())[0]
    metadata = butler.queryMetadata(datasettype, [onekey])
    print("{} dataset exists.".format(datasettype))
except:
    print("{} dataset doesn't exist.".format(datasettype))

## Obtaining Basic Dataset Properties Using the Butler
Now we can start using Butler methods to query the data. For this dataset, we can look at the filters used, number of visits, number of pointings, etc. by examining the Butler's keys and metadata. For these basic properties, we will look at the `calexp` and `src` tables. The contents of these tables are derived from the processing of individual sensors.

In [None]:
# This would be faster if only one query were issued
visits = butler.queryMetadata('calexp', ['visit'])
pointings = butler.queryMetadata('calexp', ['pointing'])
ccds = butler.queryMetadata('calexp', ['ccd'])
fields = butler.queryMetadata('calexp', ['field'])
filters = butler.queryMetadata('calexp', ['filter'])

# Collect number of objects from Source Catalog
sources = butler.queryMetadata('src', ['id'])

In [None]:
num_visits = len(visits)
num_pointings = len(pointings)
num_ccds = len(ccds)
num_fields = len(fields)
num_filters = len(filters)

num_sources = len(sources)

One may also be interested in the total sky area imaged for a particular coadd rerun/depth. We can estimate and visualize this from the coadd tract info. To collect all the tracts, we have to get them via the file strucutre. This operation will hopefully be `Butler`-ized with the Gen3 Butler.

In [None]:
#Set the loaction of the tracts
rerun = 'DM-13666'
depth = 'WIDE'
# Try a different one:
# rerun = 'DM-10404'
# depth = 'DEEP'

tract_location = repo + '/rerun/' + rerun + '/' + depth

# Collect tracts from files
tracts = sorted([int(os.path.basename(x)) for x in
                 glob.glob(os.path.join(tract_location, 'deepCoadd-results', 'merged', '*'))])
num_tracts = len(tracts)

print("Found {} tracts in the directory {}".format(num_tracts, tract_location))

A quick way of extimating the sky area covered is to sum the areas of the inner boxes of all the tracts. For more information on the properties of tracts, you can look at the [Documentation](http://doxygen.lsst.codes/stack/doxygen/x_masterDoxyDoc/classlsst_1_1skymap_1_1tract_info_1_1_tract_info.html).

As a quick note, the file structure only tells us the names of the tracts in the particular rerun/depth to look at. The actual `TractInfo` objects are obtained by selecting the tracts we want from the `deepCoadd_skyMap` dataset. Therefore, we will have to ask the `Butler` to bring us this dataset for the particular rerun/depth. Since we already made a Butler for the repo as a whole, let's make a `skymap_butler` to bring us the skymap correspoinding to the `tract_location` we just specified.

In [None]:
under_butler = Butler(tract_location)

In [None]:
# Calculate area from all tracts
skyMap = under_butler.get('deepCoadd_skyMap')
total_area = 0.0  #deg^2
plotting_vertices = []
for test_tract in tracts:
    # Get inner vertices for tract
    tractInfo = skyMap[test_tract]
    vertices = tractInfo._vertexCoordList
    plotting_vertices.append(vertices)
    
    #calculate area of box
    av_dec = 0.5 * (vertices[2][1] + vertices[0][1])
    av_dec = av_dec.asRadians()
    delta_ra_raw = vertices[0][0] - vertices[1][0] 
    delta_ra = delta_ra_raw.asDegrees() * np.cos(av_dec)
    delta_dec= vertices[2][1] - vertices[0][1]
    area = delta_ra * delta_dec.asDegrees()
    
    #combine areas
    total_area += area
    
print("Total area imaged (sq deg): ",total_area)

#round total area for table purposes
rounded_total_area = round(total_area, 2)


Let's also pull out, for this rerun, the same basic information that we did for the whole repo, and compare:

In [None]:
rerun_visits = under_butler.queryMetadata('calexp', ['visit'])
rerun_pointings = under_butler.queryMetadata('calexp', ['pointing'])
rerun_ccds = under_butler.queryMetadata('calexp', ['ccd'])
rerun_fields = under_butler.queryMetadata('calexp', ['field'])
rerun_filters = under_butler.queryMetadata('calexp', ['filter'])

rerun_sources = under_butler.queryMetadata('src', ['id'])

rerun_num_visits = len(rerun_visits)
rerun_num_pointings = len(rerun_pointings)
rerun_num_ccds = len(rerun_ccds)
rerun_num_fields = len(rerun_fields)
rerun_num_filters = len(rerun_filters)

rerun_num_sources = len(rerun_sources)

In [None]:
print("The HSC repo contains {} visits, the {}/{} rerun contains {} visits.".format(num_visits,rerun,depth,rerun_num_visits))
print("The HSC repo contains {} pointings, the {}/{} rerun contains {} pointings.".format(num_pointings,rerun,depth,rerun_num_pointings))
print("The HSC repo contains {} ccds, the {}/{} rerun contains {} ccds.".format(num_ccds,rerun,depth,rerun_num_ccds))
print("The HSC repo contains {} fields, the {}/{} rerun contains {} fields.".format(num_fields,rerun,depth,rerun_num_fields))
print("The HSC repo contains {} filters, the {}/{} rerun contains {} filters.".format(num_filters,rerun,depth,rerun_num_filters))
print("The HSC repo contains {} sources, the {}/{} rerun contains {} sources.".format(num_sources,rerun,depth,rerun_num_sources))

This makes sense, given what we saw in the `repositoryCfg.yaml` file in the DM-13666/WIDE subfolder. What is less clear is why it is like this - one would think that each rerun would lead to a slightly different number of sources, at least - and when the main `butler` queries for sources, where is it getting the information from? Did we just get lucky somehow, choosing DM-13666/WIDE? 

When I choose a different rerun and depth, I get the exact same numbers, except for the sky area: DM-10404/DEEP only covers 103.8 sq deg. So this number of sources is confusing: where is it coming from? Why does the metadata queried by a butler attached to DM-10404/DEEP give the same number of sources as that from DM-13666/WIDE?

## Displaying Dataset Characteristics

Now let's print out a report of all the characteristcs we have found. We'll use the sky area from the DM-13666/WIDE rerun and the numbers common to all reruns.

In [None]:
dataset_name = 'HSC'
display(Markdown('### %s' % repo))


# Make a table of the collected metadata
collected_data = [num_visits, num_pointings, num_ccds, num_fields, num_filters, num_sources, 
                  num_tracts, rounded_total_area]
data_names = ("Number of Visits", "Number of Pointings", "Number of CCDs", "Number of Fields", 
              "Number of Filters", "Number of Sources", "Number of Tracts", "Total Sky Area (deg$^2$)")

output_table = "|   Metadata Characteristic  | Value | \n  | ---: | ---: | \n "
counter = 0
while counter < len(collected_data):
    output_table += "| %s |  %s | \n" %(data_names[counter], collected_data[counter])
    counter += 1
display(Markdown(output_table))

# Show which fields and filters we're talking about:
display(Markdown('Fields: (%i total)' %num_fields))
print(fields)
display(Markdown('Filters: (%i total)' %num_filters))
print(filters)


## Plotting the sky coverage

For this we will need our list of `tracts` from above, and also the `skyMap` object. We can then extract the sky coordinates of the corners of each tract, and use them to draw a set of rectangles to illustrate the sky coverage, following Jim Chiang's LSST DESC tutorial [dm_butler_skymap.ipynb](https://github.com/LSSTDESC/DC2-analysis/blob/master/tutorials/dm_butler_skymap.ipynb).

In the future, we could imagine overlaying the focal plane and color the individual visits, using more of the code from Jim's notebook. Let's see what functionality the Gen3 Butler provides first, and then return to visualization.

In [None]:
plt.figure()

for tract in tracts:
    tractInfo = skyMap[tract]
        
    corners = [(x[0].asDegrees(), x[1].asDegrees()) for x in tractInfo.getVertexList()]
    x = [k[0] for k in corners] + [corners[0][0]]
    y = [k[1] for k in corners] + [corners[0][1]]
    
       
    plt.plot(x,y, color='b')
    
plt.xlabel('RA (deg)')
plt.ylabel('Dec (deg)')
plt.title('2D Projection of Sky Coverage')

plt.show()

We could imagine plotting the patches as well, to show which tracts were incomplete - but this gives us a rough idea of where our data is on the sky.

# Summary

We have shown a few techniques for exploring a data repo. To make this process straightforward, we have implemented all these techniques into mehtods of a `Taster` class, which is now a part of the `stackclub` library. The `Taster` will give you a taste of what the `Butler` delivers. we demonstrate the use of this class in the [DataInventory.ipynb](https://github.com/LSSTScienceCollaborations/StackClub/blob/project/data_inventory/drphilmarshall/Basics/DataInventory.ipynb) notebook.
