# Session 02: Intro to the Science Pipelines Software and Data products

<br>Owner(s): **Yusra AlSayyad** ([@yalsayyad](https://github.com/LSSTScienceCollaborations/StackClub/issues/new?body=@yalsayyad))
<br>Last Verified to Run: **2020-05-14?**
<br>Verified Stack Release: **19.0?**

Now that you have brought yourself to the data via the Science Platform (Lesson 1), you can retrieve that data and rerun elements of the Science Pipelines. 

Today we'll cover:
* What are the **Science Pipelines**?
* What is this **Stack**?
* What are the **data products**?
* How to rerun an element of the pipelines, a **Task**, with a different configuration. 

We'll only quickly inspect the images and catalogs. Next week's lesson will be dedicated to learning more sophisticated methods for exploring the data.

## 1. Overview of the Science Pipelines  and the stack (presentation)

The stack is an implementation of the science pipelines and its corresponding library. 

## 2.  The Data Products

Data products include both catalogs and images.

Instead of operating directly on files and directories, we interact with on-disk data products via an abstraction layer called the data Butler. The butler operates on data repositories. DM regularly tests the science pipelines on precursor data from HSC, DECam, and simulated data generated by DESC that we call "DC2 ImSim". These Data Release Production (DRP) outputs can be found in `/datasets`. 

This notebook will be using HSC Gen 2 repo.

*Jargon Watch: Gen2/Gen3 - We're in the process of building a brand new Butler, which we are calling the 3rd Generation Butler,  or Gen3 for short.*
 
| generation | Class |
| ------------- | ------------- |
|  Gen 2  | `lsst.daf.persistence.Butler`  |
|  Gen 3   | `lsst.daf.butler.Butler` |

A Gen 2 Butler is the first stack object we are going to instantiate, with a path to a directory that is a repo.

In [None]:
# What version of the Stack am I using?
! echo $HOSTNAME\n
! eups list lsst_distrib -s

In [None]:
import os
REPO = '/datasets/hsc/repo/rerun/RC/w_2020_19/DM-24822'  
from lsst.daf.persistence import Butler
butler = Butler(REPO)

In [None]:
HSC_REGISTRY_COLUMNS = ['taiObs', 'expId', 'pointing', 'dataType', 'visit', 'dateObs', 'frameId', 'filter', 'field', 'pa', 'expTime', 'ccdTemp', 'ccd', 'proposal', 'config', 'autoguider']
butler.queryMetadata('calexp', HSC_REGISTRY_COLUMNS, dataId={'filter': 'HSC-I', 'visit': 30504, 'ccd': 50})

**Common error messages** when instantiating a Butler:

1) 
```PermissionError: [Errno 13] Permission denied: '/datasets/hsc/repo/rerun/RC/w_2020_19/DM-248222'```
- Translation: This directory does not exist. Confirm with  `os.path.exists(REPO)`

2) `RuntimeError: No default mapper could be established from inputs`:

- Translation: This directory exists, but is not a data repo. Does `REPO` have a file called `repositoryCfg.yaml` in it? Nope? It's not a data repo. Use `os.listdir` to see what's in your directory


*Next we'll look at 3 types of data products:*
* Images
* Catalogs: lsst.afw.table
* Catalogs: parquet/pyArrow DataFrames

## 2.1 Images

In [None]:
VISIT = 34464
CCD = 81
exposure = butler.get('calexp', visit=int(VISIT), ccd=CCD)

**Common error messages** when getting data:

1) `'HscMapper' object has no attribute 'map_calExp'`
- You're asking for a data product that doesn't exist. In this example, I asked for a 'calExp' with a capital E, which is not a thing.  Double check your spelling in: https://github.com/lsst/obs_base/blob/master/policy/exposures.yaml for images or  https://github.com/lsst/obs_base/blob/master/policy/datasets.yaml for catalogs or models.

2) `NoResults: No locations for get: datasetType:calexp dataId:DataId(initialdata={'visit': 34464, 'ccd': 105}, tag=set())`:

- This file doesn't exist. If you don't believe the Butler, add "_filename" to the data product you want, and you'll get back the filename you can lookup. For example:

        butler.get('calexp_filename', visit=VISIT, ccd=105)
        

In [None]:
butler.get('calexp_filename', visit=VISIT, ccd=105)

Rare error message: Did you try that and now it says it can't find the filename? `NoResults: No locations for get: datasetType:calexp_filename dataId:DataId(initialdata={'visit': 34464, 'ccd': 81}, tag=set())` Sqlalchemy doesn't handle data types well. Force your visit or ccd numbers to be integers like `butler.get('calexp', visit=int(34464), ...`

Q: If I can get the filename from the butler, why can't I just read it in manually like I do other fits files and fits tables?

A: Because in operations, the data will not be on a shared GPFS disk like you're reading from now. We guarantee `butler.get` to work the same regardless of the backend storage. 

### Exposure Objects

The data that the butler just fetched for us is an `Exposure` object. It composes a `maskedImage` which has 3 `Image` object: an `image`, `mask`, and `variance`.  These are pointers/views!

In [None]:
exposure

In [None]:
exposure.maskedImage.image   
exposure.maskedImage.mask
exposure.maskedImage.variance

# These shortcuts work too.
exposure.image
exposure.variance
exposure.mask

# each image also has an array property e.g.
exposure.image.array

The exposures also include a WCS Object, a PSF Object and ExposureInfo. These can be accessed via the following methods.

In [None]:
wcs = exposure.getWcs()
psf = exposure.getPsf()
photoCalib = exposure.getPhotoCalib()
expInfo = exposure.getInfo()

In [None]:
visitInfo = expInfo.getVisitInfo()

**Exercise:** Use tab-complete or '?exposure' to explore the Exposure.  Explore details are in this visit info. What was the exposure time? What was the observation date? Exploring the other methods of Exposure object, what are the dimensions of the image? 

In [None]:
# visitInfo.
# exposure.

For more documentation on Exposure objects:
* https://pipelines.lsst.io/modules/lsst.afw.image/indexing-conventions.html

For another notebook on Exposure objects:
* https://github.com/LSSTScienceCollaborations/StackClub/blob/master/Basics/Calexp_guided_tour.ipynb

Session 3 will introduce more sophisticated image display tools, and go into detail on the `Display` objects in the stack, but let's take a quick look at this image:

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import lsst.afw.display as afw_display
%matplotlib inline

matplotlib.rcParams["figure.figsize"] = (6, 4)
matplotlib.rcParams["font.size"] = 12
matplotlib.rcParams["figure.dpi"] = 120

In [None]:
# Let's smooth it first just for fun. 

from skimage.filters import gaussian
exposure.image.array[:] = gaussian(exposure.image.array, sigma=5)

# and display it
display = afw_display.Display(frame=1, backend='matplotlib')
display.scale("linear", "zscale")
display.mtv(exposure)

From the colorbar, you can tell that the background has been subtracted.

The first step of the pipeline, `processCcd` takes a `postISRCCD` as input. Before detection and measurement, it estimates a naive background and subtracts it.  Indeed we store a `calexp` with the background subtracted and model that was subtracted from the original `postISRCCD` as the `calexpBackground`.

*Jargon Watch: ISR - Instrument Signature Removal (ISR) encapsulates all the steps you normally associated with reducing astronomical imaging data (bias subtraction, flat fielding, crosstalk correction, cosmic ray removal, etc.)*


There's full focal plane background estimation step that produces a delta on the `calexpBackground`, which is called `skyCorr`. 

Let's quickly plot these:

In [None]:
# Fetch background models from the butler
background = butler.get('calexpBackground', visit=VISIT, ccd=CCD)
skyCorr = butler.get('skyCorr', visit=VISIT, ccd=CCD)

# call "getImage" to evaluate the model on a pixel grid
plt.subplot(121)
plt.imshow(background.getImage().array, origin='lower', cmap='gray')
plt.title("Local Polynomial Bkgd")
plt.subplot(122)
plt.imshow(background.getImage().array - skyCorr.getImage().array, origin='lower', cmap='gray')
plt.title("SkyCorr Bkgd")

In [None]:
exposure = butler.get('calexp', visit=VISIT, ccd=CCD)
background = butler.get('calexpBackground', visit=VISIT, ccd=CCD)
# create a view to the masked image
mi = exposure.maskedImage
# add the background image to that view.
mi += background.getImage()

display1 = afw_display.Display(frame=1, backend='matplotlib')
display1.scale("linear", "zscale")
exposure.image.array[:] = gaussian(exposure.image.array, sigma=5)
display1.mtv(exposure)

It is good to get in the habit of mathmatical operations on maskedImages instead of images, because it scales the variance plane appropriately. For example, when you multiply a `MaskedImage` by 2, it multiplies the `Image` by 2 and the `Variance` by 4.

**Exercise 2.1)** Coadds have dataId's defined by their SkyMap. Fetch the `deepCoadd` with `tract=9813`, `patch='3,3'` and `filter='HSC-I'` from the same repo. 

Bonus: a `deepCoadd_calexp` has had an additional aggressive background model applied called a `deepCoadd_calexp_background`. Confirm that the `deepCoadd_calexp` + `deepCoadd_calexp_background` = `deepCoadd`

In [None]:
deepCoadd = butler.get('deepCoadd', tract=9813, patch='3,3', filter='HSC-I')
deepCoadd_calexp_background = butler.get('deepCoadd_calexp_background', tract=9813, patch='3,3', filter='HSC-I')
deepCoadd_calexp = butler.get('deepCoadd_calexp', tract=9813, patch='3,3', filter='HSC-I')

In [None]:
mi = deepCoadd_calexp.maskedImage
mi += deepCoadd_calexp_background.getImage()

deepCoadd.image.array - deepCoadd_calexp.image.array


## 2.2 Catalogs (lsst.afw.table format)


afwTables are for passing to tasks.  The pipeline needed C++ readable table format, so we wrote one. If you want to pass a catalog to a Task, it'll probably take one of these. They are:
* Row stores and
* the column names are oriented for software

The source table immediatly output by processCcd is called `src`

In [None]:
src = butler.get('src', visit=VISIT, ccd=CCD)
src

The returned object, `src`, is a `lsst.afw.table.SourceCatalog` object. 

In [None]:
src.getSchema()

Inspecting the schema reveals that instFluxes are in uncalibrated units of counts. coord_ra/coord_dec are in units of radians. 

`lsst.afw.table.SourceCatalog`s have their own API. However if you are *just* going to use it for analysis, you can convert it to an AstroPy table or a pandas DataFrame:

In [None]:
src.asAstropy()

In [None]:
df = src.asAstropy().to_pandas()
df.tail()

## 2.3 Catalogs (Parquet/PyArrow DataFrame format)

* Output data product ready for analysis
* Column store
* Full visit and full tract options
* Column names

The parquet outputs have been transformed to database-specified units. Fluxes are in nanojanskys, coordinates are in degrees. These will match what you get via the Portal and the Catalog Access tool Simon showed last week.

**NOTE:** The `sourceTable` is a relatively new edition to the Stack, and probably requires a stack version more recent than v19.


In [None]:
parq = butler.get('sourceTable', visit=VISIT, ccd=CCD)

The `ParquetTable` is just a light  wrapper around a `pyarrow.parquet.ParquetFile`.

You can get a parquet table from the butler, but it doesn't fetch any columns until you ask for them. It's a column store which means that it can read only one column at a time. This is great for analysis when you want to plot two million element arrays. In a row-store you'd have to read the whole ten million-row table just for those two columns you wanted. 

But don't even try to loop through rows! If you want a whole row, use the `afwTable`. 

Last I checked the processing step that consolidates the Source Tables from a per-ccd Source Table to a 

`parq = butler.get('sourceTable', visit=VISIT)`

In [None]:
parq

In [None]:
# inspect the columns with: 
parq.columns

Note that the column names are different. Now fetch just the columns you want. For example:

In [None]:
df = parq.toDataFrame(columns=['ra', 'decl', 'PsFlux', 'PsFluxErr', 'sky_source',
                               'PixelFlags_bad', 'PixelFlags_sat', 'PixelFlags_saturated'])

**Exercise:** Using this DataFrame `df`, make a histogram of `PsFlux` for sky sources using this parquet source table. If `sky_source == True` then the source was not a normal detection, but rather an randomly placed centroid to measure properties of blank sky. The distribution should be centered at 0. 


**Exercise:** A parquet `objectTable_tract` contains deep coadd measurements for 1.5 sq. deg. tract.
**Make a r-i vs. g-r plot** of stars with a r-band SNR > 100. Use `refExtendedness` == 0 to select for stars. It means that the galaxy model Flux was similar to the PSF Flux. By the looks of your plot, what do you think about using refExtendedness for star galaxy separation?*


In [None]:
# butler = Butler('/datasets/hsc/repo/rerun/DM-23243/OBJECT/DEEP')
# parq = butler.get('objectTable_tract', tract=9813)
# parq.columns

# Tasks


**TL;DR  If you remember one thing about tasks it's go to http://pipelines.lsst.io, then click on lsst.pipe.base**

On the landing page for lsst.pipe.base documenation https://pipelines.lsst.io/modules/lsst.pipe.base/index.html, you'll see a number of tutorials on how to use Tasks and how to create one.

CmdlineTask extends Task with commandline driver utils for use with Gen2 Butlers, and will be deprecated soon. However, not all the links under "CommandlineTask" will become obsolete. For example, Retargeting subtasks of command-line tasks will live on.

Read: https://pipelines.lsst.io/modules/lsst.pipe.base/task-framework-overview.html

What is a Task?
Tasks implement astronomical data processing functionality. They are:

* **Configurable:** Modify a task’s behavior by changing its configuration. Automatically apply camera-specific modifications
* **Hierarchical:** Tasks can call other tasks as subtasks
* **Extensible:** Replace (“retarget”) any subtask with a variant. Write your own subclass of a task.


In [None]:
# Edited highlights of ${PIPE_TASKS_DIR}/example/exampleStatsTask.py
import sys
import numpy as np
from lsst.geom import Box2I, Point2I, Extent2I
from lsst.afw.image import MaskedImageF
from lsst.pipe.tasks.exampleStatsTasks import ExampleSimpleStatsTask, ExampleSigmaClippedStatsTask

In [None]:
# Load a MaskedImageF -- an image containing floats
# together with a mask and a per-pixel variance.

WIDTH = 40
HEIGHT = 20

maskedImage = MaskedImageF(Box2I(Point2I(10, 20),
                                 Extent2I(WIDTH, HEIGHT)))
x = np.random.normal(10, 20, size=WIDTH*HEIGHT)

# Because we are shoving it into an ImageF and numpy defaults
# to double precision
X = x.reshape(HEIGHT, WIDTH).astype(np.float32)  
im = maskedImage.image
im.array = X

# We initialize the Task once but can call it many times.
task = ExampleSimpleStatsTask()

# Simply call the .run() method with the MaskedImageF.
# Most Tasks have a .run() method. Look there first. 
result = task.run(maskedImage)

# And print the result.
print(result)

## Using a Task with configuration

Now we are going to instantiate Tasks with two different configs. Configs must be set *before* instantiating the task. Do not change the config of an already-instatiated Task object. It will not do what you think it's doing. 

In fact, during commandline processing, the `Task` drivers such as `CmdLineTask` freeze the configs before running the Task. When you're running them from notebooks, they are not frozen, hence this warning.

In [None]:
# Edited highlights of ${PIPE_TASKS_DIR}/example/exampleStatsTask.py

config1 = ExampleSigmaClippedStatsTask.ConfigClass(numSigmaClip=1)

config2 = ExampleSigmaClippedStatsTask.ConfigClass()
config2.numSigmaClip = 3

task1 = ExampleSigmaClippedStatsTask(config=config1)
task2 = ExampleSigmaClippedStatsTask(config=config2)

print(task1.run(maskedImage).mean)
print(task2.run(maskedImage).mean)


# Example of what not to do
# -------------------------
# task1 = ExampleSigmaClippedStatsTask(config=config1)
# print(task1.run(maskedImage).mean)
# DO NOT EVER DO THIS!
# task1.config.numSigmaClip = 3  <--- bad bad bad 
# print(task1.run(maskedImage).mean)

## Background Subtraction and Task Configuration

The following example of reconfiguring a task is one step in an introduction to `processCcd`: https://github.com/lsst-sqre/notebook-demo/blob/master/AAS_2019_tutorial/intro-process-ccd.ipynb

`processCcd`, our basic source extractor run as the first step step in the pipeline, will be covered in more detail in Session 4. 

In [None]:
from lsst.meas.algorithms import SubtractBackgroundTask

In [None]:
# Add the background back in so that we can remodel it (like we did above)

postISRCCD  = butler.get("calexp", visit=30502, ccd=CCD)
bkgd = butler.get("calexpBackground", visit=30502, ccd=CCD)
mi = exposure.maskedImage
mi += bkgd.getImage()

In [None]:
# Execute this cell to get fun & terrible results!
bkgConfig = SubtractBackgroundTask.ConfigClass()
bkgConfig.useApprox = False
bkgConfig.binSize = 20

 The `config` object here is an instance of a class that inherits from `lsst.pex.config.Config` that contains a set of `lsst.pex.config.Field` objects that define the options that can be modified.  Each `Field` behaves more or less like a Python `property`, and you can get information on all of the fields in a config object by either using `help`:

In [None]:
help(bkgConfig)

In [None]:
SubtractBackgroundTask.ConfigClass.algorithm?

In [None]:
bkgTask = SubtractBackgroundTask(config=bkgConfig)

In [None]:
bkgResult = bkgTask.run(exposure)

In [None]:
display1 = afw_display.Display(frame=1, backend='matplotlib')
display1.scale("linear", min=-0.5, max=10)
display1.mtv(exposure[700:1400,1800:2400])

If you've run through all of these steps after executing the cell that warns about terrible results, you should notice that the galaxy in the upper right has been oversubtracted.

**Exercise**: Before continuing on, re-load the exposure from disk, reset the configuration and `Task` instances, and re-run without executing the cell that applies bad values to the config, all by just re-executing the right cells above.  You should end up an image in which the upper-right galaxy looks essentially the same as it does in the image before we subtracted the background.