# Inspection of DC2 Run 2.2i DR6 Object Table with Dask
### Michael Wood-Vasey (@wmwv)
### Last Verified to Run: 2020-07-14 by MWV

Inspect the Run 2.2i DR6 Object Table  
Using a Dask Cluster on one NERSC node as the backend.

#### Run 2.2i DR6a as of 2020-06-04 includes  
  * 78 tracts
  * 52 million objects  
  * 34 million objects with i-band SNR > 5

Logistics:

1. These tests were conducted on NERSC through the https://jupyter.nersc.gov interface.  
Note: To enable re-rastering when zooming, use the JupyterLab Classic interface.
You can launch this from an active JupyterHub Notebook by selecting "Help->Launch Classic Notebook".
  * You can select the "Running" tab and then select the Notebook you want.
  * You could instead browse through the full filesystem path under the "Files" tab to find your Notebook, but that's a lot more clicking.  You may want to take this aproach to launch some other Notebook that's not currently running under JupyterHub.

2. Requires:
```
dask
dask.distributed
holoviews
datashader
bokeh
pyarrow >= 0.13.1
```

Up-to-date versions of each of these are available in `desc-python-bleed` kernel

3. This was run using the `desc-python-bleed` kernel

We directly use the DPDD Parquet files.

4. We use Dask, HoloViews, and Datashader to read Parquet files.  For more on each of these, see:

References:  
    https://dask.org  
    https://datashader.org  
    https://parquet.apache.org  
    https://holoviews.org  
    
In brief:

### Parquet

Parquet is a column-based storage format that's part of a wider Arrow project to provide standardized, high-performance data representations in memory and on disk.  It's commonly used in current data science and large data volume processing, and is the current selected standard for Rubin Observatory LSST Data Management on-disk representations of output data catalogs.  The DESC Data Access Team is thus similarly using Parquet as the default underlying data format for representations of DC2 data as processed by the LSST DM Science Pipelines.

### Dask

Dask allows us to do processing by dividing tasks into individual workers.  These workers allow us to take fuller use of available memory and processors, including those on other machines.

Dask is solving the needs to:

1. Load more data than fit into memory. You can delay this in either time or space.
   * Delaying in time would be if you running on a memory-limited machine, then Dask will be able to chunk through the work units without simultaneously needing the full amount of memory to hold all of the data at once.
   * Delaying in space means spinning up additional machines.  This is often particularly powerful when connecting your front-end machine (e.g., a NERSC JupyterHub job is limited to 42 GB memory), to several full cluster compute nodes (e.g., a NERSC Haswell node is 32 real cores, 128 GB).  

2. Distributing work across multiple processors. Python and numpy/scipy are not naturally parallel or easily parallelizable. One of the common things we will do with large datasets is aggregate for both analysis and visualization. Being able to do this aggregation in parallel is a significant gain.


### HoloViews

HoloViews is Dask aware and can provide Dask the correct information to build a Task Graph that effectively parallelizes the requisite data loading and computation.  HoloViews can use either bokeh or matplotlib backends.  If you directly use the matplotlib backend with a Dask DataFrame it will not appropriately parallelize across the workers and instead do lots of stuff in serial.  Bokeh also gives some nice interactive capabilities and HoloViews knows how to appropriate set up the linking and call backs to enable coordinated zooming and selection.

TODO
1. I can't figure how to get Histograms to not have vertical lines.
    * It may not be possible because I think they're being drawn with Bokeh `quad`.
    * When I try to plot with `logy=True`, the plots become blank.
    * Yes, it really does alphabetize the samples.  I would like to learn how to force the order so I get can `galaxy` to appear more clearly on top of `good`.
    * If I instead use Path plot elements instead of Histogram, I can't figure out how to get labeled legends.
2. Plot colorbars for data-shaded plots
3. Histograms don't display scaled `logy`.  See, https://github.com/holoviz/holoviews/issues/2591

## Import Needed Modules

In [None]:
import math
import os

import numpy as np

import astropy.units as u
import pandas as pd

In [None]:
import colorcet

import dask
from dask.distributed import Client

import dask.dataframe as dd
import datashader as ds
import holoviews as hv
from holoviews.operation import histogram
from holoviews.operation.datashader import datashade, shade, dynspread, rasterize
from holoviews.plotting.util import process_cmap

In [None]:
hv.extension('bokeh')

In [None]:
cmap = 'viridis'

## Start our Dask Cluster


For simple testing and illustration of how to use dask, holoview, and datashader here you can run locally on just one tract.
To run on the full set of DR6, you'll need to set up a node to support Dask distributed.  Basically you just need a machine that can hold the data in memory.

### Start a local Dask Cluster

In [None]:
LOCAL_DASK = False

In [None]:
if LOCAL_DASK:
    client = Client()

### Start a Dask Cluster on an Interactive Nodes

So instead, in a separate Terminal on Cori, ask for a pair of Node from the `interactive` queue.  This generally completes in seconds.  We ask for 2 Nodes because we'd like the full ~256 GB of availablel memory to store the data and the intermediate copies that often get made in some of the plots below:

Move to somewhere CSCRATCH directory to simplify file locking.  We're going to use `CSCRATCH` because that will be consistent across the nodes, whereas `SCRATCH` sometimes is empty on the compute nodes.
```
cd $CSCRATCH/
```

```
salloc -N 2 -C haswell --qos=interactive -t 04:00:00
```

And then once on the first Node, where you'll get put after the `salloc` complets, load the right Python environment:
```
python /global/common/software/lsst/common/miniconda/start-kernel-cli.py desc-python-bleed
```

And then start up the Dask Cluster
```
NUM_WORKERS=16
SCHEDULER_FILE=${CSCRATCH}/scheduler.json
dask-scheduler --scheduler-file ${SCHEDULER_FILE} &
dask-worker --nprocs ${NUM_WORKERS} --scheduler-file ${SCHEDULER_FILE} &
```

Then exit the environment and go to the second Node.

```
cd $CSCRATCH
```

```
python /global/common/software/lsst/common/miniconda/start-kernel-cli.py desc-python-bleed
```

```
NUM_WORKERS=16
SCHEDULER_FILE=${CSCRATCH}/scheduler.json
dask-worker --nprocs ${NUM_WORKERS} --scheduler-file ${SCHEDULER_FILE} &
```

The nodes will be printed out when the `salloc` launches.  And if you forget, you can look them up under the `SLURM_NODELIST` environment variable.

We connect to this Dask cluster through a shared agreement on where the `SCHEDULER_FILE` is.

We then configure the dashboard URL to use the JupyterHub proxy service.
We here set the formatting string template to the correct value.
Once we actually connect the client, then client can then tell us the full link.

In [None]:
if not LOCAL_DASK:
    scheduler_file = os.path.join(os.environ["SCRATCH"], "scheduler.json")
    dask.config.config["distributed"]["dashboard"]["link"] = "{JUPYTERHUB_SERVICE_PREFIX}proxy/{host}:{port}/status"
    client = Client(scheduler_file=scheduler_file)

In [None]:
client

## Define Catalog and Subsampling

In [None]:
LOCAL_FILE = False

In [None]:
desc_data_dir = f"/global/cfs/cdirs/lsst/shared/DC2-prod/"
# Local copy on my own machine for testing and for when NERSC is down
if LOCAL_FILE:
    desc_data_dir = f"/Users/wmwv/tmp/DC2"

In [None]:
data_release = "dr6a"

run_data_dir = f"Run2.2i/dpdd/Run2.2i-{data_release}/dc2_object_run2.2i_{data_release}"
datafile = os.path.join(desc_data_dir, run_data_dir)

## Load Data

In [None]:
filters = ('u', 'g', 'r', 'i', 'z', 'y')

In [None]:
columns = ['ra', 'dec']
columns += [f'mag_{f}' for f in filters]
columns += [f'magerr_{f}' for f in filters]
columns += [f'mag_{f}_cModel' for f in filters]
columns += [f'magerr_{f}_cModel' for f in filters]
columns += [f'I_flag']
columns += [f'I_flag_{f}' for f in filters]
columns += [f'Ixx_{f}' for f in filters]
columns += [f'Ixy_{f}' for f in filters]
columns += [f'Iyy_{f}' for f in filters]
columns += [f'psf_fwhm_{f}' for f in filters]
columns += ['good', 'extendedness', 'blendedness']

In [None]:
# Select good detections:
#  1. Marked as 'good' in catalog flags.
#  2. SNR in given band > threshold
#  3. In defined simulation range
snr_threshold = 5
snr_filter = 'i'

# We want to do a SNR cut, but magerr is the thing already calculated
# So we'll redefine our SNR in terms of magerr
magerr_cut = (2.5 / np.log(10)) / snr_threshold
snr_cut = f'magerr_{snr_filter} < {magerr_cut}'

In [None]:
df = dd.read_parquet(datafile, columns=columns, engine='pyarrow')

In [None]:
# Define color columns
df['u-g'] = df['mag_u'] - df['mag_g']
df['g-r'] = df['mag_g'] - df['mag_r']
df['r-i'] = df['mag_r'] - df['mag_i']
df['i-z'] = df['mag_i'] - df['mag_z']
df['z-y'] = df['mag_z'] - df['mag_y']

In [None]:
# np.isfinite('blendedness')

good = df[df["good"] & (df[f"magerr_{snr_filter}"] < magerr_cut)]

In [None]:
star = good[good['extendedness'] == 0]
galaxy = good[good['extendedness'] > 0]

In [None]:
print(f'Total: {len(df)}, Good: {len(good)}, Stars: {len(star)}, Galaxies: {len(galaxy)}')

## Persist the data in Dask Cluster Worker memory

Dask actively purges data from memory when its no longer needed by the Dask Task Graph currently doing the computation.

But each plot below is its own separate computation.  Dask doesn't know that it's going to use those data again in the next plot.  So we explicitly tell Dask to persist this data frame.

If you don't have the physical memory across your Dask installation (whether local or remote), then skip this persist step.  Running each of the plots will require re-reading the data and bit a bit slower than if we had memory to keep all of the data, but will work fine.

In [None]:
good = good.persist()

## Object Density in RA, Dec

DC2 Run 2.x WFD and DDF regions
https://docs.google.com/document/d/18nNVImxGioQ3tcLFMRr67G_jpOzCIOdar9bjqChueQg/view
https://github.com/LSSTDESC/DC2_visitList/blob/master/DC2visitGen/notebooks/DC2_Run2_regionCoords_WFD.ipynb

| Location          | RA (degrees) | Dec (degrees) | RA (degrees) | Dec (degrees) |
|:----------------- |:------------ |:------------- |:------------ |:------------- |
| Region            | WFD          | WFD           | DDF          | DDF           |
| Center            | 61.856114    | -35.79        | 53.125       | -28.100       |
| North-East Corner | 71.462228    | -27.25        | 53.764       | -27.533       |
| North-West Corner | 52.250000    | -27.25        | 52.486       | -27.533       |
| South-West Corner | 49.917517    | -44.33        | 52.479       | -28.667       |
| South-East Corner | 73.794710    | -44.33        | 53.771       | -28.667       |

(Note that the order of the rows above is different than in the DC2 papers.  The order of the rows above goes around the perimeter in order.)

In [None]:
dc2_run2x_wfd = [[71.462228, -27.25], [52.250000, -27.25], [49.917517, -44.33], [73.794710, -44.33]]
dc2_run2x_ddf = [[53.764, -27.533], [52.486, -27.533], [52.479, -28.667], [53.771, -28.667]]

In [None]:
dc2_run2x_wfd_df = pd.DataFrame({'ra': [coord[0] for coord in dc2_run2x_wfd] + [dc2_run2x_wfd[0][0]],
                                 'dec': [coord[1] for coord in dc2_run2x_wfd] + [dc2_run2x_wfd[0][1]]})
dc2_run2x_ddf_df = pd.DataFrame({'ra': [coord[0] for coord in dc2_run2x_ddf] + [dc2_run2x_ddf[0][0]],
                                 'dec': [coord[1] for coord in dc2_run2x_ddf] + [dc2_run2x_ddf[0][1]]})

In [None]:
def plot_ra_dec(df, dc2_run2x_wfd_df=dc2_run2x_wfd_df, dc2_run2x_ddf_df=dc2_run2x_ddf_df,
                show_dc2_region=True, cmap="bmy", bins=100, cmin=10):
    """We're just doing this on a rectilinear grid.
    We should do a projection, of course, but that distortion is tolerable in this space."""
    points_ra_dec = hv.Points(df, kdims=[hv.Dimension('ra', soft_range=(dc2_run2x_wfd[2][0], dc2_run2x_wfd[3][0])),
                                         hv.Dimension('dec', soft_range=(dc2_run2x_wfd[3][1], dc2_run2x_wfd[1][1]))])
    # We have to define the colormap here now, because the opts aren't passed through the datashade->Points.
    # See, e.g., https://github.com/holoviz/holoviews/issues/4125
    ra_dec = datashade(points_ra_dec, cmap=process_cmap(cmap, provider="colorcet"))
    ra_dec = ra_dec.opts(invert_xaxis=True)  # Flip to East left
    
    if show_dc2_region:
        # This region isn't quite a polygon.  The sides should be curved.
        wfd_region = hv.Path(dc2_run2x_wfd_df).opts(color='red')
        ddf_region = hv.Path(dc2_run2x_ddf_df).opts(color='orange')
        ra_dec = ra_dec * wfd_region * ddf_region
        
        max_delta_ra = dc2_run2x_wfd_df['ra'][3] - dc2_run2x_wfd_df['ra'][2]
        delta_dec = dc2_run2x_wfd_df['dec'][1] - dc2_run2x_wfd_df['dec'][3]
        grow_buffer = 0.05

        # Notice that these are specified in increasing RA left->right
        # We rely on the invert_xaxis True above to flip this in the display
        # It's important to get this right because these ranges are used for data selection
        # and then the range is flipped in the display.
        ra_dec.opts(xlim=(dc2_run2x_wfd[2][0] - max_delta_ra * grow_buffer,
                    dc2_run2x_wfd[3][0] + max_delta_ra * grow_buffer))
        ra_dec.opts(ylim=(dc2_run2x_wfd[3][1] - delta_dec * grow_buffer,
                    dc2_run2x_wfd[1][1] + delta_dec * grow_buffer))

    
    return ra_dec

In [None]:
plot_ra_dec(good)

TODO:
    1. Make aspect ratio square.
    2. Think about spherical->2D projection issues in general.

The overall object density distribution looks good.

Notes:
* If you are viewing this through a direct JupyterLab connection (Jupyter Classic Notebook, or separately on your own machine or setup), the plot will re-raster as you zoom in and out.  This functionality is not available within the JupyterHub environment.  JupyterHub doesn't allow the JavaScript callbacks in the browser back to the server that are necessary to do the re-rastering.
* We explicitly excluded the tracts that overlap the DDF region (orange square upper-right corner).
* There are also a few patches that failed within the main region.
* There is an overall gradient N/S in object density, because we're plotting in rectilinear RA, Dec bins, which means that bins at the bottom in RA cover less area than those at the top.

See the input visit coverage map here:
https://github.com/LSSTDESC/ImageProcessingPipelines/issues/97#issuecomment-498303504


## Color-Color Diagrams and the Stellar Locus

In [None]:
# We refer to a file over in `tutorials/assets' for the stellar locus
datafile_davenport = '../tutorials/assets/Davenport_2014_MNRAS_440_3430_table1.txt'

def get_stellar_locus_davenport(color1='gmr', color2='rmi',
                                datafile=datafile_davenport):
    color1 = color1.replace('-', 'm')
    color2 = color2.replace('-', 'm')

    data = pd.read_table(datafile, sep='\s+', header=1)
    return data[color1], data[color2]

    
def plot_stellar_locus(color1='gmr', color2='rmi',
                       color='blue', line_dash='dashed', line_width=2.5,
                       ax=None):

    color1_m = color1.replace('-', 'm')
    color2_m = color2.replace('-', 'm')

    model_color1, model_color2 = get_stellar_locus_davenport(color1_m, color2_m)
    model_df = pd.DataFrame({color1: model_color1, color2: model_color2})
    stellar_locus = hv.Path(model_df).opts(color='blue', line_dash=line_dash, line_width=line_width)
        
    return stellar_locus 

In [None]:
def plot_color_color(df, color1, color2, 
                     range1=(-1, +2), range2=(-1, +2),
                     cmin=10, cmap='rainbow',
                     vmin=None, vmax=None):
    """Plot a color-color diagram.  Overlay stellar locus"""
    band1, band2 = color1[0], color1[-1]
    band3, band4 = color2[0], color2[-1]

    clean = df[np.isfinite(df[color1]) & np.isfinite(df[color2])]
    points_color1_color2 = hv.Points(
        clean,
        kdims=[
            hv.Dimension(color1, range=range1),
            hv.Dimension(color2, range=range2)]
    )

    color1_color2 = datashade(points_color1_color2, cmap=process_cmap(cmap, provider='colorcet'))

    try:
        stellar_locus = plot_stellar_locus(color1, color2)
        color1_color2 = color1_color2 * stellar_locus
    except KeyError as e:
        print(f"Couldn't plot Stellar Locus model for {color1}, {color2}")
        
    return color1_color2

In [None]:
def plot_four_color_color(df, vmin=0, vmax=50000):
    layout = hv.Layout(
    plot_color_color(df, 'g-r', 'u-g') + \
    plot_color_color(df, 'g-r', 'r-i') + \
    plot_color_color(df, 'g-r', 'i-z') + \
    plot_color_color(df, 'g-r', 'z-y'))
    
    layout = layout.cols(2)
    
    return layout

Note that the above panels will zoom in `g-r` together because HoloViews knows that they share this data column.  They don't zoom "together" in the y-axes because those columns are not shared between the plots.

The plots each re-raster as you zoom in and out.

There is no brushing (selection) and linking.

In [None]:
plot_four_color_color(star)

The discrete islands in the data for stellar color-color plot -- most visible in `r-i` vs. `g-r` at g-r ~= 1.2 mag -- are due to the finite set of stellar models used for simulating M dwarfs.

------
Let's plot the galaxies on the same color-color plots

Clearly one doesn't expect the galaxies to follow the stellar locus.  But including the stellar locus lines makes it easy to guide the eye between the stars-only and the galaxies-only plots.  

In [None]:
plot_four_color_color(galaxy)

Questions for further study:
   1. Is there a better comparison sample for the stellar locus than the Davenport reference?
   2. Why is the stellar locus in the Davenport 0.1--0.2 mag redder for the reddest stars than the observed data.  Are there different extinction assumptions (this should be a low-extinction region).  Are there different bandpasses used?

## 1D Density Plots

To compare number densities, we have to calculate the area covered by each catalog.
We'll use Healpix through HealPy to pixelate the region and then count of the number of pixels with significant numbers of objects.

In [None]:
def calculate_area(df, threshold=0.25, nside=1024, verbose=False):
    """Calculate the area covered by a catalog with 'ra', 'dec'
    
    Parameters:
    --
    cat: DataFrame, dict-like with 'ra', 'dec', keys
    threshold:  float
        Fraction of median value required to count a pixel.
    nside:  int
        Healpix NSIDE.  NSIDE=1024 is ~12 sq arcmin/pixel, NSIDE=4096 is 0.74 sq. arcmin/pixel
        Increasing nside will decrease calculated area as holes become better resolved 
        and relative Poisson fluctuations in number counts become more significant.
    verbose:  bool
        Print details on nside, number of significant pixels, and area/pixel.
        
    Returns:
    --
    area:  Astropy Quantity.
    """
    import healpy as hp

    indices = hp.ang2pix(nside, df['ra'], df['dec'], lonlat=True)
    idx, counts = np.unique(indices, return_counts=True)
    
    # Take the 25% of the median value of the non-zero counts/pixel
    threshold_counts = threshold * np.median(counts)

    if verbose:
        print(f'Median {np.median(counts)} objects/pixel')
        print(f'Only count pixels with more than {threshold_counts} objects')

    significant_pixels, = np.where(counts > threshold_counts)
    area_pixel = hp.nside2pixarea(nside, degrees=True) * u.deg**2

    if verbose:
        print(f'Pixel size ~ {hp.nside2resol(nside, arcmin=True) * u.arcmin:0.2g}')
        print(f'nside: {nside}, area/pixel: {area_pixel:0.4g}, num significant pixels: {len(significant_pixels)}')

    area = len(significant_pixels) * area_pixel

    if verbose:
        print(f'Total area: {area:0.7g}')
    
    return area

In [None]:
area_dc2 = calculate_area(galaxy)
print(f'DC2 Run 2.2i area: {area_dc2:0.2f}')

In [None]:
num_den_dc2 = len(galaxy) / area_dc2

# Change default expression to 1/arcmin**2
num_den_dc2 = num_den_dc2.to(1/u.arcmin**2)

In [None]:
area_dc2

In [None]:
def plot_mag_density_hist(df, filt, bins=None, density=False, area=None, color=None, **kwargs):
    mag_col = f'mag_{filt}'
    frequencies, edges = np.histogram(df[mag_col], bins=bins, density=density)
    if area is not None:
        frequencies = frequencies / area.value
    hist = hv.Histogram((edges, frequencies))
    hist.opts(xlabel=filt)
    
    if area is not None:
        ylabel = f"Objects/{area.unit}/bin"
    else:
        ylabel = "Objects/bin"
    hist.opts(ylabel=ylabel)

    if color:
        hist.opts(fill_color=None)
        hist.opts(line_color=color)
    
    return hist

def plot_mag_density_path(df, filt, object_type='', bins=None, density=False, area=None, **kwargs):
    mag_col = f'mag_{filt}'
    frequencies, edges = np.histogram(df[mag_col], bins=bins, density=density)
    if area is not None:
        frequencies = frequencies / area
    path = hv.Path(pd.DataFrame({mag_col: (edges[:-1]+edges[1:])/2, 'frequencies': frequencies}),
                  label=object_type)
#    if object_type is not None:
#        path.opts(label=object_type)

    path.opts(xlabel=mag_col)
    
    if area is not None:
        ylabel = f"Objects/{area.unit}/bin"
    else:
        ylabel = "Objects/bin"
    path.opts(ylabel=ylabel)
    
    return path 

plot_mag_density = plot_mag_density_hist

def plot_mag_densities(good, star, galaxy, filt,
                       area=None,
                       log=False, range=(16, 32), bins=None,
                       legend_position='top_left'):
    if bins is None:
        bins = np.linspace(*range, 100)
    
    densities = {'good': plot_mag_density(df, filt, object_type='good', bins=bins, area=area, color='green'),
                 'star': plot_mag_density(star, filt, object_type='star', bins=bins, area=area, color='blue'),
                 'galaxy': plot_mag_density(galaxy, filt, object_type='galaxy', bins=bins, area=area, color='red')}

    overlay = hv.NdOverlay(densities, kdims='Sample')
    overlay.opts(show_legend=True, legend_position=legend_position)
    if log:
        overlay.opts(logy=True)
    
    return overlay

In [None]:
density_plots = [plot_mag_densities(good, star, galaxy, filt, area=area_dc2) for filt in filters]
density_plots[0].opts(legend_position='top_left')
layout = hv.Layout(density_plots)

In [None]:
layout.cols(3)

The sharp cut in i-band is because that was the reference band for most detections.  The distributions in the other bands extend to 28th mag because many of the forced-photometry measurements are consistent with 0 and our S/N cut above was on i-band flux.

## Magnitude Error vs. Magnitude

The magnitude uncertainties come directly from the poisson estimates of the flux measurements.  By construction they will follow smooth curves.  We here confirm that they do.

In [None]:
def plot_mag_magerr(df, band, ax=None, range=(16, 28), magerr_limit=0.25, vmin=100,
                   cmap="rainbow", snr_magerr_threshold=magerr_cut):
    mag_col, magerr_col = f'mag_{band}', f'magerr_{band}'
    points_mag_magerr = hv.Points(df, kdims=[hv.Dimension(mag_col, range=(14, 28)),
                                             hv.Dimension(magerr_col, range=(0, snr_magerr_threshold))])
    return datashade(points_mag_magerr, cmap=process_cmap(cmap, provider='colorcet'))


In [None]:
mag_magerr = hv.Layout([plot_mag_magerr(good, filt) for filt in filters])
mag_magerr.cols(3)

## Blendedness

Blendedness is a measure of how much the identified flux from an object is affected by overlapping from other objects.

See Bosch et al., 2018, Section 4.9.11.

In [None]:
# print(f'{100 * len(w)/len(good_idx):0.1f}% of objects have finite blendedness measurements.')

Question for futher study:  What happened to yield non-finite blendedness measurements?

In [None]:
blendedness = datashade(hv.Points(good, kdims=['mag_i', 'blendedness']), cmap=process_cmap("rainbow", provider="colorcet"))

In [None]:
blendedness

### Extendedness
 
Extendedness is essentially star/galaxy separation based purely on morphology in the main detected reference band (which is `i` for most Objects).

Extendedness a binary property in the catalog, so it's either 0 or 1.

In [None]:
extendedness = datashade(hv.Points(good, kdims=['mag_i', 'extendedness']), cmap=process_cmap("rainbow", provider="colorcet"))

In [None]:
extendedness.opts(ylim=(-0.1, +1.1)) * hv.Text(18, 0.9, "Galaxies") * hv.Text(18, 0.1, "Stars")

While the first plot above made extendedness look like a simple binary property, the truth is more complicated.

As galaxies get smaller in angular size and lower in signal-to-noise ratio, it becomes harder to clearly distinguish stars from galaxies.

Extendedness is based off of the difference between the point-source model and extended model brightness.  Specifically objects with `mag_psf - mag_cmodel > 0.164` mag are labeled with `extendedness=1` (i.e., galaxies).

See Bosch et al. 2018, Section 4.9.10 for details.

In [None]:
extendedness_delta_mag_cut = 0.0164
psf_cModel_mag_cut = hv.VLine(extendedness_delta_mag_cut,
                              label=rf"{extendedness_delta_mag_cut:0.4f} $\Delta$mag cut")
psf_cModel_mag_cut = psf_cModel_mag_cut.opts(color='red', line_dash="dashed")

In [None]:
def plot_delta_mag_cModel(df, filt, bins=None):
    if bins is None:
        bins = np.linspace(-0.1, 0.1, 201)
    frequencies, edges = np.histogram(df[f'mag_{filt}'] - good[f'mag_{filt}_cModel'], bins=bins)
    return hv.Histogram((edges, frequencies))

In [None]:
filt = 'i'
delta_mag_cModel_hists = {'good': plot_delta_mag_cModel(good, filt),
                          'star': plot_delta_mag_cModel(star, filt),
                          'galaxy': plot_delta_mag_cModel(galaxy, filt)}

In [None]:
delta_mag_cModel = hv.NdOverlay(delta_mag_cModel_hists, kdims="Sample") 

In [None]:
delta_mag_cModel.opts(width=600, xlabel='mag_i[_psf] - mag_i_CModel', ylabel='Objects/bin') \
  * psf_cModel_mag_cut \
  * hv.Text(-0.05, 4000, "Stars") * hv.Text(0.05, 4000, "Galaxies")

In [None]:
good['delta_mag_cModel_i'] = good['mag_i'] - good['mag_i_cModel']
clean = good[(-2.5 < good['g-r']) & (good['g-r'] < 4)]

In [None]:
psf_cModel_mag_cut = hv.HLine(extendedness_delta_mag_cut,
                              label=rf"{extendedness_delta_mag_cut:0.4f} $\Delta$mag cut")
psf_cModel_mag_cut = psf_cModel_mag_cut.opts(color='green', line_dash="dashed")

In [None]:
points = hv.Points(clean, kdims=['mag_i', 'delta_mag_cModel_i'])
points = points.opts(xlabel='mag_i[_psf] - mag_cModel_i')

yhist = points.hist(dimension='delta_mag_cModel_i', adjoin=False)
xhist = points.hist(dimension='mag_i', adjoin=False)

shaded_points = datashade(points, cmap=process_cmap("rainbow", provider="colorcet"))

In [None]:
points_color = hv.Points(clean, kdims=['g-r', 'delta_mag_cModel_i'])
points_color_xhist = points_color.hist(dimension='g-r', dynamic=True, adjoin=False)

shaded_points_color = datashade(points_color, cmap=process_cmap("rainbow", provider="colorcet"))

In [None]:
composite = (shaded_points_color << yhist << points_color_xhist) \
    + (shaded_points * psf_cModel_mag_cut << yhist << xhist)

In [None]:
composite

We can zoom in a little to see how the fixed 0.0164 mag cut works at the low SNR limit.  Specifically at mag 24, we're starting to run out of stars and most things are galaxies.  But that's a population prior, it's not something visible using just morphology information.

You can see the effect of lower SNR measurements as the horizontal line at $\Delta$mag=0 puff up due to increased uncertainties.

TODO: 
1. I don't know how to construct an AdjointLayout without a "right" element.  So there's an extra duplicate "delta_mag_cModel_i" histogram that's not really helpful or projected right.

## Shape Parameters

Ixx, Iyy, Ixy

In [None]:
def plot_data_hist(data, bins, color, line_dash):
    frequencies, edges = np.histogram(data, bins=bins)
    hist = hv.Histogram((edges, frequencies))
    hist.opts(fill_color=color)
    hist.opts(line_dash=line_dash)
    return hist
 

def plot_moments_for_filter(good, star, galaxy, filt,
                            names=['good', 'star', 'galaxy'],
                            colors=['blue', 'orange', 'green']):
    hist_kwargs = {'color': colors, 'log': True,
             'range': (0, 50)}

    bins = np.logspace(-1, 1.5, 100)
    moment_lines = {}
    for prefix, ls in (('Ixx', 'solid'), ('Iyy', 'dashed'), ('Ixy', 'dotted')):
        field = f'{prefix}_{filt}'
        for df, name, color in zip((good, star, galaxy), names, colors):
            label = f'{prefix} {name}'
            line = plot_data_hist(good[field], bins=bins, color=color, line_dash=ls)
            moment_lines[label] = line

    moments_plot = hv.NdOverlay(moment_lines, kdims="Moments")
    moments_plot.opts(xlabel=f'{filt} Moments: Ixx, Iyy, Ixy [pixels^2]')
    moments_plot.opts(ylabel='objects / bin')
    
    return moments_plot

In [None]:
moment_plots = [plot_moments_for_filter(good, star, galaxy, filt) for filt in filters]
for m in moment_plots[1:]:
    m.opts(show_legend=False)
moments = hv.Layout(moment_plots)

In [None]:
moments.cols(3)

TODO:
1. Need to clean up histograms so that you can see the different lines.
2. Shift to a legend that's outside the grid of plots

The stars (orange) are concentrated at low values of the source moments.

Would be interesting to
1. Look by magnitude or SNR to undersatnd the longer tail.  Are these galaxies mis-classified as stars, or are these noise sources?
2. Distribution of ellipticity (see validate_drp to type this right)

In [None]:
def ellipticity_col_does_not_work(I_xx, I_xy, I_yy):
    """Calculate ellipticity from second moments.

    Parameters
    ----------
    I_xx : float or numpy.array
    I_xy : float or numpy.array
    I_yy : float or numpy.array

    Returns
    -------
    e, e1, e2 : (float, float, float) or (numpy.array, numpy.array, numpy.array)
        Complex ellipticity, real component, imaginary component
        
    Copied from https://github.com/lsst/validate_drp/python/lsst/validate/drp/util.py
    """
    e = (I_xx - I_yy + 2j*I_xy) / (I_xx + I_yy + 2*dask.array.sqrt(I_xx*I_yy - I_xy**2))
    e1 = np.real(e)
    e2 = np.imag(e)
    return e, e1, e2

In [None]:
def ellipticity(df, Ixx, Ixy, Iyy):
    """Calculate ellipticity from second moments from a dataframe.

    Parameters
    ----------
    df : DataFrame
    Ixx : column name
    Ixy : column name
    Iyy : column name

    Returns
    -------
    e, e1, e2 : (float, float, float) or (numpy.array, numpy.array, numpy.array)
        Complex ellipticity, real component, imaginary component
        
    Copied from https://github.com/lsst/validate_drp/python/lsst/validate/drp/util.py
    """
    e =  (df[Ixx] - df[Iyy] + 2j*df[Ixy] ) / (df[Ixx] + df[Iyy] + 2*dask.array.sqrt(df[Ixx]*df[Iyy] - df[Ixy]**2))
    e1 = np.real(e)
    e2 = np.imag(e)
    return e, e1, e2

In [None]:
def plot_ellipticities_for_filter(good, star, galaxy, filt,
                                names=['good', 'star', 'galaxy'],
                                colors=['blue', 'orange', 'green']):
    hist_kwargs = {'color': colors, 'log': True, 'range': (0, 50)}

    bins = np.linspace(0, 20, 201)
    ellipticity_lines = {}
    for df, name, color in zip((good, star, galaxy), names, colors):
        e, e1, e2 = \
        ellipticity(df, f'Ixx_{filt}', f'Ixy_{filt}', f'Iyy_{filt}')
        for data, prefix, ls in ((e, 'e', 'solid'), (e1, 'e1', 'dashed'), (e2, 'e2', 'dotted')):
            field = f'{prefix}_{filt}'
            label = f'{prefix} {name}'
            line = plot_data_hist(data, bins=bins, color=color, line_dash=ls)
            ellipticity_lines[label] = line

    ellipticities_plot = hv.NdOverlay(ellipticity_lines, kdims="Ellipticities")
    ellipticities_plot.opts(xlabel=f'{filt} Ellipticity: e, e1, e2 [pixels^2]')
    ellipticities_plot.opts(ylabel='objects / bin')

    return ellipticities_plot    

In [None]:
ell = plot_ellipticities_for_filter(good, star, galaxy, filt='i')

In [None]:
ell

In [None]:
ellipticity_plots = [plot_ellipticities_for_filter(good, star, galaxy, filt) for filt in filters]
for m in ellipticity_plots[1:]:
    m.opts(show_legend=False)
    
# logy=True results in a blank plot.  This is a bug in holoviews+bokeh
# for m in ellipticity_plots:
#     m.opts(logy=True)
#     m.opts(ylim=(10, 100000))

ellipticities = hv.Layout(ellipticity_plots)

In [None]:
ellipticities.cols(3)

TODO:
    1. logy results in nothing being shown on plots
    2. Full set of ellipticity calculations+plotting takes ~30 seconds on 8 workers on 16 core, 64 GB machine.  Should I pre-compute this earlier?

### FWHM of the PSF
At the location of the catalog objects.

The Object Table stores the shape parameters of the PSF model as evaluated at the location of the object.

This is not the same as, but is certainly related to, the distribution of effective seeing in the individual images that made up the coadd.

In [None]:
def plot_psf_fwhm(df, filt, bins=None, density=True):
    psf_col = f"psf_fwhm_{filt}"
    frequencies, edges = np.histogram(df[psf_col], density=density, bins=bins)
    hist = hv.Histogram((edges, frequencies))
    hist.opts(xlabel=psf_col)
    return hist

def plot_psf_fwhm_for_filters(df, filters=filters, bins=None, density=True,
                              alpha=0.5,
                              colors=("purple", "blue", "green", "orange", "red", "brown")):
    if bins is None:
        bins = np.linspace(0, 1.5, 201)

    fwhm_histograms = {}
    for filt, color in zip(filters, colors):
        hist = plot_psf_fwhm(df, filt, bins=bins, density=density)
        hist.opts(color=color)
        hist.opts(line_color=None)
        hist.opts(line_alpha=alpha)
        fwhm_histograms[filt] = hist
    
    fwhm = hv.NdOverlay(fwhm_histograms, kdims="Filter")
    if density:
        ylabel = "Density [Normalize to sum=1]"
    else:
        ylabel = "Objects / bin"
    fwhm.opts(xlabel="Model PSF FWHM [arcsec]")
    fwhm.opts(ylabel=ylabel)
    
    return fwhm

In [None]:
plot_psf_fwhm_for_filters(df).opts(width=600)