# Run 2.2i Object Tables via PostgreSQL with Dask read_sql_table +Holoviews
<br>Owner: **Joanne Bogart** ([@jrbogart](https://github.com/LSSTDESC/DC2-analysis/issues/new?body=@wmwv))
<br>Last Verified to Run: 2020-08-12
    
* Demonstrate the use of the DPDD-style Object table stored in PostgreSQL db. 
* Uses Dask+Holoviews+Datashader to demonstrate visualizing the entire dataset.
* Largely stolen from @wmwv "Run 2.2i Object Tables with Dask+Holoviews"

Learning Objectives:
After completing and studying this Notebook, a person should be able to
1. Access database tables with Dask.
    - Understand that specifying the columns to select increases performance.
2. Make a plot using Holoviews
3. Use Datashader to interactively rasterize a large dataset for display.
4. Launch a set of Dask workers on a SLURM qos=interactive job.

Logistics:

1. These tests were conducted on NERSC through the https://jupyter.nersc.gov interface.

2. Requires:
```
dask
dask.distributed
holoviews
datashader
bokeh
```

These were used with `desc-python-bleed` kernel

3. This was run using the `desc-python-bleed` kernel

Current Status:
* Quick demonstration for people who want to check out use of PostgreSQL with DASK.
* Future work planned: try matching Run2.2i dr6c with truth

References:  
    https://dask.org  
    https://datashader.org   
    https://holoviews.org  

In [None]:
import os

import dask
import dask.dataframe as dd
import datashader as ds
import holoviews as hv
from holoviews.operation.datashader import datashade, shade, dynspread, rasterize

In [None]:
hv.extension('bokeh')

## Start our Dask Cluster

For simple testing and illustration of how to use dask, holoview, and datashader here you can run locally on just one tract.
To run on the full set of DR6, you'll need to set up a node to support Dask distributed.  Basically you just need a machine that can hold the data in memory.

### Start a Dask Cluster on an Interactive Node

Move to your SCRATCH directory to simplify file locking.  (On NERSC, the `SCRATCH` environment variable points to an individual SCRATCH area for each user.)
```
cd $SCRATCH/
```

In a separate Terminal on Cori (a JupyterHub Terminal works fine), ask for a Node from the `interactive` queue.  This generally completes in seconds.
```
salloc -N 1 -C haswell --qos=interactive -t 04:00:00
```

And then once on that Node, load the right Python environment:
```
python /global/common/software/lsst/common/miniconda/start-kernel-cli.py desc-python-bleed
```

And then start up the Dask Cluster
```
NUM_WORKERS=16
SCHEDULER_FILE=${SCRATCH}/scheduler.json
rm -rf ${SCHEDULER_FILE}
dask-scheduler --scheduler-file ${SCHEDULER_FILE} &
dask-worker --nprocs ${NUM_WORKERS} --scheduler-file ${SCHEDULER_FILE} &
```

(We explicitly `rm -rf ${SCHEDULER_FILE}` above in case it's still around from a previous invocation.)

You connect to this Dask cluster by passing in the location of the `SCHEDULER_FILE`:

In [None]:
scheduler_file = os.path.join(os.environ["SCRATCH"], "dask/scheduler.json")  # modified

We then configure the dashboard URL to use the JupyterHub proxy service.
We here set the formatting string template to the correct value.
Once we actually connect the client, the client can then tell us the full link.

In [None]:
dask.config.config["distributed"]["dashboard"]["link"] = "{JUPYTERHUB_SERVICE_PREFIX}proxy/{host}:{port}/status"

In [None]:
from dask.distributed import Client

client = Client(scheduler_file=scheduler_file)

In [None]:
client

Click on the link after `Dashboard:` above to get a visualization of the Dask Cluster.

## Connect to the database
When using dask the workers make the connections. To make a query we will pass in the url so each worker can make its own connection if it needs to.

In [None]:
pg_url = 'postgresql://desc_dc2_drp_user@nerscdb03.nersc.gov:5432/desc_dc2_drp'

## Define our data 

In [None]:
schema_name = "run22i_dr6c_object"
table_name = "dpdd_object"

In [None]:
# Specify the columns we need.  This allows for significant performance advantages when reading a column-based storage format such as Parquet.
columns_to_read = ['ra', 'dec', 'mag_g', 'mag_i', 'mag_r', 'magerr_g', 'magerr_i',
                  'magerr_r', 'extendedness', 'tract']

# Specify how to partition input.
# Value of divisions should be, e.g., 17 tract numbers, starting with
# first and ending with > last tract, to divide into 16 parallel queries.
# The following definition encompasses all tracts
div = [2723, 2905, 3080, 3258, 3268, 3449, 3635, 3825, 3834, 4028, 4226, 4235, 4436,
      4639, 4648, 4858, 5075]

# There is no problem reading in all tracts, but rendering plots is an issue.
# For this notebook, we define a partition which only includes 16 tracts
div_small = [3631,3632, 3633, 3634, 3635, 3636, 3637, 3638, 3639,
             3640, 3641, 3642, 3643, 3825, 3826, 3827, 3638]

In [None]:
df = dd.read_sql_table(table_name, pg_url, index_col="tract", 
                       columns=columns_to_read,                                                                             
                       schema='run22i_dr6c_object', divisions=div_small)

The warning is harmless. The special datatype (three floating point values) of the coord column isn't recognized, but it's not used for anything so it doesn't matter.

In [None]:
print(df.columns)

## Persist the data in Dask Cluster Worker memory

Dask actively purges data from memory when its no longer needed by the Dask Task Graph currently doing the computation.

That's not what we want here where we want to plot several quantities repeatedly.  So we explicitly tell Dask to persist this data frame.

In [None]:
df = df.persist()

## Clean and Create Color Columns

In [None]:
# Clean.  
snr_magerr_threshold = 0.3  # mag
df = df[(df.magerr_g < snr_magerr_threshold) & (df.magerr_r < snr_magerr_threshold) & (df.magerr_i < snr_magerr_threshold)]

In [None]:
df['g-r'] = df['mag_g'] - df['mag_r']
df['r-i'] = df['mag_r'] - df['mag_i']

In [None]:
gal = df[(df.extendedness > 0.95)]  
star = df[(df.extendedness < 0.95)]

## Create HoloViews `Points` objects and Wrap with Datashader

Create for position, magnitude and color.  Use datashader to provide rasterized images that display in finite time but are still zoomable.

We will first define several different HoloViews objects
Then we will use the `+` overloading to display the visualizations.

For some more examples and information, see
https://holoviews.org/user_guide/Large_Data.html

In [None]:
points_ra_dec = hv.Points(df, kdims=['ra', 'dec'])
ra_dec = datashade(points_ra_dec)

In [None]:
ra_dec

In [None]:
points_mag_magerr = hv.Points(df, kdims=[hv.Dimension('mag_g', soft_range=(14, 28)),
                                         hv.Dimension('magerr_g', range=(0, snr_magerr_threshold))])
mag_magerr = datashade(points_mag_magerr)

In [None]:
points_color_mag = hv.Points(df, kdims=[hv.Dimension('g-r', soft_range=(-2, 3)),
                                        hv.Dimension('mag_g', soft_range=(14, 28))])
color_mag = datashade(points_color_mag)

In [None]:
points_color_color = hv.Points(df, kdims=[hv.Dimension('g-r', soft_range=(-2, 3)),
                                          hv.Dimension('r-i', soft_range=(-2, 3))])
color_color = datashade(points_color_color)

In Holoviews one uses the `+` operator to put these three different visualizations next to each other.

In [None]:
mag_magerr + color_mag + color_color

The color-magnitude and color-color plots will zoom together.  The synchronized range zooming is just based on matching the range of g-r. it's not subsetting the points in the view.  Thus there is no relationship between the RA, Dec plot and the other plots.

Related to this, I don't know how to invert the mag_g axis in the middle plot without also inverting the mag_g axis in the left plot and without breaking the shared axes selections.  One can `color_mag.opts(invert_yaxis=True)`, but that would not just invert the `mag_magerr` plot axis, which would be annoying, but even worse it would invert the linked selection to be 28 < mag_g < 14, which is the empty set.

## Datashade Multiple Sets

Show how to separate by `extendedness` parameter.  Note that `extendedness` is pretty conservative with a high false negative at faint magnitudes.

In [None]:
# I have to force the range for magerr with 'range' instead of 'soft_range'.  I'm not sure why.
gal_mag_magerr = hv.Points(gal, kdims=[hv.Dimension('mag_g', soft_range=(14, 28)),
                                       hv.Dimension('magerr_g', range=(0, snr_magerr_threshold))])
gal_color_mag = hv.Points(gal, kdims=[hv.Dimension('g-r', soft_range=(-2, 3)),
                                      hv.Dimension('mag_g', soft_range=(14, 28))])
gal_color_color = hv.Points(gal, kdims=[hv.Dimension('g-r', soft_range=(-2, 3)),
                                        hv.Dimension('r-i', soft_range=(-2, 3))])

star_mag_magerr = hv.Points(star, kdims=[hv.Dimension('mag_g', soft_range=(14, 28)),
                                       hv.Dimension('magerr_g', range=(0, snr_magerr_threshold))])
star_color_mag = hv.Points(star, kdims=[hv.Dimension('g-r', soft_range=(-2, 3)),
                                        hv.Dimension('mag_g', soft_range=(14, 28))])
star_color_color = hv.Points(star, kdims=[hv.Dimension('g-r', soft_range=(-2, 3)),
                                          hv.Dimension('r-i', soft_range=(-2, 3))])

typed_mag_magerr = {'gal': gal_mag_magerr, 'star': star_mag_magerr}
typed_color_mag = {'gal': gal_color_mag, 'star': star_color_mag}
typed_color_color = {'gal': gal_color_color, 'star': star_color_color}

shaded_mag_magerr = datashade(hv.NdOverlay(typed_mag_magerr, kdims='type'),
                              aggregator=ds.count_cat('type'))
shaded_color_mag = datashade(hv.NdOverlay(typed_color_mag, kdims='type'),
                             aggregator=ds.count_cat('type'))
shaded_color_color = datashade(hv.NdOverlay(typed_color_color, kdims='type'),
                               aggregator=ds.count_cat('type'))

In [None]:
shaded_mag_magerr + shaded_color_mag + shaded_color_color

The galaxies are red, while the "stars" (== not obviously extended) are blue.

Zoom in to mag - magerr plot to see that the outlying cluster of higher uncertainties as a function of magnitude are galaxies.

Notes:  
1. Constructing a legend for the above is unfortunately a little unobvious and awkward.  We lost the information when we datashaded the NdOverlay.  You could do it by creating a new set of empty NdOverlay object to get the colors. 

This is relying on the fact that the above commands used the default color map.  
Explicitly specifying the color above would have been better.
```
from datashader.colors import Sets1to3 # default datashade() and shade() color cycle
color_key = {k: Sets1to3[i] for i, k in enumerate(typed_color_mag)}
color_points = hv.NdOverlay({k: hv.Points([gal['g-r'][0], gal['mag_g'][0]],
                                          label=str(k)).options(color=v) for k, v in color_key.items()})
                                          
shaded_color_mag * color_points + shaded_color_color * color_points
```

Above code adapted from http://holoviews.org/user_guide/Large_Data.html

## Use hover-over aggregation

We can set up a dynamic hover-over that gives information about the local area.  In this case we're just doing a count of the number of points in a given rectangular region.

Note the use of the `*` to compose the results of `datashade` and `hv.util.Dynamic`.  This is the idiom in Holoviews to combine several different visualizations/tools.

In [None]:
from holoviews.streams import RangeXY

# A funnily-named wrapper function to generate hover-overs by count.
# nx, ny are fixed by the original range.  We don't get finer resolution as we zoom in.
def dynamate(points, width=400, height=400, nx=50, ny=50):
    """Datashades points at width, height.  Hover-over in dynamic boxes of nx x ny at given display size"""
    datashaded_points = datashade(points, width=width, height=height)
    hover_over_count = \
        hv.util.Dynamic(rasterize(points, width=nx, height=ny, streams=[RangeXY]),
                        operation=hv.QuadMesh)
    return datashaded_points * hover_over_count

In [None]:
points_mag_magerr = hv.Points(df, kdims=[hv.Dimension('mag_g', soft_range=(14, 28)),
                                         hv.Dimension('magerr_g', range=(0, snr_magerr_threshold))])
mag_magerr = datashade(points_mag_magerr)

In [None]:
%%opts QuadMesh [tools=['hover']] (alpha=0 hover_alpha=0.2)
dynamic_mag_magerr = dynamate(points_mag_magerr)
dynamic_color_mag = dynamate(points_color_mag)
dynamic_color_color = dynamate(points_color_color)

In [None]:
dynamic_mag_magerr + dynamic_color_mag + dynamic_color_color

## Shut down Dask Cluster

When you're done, go back to your Terminal window and log out of the interactive node.  This will both shut down the Dask Cluster and log out of your interactive node job.