# SimpleSOMSummarizer demo

Author: Sam Schmidt <br>
Last successfully run: April 26, 2023<br>

This notebook shows a quick demonstration of the use of the `SimpleSOMSummarizer` summarization module.  Algorithmically, this module is not very different from the NZDir estimator/summarizer.  NZDir operates by finding neighboring photometric points around spectroscopic objects.  SimpleSOMSummarizer takes a large training set of data in the `Inform_SimpleSOMSUmmarizer` stage and trains a self-organized map (SOM) (using code from the `minisom` package available at: https://github.com/JustGlowing/minisom).  Once the SOM is set up, the "winning"/best-fit cells are determined for both the photometric/unknown data and a set of spectroscopic data with known redshifts.  For each SOM cell, the algorithm constructs a histogram using the spectroscopic members mapped to that cell, and weights these by the number of photometric galaxies in that cell.  Both the photometric and spectroscopic datasets can also employ an optional weight per-galaxy. <br>

The summarizer also identifies SOM cells that contain photometric data but do not contain and galaxies with a measured spec-z, and thus do not have an obvious redshift estimate.  It writes out the (raveled) SOM cell indices that contain "uncovered"/uncalibratable data to the file specified by the `uncovered_cell_file` option as a list of integers.  The cellIDs and galaxy/objIDs for all photometric galaxies will be written out to the file specified by the `cellid_output` parameter.  Any galaxies in these cells should really be removed, and thus some iteration may be necessary in defining bin membership by looking at the properties of objects in these uncovered cells before a final N(z) is estimated, as otherwise a bias may be present.<br>

The shape and number of cells used in constructing the SOM affects performance, as do several tuning parameters.  This paper, http://www.giscience2010.org/pdfs/paper_230.pdf gives a rough guideline that the number of cells should be of the order ~ 5 x sqrt (number of data rows x number of column rows), though this is a rough guide.  Some studies have found a 2D SOM that is more elongated in one direction to be preferential, while others claim that a square layout is optimal, the user can set the number of cells in each SOM dimension via the `n_dim` and `m_dim` parameters.  For more discussion on SOMs see the Appendices of this KiDS paper:  http://arxiv.org/abs/1909.09632.

As with the other RAIL summarizers, we bootstrap the spectroscopic sample and return N bootstraps in an ensemble, along with a single fiducial N(z) estimate.<br>

More specific details of the algorithm's set up will be described in the course of this notebook, along with some illustrative plots.

Let's set up our dependencies:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pickle
import rail
import qp
import os
import tables_io
from rail.core.data import TableHandle
from rail.core.stage import RailStage
from minisom import MiniSom
from rail.core.utils import RAILDIR

Next, let's set up the Data Store, so that our RAIL module will know where to fetch data:

In [None]:
DS = RailStage.data_store
DS.__class__.allow_overwrite = True

First, let's grab some data files.  For the SOM, we will want to train on a fairly large, representative set that encompasses all of our expected data.  We'll grab a larger data file than we typically use in our demos to ensure that we construct a meaningful SOM.

This data consists of ~150,000 galaxies from a single healpix pixel of the comsoDC2 truth catalog with mock 10-year magnitude errors added.  It is cut at a relatively bright i<23.5 magnitudes in order to concentrate on galaxies with particularly high S/N rates.

In [None]:
training_file = "./healpix_10326_bright_data.hdf5"

if not os.path.exists(training_file):
  os.system('curl -O https://portal.nersc.gov/cfs/lsst/schmidt9/healpix_10326_bright_data.hdf5')

In [None]:
# way to get big data file
training_file = "./healpix_10326_bright_data.hdf5"
training_data = DS.read_file("training_data", TableHandle, training_file)

Now, let's set up the inform stage for our summarizer

In [None]:
from rail.estimation.algos.simpleSOM import Inform_SimpleSOMSummarizer

We need to define all of our necessary initialization params, which includes the following:<br>
`name` (str): the name of our estimator, as utilized by ceci<br>
`model` (str): the name for the model file containing the SOM and associated parameters that will be written by this stage<br>
`hdf5_groupname` (str): name of the hdf5 group (if any) where the photometric data resides in the training file<br>
`n_dim` (int): the number of dimensions in the x-direction for our 2D SOM<br>
`m_dim` (int): the number of dimensions in the y-direction for our 2D SOM<br>
`som_iterations` (int): the number of iteration steps during SOM training.  SOMs can take a while to converge, so we will use a fairly large number of 500,000 iterations.<br>
`som_sigma` (float): the "radius" of how far to spread changes in the SOM <br>
`som_learning_rate` (float): a number between 0 and 1 that controls how quickly the weighting function decreases.  SOM's are not guaranteed to converge mathematically, and so this parameter tunes how the response drops per iteration.  A typical values we might use might be between 0.5 and 0.75.<br>
`column_usage` (str):  this value determines what values will be used to construct the SOM, valid choices are `colors`, `magandcolors`, and `columns`.  If set to `colors`, the code will take adjacent columns as specified in `usecols` to construct colors and use those as SOM inputs.  If set to `magandcolors` it will use the single column specfied by `ref_column_name` and the aforementioned colors to construct the SOM.  If set to `columns` then it will simply take each of the columns in `usecols` with no modification.  So, if a user wants to use K magnitudes and L colors, they can precompute the colors and specify all names in `usecols`.  NOTE: accompanying `usecols` you must have a `nondetect_val` dictionary that lists the replacement values for any non-detection-valued entries for each column, see the code for an example dictionary.  WE will set `column_usage` to colors and use only colors in this example notebook.

In [None]:
inform_dict = dict(model='output_SOM_model.pkl', hdf5_groupname='photometry',
                   n_dim=71, m_dim=71, som_iterations=500_000,
                   som_sigma=12.0, som_learning_rate=0.75,
                   column_usage='colors')

In [None]:
inform_som = Inform_SimpleSOMSummarizer.make_stage(name='inform_som', **inform_dict)

Let's run our stage, which will write out a file called `output_SOM_model.pkl`

In [None]:
%%time
inform_som.inform(training_data)

Running the stage took ~4 minutes wall time on a desktop Mac and ~5 minutes on a Mac laptop.  Remember, however, that in many production cases we would likely load a pre-trained SOM specifically tuned to the given dataset, and this inform stage would not be run each time.<br>
Let's read in the SOM model file, which contains our som model and several of the parameters used in constructing the SOM, and needed by our summarization model.

In [None]:
with open("output_SOM_model.pkl", "rb") as f:
    model = pickle.load(f)

In [None]:
model.keys()

To visualize our SOM, let's calculate the cell occupation of our training sample, as well as the mean redshift of the galaxies in each cell.  The SOM took colors as inputs, so we will need to construct the colors for our training set galaxie:

In [None]:
bands = ['u','g','r','i','z','y']
bandnames = [f"mag_{band}_lsst" for band in bands]
ngal = len(training_data.data['photometry']['mag_i_lsst'])
colors = np.zeros([5, ngal])
for i in range(5):
    colors[i] = training_data.data['photometry'][bandnames[i]] - training_data.data['photometry'][bandnames[i+1]]

We can calculate the best SOM cell using the SOM.winner() method from minisom, which will return the 2D SOM coordinates for each galaxy, and then use these for our visualizations:

In [None]:
SOM = model['som']
winner_coordinates = np.array([SOM.winner(x) for x in colors.T]).T

In [None]:
winner_coordinates

In [None]:
n_dim=71
m_dim=71
meanszs = np.zeros((n_dim,m_dim))
cellocc = np.zeros((n_dim,m_dim))
for i in range(n_dim):
    for j in range(m_dim):
        mask = ((winner_coordinates[0] == i) & (winner_coordinates[1]==j))
        szs = training_data.data['photometry']['redshift'][mask]
        if np.sum(mask)>0:
            meanszs[i,j] = np.mean(szs)
            cellocc[i,j] = len (szs)

Here is the cell occupation distribution:

In [None]:
plt.figure(figsize=(12,10))
plt.imshow(cellocc.T,cmap='jet')
cbar = plt.colorbar()
cbar.set_label(label='cell occupation', size=15)
plt.clim((0,150))

And here is the mean redshift per cell:

In [None]:
plt.figure(figsize=(12,10))
plt.imshow(meanszs.T,cmap='jet')
cbar = plt.colorbar()
cbar.set_label(label='mean redshift', size=15)

Note that there is spatial correlation between redshift and cell position, which is good, this is showing how there are gradual changes in redshift between similarly-colored galaxies (and sometimes abrupt changes, when degeneracies are present).

Now that we have illustrated what exactly we have constructed, let's use the SOM to predict the redshift distribution for a set of photometric objects.  We will make a simple cut in spectroscopic redshift to create a compact redshift bin.  In more realistic circumstances we would likely be using color cuts or photometric redshift estimates to define our test bin(s).  We will cut our photometric sample to only include galaxies in 0.5<specz<0.9.

We will need to trim both our spec-z set to i<23.5 to match our trained SOM:

In [None]:
testfile = os.path.join(RAILDIR, 'rail/examples_data/testdata/test_dc2_training_9816.hdf5')
data = tables_io.read(testfile)['photometry']
mask = ((data['redshift'] > 0.2) & (data['redshift']<0.5))
brightmask = ((mask) & (data['mag_i_lsst']<23.5))
trim_data = {}
bright_data = {}
for key in data.keys():
    trim_data[key] = data[key][mask]
    bright_data[key] = data[key][brightmask]
trimdict = dict(photometry=trim_data)
brightdict = dict(photometry=bright_data)
# add data to data store
test_data = DS.add_data("tomo_bin", trimdict, TableHandle)
bright_data = DS.add_data("bright_bin", brightdict, TableHandle)

In [None]:
specfile = os.path.join(RAILDIR, "rail/examples_data/testdata/test_dc2_validation_9816.hdf5")
spec_data = tables_io.read(specfile)['photometry']
smask = (spec_data['mag_i_lsst'] <23.5)
trim_spec = {}
for key in spec_data.keys():
    trim_spec[key] = spec_data[key][smask]
trim_dict = dict(photometry=trim_spec)
spec_data = DS.add_data("spec_data", trim_dict, TableHandle)

Note that we have removed the 'photometry' group, we will specify the `phot_groupname` as "" in the parameters below.<br>
As before, let us specify our initialization params for the SimpleSOMSummarizer stage, including:<br>
`model`: name of the pickled model that we created, in this case "output_SOM_model.pkl"<br>
`hdf5_groupname` (str): hdf5 group for our photometric data (in our case "")<br>
`objid_name` (str): string specifying the name of the ID column, if present photom data, will be written out to cellid_output file<br>
`spec_groupname` (str): hdf5 group for the spectroscopic data<br>
`nzbins` (int): number of bins to use in our histogram ensemble<br>
`nsamples` (int): number of bootstrap samples to generate<br>
`output` (str): name of the output qp file with N samples<br>
`single_NZ` (str): name of the qp file with fiducial distribution<br>
`uncovered_cell_file` (str): name of hdf5 file containing a list of all of the cells with phot data but no spec-z objects: photometric objects in these cells will *not* be accounted for in the final N(z), and should really be removed from the sample before running the summarizer.  Note that we return a single integer that is constructed from the pairs of SOM cell indices via `np.ravel_multi_index`(indices).<br>

In [None]:
summ_dict = dict(model="output_SOM_model.pkl", hdf5_groupname='photometry',
                 spec_groupname='photometry', nzbins=101, nsamples=25,
                 output='SOM_ensemble.hdf5', single_NZ='fiducial_SOM_NZ.hdf5',
                 uncovered_cell_file='all_uncovered_cells.hdf5',
                 objid_name='id',
                 cellid_output='output_cellIDs.hdf5')

Now let's initialize and run the summarizer.  One feature of the SOM: if any SOM cells contain photometric data but do not contain any redshifts values in the spectroscopic set, then no reasonable redshift estimate for those objects is defined, and they are skipped.  The method currently prints the indices of uncovered cells, we may modify the algorithm to actually output the uncovered galaxies in a separate file in the future.

In [None]:
from rail.estimation.algos.simpleSOM import SimpleSOMSummarizer
som_summarizer = SimpleSOMSummarizer.make_stage(name='SOM_summarizer', **summ_dict)

In [None]:
som_summarizer.summarize(test_data, spec_data)

Let's open the fiducial N(z) file, plot it, and see how it looks, and compare it to the true tomographic bin file:

In [None]:
fid_ens = qp.read("fiducial_SOM_NZ.hdf5")

In [None]:
fig, axs = plt.subplots(1,2, figsize=(25,8))
fid_ens.plot_native(axes=axs[0],label="SOM estimate")
axs[0].set_xlabel("redshift", fontsize=15)
axs[0].set_ylabel("N(z)", fontsize=15)
axs[1].hist(test_data.data['photometry']['redshift'], bins=np.linspace(0,3,101));
axs[1].set_xlabel("redshift", fontsize=15)
axs[1].set_ylabel("N(z)", fontsize=15);

Not great, roughly the correct redshift range for the lower redshift peak, but a large secondary peak at ~1.0<z<1.5.  What if we try the bright dataset that we made?

In [None]:
bright_dict = dict(model="output_SOM_model.pkl", hdf5_groupname='photometry',
                   spec_groupname='photometry', nzbins=101, nsamples=25,
                   output='BRIGHT_SOM_ensemble.hdf5', single_NZ='BRIGHT_fiducial_SOM_NZ.hdf5',
                   uncovered_cell_file="BRIGHT_uncovered_cells.hdf5",
                   objid_name='id',
                   cellid_output='BRIGHT_output_cellIDs.hdf5')
bright_summarizer = SimpleSOMSummarizer.make_stage(name='bright_summarizer', **bright_dict)

In [None]:
bright_summarizer.summarize(bright_data, spec_data)

In [None]:
bright_fid_ens = qp.read("BRIGHT_fiducial_SOM_NZ.hdf5")

In [None]:
fig, axs = plt.subplots(1,2, figsize=(25,8))
bright_fid_ens.plot_native(axes=axs[0],label="SOM estimate")
axs[0].set_xlabel("redshift", fontsize=15)
axs[0].set_ylabel("N(z)", fontsize=15)
axs[0].legend(loc='upper right', fontsize=15)
axs[1].hist(bright_data.data['photometry']['redshift'], bins=np.linspace(0,3,101),label='true N(z)');
axs[1].set_xlabel("redshift", fontsize=15)
axs[1].set_ylabel("N(z)", fontsize=15)
axs[1].legend(loc='upper right', fontsize=15);

Slightly better, we've eliminated the secondary peak.  Now, SOMs are a bit touchy to train, and are highly dependent on the dataset used to train them.  This demo used a relatively small dataset (~150,000 DC2 galaxies from one healpix pixel) to train the SOM, and even smaller photometric and spectroscopic datasets of 10,000 and 20,000 galaxies.  We should expect slightly better results with more data, at least in cells where the spectroscopic data is representative.

However, there is a caveat that SOMs are not guaranteed to converge, and are very sensitive to both the input data and tunable parameters of the model.  So, users should do some verification tests before trusting the SOM is going to give accurate results.

Finally, let's load up our bootstrap ensembles and print out the first six to see what kind of variation we see:

In [None]:
boot_ens = qp.read("BRIGHT_SOM_ensemble.hdf5")

In [None]:
fig=plt.figure(figsize=(20,15))
for i in range(6):
    ax = plt.subplot(2,3,i+1)
    ax.set_xlim((0,3))
    boot_ens[i].plot_native(axes=ax, label=f'SOM bootstrap {i}')
    ax.set_xlabel("redshift", fontsize=15)
    ax.set_ylabel("bootstrap N(z)", fontsize=15)
    ax.legend(loc='upper right', fontsize=13);

# quantitative metrics

Let's look at how we've done at estimating the mean redshift and "width" (via standard deviation) of our tomographic bin compared to the true redshift and "width" for both our "full" sample and "bright" i<23.5 samples.  We will plot the mean and std dev for the full and bright distributions compared to the true mean and width, and show the Gaussian uncertainty approximation given the scatter in the bootstraps for the mean:

In [None]:
from scipy.stats import norm

In [None]:
full_ens = qp.read("SOM_ensemble.hdf5")
full_means = full_ens.mean().flatten()
full_stds = full_ens.std().flatten()
true_full_mean = np.mean(test_data.data['photometry']['redshift'])
true_full_std = np.std(test_data.data['photometry']['redshift'])
# mean and width of bootstraps
full_mu = np.mean(full_means)
full_sig = np.std(full_means)
full_norm = norm(loc=full_mu, scale=full_sig)
grid = np.linspace(0, .7, 301)
full_uncert = full_norm.pdf(grid)*2.51*full_sig

In [None]:
bright_means = boot_ens.mean().flatten()
bright_stds = boot_ens.std().flatten()
true_bright_mean = np.mean(bright_data.data['photometry']['redshift'])
true_bright_std = np.std(bright_data.data['photometry']['redshift'])
bright_uncert = np.std(bright_means)
# mean and width of bootstraps
bright_mu = np.mean(bright_means)
bright_sig = np.std(bright_means)
bright_norm = norm(loc=bright_mu, scale=bright_sig)
bright_uncert = bright_norm.pdf(grid)*2.51*bright_sig

In [None]:
plt.figure(figsize=(12,18))
ax0 = plt.subplot(2, 1, 1)
ax0.set_xlim(0.0, 0.7)
ax0.axvline(true_full_mean, color='r', lw=3, label='true mean full sample')
ax0.vlines(full_means, ymin=0, ymax=1, color='r', ls='--', lw=1, label='bootstrap means')
ax0.axvline(true_full_std, color='b', lw=3, label='true std full sample')
ax0.vlines(full_stds, ymin=0, ymax=1, lw=1, color='b', ls='--', label='bootstrap stds')
ax0.plot(grid, full_uncert, c='k', label='full mean uncertainty')
ax0.legend(loc='upper right', fontsize=12)
ax0.set_xlabel('redshift', fontsize=12)
ax0.set_title('mean and std for full sample', fontsize=12)

ax1 = plt.subplot(2, 1, 2)
ax1.set_xlim(0.0, 0.7)
ax1.axvline(true_bright_mean, color='r', lw=3, label='true mean bright sample')
ax1.vlines(bright_means, ymin=0, ymax=1, color='r', ls='--', lw=1, label='bootstrap means')
ax1.axvline(true_bright_std, color='b', lw=3, label='true std bright sample')
ax1.plot(grid, bright_uncert, c='k', label='bright mean uncertainty')
ax1.vlines(bright_stds, ymin=0, ymax=1, ls='--', lw=1, color='b', label='bootstrap stds')
ax1.legend(loc='upper right', fontsize=12)
ax1.set_xlabel('redshift', fontsize=12)
ax1.set_title('mean and std for bright sample', fontsize=12);

We see that the mean (red) and std dev (blue) estimates are quite biased compared to the truth in both cases, this is not unexpected for a simple true redshift cut and small samples sizes used in this demo.  The std dev estimate is much closer to the truth for the bright sample due to the elimination of the false secondary redshift peak.  However, we are still well above tolerances for any cosmological analysis, actual tomographic samples will need both tight color selections and larger training sets than those used in this demo in order to properly calibrate our redshift distributions.