# TXPipe - CLMM Data Preparation

This notebook runs and explores two pipelines that generate the weighted, calibrated, per-cluster background shear catalogs as inputs to CLMM.  The instructions for running this on IN2P3 and on NERSC will differ.

On IN2P3 -- before starting you will need to:
- set up the TXPipe environment at CC-IN2p3 using the command 
`source /pbs/throng/lsst/users/jzuntz/txpipe-environments/setup-txpipe`
- clone the TXPipe repository somewhere. Note, the recommendation is for big data files to live in the sps folder, but not code.  You will want additional data (not tracked by git) to live in sps, and to softlink to the data directory in TXPipe. There are potentially big data files in the the data folder inside TXPipe, but these should ideally be put  in the SPS space since you may generate large files.
- downloaded the two input catalogs: [1 square degree](https://portal.nersc.gov/cfs/lsst/txpipe/data/example.tar.gz) and [20 square degrees](https://portal.nersc.gov/cfs/lsst/txpipe/data/cosmodc2-20deg2.tar.gz) and unzipped them in your TXPipe clone directory.

On NERSC -- make sure you work through the installation and data downloading instructions on the README:
- [Install TXPipe](https://github.com/LSSTDESC/TXPipe#installing)
- [Download](https://github.com/LSSTDESC/TXPipe#running) example data

You should then be able to execute the cells below in the **1 deg$^2$ Sample** section with the TXPipe kernel in NERSC.

In [None]:
import os
from pprint import pprint
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from IPython.display import Image
import ceci

# 1 deg$^2$ Sample

First we will do some runs on the 1 deg^2 example data set with around 80k galaxies. This is small enough that we can do it all in jupyter.

The data set, which is based on CosmoDC2, contains pre-computed photo-z and and contains a RedMapper cluster catalog for the field.

We will clone our own copy of the TXPipe directory, and run this notebook from there.  **Please change `my_txpipe_dir`** to your own version of the path when running this:

In [None]:
# my_txpipe_dir = "/pscratch/sd/a/avestruz/TXPipe"
my_txpipe_dir = "/pbs/throng/lsst/users/ccombet/TXPipe"
os.chdir(my_txpipe_dir)

import txpipe

Now we make an output directory for everything, if it doesn't exist already.

In [None]:
os.makedirs("data/example/outputs_metadetect", exist_ok=True)

In [None]:
if not os.path.exists("data/example/inputs/metadetect_shear_catalog.hdf5"):
    raise RuntimeError("Download and extract the sample data file to continue")

---

## WL sample selection

Our first step is the WL sample selection. This does both selection and tomography. The latter is not used here.

In [None]:
step1 = txpipe.TXSourceSelectorMetadetect.make_stage(
    # This file is the input metadetect shear catalog
    shear_catalog="data/example/inputs/metadetect_shear_catalog.hdf5",
    # This is an input training set for the tomographic selection
    calibration_table="data/example/inputs/sample_cosmodc2_w10year_errors.dat",

    # This contains all the options for this stage. You can override them here
    # manually too.
    config="examples/metadetect/config.yml",

    # This is the output file for this stage
    shear_tomography_catalog="data/example/outputs_metadetect/shear_tomography_catalog.hdf5"
)

This step will first train a classifier to select objects into tomographic bins, and then run it on the input data files
to produce the output file:

In [None]:
step1.run()
step1.finalize()

---

## Cluster shear catalog indexing and weights

Our second step runs the matching to find the shear catalog behind every cluster.

This step saves a cluster shear catalog, which is actually just an index into the shear and cluster catalogs (to avoid making many copies of the data), with added weights from CLMM

In [None]:
print("Options for this pipeline and their defaults:")
print(txpipe.extensions.CLClusterShearCatalogs.config_options)

step2 = txpipe.extensions.CLClusterShearCatalogs.make_stage(
    # Shear catalog, as before
    shear_catalog="data/example/inputs/metadetect_shear_catalog.hdf5",
    # This is the initial cluster catalog - RAs, Decs, richess, redshift, etc.
    cluster_catalog="./data/example/inputs/cluster_catalog.hdf5",
    # This fiducial cosmology is used to convert distance separations to redshifts
    fiducial_cosmology="./data/fiducial_cosmology.yml",
    # The tomography catalog created in step 1 selects objects for the WL sample
    # and assigns them to tomographic bins. We don't need the tomography here, just the basic selection
    shear_tomography_catalog="data/example/outputs_metadetect/shear_tomography_catalog.hdf5",
    # This is a QP file created by RAIL to generate the photo-zs for this sample
    source_photoz_pdfs="data/example/inputs/photoz_pdfs.hdf5",

    # This is the output for this stage
    cluster_shear_catalogs="my_cluster_shear_catalog.hdf5",

    # This contains all the options for this stage. You can override them here, as we do with the max_radius below.
    config="examples/metadetect/config.yml",    
    # Let's override one of the configuration parameters for this stage:
    max_radius=5.0
)


In [None]:
step2.run()
step2.finalize()

In [None]:
step3 = txpipe.extensions.CLClusterEnsembleProfiles.make_stage(

        # Shear catalog, as before
    shear_catalog="data/example/inputs/metadetect_shear_catalog.hdf5",
    # This is the initial cluster catalog - RAs, Decs, richess, redshift, etc.
    cluster_catalog="./data/example/inputs/cluster_catalog.hdf5",
    # This fiducial cosmology is used to convert distance separations to redshifts
    fiducial_cosmology="./data/fiducial_cosmology.yml",
    # The tomography catalog created in step 1 selects objects for the WL sample
    # and assigns them to tomographic bins. We don't need the tomography here, just the basic selection
    shear_tomography_catalog="data/example/outputs_metadetect/shear_tomography_catalog.hdf5",
    # This is a QP file created by RAIL to generate the photo-zs for this sample
    source_photoz_pdfs="data/example/inputs/photoz_pdfs.hdf5",
    cluster_shear_catalogs="my_cluster_shear_catalog.hdf5",    
    # This is the output for this stage
    cluster_profiles="./my_cluster_ensemble_object.hdf5",
    
)

In [None]:
step3.run()
step3.finalize()

## Exploring the index

To avoid making lots and lots of copies of the data, this stage has not made a catalog, but instead made an index into the other catalogs, and stored only the relevant weight.

We have a helper class which is designed to match up all the different catalogs that go into this and collect the results for each cluster.

In [None]:
ccc = txpipe.extensions.CombinedClusterCatalog(
    shear_catalog="data/example/inputs/metadetect_shear_catalog.hdf5",
    shear_tomography_catalog="data/example/outputs_metadetect/shear_tomography_catalog.hdf5",
    cluster_catalog="./data/example/inputs/cluster_catalog.hdf5",
    cluster_shear_catalogs="my_cluster_shear_catalog.hdf5",
    photoz_pdfs="data/example/inputs/photoz_pdfs.hdf5",
)

In [None]:
print(f"Have {ccc.ncluster} clusters")

We can extract the cluster catalog info by index (0 -- 74):

In [None]:
cluster_info = ccc.get_cluster_info(0)
cluster_info

A also the shear catalog associated with that cluster, again by index, in the CLMM data format:

In [None]:
bg_cat = ccc.get_background_shear_catalog(0)
bg_cat[0:100]

Since our field is so small here (1 deg^2) the background catalog may be cut off at the edges:

In [None]:
plt.scatter(bg_cat['ra'], bg_cat['dec'], c=bg_cat['distance_arcmin'], s=1)
plt.plot(cluster_info['ra'], cluster_info['dec'], 'r*', markersize=10)
plt.colorbar()

We can also look at the redshift-weight calculation result for this b/g sample to check it makes sense:

In [None]:
plt.plot(bg_cat['zmean'], bg_cat['weight_clmm'], ',')
plt.xlabel("Redshift of b/g galaxy")
plt.ylabel("CLMM weight")

In [None]:
#radii2 = (bg_cat['ra'] - cl_cat['ra'])**2 + ((bg_cat['dec'] - cl_cat['dec'])**2)

plt.scatter(bg_cat["distance_arcmin"], bg_cat["tangential_comp_clmm"], marker='.')
plt.xscale('log')
plt.yscale('log')
plt.xlabel("Radius [arcmin]")
plt.ylabel("Tangential Delta Sigma")

# 20 deg$^2$ Sample

Our second input catalog contains a larger data set - 20 square degrees of CosmoDC2 data + mock noise, and an accompanying redmapper cluster catalog and mock spectroscopic sample.

It contains about 25 million galaxies and 1900 clusters.

This is large enough that it's worth running in parallel, instead of in Jupyter, especially because we have to calculate the photo-z for this sample, which is pretty slow.

To download the 20 deg$^2$ catalog (13 GB), 
```
curl -O https://portal.nersc.gov/cfs/lsst/txpipe/data/cosmodc2-20deg2.tar.gz
tar -zxvf cosmodc2-20deg2.tar.gz
```

In [None]:
if not os.path.exists("data/cosmodc2/20deg2/cluster_catalog.hdf5"):
    raise RuntimeError("Download and extract the 20 deg^2 data file to continue")

## Launching a pipeline

At CC-IN2P3: Let's have a look at the submission script for this pipeline: `examples/cosmodc2/20deg2-in2p3.sub`:

In [None]:
! cat examples/cosmodc2/20deg2-in2p3.sub

At NERSC: Let's have a look at the submission script for this pipeline: `examples/cosmodc2/20deg2-nersc.sub`:

In [None]:
! cat examples/cosmodc2/20deg2-nersc.sub

These submission scripts will launch a job of up to one hour (it should finish in 30 min) on a single node in CC-IN2P3 (for the former) or NERSC perlmutter (for the latter) to run a pipeline.

In a terminal, **navigate to your TXPipe directory and run the following to launch the job**

In IN2P3:
```
sbatch examples/cosmodc2/20deg2-in2p3.sub
```

In NERSC:
```
sbatch examples/cosmodc2/20deg2-nersc.sub
```

## Investigating our pipeline

While that's running, let's have a look at what's in the pipeline.

- The pipeline file on IN2P3 is `examples/cosmodc2/pipeline-20deg2-clmm.yml`
- The pipeline file on NERSC is `examples/cosmodc2/pipeline-20deg2-clmm-nersc.yml`

Comment/uncomment out the appropriate `pipeline_file` definition below, depending on whether you are running on IN2P3 or on NERSC

First, we can use ceci to build a flow-chart showing the pipeline stages:

In [None]:
# Read the appropriate pipeline configuration, and ask for a flow-chart.
# pipeline_file = "examples/cosmodc2/pipeline-20deg2-clmm.yml"
# pipeline_file = "examples/cosmodc2/pipeline-20deg2-clmm-nersc.yml"
pipeline_file = "examples/cosmodc2/pipeline-20deg2-CLdev.yml"

flowchart_file = "20deg2.png"


pipeline_config = ceci.Pipeline.build_config(
    pipeline_file,
    flow_chart=flowchart_file,
    dry_run=True
)

# Run the flow-chart pipeline
ceci.run_pipeline(pipeline_config)

Now we can have a look at the chart it has created:

In [None]:
Image(flowchart_file)

The flowchart elements are classified as follows:

- Red ellipses are pipeline stages, each of which is a python class.

- Yellow boxes are pre-existing input data files.

- Blue boxes are files created by the pipeline.

We can say that it generally makes sense - the final output is a set of cluster shear index catalogs, just like before.

Let's have a look at the pipeline information for this stage:


In [None]:
pprint(pipeline_config)

This dictionary defines what pipeline stages are run, and how they are executed. You can see:

- a list of stages to be run, including their parallelization.
- site information showing how to run individual steps.
- directories to put logs and outputs
- launcher information on how to launch and manage the workflow
- overall inputs to the pipeline

Finally, the 'config' item points to another file that configures the individual pipeline stages:



In [None]:
with open(pipeline_config['config']) as f:
    print(f.read())

--- 

After a bit more waiting, the final background cluster selection should complete.

We can again use our combined catalog to explore it:

In [None]:
if not os.path.exists("data/cosmodc2/outputs-20deg2/cluster_shear_catalogs.hdf5"):
    raise RuntimeError("Please wait a bit longer for the pipeline to complete")

In [None]:
# TODO: fix finding all these automatically from the pipeline object
ccc = txpipe.extensions.CombinedClusterCatalog(
    shear_catalog="data/cosmodc2/20deg2/shear_catalog.hdf5",
    shear_tomography_catalog="data/cosmodc2/outputs-20deg2/shear_tomography_catalog.hdf5",
    cluster_catalog="./data/cosmodc2/20deg2/cluster_catalog.hdf5",
    cluster_shear_catalogs="data/cosmodc2/outputs-20deg2/cluster_shear_catalogs.hdf5",
    photoz_pdfs="data/cosmodc2/outputs-20deg2/source_photoz_pdfs.hdf5",
)

In [None]:
# number of clusters:
ccc.ncluster

In [None]:
# info about one cluster
info = ccc.get_cluster_info(500)
info

In [None]:
# check the number of galaxies behind this cluster
ccc.get_background_catalog_indexing(500)[0].size

Depending on the file system this can be slow ...

In [None]:
# get the catalog for this cluster:
cat = ccc.get_background_shear_catalog(500)

In [None]:
# check that the positions make sense for cluster and galaxies
plt.plot(cat['ra'], cat['dec'] ,',')
plt.plot(info['ra'], info['dec'], 'rx')

Hopefully this is the information you need to do the next steps, but let me know if not!