# Gaap pipeline 

Use this notebook to run `gaap` photometry on Merian-reduced data.

Make sure that you are in the right environment! When activating the jupyter notebook:

        module load anaconda3/2022.5
        . /scratch/gpfs/am2907/Merian/gaap/lambo/scripts/setup_env_w40.sh
        jupyter notebook

In [101]:
import lsst.daf.butler as dafButler
import numpy as np
import glob
import os, sys
sys.path.append(os.path.join(os.getenv('LAMBO_HOME'), 'lambo/scripts/'))
from hsc_gaap.deploy_gaap_array import deploy_training_job
from hsc_gaap.check_gaap_run import checkRun
from hsc_gaap.find_patches_to_reduce import * 
from hsc_gaap.compile_catalogs import compileCatalogs
from tqdm import tqdm

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
# Step 1: What patches do we need to reduce?

We want to identify patches that have the necessary merian data products for `gaap` processing and have not already been processed. Patches need to have:

- deepCoadd_ref
- deepCoadd_meas
- deepCoadd_scarletModelData
- deepCoadd_calexp


Get a list of all Merian tracts with reduced data, and we will search through them to see which patches fit our criteria:

In [102]:
repo = '/scratch/gpfs/am2907/Merian/gaap'

In [103]:
output_collection = "DECam/runs/merian/dr1_wide"
data_type = "deepCoadd_calexp"
skymap = "hsc_rings_v1"
butler = dafButler.Butler('/projects/MERIAN/repo/', collections=output_collection, skymap=skymap)

In [104]:

patches = np.array([[data_id['tract'], data_id["patch"]] for data_id in butler.registry.queryDataIds (['tract','patch'], datasets=data_type, 
                                                 collections=output_collection, skymap=skymap)])
patches = patches[patches[:, 0].argsort()]
tracts, idx = np.unique(patches[:,0], return_index=True) 
patches_by_tract = np.split(patches[:,1] ,idx[1:])

Now find patches with necessary data products in N708:

In [105]:
tracts_n708 = []
patches_n708 = []
for tract in tqdm(tracts):
    patches = findReducedPatches(tract)
    if len(patches) > 0:
        tracts_n708.append(tract)
        patches_n708.append(patches)

tracts_n708 = np.array(tracts_n708)
print(f"{sum([len(p) for p in patches_n708])} patches in {len(tracts_n708)} tracts with necessary data products in N708")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 355/355 [04:08<00:00,  1.43it/s]

14319 patches in 285 tracts with necessary data products in N708





And in N540:

In [106]:
tracts_n540 = []
patches_n540 = []
for tract in tqdm(tracts):
    patches = findReducedPatches(tract, band="N540")
    if len(patches) > 0:
        tracts_n540.append(tract)
        patches_n540.append(patches)

tracts_n540 = np.array(tracts_n540)
print(f"{sum([len(p) for p in patches_n540])} patches in {len(tracts_n540)} tracts with necessary data products in N540")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 355/355 [04:11<00:00,  1.41it/s]

15259 patches in 318 tracts with necessary data products in N540





We might be interested in seeing which patches have only one but not both bands:

In [107]:
n708_non540_tracts = []
n708_non540_patches = []
n540_non708_tracts = []
n540_non708_patches = []
for tract in tqdm(tracts):
    if tract in tracts_n540:
        patches_n540_i = patches_n540[np.where(tracts_n540 == tract)[0][0]]
    else:
        patches_n540_i=[]
    if tract in tracts_n708:
        patches_n708_i = patches_n708[np.where(tracts_n708 == tract)[0][0]]
    else:
        patches_n708_i=[]
        
    n708_non540 = list(set(patches_n708_i) - set(patches_n540_i))
    n540_non708 = list(set(patches_n540_i) - set(patches_n708_i))

    if len(n708_non540)>0:
        n708_non540_tracts.append(tract)
        n708_non540_patches.append(n708_non540)

    if len(n540_non708)>0:
        n540_non708_tracts.append(tract)
        n540_non708_patches.append(n540_non708)

numpatches_n540_only = sum([len(p) for p in n540_non708_patches])
numpatches_n708_only = sum([len(p) for p in n708_non540_patches])
numpatches_total = sum([len(p) for p in patches_by_tract])
    

100%|████████████████████████████████████████████████████████████████████████████████████████████████| 355/355 [00:00<00:00, 18194.66it/s]


In [108]:
print(f"There are {numpatches_n540_only} patches in {len(n540_non708_tracts)} tracts that have N540 and no N708 ")
print(f"There are {numpatches_n708_only} patches in {len(n708_non540_tracts)} tracts that have N708 and no N540 ")
print(f'There are {numpatches_total - numpatches_n708_only - numpatches_n540_only} with both N540 and N708')

There are 3831 patches in 115 tracts that have N540 and no N708 
There are 2891 patches in 89 tracts that have N708 and no N540 
There are 26690 with both N540 and N708


Save a csv with the info if you want:

In [109]:
# saveMerianReducedPatchList(tracts, os.path.join(repo, "reducedPatches_N708.csv"))
# saveMerianReducedPatchList(tracts, os.path.join(repo, "reducedPatches_N540.csv"), band="N540")

Now find patches that haven't yet been `gaap` processed:

- N708 here is all patches that have N708 (whether they have N540 also or not)
- N540 is patches with only N540

In [110]:
tracts_n708_nogaap = []
patches_n708_nogaap = []
for i, tract in enumerate(tqdm(tracts_n708)):
    patches_mer  = patches_n708[i]
    patches_gaap = findGaapReducedPatches(tract, repo=repo)
    if len(set(patches_mer) - set(patches_gaap)) > 0:
        tracts_n708_nogaap.append(tract)
        patches_n708_nogaap.append(list(set(patches_mer) - set(patches_gaap)))


numpatches_n708_nogaap = sum([len(p) for p in patches_n708_nogaap])
print(f"{numpatches_n708_nogaap} patches in {len(tracts_n708_nogaap)} tracts to be reduced with N708")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 285/285 [00:29<00:00,  9.79it/s]

5 patches in 2 tracts to be reduced with N708





In [111]:
tracts_n540_nogaap = []
patches_n540_nogaap = []
for i, tract in enumerate(tqdm(n540_non708_tracts)):
    patches_mer  = n540_non708_patches[i]
    patches_gaap = findGaapReducedPatches(tract, repo=repo)
    if len(set(patches_mer) - set(patches_gaap)) > 0:
        tracts_n540_nogaap.append(tract)
        patches_n540_nogaap.append(list(set(patches_mer) - set(patches_gaap)))

numpatches_n540_nogaap = sum([len(p) for p in patches_n540_nogaap])
print(f"{numpatches_n540_nogaap} patches in {len(tracts_n540_nogaap)} tracts to be reduced with only N540")

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 115/115 [00:10<00:00, 10.49it/s]

1042 patches in 43 tracts to be reduced with only N540





In [112]:
tracts_notcompiled_N708 = [tract for tract in tracts_n708 if not hasCompiledCatalog(tract)]
tracts_notcompiled_N540 = [tract for tract in tracts_n540 if not hasCompiledCatalog(tract)]

In [113]:
# saveGaapReducedPatchList(tracts, os.path.join(repo, "GaapReduced.csv"))
# saveGaapNotReducedPatchList(tracts, os.path.join(repo, "notGaapReduced.csv"), notionformat=False)
# saveGaapNotReducedPatchList(tracts_n540_nogaap, os.path.join(repo, "notGaapReduced_N540.csv"), patches = patches_n540_nogaap,
#                             notionformat=True, band="N540")

In [114]:
npatches = numpatches_n540_nogaap# + numpatches_n708_nogaap
print(f"It will take ~ {npatches/60:.1f} hours to download HSC images for {npatches} patches")
print(f"It will take ~ {npatches*.6/1000:.1f} TBs to download HSC images for {npatches} patches")
print(f"Once the data has been downloaded, it will take ~ {npatches/20/2:.1f} hours to run gaap on {npatches} patches")
print(f"It will take ~ {npatches*.212/1000:.1f} TBs to save the gaap catalogs for {npatches} patches")


It will take ~ 17.4 hours to download HSC images for 1042 patches
It will take ~ 0.6 TBs to download HSC images for 1042 patches
Once the data has been downloaded, it will take ~ 26.1 hours to run gaap on 1042 patches
It will take ~ 0.2 TBs to save the gaap catalogs for 1042 patches


---
# Step 2: Download the data

We need to download the HSC data for all of the tracts we need to reduce. *Be warned, this takes a while and uses a lot of storage.*

It is recommended to run the following in a bash screen because depending on how much data you need to download, it can take many hours.

The following will download images for tract 9813 to `/scratch/gpfs/am2907/Merian/gaap/S20A/deepCoadd_calexp/9813` and the blendedness catalogs to `/scratch/gpfs/am2907/Merian/gaap/S20A/gaapTable/9813`:
- Unless `--only_merian=False`, this will only download the patches that have been reduced by Merian.
- You can download all of the Merian-reduced data in one go if you set `--alltracts=True`. Be careful with this, because it is ****lots**** of data!

    screen -L -S downloadtract    
    
    cd /scratch/gpfs/am2907/Merian/gaap
    . lambo/scripts/setup_env_w40.sh
    python3 lambo/scripts/hsc_gaap/download_S20A.py --tract=9813 --outdir="/scratch/gpfs/am2907/Merian/gaap/"


To exit screen do `ctrl a d` and to reattach do `screen -r downloaddata`

---
# Step 3: Make slurm scripts and submit

Write one slurm script for each tract – each of which is a job array with one job for each patch. 
You can submit the scripts as you write them if you want, but beware that there is an upper limit for the number of jobs you can submit at once to the queue.

In [18]:
tracts_n708_nogaap[2:3]

[10041]

In [19]:
for tract in tracts_n708_nogaap[2:3]:
    deploy_training_job(tract, band = "N708", filter_jobs=5,
                        python_file='lambo/scripts/hsc_gaap/run_gaap.py',
                        name='gaap', email="am2907@princeton.edu", outname = None, 
                        repo='/scratch/gpfs/am2907/Merian/gaap', scriptdir="/scratch/gpfs/am2907/Merian/gaap/", 
                        submit=True, fixpatches=False)

Submitted batch job 9993542


In [49]:
deploy_training_job(10050, band = "N708", filter_jobs=5,
                    python_file='lambo/scripts/hsc_gaap/run_gaap.py',
                    name='gaap', email="am2907@princeton.edu", outname = None, 
                    repo='/scratch/gpfs/am2907/Merian/gaap', scriptdir="/scratch/gpfs/am2907/Merian/gaap/", 
                    submit=True, fixpatches=True)


Submitted batch job 9993706


In [91]:
deploy_training_job(9373, band = "N540", filter_jobs=5,
                        python_file='lambo/scripts/hsc_gaap/run_gaap.py',
                        name='gaap', email="am2907@princeton.edu", outname = None, 
                        repo='/scratch/gpfs/am2907/Merian/gaap', scriptdir="/scratch/gpfs/am2907/Merian/gaap/", 
                        submit=True, fixpatches=False)

Submitted batch job 10003993


In [93]:
tracts_n540_nogaap[73]

9461

In [94]:
for tract in tracts_n540_nogaap[73:]:
    deploy_training_job(tract, band = "N540", filter_jobs=5,
                        python_file='lambo/scripts/hsc_gaap/run_gaap.py',
                        name='gaap', email="am2907@princeton.edu", outname = None, 
                        repo='/scratch/gpfs/am2907/Merian/gaap', scriptdir="/scratch/gpfs/am2907/Merian/gaap/", 
                        submit=True, fixpatches=False)

Submitted batch job 10004004
Submitted batch job 10004005
Submitted batch job 10004006
Submitted batch job 10004007
Submitted batch job 10004008
Submitted batch job 10004009
Submitted batch job 10004010
Submitted batch job 10004011
Submitted batch job 10004012
Submitted batch job 10004013
Submitted batch job 10004014
Submitted batch job 10004015
Submitted batch job 10004016
Submitted batch job 10004017
Submitted batch job 10004018
Submitted batch job 10004019
Submitted batch job 10004020
Submitted batch job 10004021
Submitted batch job 10004022
Submitted batch job 10004023
Submitted batch job 10004024
Submitted batch job 10004025
Submitted batch job 10004026
Submitted batch job 10004027
Submitted batch job 10004028
Submitted batch job 10004029
Submitted batch job 10004030
Submitted batch job 10004031
Submitted batch job 10004032
Submitted batch job 10004033
Submitted batch job 10004034
Submitted batch job 10004035
Submitted batch job 10004039
Submitted batch job 10004040
Submitted batc

The gaap reduction will save one catalog for each patch to (for example):

        /scratch/gpfs/am2907/Merian/gaap/S20A/gaapTable/9813/0,0/objectTable_9813_0,0_S20A.fits

---
# Step 4: Check on it!

You can check on the logs while the jobs are running to check for any glaring problems:
- `logs/gaapPhot_array_9813_0.o` 
- `logs/gaap_9813_0.log`

One the jobs are done running (for a given tract), you can check how things went. 

In [78]:
print(tracts_notcompiled_N708[30:])

[9809, 9810, 9811, 9815, 9816, 9817, 9818, 9819, 9820, 9821, 9828, 9833, 9837, 9838, 9839, 9862, 9863, 9939, 9940, 9941, 9942, 9943, 9944, 9945, 9949, 9950, 9951, 9952, 9953, 10040, 10041, 10042, 10043, 10044, 10045, 10046, 10047, 10048, 10049, 10050, 10051, 10052, 10053, 10057, 10058, 10060, 10061, 10062, 10070, 10078, 10182, 10183, 10184, 10185, 10186, 10283, 10284, 10285, 10286, 10287, 10288, 10289, 10290, 10291, 10292, 10293, 10294, 10295, 10296, 10297, 10298, 10299, 10300, 10301, 10302, 10303, 10304, 10426, 10427, 10428]


In [47]:
for tract in tracts_notcompiled_N708[50:]:
    problems = checkRun(tract, band="N708")

TRACT: 9942
NO PROBLEMS

TRACT: 9943
NO PROBLEMS

TRACT: 9944
NO PROBLEMS

TRACT: 9945
NO PROBLEMS

TRACT: 9949
NO PROBLEMS

TRACT: 9950
NO PROBLEMS

TRACT: 9951
NO PROBLEMS

TRACT: 9952
NO PROBLEMS

TRACT: 9953
NO PROBLEMS

TRACT: 10040
NO PROBLEMS

TRACT: 10041
PROBLEM IN PATCH 3: Catalog not saved
PROBLEM IN PATCH 3: Failed for 5 bands
PROBLEM IN PATCH 6: Catalog not saved
PROBLEM IN PATCH 6: Failed for 5 bands
PROBLEM IN PATCH 10: Catalog not saved
PROBLEM IN PATCH 10: Failed for 5 bands
PROBLEM IN PATCH 14: Catalog not saved
PROBLEM IN PATCH 14: Failed for 5 bands

TRACT: 10042
NO PROBLEMS

TRACT: 10043
NO PROBLEMS

TRACT: 10044
NO PROBLEMS

TRACT: 10045
NO PROBLEMS

TRACT: 10046
NO PROBLEMS

TRACT: 10047
NO PROBLEMS

TRACT: 10048
NO PROBLEMS

TRACT: 10049
NO PROBLEMS

TRACT: 10050
PROBLEM IN PATCH 39: Catalog not saved
PROBLEM IN PATCH 40: Catalog not saved
PROBLEM IN PATCH 40: Failed for 5 bands
PROBLEM IN PATCH 41: Catalog not saved
PROBLEM IN PATCH 41: Failed for 5 bands
PROBL

In [99]:
checkRun(9007, band="N540")

TRACT: 9007
NO PROBLEMS



array([], dtype=float64)

In [100]:
patches_n540_nogaap[24]

[46, 47]

In [98]:
tracts_n540_nogaap[24]

9085

In [95]:
for tract in tracts_n540_nogaap[20:]:
    checkRun(tract, band="N540")

TRACT: 9008
NO PROBLEMS

TRACT: 9009
NO PROBLEMS

TRACT: 9010
NO PROBLEMS

TRACT: 9011
NO PROBLEMS

TRACT: 9085
PROBLEM IN PATCH 50: Catalog not saved
PROBLEM IN PATCH 50: Failed for 5 bands
PROBLEM IN PATCH 51: Failed for 2 bands
PROBLEM IN PATCH 54: Failed for 2 bands
PROBLEM IN PATCH 55: Failed for 3 bands
PROBLEM IN PATCH 56: Failed for 3 bands
PROBLEM IN PATCH 58: Failed for 3 bands
PROBLEM IN PATCH 59: Catalog not saved
PROBLEM IN PATCH 59: Failed for 5 bands
PROBLEM IN PATCH 61: Failed for 3 bands
PROBLEM IN PATCH 62: Catalog not saved
PROBLEM IN PATCH 62: Failed for 5 bands
PROBLEM IN PATCH 65: Failed for 2 bands
PROBLEM IN PATCH 66: Failed for 2 bands
PROBLEM IN PATCH 67: Failed for 1 bands
PROBLEM IN PATCH 68: Failed for 2 bands
PROBLEM IN PATCH 75: Failed for 2 bands

TRACT: 9097
NO PROBLEMS

TRACT: 9098
NO PROBLEMS

TRACT: 9099
NO PROBLEMS

TRACT: 9102
NO PROBLEMS

TRACT: 9103
NO PROBLEMS

TRACT: 9104
NO PROBLEMS

TRACT: 9105
NO PROBLEMS

TRACT: 9106
NO PROBLEMS

TRACT: 910

You might get issues like "Failed for 3 bands" - this could be because HSC images don't exist for all bands. So it might not be an issue you can fix!

---
# Step 4: Merge catalogs

If everything is looking good, you can merge the patch catalogs into a tract-level catalog. 

It's recommended to run this step in a screen in terminal, because it takes some time!

But here is an example:

In [35]:
compileCatalogs([9839], repo, alltracts=False, rewrite=True)

CATALOG FOR TRACT 9839 ALREADY WRITTEN - REWRITTING
COMPILING CATALOG FOR TRACT 9839 WITH 5 PATCHES
COMPILED TABLE OF 95878 ROWS and 72 COLUMNS
WROTE TABLE TO /scratch/gpfs/am2907/Merian/gaap/S20A/gaapTable/9839/objectTable_9839_S20A.fits


In [59]:
",".join(np.array(list(set(tracts_n708) - set(tracts_n708_nogaap)), dtype=str))

'9226,9227,10283,10284,10285,10286,10287,10288,10289,10290,10291,10292,10293,10294,10295,10296,10297,10298,10299,10300,10301,10302,10303,10304,8280,8281,8282,8283,8284,8285,8286,9313,9314,9315,9316,9317,9318,9319,9320,9321,9322,9323,9324,9325,9326,9327,9328,9329,9330,9331,9332,9333,9334,9335,9336,9339,9340,9341,9342,9343,9344,9345,9346,9347,9373,9374,9375,9376,9377,9378,9379,10426,10427,10428,9455,9456,9457,9458,9459,9465,9466,9467,9468,9469,9470,9471,8521,8522,8523,8524,8525,8526,8527,9556,9557,9558,9559,9560,9561,9562,9563,9564,9565,9566,9567,9568,9569,9570,9571,9572,9573,9574,9575,9576,9577,9578,9579,9580,9582,9583,9584,9585,9586,9587,9588,9589,9590,9591,9592,9593,9594,9595,9596,9617,9618,9619,9620,9621,9697,9698,9699,9700,9701,9702,9703,9707,9708,9709,9710,9711,9712,9713,9714,8764,8765,8766,8767,8768,9798,9799,9800,9801,9802,9803,9804,9805,9806,9807,9808,9809,9810,9811,9812,9813,9814,9815,9816,9817,9818,9819,9820,9821,9828,9829,9830,9831,9832,9833,9834,9835,9836,9837,9838,9839,9862

In [36]:
from astropy.table import Table

In [39]:
Table.read('/scratch/gpfs/am2907/Merian/gaap/S20A/gaapTable/9839/objectTable_9839_S20A.fits')

        python3 lambo/scripts/hsc_gaap/compile_catalogs.py --tracts="[9327,9328,9329,9813,9812]"

This will save a catalog to (for example):
        
        /scratch/gpfs/am2907/Merian/gaap/S20A/gaapTable/9813/objectTable_9813_S20A.fits

If you want to change the columns that are used for the compiled catalog, edit these files:

        lambo/scripts/hsc_gaap/keep_table_columns_gaap.txt
        lambo/scripts/hsc_gaap/keep_table_columns_merian.txt

And you're all done!