# OR3 PZ: Access Truth Data

Author: Melissa Graham

Last verified to run: Fri Apr 12 2024

LSST Science Pipelines version: Weekly 2024_04

**Overview** 

Poke around and see what kind of truth data is available.

**Summary**

There is OR3 truth data in parquet files, but it is unmatched to `object` table.

There is also OR3 truth data in the butler, and some of it is matched -- but it is not clear to me (yet) that it was matched to the `object` table being accessed in notebook 1.

Really, we should probably just use DP0.2 and not the OR3 data sets.

## Set up

Import packages.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lsst.daf.butler import Butler
import gc, os

## Parquet files

Jim Chiang has let us know they're in: 
`/sdf/data/rubin/shared/ops-rehearsal-3/imSim_catalogs/skyCatalogs`.

Jim also says: "Unfortunately, there isn't any real documentation for the format of those files, but things like the ra, dec, and redshift values should be easy to find. I can help with specific questions. The parquet files are labled by healpix id, but they may not be the same as the tract numbers for a particular skymap.  For the skyCatalogs, the RING numbering and nside=32 were used with ra, dec."

The down side of using the parquet file truth data is that we would have to do the cross-matching ourselves.

Option to list relevant files.

In [None]:
# path = '/sdf/data/rubin/shared/ops-rehearsal-3/imSim_catalogs/skyCatalogs'
# os.system('ls + ' + path)
# os.system('ls + ' + path + '/galaxy*parquet')

They have very different sizes. 

```
-rw-r--r-- 1 7485 rubin_users 2.6G Feb  8 19:32 galaxy_7436.parquet
-rw-r--r-- 1 7485 rubin_users 152M Mar  4 01:40 galaxy_flux_7436.parquet
```

The smaller file can be read in full with pandas.

The bigger file, would have to use `dask` or `pyarrow` (see a demo in <a href="https://github.com/rubin-dp0/delegate-contributions-dp02/blob/main/desc_truth/read_truth_parquet_files.ipynb">this DP0.2 notebook</a>).

However it looks like `ra`, `dec`, and `redshift` are in the smaller file and that's probably all we'd need.

In [None]:
fnm = '/sdf/data/rubin/shared/ops-rehearsal-3/imSim_catalogs/skyCatalogs/galaxy_7436.parquet'
galaxy_tract = pd.read_parquet(fnm)

Option to show the full table.

In [None]:
galaxy_tract

## Butler 

I got this repo and collection from Dan Taranu.

In [None]:
repo = '/repo/dc2'
collection = '2.2i/runs/test-med-1/w_2024_12/DM-43400'
butler = Butler(repo, collections=collection)
registry = butler.registry

Option to print all data types related to truth.

In [None]:
# for dtype in sorted(registry.queryDatasetTypes(expression="*truth*")):
#     print(dtype.name)

The tables of interest are:

```
truth_summary
match_ref_truth_summary_objectTable_tract
match_target_truth_summary_objectTable_tract
matched_truth_summary_objectTable_tract
```

### The truth summary table

In [None]:
ts_refs = list(butler.registry.queryDatasets('truth_summary'))
print(len(ts_refs))
for i, ref in enumerate(ts_refs):
    if i == 0:
        print(ref.dataId)

I'm not sure what the `storageClass` parameter means or why it's needed, but got this from Dan also.

In [None]:
dataId = {'skymap': 'DC2', 'tract': 2723}
truth_summary = butler.get('truth_summary', dataId=dataId, storageClass="ArrowAstropy")

Option to show full `truth_summary` or just the columns.

In [None]:
# truth_summary

In [None]:
truth_summary.columns

### The match tables

They do contain `match_objectId` columns **BUT** there is no gaurantee that
these are the same `objectId` as in the butler repo and collection for the DRP processing we were using in the first notebook.

Plus, the match tables have only two tracts so... I think these aren't to be used.

In [None]:
mts_refs = list(butler.registry.queryDatasets('match_ref_truth_summary_objectTable_tract'))
print(len(mts_refs))
for i, ref in enumerate(mts_refs):
    print(ref.dataId)

mts_refs = list(butler.registry.queryDatasets('match_target_truth_summary_objectTable_tract'))
print(len(mts_refs))
for i, ref in enumerate(mts_refs):
    print(ref.dataId)

mts_refs = list(butler.registry.queryDatasets('matched_truth_summary_objectTable_tract'))
print(len(mts_refs))
for i, ref in enumerate(mts_refs):
    print(ref.dataId)

In [None]:
dataId = {'skymap': 'DC2', 'tract': 3828}

Take a peek at each of the three versions of the matched tables. They do have `match_objectId`.

In [None]:
matched_truth_summary = butler.get('match_ref_truth_summary_objectTable_tract', dataId=dataId, storageClass="ArrowAstropy")

In [None]:
# matched_truth_summary
matched_truth_summary.columns

In [None]:
matched_truth_summary = butler.get('match_target_truth_summary_objectTable_tract', dataId=dataId, storageClass="ArrowAstropy")

In [None]:
# matched_truth_summary
matched_truth_summary.columns

In [None]:
matched_truth_summary = butler.get('matched_truth_summary_objectTable_tract', dataId=dataId, storageClass="ArrowAstropy")

In [None]:
# matched_truth_summary
matched_truth_summary.columns