# Selection cuts with ZTF and Supernova

In [1]:
import pandas as pd
import numpy as np

from astropy.time import Time

# pip install fink-client
from fink_client.avroUtils import AlertReader

## ZTF dataset

I thought it would be easier to share alerts data using parquet (which we do internally in the broker with Spark) for this simple exercise, but it turns out that loading/writing arrays of struct from a parquet file are not supported in pandas (pyarrow)... Alert data is intrisically nested (the history for example), so this is hopeless to try to use pandas (or pyarrow) straight.

Instead let's use Apache Avro (the same format as input broker alerts). Since Avro is not supported by Pandas, we will use tools from [fink-client](https://github.com/astrolabsoftware/fink-client) to manipulate it.

The ZTF dataset is made of 6038 alerts collected in November 2019 (only a small subset of all alerts). We use here processed alerts, namely:
- quality cuts have been already applied, and there should be no bogus.
- science modules have been run, and alerts contain added values (score from random forest classifier, score from superNNova, cross-match from Simbad).

In [2]:
# You can pass an entire folder, or a single file
r = AlertReader('november_snn_exploration/')

# Initial load takes a few seconds
pdf_ztf = r.to_pandas()

# un-nest some columns
pdf_ztf['neargaia'] = [i['neargaia'] for i in pdf_ztf['candidate']]
pdf_ztf['distpsnr1'] = [i['distpsnr1'] for i in pdf_ztf['candidate']]

print('{} alerts loaded'.format(len(pdf_ztf)))

6038 alerts loaded


## TNS dataset

The TNS dataset contains all SN Ia reported in TNS between August 1st 2019 and January 31st 2020. 

In [5]:
pdf_tns = pd.read_csv('tns_500_snia.csv')

# Select only a few columns
cols = [
    'Disc. Internal Name', 
    'Obj. Type',
    'Redshift', 
    'Discovery Date (UT)', 
    'Discovery Filter']
pdf_tns = pdf_tns[cols]

# Rename object ID column for cross-match later
pdf_tns.rename(columns={'Disc. Internal Name': 'objectId'}, inplace=True)

pdf_tns

Unnamed: 0,objectId,Obj. Type,Redshift,Discovery Date (UT),Discovery Filter
0,ZTF20aalyeut,SN Ia,0.0864,2020-01-31 11:06:38.000,g-ZTF
1,ZTF20aajbsxt,SN Ia,0.0700,2020-01-26 08:57:07.000,g-ZTF
2,ZTF20aaivego,SN Ia,0.0900,2020-01-26 03:57:59.000,g-ZTF
3,ZTF20aaknzba,SN Ia,0.1100,2020-01-30 12:05:07.000,g-ZTF
4,ZTF20aakyoez,SN Ia,0.0410,2020-01-30 12:16:34.000,g-ZTF
...,...,...,...,...,...
495,ZTF19abpelgt,SN Ia,0.0769,2019-08-14 09:27:48.000,r-ZTF
496,ZTF19abpgggu,SN Ia,0.0491,2019-08-11 11:37:39.000,r-ZTF
497,ZTF19abpbmli,SN Ia,0.0370,2019-08-11 04:05:51.000,g-ZTF
498,ZTF19ablusdf,SN Ia,0.0570,2019-08-01 07:33:36.000,g-ZTF


## Raw ZTF-TNS crossmatch

Let's see how many SN Ia are present in the ZTF dataset by crossmatching it directly with TNS:

In [6]:
pdf_tns_raw = pd.merge(
    pdf_ztf, 
    pdf_tns, 
    on='objectId', 
    how='left'
)

pdf_tns_raw = pdf_tns_raw.sort_values('snn', ascending=False)
mask = pdf_tns_raw['objectId'].duplicated('first')
pdf_tns_raw = pdf_tns_raw[~mask]

print('SN Ia found in ZTF[raw] x TNS = {}'.format(len(pdf_tns_raw[pdf_tns_raw['Obj. Type'] == 'SN Ia'])))

SN Ia found in ZTF[raw] x TNS = 9


## Applying selection cuts

Let's mimick the selection cuts that are applied at the end of the broker, to select candidates:

In [7]:
# filter 1: cross-match with CDS should return 'Unknown'
f1 = pdf_ztf['cdsxmatch'] == 'Unknown'

# filter 2: SN Ia probability from superNNova should be above 0.5
f2 = pdf_ztf['snn'] > 0.5

# filter 3: the alert should be at least 5 arcsec away from known Gaia or PS1 objects
f3 = (pdf_ztf['neargaia'] > 5) & (pdf_ztf['distpsnr1'] > 5)

# Apply all cuts on the dataset
pdf_ztf_cut = pdf_ztf[f1 & f2 & f3]

Let's now perform the cross-match between the filtered ZTF dataset and the TNS dataset

In [8]:
pdf_tns_cut = pd.merge(
    pdf_ztf_cut, 
    pdf_tns, 
    on='objectId', 
    how='left'
)

pdf_tns_cut = pdf_tns_cut.sort_values('snn', ascending=False)
mask = pdf_tns_cut['objectId'].duplicated('first')
pdf_tns_cut = pdf_tns_cut[~mask]

print('Number of candidates = {} ({:.2f}%)'.format(len(pdf_tns_cut), len(pdf_tns_cut)/len(pdf_ztf)*100))
print('SN Ia found in ZTF[filtered] x TNS = {}'.format(len(pdf_tns_cut[pdf_tns_cut['Obj. Type'] == 'SN Ia'])))

cols_to_print = ['objectId', 'snn', 'cdsxmatch', 'neargaia', 'distpsnr1', 'Obj. Type']
pdf_tns_cut[cols_to_print]

Number of candidates = 9 (0.15%)
SN Ia found in ZTF[filtered] x TNS = 0


Unnamed: 0,objectId,snn,cdsxmatch,neargaia,distpsnr1,Obj. Type
16,ZTF19acbvpjn,0.790782,Unknown,55.814697,5.429846,
14,ZTF18accnoli,0.623538,Unknown,23.426741,12.460747,
12,ZTF19acifduk,0.614877,Unknown,46.745686,8.159854,
1,ZTF19abzrdup,0.551782,Unknown,53.335396,9.025885,
0,ZTF19abxwakm,0.544291,Unknown,13.048496,7.117801,
11,ZTF18aabwcfc,0.529394,Unknown,23.718136,7.219947,
5,ZTF18aaawpwp,0.525352,Unknown,25.184441,6.079544,
15,ZTF19acdtnex,0.507545,Unknown,13.258698,8.289214,
8,ZTF19acifcyu,0.501077,Unknown,29.941349,11.612388,


With the definition of our 3 selection cuts, we missed all SN Ia that we contained in the dataset. Tips, you can easily inspect alerts using the ALeRCE web portal (hopefully the one in Fink will be ready soon!): e.g. https://alerce.online/object/ZTF19acdtnex. You can see that some can be SN II or SLSN for example.

Let's see who were the SN Ia missed:

In [9]:
d = pdf_tns_cut['objectId'].values
d_raw = pdf_tns_raw['objectId']
mm = pdf_tns_raw[~d_raw.isin(d)]

print('SN Ia missed because of selection cuts = {}'.format(len(mm[mm['Obj. Type'] == 'SN Ia'])))

cols_to_print = ['objectId', 'snn', 'cdsxmatch', 'neargaia', 'distpsnr1', 'Obj. Type']
mm[mm['Obj. Type'] == 'SN Ia'][cols_to_print]

SN Ia missed because of selection cuts = 9


Unnamed: 0,objectId,snn,cdsxmatch,neargaia,distpsnr1,Obj. Type
1906,ZTF19acaiylt,0.804489,Unknown,32.663235,0.367022,SN Ia
510,ZTF19acdtpow,0.788519,Unknown,36.110317,0.024942,SN Ia
5710,ZTF19acdubmt,0.777658,Unknown,67.412811,1.155709,SN Ia
4877,ZTF19achtewn,0.714137,Unknown,3.159565,3.164753,SN Ia
4274,ZTF19acjnatv,0.29005,Unknown,1.558023,1.559003,SN Ia
3626,ZTF19acgdfjt,0.22146,Unknown,1.232036,1.232973,SN Ia
4384,ZTF19acgjojy,0.167177,Unknown,57.691498,6.473939,SN Ia
466,ZTF19acbihyi,0.124625,Unknown,43.971371,1.575779,SN Ia
5011,ZTF19acmblbi,0.0,Galaxy,58.175545,0.445353,SN Ia


The top 4 SN Ia have been missed because our filter on Gaia and PS1 (they are all falling < 5 arsec from a PS1 object). The others have been missed because they have low SNN score (which means alerts have less than 3 valid measurements in the combined 2 bands), or they are closed to Gaia or PS1 objects.

## Selection cut challenge

Try to define better selection cuts and catch a maximum of alerts, with a minimum of candidates! 