## ZTF - Data Processing

In this notebook, we query the local UW/DIRAC database for ZTF alerts and process them into a format that can be used by THOR. 

The resulting processed data files can be downloaded [here](https://dirac.astro.washington.edu/~moeyensj/projects/thor/paper1/data/ztf).

In [13]:
import os
import glob
import numpy as np
import pandas as pd
import sqlite3 as sql

import mysql.connector as mariadb
from astropy.time import Time

In [14]:
os.nice(2)

6

## Data Processing

Here we connect to the alert database and query it for two weeks of observations from night ID 610 up to and including night 624. 

A description of the format of the alerts can be found here: https://zwickytransientfacility.github.io/ztf-avro-alert/schema.html

In [15]:
# Connect to database
con = mariadb.connect(user='ztf', database='ztf')

In [16]:
# Read alerts for solar system objects from after the photometry fix 
sso_alert_fix_date1 = Time('2018-05-16T23:30:00', format='isot', scale='utc') # first attribution fix
sso_alert_fix_date2 = Time('2018-06-08T23:30:00', format='isot', scale='utc') # second attribution fix
sso_alert_phot_fix_date = Time('2018-06-18T23:30:00', format='isot', scale='utc') # photometry fix date

In [17]:
# Only consider alerts post photometry fix
jd_good = sso_alert_phot_fix_date.jd
#ssdistnr >= 0 
df = pd.read_sql_query('select distinct nid from alerts where jd > {}'.format(jd_good), con)
print(len(df))



497


In [8]:
# Set the night range (the nights were picked by looking for an average two week period 
# in terms of the alert volume)
night_range = [610, 624]
df = pd.read_sql_query('select * from alerts where nid >= {} and nid <= {}'.format(*night_range), con)
print(len(df))



4966353


In [9]:
df.sort_values(by=["jd"], inplace=True)
df.reset_index(inplace=True)

Only keep observations with real bogus value above 0.5 and that have been observed less than 4 times in the same area (removes static sources). 

In [4]:
df = df[(df["rb"] >= 0.5) & (df["ndethist"] <= 4)]
len(df)

NameError: name 'df' is not defined

In [5]:
df.to_csv("ztf_observations_610_624.csv", index=False, sep=" ")

NameError: name 'df' is not defined

## Preprocess Observations

Because the ZTF alert stream is no longer running we are using a two week time frame of the old data with a little less than 5 million observations to create code that will get the correct information we need and correct format for the precovery search.

In [None]:
observations = pd.read_csv(
    os.path.join("ztf_observations_610_624.csv"), 
    sep=" ", 
    index_col=False, 
    low_memory=False
)
observations.sort_values(by="jd", inplace=True)

observations["observatory_code"] = ["I41" for i in range(len(observations))]    
observations["mjd_utc"] = Time(
    observations["jd"], 
    scale="utc", 
    format="jd"
).utc.mjd

TypeError: expected str, bytes or os.PathLike object, not DataFrame

In [3]:
len(observations)

4966353

In [4]:
observations.head()

Unnamed: 0,objectId,jd,fid,pid,diffmaglim,programid,candid,isdiffpos,tblid,nid,...,clrrms,neargaia,neargaiabright,maggaia,maggaiabright,exptime,drb,drbversion,observatory_code,mjd_utc
88437,ZTF18abdsqbl,2458365.0,2,610130484415,19.2443,1,610130484415010015,f,15,610,...,0.197519,0.261128,0.261128,12.4554,12.4554,,,,I41,58364.130486
91863,ZTF18abdysxo,2458365.0,2,610130481215,19.1212,1,610130481215010012,f,12,610,...,0.149188,0.461806,0.461806,12.8894,12.8894,,,,I41,58364.130486
91864,ZTF18abdytdq,2458365.0,2,610130481215,19.1212,1,610130481215010007,f,7,610,...,0.149188,2.92264,21.9948,17.8857,13.5448,,,,I41,58364.130486
91865,ZTF18abslvpe,2458365.0,2,610130481215,19.1212,1,610130481215015021,t,21,610,...,0.149188,0.20871,50.003,15.6539,12.0464,,,,I41,58364.130486
91866,ZTF18ablqnbj,2458365.0,2,610130483515,19.3369,1,610130483515015051,t,51,610,...,0.25369,9.44552,35.2857,19.1802,13.5749,,,,I41,58364.130486


In [5]:
def fixZTFDesignations(ssnamenr):
    try: 
        # eg. 401811 -> 401811
        designation = str(int(ssnamenr)) 
    except: 
        if len(ssnamenr) <= 4:
            # eg. 173P -> 173P
            designation = ssnamenr
        elif ssnamenr[1] == "/":
            # eg. C/2012A2 -> C/2012 A2
            designation = "{} {}".format(ssnamenr[:6], ssnamenr[6:])
        else:
            # eg. 2008SO196 -> 2008 SO196, 2007UJ07 ->  2007 UJ7
            if int(ssnamenr[6:]) == 0:
                n = ""
            else:
                n = str(int(ssnamenr[6:]))
            designation = "{} {}{}".format(ssnamenr[:4], ssnamenr[4:6], n)
    return designation

observations.loc[~observations["ssnamenr"].isna(), "ssnamenr_fixed"] = observations[~observations["ssnamenr"].isna()]["ssnamenr"].apply(fixZTFDesignations)

Let's take a look at some of the unique designation that were fixed:

In [6]:
observations[(observations["ssnamenr_fixed"] != observations["ssnamenr"]) & (~observations["ssnamenr"].isna())][["ssnamenr", "ssnamenr_fixed"]].drop_duplicates()

Unnamed: 0,ssnamenr,ssnamenr_fixed
91075,2010PJ64,2010 PJ64
89081,2006EU15,2006 EU15
88705,2014QO465,2014 QO465
90398,2014HQ45,2014 HQ45
90195,2015RK18,2015 RK18
...,...,...
4957514,2010EF170,2010 EF170
4957893,C/2015D3,C/2015 D3
4962085,2007VK77,2007 VK77
4964017,2008YK167,2008 YK167


In [37]:
df = pd.DataFrame(observations)

In [67]:
# turning the column names to the correct names so that the precovery will run

observations_new = df.rename(columns = {"candid" : "obs_id",
    "mjd" : "mjd_utc",
    "RA_deg" : "ra",
    "decl" : "dec",
    "fid" : "filter",
    "RA_sigma_deg" : None,
    "Dec_sigma_deg" : None,
    "observatory_code" : "observatory_code",
    "magpsf" : "mag",
    "Mag_sigma" : "mag_sigma",
    "obj_id" : "ssnamenr_fixed"}, inplace = True)

In [81]:
#this allows us to get a brief understanding of what objects are being detected inside this dataframe

df.ssnamenr_fixed.value_counts()

453781       103
244616        98
488511        93
415813        87
76864         87
            ... 
211251         1
184541         1
2015 RP45      1
7085           1
172418         1
Name: ssnamenr_fixed, Length: 64943, dtype: int64

In [69]:
#this pulls the columns we need into a different dataframe that we will eventually index

df2 = df[['mjd_utc','ra','dec','mag','mag_sigma','observatory_code','obs_id','filter']]

In [73]:
#this turns the values that ZTF doesn't have into a format that will allow the precovery code to still run

df2["ra_sigma"] = np.nan
df2["dec_sigma"] = np.nan
df2["exposure_id"] = ""

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["ra_sigma"] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["dec_sigma"] = np.nan
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["exposure_id"] = ""


In [74]:
df2["obs_id"] = df["obs_id"].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["obs_id"] = df["obs_id"].astype(str)


In [157]:
df2

Unnamed: 0,mjd_utc,ra,dec,mag,mag_sigma,observatory_code,obs_id,filter,ra_sigma,dec_sigma,exposure_id
88437,58364.130486,255.347544,-23.059466,14.8516,0.034435,I41,610130484415010015,2,,,
91863,58364.130486,255.311177,-26.697539,15.5002,0.087279,I41,610130481215010012,2,,,
91864,58364.130486,255.077637,-26.542553,16.4782,0.044278,I41,610130481215010007,2,,,
91865,58364.130486,254.708502,-26.721381,17.4692,0.114528,I41,610130481215015021,2,,,
91866,58364.130486,261.051009,-23.852801,15.8831,0.038632,I41,610130483515015051,2,,,
...,...,...,...,...,...,...,...,...,...,...,...
4966036,58378.525845,92.689451,39.997552,17.2580,0.109823,I41,624525841615015021,2,,,
4966035,58378.525845,87.710153,40.392751,18.2231,0.085428,I41,624525842415010000,2,,,
4966034,58378.525845,88.017326,40.350214,19.1111,0.150922,I41,624525842415015003,2,,,
4966046,58378.525845,88.260257,38.084976,17.7147,0.070280,I41,624525840815015038,2,,,


In [166]:
mjd_test = df2["mjd_utc"].values
df2["mjd_utc"].value_counts()

58365.280474    21149
58365.290486    14487
58365.290023    13779
58365.287604    12261
58365.290949    10547
                ...  
58371.287593        1
58371.288495        1
58372.426146        1
58368.304271        1
58378.120891        1
Name: mjd_utc, Length: 7252, dtype: int64

In [78]:
#turn the dataframe into an HDF5 which is the correct format we need to index our data

df2.to_hdf('ztf_observations_610_624.h5', key = 'data', mode='w', format='table', encoding = 'utf-8')

## Precovery Search

After correctly indexing your dataframe we can begin to search for objects. This specific example shows us calling the test2 indexed dataframe, we then inputed 7 initial conditions to try and match trajectories with other sources from the dataframe.

In [1]:
from precovery.orbit import Orbit, EpochTimescale
from precovery import precover
from astropy.time import Time

t0 = Time([2459800.5], scale="tdb", format="jd")
t0_mjd_utc = t0.utc.mjd









In [7]:
DB_DIR = "/epyc/ssd/users/paulob14/precovery/scripts/test2/"

orbit = Orbit.keplerian(
    0,
    2.269057465131142, 0.1704869454928905, 21.27981352885659,
    281.533811391701, 7.854179343480579, 98.55494515731131,
    t0_mjd_utc,
    EpochTimescale.UTC,
    20,
    0.15
)

results = precover(orbit, DB_DIR, tolerance=10/3600)

In [8]:
#the results showed 100 matches that could be this object!

results

Unnamed: 0,mjd_utc,ra_deg,dec_deg,ra_sigma_arcsec,dec_sigma_arcsec,mag,mag_sigma,filter,obscode,exposure_id,observation_id,healpix_id,pred_ra_deg,pred_dec_deg,pred_vra_degpday,pred_vdec_degpday,delta_ra_arcsec,delta_dec_arcsec,distance_arcsec,dataset_id
0,58364.249826,341.413612,32.109401,,,19.7743,0.129308,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58364.249826400075,610249823015015023,3521276733,341.413647,32.109391,-0.312044,0.030979,0.125663,-0.037068,0.112711,ZTF
1,58364.250289,341.413371,32.109306,,,19.8607,0.203569,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58364.25028940011,610250281915015001,3521276733,341.413502,32.109405,-0.312061,0.030960,0.471335,0.356912,0.535515,ZTF
2,58364.266204,341.408527,32.109863,,,19.6791,0.138827,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58364.2662037001,610266203015015024,3521276820,341.408531,32.109892,-0.312586,0.030271,0.016794,0.104030,0.104998,ZTF
3,58364.266667,341.408362,32.109805,,,19.7882,0.178902,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58364.26666670013,610266661915015001,3521276820,341.408387,32.109906,-0.312599,0.030250,0.089046,0.363628,0.371368,ZTF
4,58364.268079,341.407926,32.109931,,,19.6775,0.131921,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58364.268078700174,610268073015015018,3521276820,341.407945,32.109949,-0.312640,0.030188,0.069182,0.063280,0.086245,ZTF
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,58378.258310,337.402391,31.677511,,,18.9634,0.085289,b'\x02\x00\x00\x00\x00\x00\x00\x00',I41,58378.25831020018,624258312215015106,3540816054,337.402411,31.677521,-0.270411,-0.084894,0.070219,0.035802,0.069662,ZTF
99,58378.276898,337.397348,31.675915,,,19.0331,0.093160,b'\x02\x00\x00\x00\x00\x00\x00\x00',I41,58378.27689810051,624276892215015093,3540816045,337.397385,31.675935,-0.270351,-0.085702,0.131990,0.070753,0.132754,ZTF
100,58378.300833,337.390911,31.673868,,,19.7973,0.135989,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58378.30083330022,624300832215015023,3540816384,337.390917,31.673871,-0.270012,-0.086737,0.022775,0.012750,0.023201,ZTF
101,58378.320752,337.385505,31.672118,,,19.7109,0.143625,b'\x01\x00\x00\x00\x00\x00\x00\x00',I41,58378.32075229986,624320752215015044,3540813654,337.385543,31.672135,-0.269509,-0.087582,0.135937,0.062959,0.131713,ZTF


In [174]:
df5 = pd.read_hdf('ztf_observations_610_624.h5', 'data')

In [187]:
df5["exposure_id"] = df5["mjd_utc"].apply(lambda x:str(x))
df5.to_hdf('ztf_observations_610_624.h5', key = 'data', mode='w', format='table', encoding = 'utf-8')

In [9]:
df[df["objectId"] == "40000"]["obs_id"].values

NameError: name 'df' is not defined

In [15]:
test1 = pd.Series(data=results, index=['exposure_id'])

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().