# Time Correlation Filter Function

The Purpose of this Notebook will be to to plan and test a time correlation function that will filter sources depending on how well its opitcal and radio data lines up timewise.

The goal was to collect the optical and radio data for all the eta-v filtered sources (206 of them), and put them into two dataframes: fsd for FINK data and vsd for VAST data. We later extended this to look for all sources in the catalogue that have good overlap in time.

In [None]:
#here are the necessary imports
import os
import sys
import gc
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from io import StringIO
from vasttools.pipeline import Pipeline
from vasttools.query import Query
import Projecttools as pro #brand new module for frequently used code!

%matplotlib inline

In [None]:
cms = pd.read_pickle('Fink_2020_sources_matched_to_VAST_all_sources.pickle')
pro.family_sort(cms)
cms.groupby('family').size().sort_values(ascending=False)

In [None]:
#This will automatically find the base directory that needed to be specified
pipe=Pipeline()
#this way, we can also load specific runs from the VAST pipeline:
my_run=pipe.load_run('tiles_corrected')

In [None]:
#Im just putting the eta and v threshholds because the eta-v analysis takes an actual eternity to complete and I already
#have the values here:
eta_thresh=2.315552652171963
v_thresh=0.2878888414273631

In [None]:
cms_candidates = pro.eta_v_candidate_filter(cms,my_run,eta_thresh,v_thresh)
cms_candidates.groupby('family').size().sort_values(ascending=False)

I will be testing out this function on the ETA-V filtered sources, but also on the full catalogue to check for any sources, regardless of variability, that have good overlap in radio. 

In order for this to work, I need to have both the radio and optical data available for each source. Since the FINK broker has a limit as to how many sources can be queried at a time, I've done some "Batching:" breaking up the ID list into batches, running the portal query, and stitching the results of each batch together into a DataFrame.

I've also collected Batches into Groups that can be read in, although its possible the kernel will die trying to load it on.

If you've already got fsd saved as a csv file, load it here. Note that you will only be required to load in the ID and Julian Date for each observation in fsd, so to save on memory, you can specify pandas to only read in those columns:

In [None]:
fsd=pd.read_csv('FINK_Source_Data/FSD_1.csv', usecols = ['i:objectId','i:jd'])

In [None]:
fsd

In [None]:
# this will convert the 'i:jd' column into MjD and rename the column to 'i:mjd'
fsd['i:jd']=fsd['i:jd']-2400000.5
fsd.rename(columns={"i:jd": "i:mjd"}, inplace=True)

In [None]:
cms_group = cms[cms['objectId'].isin(fsd['i:objectId'])]

In [None]:
# This is then the list of IDs from the group
Idlist=cms_group['objectId']

vsd is then loaded here:

In [None]:
#at the end, I turn the vaex Dataframelocal object into a pandas dataframe directly (our list of sources is not that large)
vsi=[]
for i in Idlist:
    y=cms_group[cms_group['objectId'] == i]['matched_id'].astype(int).values[0]
    vsi.append(y)
meas=my_run.measurements
vsd=meas[meas.source.isin(vsi)].to_pandas_df()

#This will convert the 'time' column in vsd into MJD. The difference between JD and MJD is 2400000.5
vsd['time']=vsd['time'].apply(pd.Timestamp.to_julian_date)-2400000.5
vsd.rename(columns={"time": "time_mjd"},inplace=True)

Now that we have cms, fsd and vsd, we can begin to construct our time overlap function:

In [None]:
#preamble defining an overlap list that will store Booleian values
Overlap = []
O = 3
R = 3
dt = 0

if O <= 1 or R <= 1:
    raise Exception('Please choose a minimum optical & radio observation overlap count greater than 1')
if dt < 0:
    raise Exception ('Please choose a dt >= 0')

#this takes the FINK and VAST IDs from the crossmatch catalogue and puts them into a dataframe, resetting the index.
cml = pd.DataFrame({"FINK ID": cms['objectId'], "VAST ID": cms['matched_id']}).reset_index()

#this drops the 'index' column leftover from cms
cml.drop('index', inplace=True, axis=1)
cml['VAST ID']=cml['VAST ID'].astype(int)

ftd = pd.DataFrame({"i:objectId": fsd['i:objectId'], "i:mjd":fsd['i:mjd']})
vtd = pd.DataFrame({"source": vsd['source'], "time_mjd": vsd['time_mjd']})

#i represents the index of the row selelcted in cml. FINK_ID and VAST_ID are then the FINK and VAST IDs of that row respectively
for i in cml.index.to_list():
    FINK_ID,VAST_ID = cml.iloc[[i]]['FINK ID'][i], cml.iloc[[i]]['VAST ID'][i]
    
    #these are all the rows in fsd that have the same FINK ID as the selected row in cml
    ftd_temp = ftd[ftd['i:objectId'] == FINK_ID]
    ftd_temp = ftd_temp.sort_values('i:mjd').reset_index()
    ftd_temp.drop('index', inplace=True, axis=1)
    
    #these are all the rows in vsd that have the same VAST ID as the selected row in cml
    vtd_temp = vtd[vtd['source'] == VAST_ID]
    vtd_temp = vtd_temp.sort_values('time_mjd').reset_index()
    vtd_temp.drop('index', inplace=True, axis=1)
    
    # j represents the jth index in ftd_temp. start is the start date at the jth row, end is the row O-1 steps ahead.
    for j in ftd_temp.index.to_list():
        
        #if we've reached the end of the list and the loop hasnt broken, it means we havent found any good overlap.
        if j+O-1 > ftd_temp.index.to_list()[-1]:
            Overlap.append(False)
            break
        
        start, end = ftd_temp.iloc[[j]]['i:mjd'][j], ftd_temp.iloc[[j+O-1]]['i:mjd'][j+O-1]
        
        #this checks which points in vtd_temp are within the range between start and end, +- dt incase an observation is slightly out of this range
        overlap_temp = vtd_temp['time_mjd'].between(start-dt,end+dt)
        
        #If the number of points wthin that range is >= R, we have good overlap!
        if len(overlap_temp[overlap_temp==True]) >= R:
            Overlap.append(True)
            break

#we then add this constructed overlap column to cml:
cml['Overlap']=Overlap
print('Of the',Overlap.count(False)+Overlap.count(True),'Sources that have been analyzed,'
      ,Overlap.count(True),'of them have good overlap:')

cml

Here, we have the above function compactified:

In [None]:
cml = pro.lightcurve_overlap_filter(cms_group,fsd,vsd,O=3,R=3)
cml