# Extracting clean lightcurve and contextual information

Before we can create the features to train our models, we first have to extract the relevant data.
In the `RAW_JSON` directory there should be one file per ATLAS object. 
The schema of the JSON files can be found in the file `schema.json` in this directory. You can also navigate the schema be opening the `schema_doc.html` page in your browser. 



In [181]:
%load_ext autoreload
%autoreload 2


import numpy as np
import pandas as pd
from atlasvras.utils.misc import fetch_vra_dataframe
from atlasvras.utils.jsondata import JsonData
from astropy.time import Time
from astropy import table as astropytable
from astropy import units as u
from tqdm.notebook import tqdm
import os


# THIS COULD BE A DICTIONARY
def determine_alert_type(vra_table_row):
    if vra_table_row.preal == 0.0 and vra_table_row.pgal != 1.0:
        return 'garbage'
    elif vra_table_row.preal == 0.0 and vra_table_row.pgal == 1.0:
        return 'pm'
    elif vra_table_row.preal == 1.0 and vra_table_row.pgal == 1.0:
        return 'galactic'
    elif vra_table_row.preal == 1.0 and vra_table_row.pgal !=1.0:
        return "good"

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. The VRA data and the human labels

To create our data set we are using data that was eyeballed by the ATLAS transient team between 27th March and 16th August 2024. 

These dates were not chosen, they are imposed by practical reasons. 
* **27th March**: When the VRA started ingesting the data from the eyeball list _(extra detail: technically it started ingetsing data before that, but on the 27th of March we set the RB threshold of 0.2 to match the eyeball list, if that doesn't mean anything to you, you don't need to know what that means)_. 

* **13th August**: When we began using the first VRA prototype and the new eyeballing policies. After this data the human decisions (and decisions in general) are affected by the VRA. Accounting for the human-machine interactions will be discussed separately later. At this stage we only use data that was not affected by previous iterations of the VRA. 

The puprose of the VRA Scores (`tcs_vra_scores`) table is to record the provenance of the alerts from when they first enter the eyeball list until they get labeled into the'Good' , 'Garbage', 'PM', or 'Attic' lists. It keeps a record of when the alerts get updated scores and when they get labeled by humans. **Because it records human decisions, it is where we find our labels**. 

Note that you will see below the `.csv` files:
* `vra_with_decisions.csv`

It is **not the full VRA Scores table**, it is a record of the rows related to human decisions. It is recorded everyday in a cron job. 

In [102]:
vra_decisions = pd.read_csv('vra_with_decisions.csv')

In [103]:
vra_decisions.head()

Unnamed: 0,transient_object_id,id,preal,pgal,pfast,timestamp,apiusername,username,debug
0,1042208690433743600,356019,0.0,,,2024-03-27T08:23:07Z,,ken.smith,False
1,1054904200253920700,356020,0.0,,,2024-03-27T09:07:36Z,,julian.sommer,False
2,1115418370455859600,356021,1.0,0.0,,2024-03-27T09:43:08Z,,ken.smith,False
3,1175823720535859600,356022,0.0,,,2024-03-27T09:54:52Z,,ken.smith,False
4,1183316590471105200,356023,0.0,,,2024-03-27T09:55:15Z,,ken.smith,False


### 1.1 VRA Scores table columns

**NOTE**: If you don't know what the scores or ranks are refering to here, give the paper a scan first ;) 

* `transient_object_id`: 19 digit ATLAS ID of the object
* `id`: Unique row ID in the `tcs_vra_scores` tables
* `preal`: Real score
* `pgal`: Galactic score 
* `pfast`: Fast score
* `timestamp`: when the row was recorded in the `tcs_vra_scores` table
* `apiusername`: username associated with the token (non null if the row is filled via API)
* `username`: username of the person making a decision (non null if row filled via webserver interaction)
* `debug`: Debug flag

_Note that after the 13th of August 2024 we added the `rank`, `rank_alt1`, and `rank_alt2` to also record the ranks, but these data predate this change in the data structure_

### 1.2 Get the VRA Scores rows associated with these objects 

We have a record of the human decisions made, but we need to know when each object first entered the eyeball list, we might also want to track if there were changes in the human decisions 

In [104]:
# threshold is 27th of march 
vra_df = fetch_vra_dataframe(datethreshold='2024-03-27') 
vra_df = vra_df.set_index('transient_object_id')
vra_df.head()

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1044002490034041600,356018,0.999882,,,,,,2024-03-27T07:43:20Z,vra,,False
1042208690433743600,356019,0.0,,,,,,2024-03-27T08:23:07Z,,ken.smith,False
1054904200253920700,356020,0.0,,,,,,2024-03-27T09:07:36Z,,julian.sommer,False
1115418370455859600,356021,1.0,0.0,,,,,2024-03-27T09:43:08Z,,ken.smith,False
1175823720535859600,356022,0.0,,,,,,2024-03-27T09:54:52Z,,ken.smith,False


In [105]:
# Only select the rows corresponding to the objects we have in our vra_decisions df
unique_atlas_ids = list(set(vra_decisions.transient_object_id.values))
vra_obj_with_decisions = vra_df.loc[unique_atlas_ids]
del vra_df # can delete this big table now

In [106]:
vra_obj_with_decisions

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1182241921472921600,369209,0.873828,,,,,,2024-04-05T16:12:26Z,vra,,False
1182241921472921600,372145,0.000000,,,,,,2024-04-08T08:04:27Z,,xinyue.sheng,False
1165403920452812800,387155,0.207479,,,,,,2024-04-20T17:08:13Z,vra,,False
1165403920452812800,387380,0.000000,,,,,,2024-04-20T19:56:22Z,,david.young,False
1182934220174131200,442022,0.016087,0.054425,,,,,2024-07-01T18:16:55Z,vra,,False
...,...,...,...,...,...,...,...,...,...,...,...
1193106761111437300,475191,1.000000,0.000000,,,,,2024-07-24T13:31:21Z,,adam.wilson,False
1184704250083934200,408877,0.545049,,,,,,2024-05-21T02:55:55Z,vra,,False
1184704250083934200,409181,0.000000,,,,,,2024-05-21T12:38:57Z,,aysha.aamer,False
1123748600415125500,462848,0.866066,0.489480,,,,,2024-07-13T16:39:22Z,vra,,False


Now for each object we have an least one row with `apiusername` is `vra` and `username` is NaN before the decision is made. The first row indicates when the oject entered the eyeball list. 

### 1.3 Find MJD of initialisation in eyeball list. 

In [107]:
# to find the first row for each object we select only the rows filled by the VRA (not a human)
# and we only keep the first one as it's always the initialisation
vra_entries_first = vra_obj_with_decisions[vra_obj_with_decisions.apiusername=='vra'
                                          ].reset_index().drop_duplicates('transient_object_id', 
                                                                          keep='first').set_index('transient_object_id')

# note, the convoluted reseting of the index to drop the duplicates _is_ necessary afaik. 
# Couldn't get it to work proper otherwise

In [108]:
vra_entries_first

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1182241921472921600,369209,0.873828,,,,,,2024-04-05T16:12:26Z,vra,,False
1165403920452812800,387155,0.207479,,,,,,2024-04-20T17:08:13Z,vra,,False
1182934220174131200,442022,0.016087,0.054425,,,,,2024-07-01T18:16:55Z,vra,,False
1020422041373900800,482808,0.278307,0.609381,,,,,2024-08-04T07:46:25Z,vra,,False
1164258191011741700,381800,0.974762,,,,,,2024-04-17T07:44:12Z,vra,,False
...,...,...,...,...,...,...,...,...,...,...,...
1123302210435153900,459155,0.146295,0.101667,,,,,2024-07-13T16:36:31Z,vra,,False
1161507181441318900,368789,0.345752,,,,,,2024-04-05T14:00:24Z,vra,,False
1193106761111437300,475162,0.570454,0.442855,,,,,2024-07-24T09:41:27Z,vra,,False
1184704250083934200,408877,0.545049,,,,,,2024-05-21T02:55:55Z,vra,,False


In [109]:
# ADD MJD_INIT TO VRA_ENTRIES_FIRST
timestamps_init = Time(pd.to_datetime(vra_entries_first.timestamp.values  # parse np.array of str into a datetime
                                ).values # have to do this otherwise get a DateTimeIndex not a numpy array
                 ) # making this Time object allows us to easily convert to mjd

vra_entries_first['mjd_init'] = timestamps_init.mjd

In [110]:
vra_entries_first.head()

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,mjd_init
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1182241921472921600,369209,0.873828,,,,,,2024-04-05T16:12:26Z,vra,,False,60405.675301
1165403920452812800,387155,0.207479,,,,,,2024-04-20T17:08:13Z,vra,,False,60420.714039
1182934220174131200,442022,0.016087,0.054425,,,,,2024-07-01T18:16:55Z,vra,,False,60492.761748
1020422041373900800,482808,0.278307,0.609381,,,,,2024-08-04T07:46:25Z,vra,,False,60526.3239
1164258191011741700,381800,0.974762,,,,,,2024-04-17T07:44:12Z,vra,,False,60417.322361


### 1.4 Looking at the decisions/labels and recorindg their MJD

This is not necessary for training purposes, but it's used for some of our summary plots that show on what timescales humans make decisions.

In [111]:
vra_all_decisions = vra_obj_with_decisions[~vra_obj_with_decisions.username.isna()]

In [112]:
mask_possible_list = (vra_all_decisions.preal==0.5) 
vra_first_decisions = vra_all_decisions[~mask_possible_list].reset_index().drop_duplicates('transient_object_id', keep='first').set_index('transient_object_id')
vra_last_decisions = vra_all_decisions[~mask_possible_list].reset_index().drop_duplicates('transient_object_id', keep='last').set_index('transient_object_id')

We want the MJD of the first decision that is **not sending the alert ot the possible list** (as that is a form of purgatory). 
That's why we excluded the possible list with the mask above. 


In [113]:
vra_first_decisions

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1182241921472921600,372145,0.0,,,,,,2024-04-08T08:04:27Z,,xinyue.sheng,False
1165403920452812800,387380,0.0,,,,,,2024-04-20T19:56:22Z,,david.young,False
1182934220174131200,445158,0.0,,,,,,2024-07-03T11:45:54Z,,charlotte.angus,False
1020422041373900800,483655,0.0,1.0,,,,,2024-08-05T10:54:42Z,,shubham.srivastav,False
1164258191011741700,383019,1.0,1.0,,,,,2024-04-17T09:55:33Z,,david.young,False
...,...,...,...,...,...,...,...,...,...,...,...
1123302210435153900,467319,0.0,,,,,,2024-07-14T00:43:41Z,,ken.smith,False
1161507181441318900,368988,0.0,,,,,,2024-04-05T14:08:07Z,,ken.smith,False
1193106761111437300,475191,1.0,0.0,,,,,2024-07-24T13:31:21Z,,adam.wilson,False
1184704250083934200,409181,0.0,,,,,,2024-05-21T12:38:57Z,,aysha.aamer,False


In [114]:
# ADD MJD_INIT to our table that contains the decisions/labels
timestamps_decision = Time(pd.to_datetime(vra_first_decisions.timestamp.values  # parse np.array of str into a datetime
                                ).values # have to do this otherwis get a DateTimeIndex not a numpy array
                 )
vra_first_decisions['mjd_decision'] = timestamps_decision.mjd


In [115]:
vra_obj_with_decisions = vra_obj_with_decisions.join(vra_first_decisions['mjd_decision'])

In [117]:
vra_obj_with_decisions

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,mjd_decision
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1182241921472921600,369209,0.873828,,,,,,2024-04-05T16:12:26Z,vra,,False,60408.336424
1182241921472921600,372145,0.000000,,,,,,2024-04-08T08:04:27Z,,xinyue.sheng,False,60408.336424
1165403920452812800,387155,0.207479,,,,,,2024-04-20T17:08:13Z,vra,,False,60420.830810
1165403920452812800,387380,0.000000,,,,,,2024-04-20T19:56:22Z,,david.young,False,60420.830810
1182934220174131200,442022,0.016087,0.054425,,,,,2024-07-01T18:16:55Z,vra,,False,60494.490208
...,...,...,...,...,...,...,...,...,...,...,...,...
1193106761111437300,475191,1.000000,0.000000,,,,,2024-07-24T13:31:21Z,,adam.wilson,False,60515.563438
1184704250083934200,408877,0.545049,,,,,,2024-05-21T02:55:55Z,vra,,False,60451.527049
1184704250083934200,409181,0.000000,,,,,,2024-05-21T12:38:57Z,,aysha.aamer,False,60451.527049
1123748600415125500,462848,0.866066,0.489480,,,,,2024-07-13T16:39:22Z,vra,,False,60504.956065


#### Compare first and last decisions 

Objects will have several rows human decisions if:
* The object is "Good" and then sent to the follow-up list, before being sent back to the "Good" list a few weeks later. Clicking on the good list button will trigger the action that adds a row into the VRA Scores table every time. 

* The object was sent to the garbage (or another list) and then fished out


In the first case, we don't care to ammend the `mjd_decision`. In the latter, we want to make sure the date of the decision reflects the date of the 'correct' or final decision. 


In [118]:
vra_last_decisions['type'] = vra_last_decisions.apply(determine_alert_type, axis=1).values
vra_first_decisions['type'] = vra_first_decisions.apply(determine_alert_type, axis=1).values

In [119]:
vra_last_decisions[vra_last_decisions.type != vra_first_decisions.type]

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,type
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1212430721015349700,426331,1.0,0.0,,,,,2024-06-10T14:54:40Z,,shubham.srivastav,False,good
1080015601165140600,392213,0.0,,,,,,2024-04-27T17:43:49Z,,xinyue.sheng,False,garbage
1230240550374737400,410471,0.0,,,,,,2024-05-23T14:42:41Z,,aysha.aamer,False,garbage
1173110430303259000,437030,1.0,1.0,,,,,2024-06-26T10:43:13Z,,tamay.arnison,False,galactic
1104450010430704800,363643,1.0,0.0,,,,,2024-04-02T14:38:39Z,,shubham.srivastav,False,good
...,...,...,...,...,...,...,...,...,...,...,...,...
1104247301273937600,388641,1.0,0.0,,,,,2024-04-22T23:56:30Z,,ken.smith,False,good
1162029881141619900,472566,1.0,1.0,,,,,2024-07-17T17:25:01Z,,ken.smith,False,galactic
1113535250281131500,363246,0.0,,,,,,2024-04-02T11:10:08Z,,shubham.srivastav,False,garbage
1185600420072651600,363244,1.0,1.0,,,,,2024-04-02T11:10:08Z,,shubham.srivastav,False,galactic


In [120]:
vra_decision_date_to_amend = vra_last_decisions[vra_last_decisions.type != vra_first_decisions.type]

In [121]:
timestamps_decision = Time(pd.to_datetime(vra_decision_date_to_amend.timestamp.values  # parse np.array of str into a datetime
                                ).values # have to do this otherwis get a DateTimeIndex not a numpy array
                 )
vra_decision_date_to_amend['mjd_decision'] = timestamps_decision.mjd

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vra_decision_date_to_amend['mjd_decision'] = timestamps_decision.mjd


In [122]:
vra_decision_date_to_amend

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,type,mjd_decision
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1212430721015349700,426331,1.0,0.0,,,,,2024-06-10T14:54:40Z,,shubham.srivastav,False,good,60471.621296
1080015601165140600,392213,0.0,,,,,,2024-04-27T17:43:49Z,,xinyue.sheng,False,garbage,60427.738762
1230240550374737400,410471,0.0,,,,,,2024-05-23T14:42:41Z,,aysha.aamer,False,garbage,60453.612975
1173110430303259000,437030,1.0,1.0,,,,,2024-06-26T10:43:13Z,,tamay.arnison,False,galactic,60487.446678
1104450010430704800,363643,1.0,0.0,,,,,2024-04-02T14:38:39Z,,shubham.srivastav,False,good,60402.610174
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1104247301273937600,388641,1.0,0.0,,,,,2024-04-22T23:56:30Z,,ken.smith,False,good,60422.997569
1162029881141619900,472566,1.0,1.0,,,,,2024-07-17T17:25:01Z,,ken.smith,False,galactic,60508.725706
1113535250281131500,363246,0.0,,,,,,2024-04-02T11:10:08Z,,shubham.srivastav,False,garbage,60402.465370
1185600420072651600,363244,1.0,1.0,,,,,2024-04-02T11:10:08Z,,shubham.srivastav,False,galactic,60402.465370


In [123]:
# for each object in the table above, 
# update the mjd_decision column in vra_obj_with_decision to this last date. 
for atlas_id in vra_decision_date_to_amend.index.values:
    vra_obj_with_decisions.loc[atlas_id].mjd_decision = vra_decision_date_to_amend.loc[atlas_id].mjd_decision

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vra_obj_with_decisions.loc[atlas_id].mjd_decision = vra_decision_date_to_amend.loc[atlas_id].mjd_decision


### 1.5 Combine tables so we have MJD_init and MJD_decision in the same place


In [140]:
vra_data_with_mjds = vra_obj_with_decisions.join(vra_entries_first['mjd_init'])
vra_data_with_mjds

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,mjd_decision,mjd_init
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1182241921472921600,369209,0.873828,,,,,,2024-04-05T16:12:26Z,vra,,False,60408.336424,60405.675301
1182241921472921600,372145,0.000000,,,,,,2024-04-08T08:04:27Z,,xinyue.sheng,False,60408.336424,60405.675301
1165403920452812800,387155,0.207479,,,,,,2024-04-20T17:08:13Z,vra,,False,60420.830810,60420.714039
1165403920452812800,387380,0.000000,,,,,,2024-04-20T19:56:22Z,,david.young,False,60420.830810,60420.714039
1182934220174131200,442022,0.016087,0.054425,,,,,2024-07-01T18:16:55Z,vra,,False,60494.490208,60492.761748
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1193106761111437300,475191,1.000000,0.000000,,,,,2024-07-24T13:31:21Z,,adam.wilson,False,60515.563438,60515.403785
1184704250083934200,408877,0.545049,,,,,,2024-05-21T02:55:55Z,vra,,False,60451.527049,60451.122164
1184704250083934200,409181,0.000000,,,,,,2024-05-21T12:38:57Z,,aysha.aamer,False,60451.527049,60451.122164
1123748600415125500,462848,0.866066,0.489480,,,,,2024-07-13T16:39:22Z,vra,,False,60504.956065,60504.694005


#### Remove the duplicate rows, only keep the decisions a.k.a **labels**

In [141]:
# REMOVING DUPLICATES
vra_data_with_mjds = vra_data_with_mjds[~vra_data_with_mjds.index.duplicated(keep='last')]
vra_data_with_mjds

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,mjd_decision,mjd_init
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1182241921472921600,372145,0.0,,,,,,2024-04-08T08:04:27Z,,xinyue.sheng,False,60408.336424,60405.675301
1165403920452812800,387380,0.0,,,,,,2024-04-20T19:56:22Z,,david.young,False,60420.830810,60420.714039
1182934220174131200,445158,0.0,,,,,,2024-07-03T11:45:54Z,,charlotte.angus,False,60494.490208,60492.761748
1020422041373900800,483655,0.0,1.0,,,,,2024-08-05T10:54:42Z,,shubham.srivastav,False,60527.454653,60526.323900
1164258191011741700,383019,1.0,1.0,,,,,2024-04-17T09:55:33Z,,david.young,False,60417.413576,60417.322361
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1123302210435153900,467319,0.0,,,,,,2024-07-14T00:43:41Z,,ken.smith,False,60505.030336,60504.692025
1161507181441318900,368988,0.0,,,,,,2024-04-05T14:08:07Z,,ken.smith,False,60405.588970,60405.583611
1193106761111437300,475191,1.0,0.0,,,,,2024-07-24T13:31:21Z,,adam.wilson,False,60515.563438,60515.403785
1184704250083934200,409181,0.0,,,,,,2024-05-21T12:38:57Z,,aysha.aamer,False,60451.527049,60451.122164


### 1.6 Remove NaN MJD_init
The objecys that got decisions after 27th March but were ingested before will have no mjd\_init and we don't want them in the data set

In [142]:
mask = ~vra_data_with_mjds.mjd_init.isna()
vra_data_with_mjds = vra_data_with_mjds [mask]

### 1.7 Adding column with time delay between decision and initialisation and order 

In [145]:
# ADDING SOME USEFUL COLUMNS
vra_data_with_mjds['ndays_to_decision'] = vra_data_with_mjds.mjd_decision-vra_data_with_mjds.mjd_init
vra_data_with_mjds.sort_values('timestamp', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vra_data_with_mjds['ndays_to_decision'] = vra_data_with_mjds.mjd_decision-vra_data_with_mjds.mjd_init
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vra_data_with_mjds.sort_values('timestamp', inplace=True)


In [146]:
vra_data_with_mjds

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,mjd_decision,mjd_init,ndays_to_decision
transient_object_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1044002490034041600,356049,1.0,1.0,,,,,2024-03-27T12:09:38Z,,shubham.srivastav,False,60396.506690,60396.321759,0.184931
1073142251350304000,356115,0.0,,,,,,2024-03-27T14:08:16Z,,shubham.srivastav,False,60396.589074,60396.541806,0.047269
1171529471411932900,357221,1.0,0.0,,,,,2024-03-28T19:33:20Z,,ken.smith,False,60397.814815,60397.774572,0.040243
1134218301411308400,357222,1.0,0.0,,,,,2024-03-28T19:34:17Z,,ken.smith,False,60397.815475,60397.685579,0.129896
1154114161701812500,357231,1.0,0.0,,,,,2024-03-28T20:50:07Z,,ken.smith,False,60397.868137,60397.685579,0.182558
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1125906141284842600,518446,1.0,0.0,,,,,2024-09-06T10:51:05Z,,stephen.smartt,False,60445.800741,60445.718681,0.082060
1024644880304716500,533313,1.0,0.0,,,,,2024-09-27T10:41:39Z,,matt.nicholl,False,60523.549144,60523.546748,0.002396
1162237411222410800,538596,1.0,0.0,,,,,2024-10-04T11:02:35Z,,ken.smith,False,60519.888171,60519.648912,0.239259
1152158771624822700,538597,1.0,0.0,,,,,2024-10-04T11:06:00Z,,ken.smith,False,60462.281424,60462.197882,0.083542


## 2. Light curve data

**NOTE: You'll need to change the data path to be where you downloaded the RAW JSON files**

In [153]:
RAW_JSON_DATA_PATH = '../../../../Science/VRA_work/data/crabby/new_download_full_data_set/json_files/'
# list all the files in the data_path
json_files = os.listdir(RAW_JSON_DATA_PATH)

We now need to extract the detection and non detection data from our ATLAS API Json files. 
To do this we're going to:
* 1) Load each file into a JsonData objetc from the VRA utilities 
* 2) Parse those in the functions below to extract the light curve data into some nicely formatted tables

In [166]:
def make_detection_table(json_data):
    "Make the detections table from atlasvras.utils JsonData object"
    detections = astropytable.QTable([json_data.get_values(['lc', 'mjd']),
                                      json_data.get_values(['lc', 'mag']) * u.ABmag,
                                      json_data.get_values(['lc', 'magerr']) * u.ABmag,
                                      json_data.get_values(['lc', 'ra']),
                                      json_data.get_values(['lc', 'dec']),
                                      json_data.get_values(['lc', 'filter'])],
                                      names=('mjd', 'mag', 'magerr','ra','dec', 'band')
                                    )
    detections['det'] = True
    return detections

def make_non_detection_table(json_data):
    "Make the non detection table from atlasvras.utils JsonData object"
    nondetections = astropytable.QTable([json_data.get_values(['lcnondets', 'mjd']),
                                      json_data.get_values(['lcnondets', 'mag5sig']) * u.ABmag,
                                      [np.nan]*json_data.get_values(['lcnondets', 'mjd']).shape[0],
                                      [np.nan]*json_data.get_values(['lcnondets', 'mjd']).shape[0],
                                      [np.nan]*json_data.get_values(['lcnondets', 'mjd']).shape[0],
                                      json_data.get_values(['lcnondets', 'filter'])],
                                      names=('mjd', 'mag', 'magerr','ra','dec', 'band')
                                    )
    nondetections['det'] = False
    return nondetections


### 2.1 Make detections data set

In [167]:
# Initialise the first "row" of our detections_data_set before we loop over the whole data

INDEX = vra_data_with_mjds.index.values[0] # grab the first index 
_detections = make_detection_table(JsonData(filename=RAW_JSON_DATA_PATH  + f'{INDEX}.json')
                                  ).to_pandas()

# Add the phase with respect to initial entry into the eyeball list
_detections['phase_init'] = _detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_init
# Add the phase with respect to decision
_detections['phase_decision'] = _detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_decision
# Add the alert type == label
_detections['type'] = determine_alert_type(vra_data_with_mjds.loc[INDEX])
# Add the atlas Id as a string (If not a string sometimes pandas uses scientific notation and mucks up the 19 digit)
_detections['ATLAS_ID'] = str(INDEX)
    
detections_data_set = _detections

In [168]:
detections_data_set

Unnamed: 0,mjd,mag,magerr,ra,dec,band,det,phase_init,phase_decision,type,ATLAS_ID
0,60396.235506,16.807,0.038,70.01019,-3.6786,o,True,-0.086253,-0.271184,galactic,1044002490034041600
1,60396.235506,16.194,0.052,70.0101,-3.67861,o,True,-0.086253,-0.271184,galactic,1044002490034041600
2,60396.235506,16.358,0.054,70.01011,-3.6786,o,True,-0.086253,-0.271184,galactic,1044002490034041600
3,60396.238669,16.815,0.034,70.01049,-3.67817,o,True,-0.08309,-0.26802,galactic,1044002490034041600
4,60396.238669,16.097,0.033,70.01039,-3.67825,o,True,-0.08309,-0.26802,galactic,1044002490034041600
5,60396.243788,16.168,0.035,70.01093,-3.67782,o,True,-0.077971,-0.262902,galactic,1044002490034041600
6,60396.243788,16.73,0.032,70.01099,-3.67779,o,True,-0.077971,-0.262902,galactic,1044002490034041600


That's the start of our dataframe, now we're going to make a whole bunch of these dataframes for each event and then we're going to concatenate them.

_Note: making a list of dataframes and concatenating outside the loop is much faster. Looks dumb but works better than concatenating inside the loop._

**>>> This takes about 5 minutes**

In [174]:
_detections_data_set_lists = []

# For each ATLAS object (except the first one cuz we've already done it)
for INDEX in tqdm(vra_data_with_mjds.index.values[1:]):
    
    # we make our detections table
    _detections = make_detection_table(JsonData(filename=RAW_JSON_DATA_PATH + f'{INDEX}.json')).to_pandas()
    
    # we make our extra columns 
    try: 
        _detections['phase_init'] = _detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_init
        _detections['phase_decision'] = _detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_decision
    except ValueError:
        # and catch pesky errors
        print(INDEX)
        continue
    
    _detections['type'] = determine_alert_type(vra_data_with_mjds.loc[INDEX])
    _detections['ATLAS_ID'] = str(INDEX)
    
    # we add our dataframe to our list 
    _detections_data_set_lists.append(_detections)
    
# then we concatenate! 
detections_data_set = pd.concat(([detections_data_set]+_detections_data_set_lists))
detections_data_set = detections_data_set.reset_index(drop=True)

  0%|          | 0/40878 [00:00<?, ?it/s]

In [175]:
detections_data_set.set_index('ATLAS_ID', inplace=True, drop=True)

In [177]:
detections_data_set

Unnamed: 0_level_0,mjd,mag,magerr,ra,dec,band,det,phase_init,phase_decision,type
ATLAS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1044002490034041600,60396.235506,16.807,0.038,70.01019,-3.67860,o,True,-0.086253,-0.271184,galactic
1044002490034041600,60396.235506,16.194,0.052,70.01010,-3.67861,o,True,-0.086253,-0.271184,galactic
1044002490034041600,60396.235506,16.358,0.054,70.01011,-3.67860,o,True,-0.086253,-0.271184,galactic
1044002490034041600,60396.238669,16.815,0.034,70.01049,-3.67817,o,True,-0.083090,-0.268020,galactic
1044002490034041600,60396.238669,16.097,0.033,70.01039,-3.67825,o,True,-0.083090,-0.268020,galactic
...,...,...,...,...,...,...,...,...,...,...
1001145281063846800,60582.445223,19.262,0.184,2.93867,6.64630,o,True,48.709644,48.399042,good
1001145281063846800,60586.455117,19.072,0.171,2.93845,6.64657,o,True,52.719538,52.408936,good
1001145281063846800,60586.554347,19.155,0.198,2.93849,6.64632,o,True,52.818769,52.508167,good
1001145281063846800,60590.397591,18.929,0.155,2.93857,6.64635,o,True,56.662012,56.351410,good


In [178]:
## detections_data_set.to_csv('./clean_data_csv/detections_data_set_NEW.csv', index=True)

In [179]:
del detections_data_set 
# free the memory 

### 2.2 Okay same stuff but for the non detections 

For the non detections we **crop the data at -100 since mjd_init**. We don't use any more data than that in the lightcurve history (see when we make the features), so it's over kill. Plus it ends up filling my RAM and killing my kernel so not worth the trouble. 

**>>>Takes about 8 minutes**

In [182]:
# initialise 
INDEX = vra_data_with_mjds.index.values[0]
_non_detections = make_non_detection_table(JsonData(filename=RAW_JSON_DATA_PATH + f'{INDEX}.json')).to_pandas()
_non_detections['phase_init'] = _non_detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_init
_non_detections['phase_decision'] = _non_detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_decision
_non_detections['type'] = determine_alert_type(vra_data_with_mjds.loc[INDEX])
_non_detections['ATLAS_ID'] = str(INDEX)
    
non_detections_data_set = _non_detections

## Loop

_non_detections_data_set_lists = []
for INDEX in tqdm(vra_data_with_mjds.index.values[1:]):
    _non_detections = make_non_detection_table(JsonData(filename=RAW_JSON_DATA_PATH+ f'{INDEX}.json')).to_pandas()
    try: 
        _non_detections['phase_init'] = _non_detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_init
        _non_detections['phase_decision'] = _non_detections.mjd - vra_data_with_mjds.loc[INDEX].mjd_decision
    except ValueError:
        print(INDEX)
        continue
    
    # HAVE TO CUT TO -100 DAYS OTHERWISE FILL UP MY MEMORY
    _non_detections=_non_detections[_non_detections.phase_init > -100] 
    _non_detections['type'] = determine_alert_type(vra_data_with_mjds.loc[INDEX])
    _non_detections['ATLAS_ID'] = str(INDEX)
    
    # CONCATENATION IN THE LOOP SLOWS DOWN THE FUCKER ENOURMOUSLY
    #non_detections_data_set = pd.concat((non_detections_data_set, _non_detections))
    _non_detections_data_set_lists.append(_non_detections)



  0%|          | 0/40878 [00:00<?, ?it/s]

In [183]:
non_detections_data_set = pd.concat(([non_detections_data_set]+_non_detections_data_set_lists))

In [185]:
non_detections_data_set.reset_index(drop=True, inplace=True)
non_detections_data_set.set_index('ATLAS_ID', inplace=True)

In [186]:
non_detections_data_set

Unnamed: 0_level_0,mjd,mag,magerr,ra,dec,band,det,phase_init,phase_decision,type
ATLAS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1044002490034041600,58018.532911,19.40,,,,c,False,-2377.788848,-2377.973778,galactic
1044002490034041600,58018.539817,19.40,,,,c,False,-2377.781942,-2377.966872,galactic
1044002490034041600,58018.547139,19.41,,,,c,False,-2377.774620,-2377.959550,galactic
1044002490034041600,58018.553778,19.42,,,,c,False,-2377.767981,-2377.952912,galactic
1044002490034041600,58022.511118,19.30,,,,o,False,-2373.810641,-2373.995572,galactic
...,...,...,...,...,...,...,...,...,...,...
1001145281063846800,60612.525706,16.40,,,,o,False,78.790127,78.479525,good
1001145281063846800,60614.369507,19.43,,,,c,False,80.633928,80.323326,good
1001145281063846800,60614.379289,19.46,,,,c,False,80.643710,80.333108,good
1001145281063846800,60614.383202,19.43,,,,c,False,80.647623,80.337022,good


In [187]:
## non_detections_data_set.to_csv('./clean_data_csv/non_detections_100days_NEW.csv', index=True)

In [188]:
del non_detections_data_set

## 3. Contextual Info

Finally we also need all the contextual information surounding our alert. 
* Location on sky (Ra, Dec)
* Real/Bogus Score form the CNN (rb_pix)
* Sherlock Classification (SN, ORPHAN, NT, etc..)
* Separation from host source 
* Redshift (spectroscopic and/or photometric)

In [189]:
vra_data_with_mjds.index.name='ATLAS_ID'
# need to add a column to contextual_info_data_set that is the label of the alert based on the preal and pgal score in the vra table 
# can use determine_alert_type function
vra_data_with_mjds['type'] = vra_data_with_mjds.apply(determine_alert_type, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vra_data_with_mjds['type'] = vra_data_with_mjds.apply(determine_alert_type, axis=1)


In [190]:
vra_data_with_mjds

Unnamed: 0_level_0,id,preal,pgal,pfast,rank,rank_alt1,rank_alt2,timestamp,apiusername,username,debug,mjd_decision,mjd_init,ndays_to_decision,type
ATLAS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1044002490034041600,356049,1.0,1.0,,,,,2024-03-27T12:09:38Z,,shubham.srivastav,False,60396.506690,60396.321759,0.184931,galactic
1073142251350304000,356115,0.0,,,,,,2024-03-27T14:08:16Z,,shubham.srivastav,False,60396.589074,60396.541806,0.047269,garbage
1171529471411932900,357221,1.0,0.0,,,,,2024-03-28T19:33:20Z,,ken.smith,False,60397.814815,60397.774572,0.040243,good
1134218301411308400,357222,1.0,0.0,,,,,2024-03-28T19:34:17Z,,ken.smith,False,60397.815475,60397.685579,0.129896,good
1154114161701812500,357231,1.0,0.0,,,,,2024-03-28T20:50:07Z,,ken.smith,False,60397.868137,60397.685579,0.182558,good
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1125906141284842600,518446,1.0,0.0,,,,,2024-09-06T10:51:05Z,,stephen.smartt,False,60445.800741,60445.718681,0.082060,good
1024644880304716500,533313,1.0,0.0,,,,,2024-09-27T10:41:39Z,,matt.nicholl,False,60523.549144,60523.546748,0.002396,good
1162237411222410800,538596,1.0,0.0,,,,,2024-10-04T11:02:35Z,,ken.smith,False,60519.888171,60519.648912,0.239259,good
1152158771624822700,538597,1.0,0.0,,,,,2024-10-04T11:06:00Z,,ken.smith,False,60462.281424,60462.197882,0.083542,good


In [191]:
## to save the csv file 
## vra_data_with_mjds.to_csv('./clean_data_csv/vra_last_entry_withmjd_NEW.csv', index=True)


In [192]:
def make_contextual_info_list(json_data):
    _json_data = json_data.data
    _contextual_info_list = [_json_data['object']['id'], 
                             _json_data['object']['ra'], 
                             _json_data['object']['dec'],
                             _json_data['object']['rb_pix'],
                             _json_data['object']['sherlockClassification'],

                             ] 
    try: 
        _extra_context_list = [_json_data['sherlock_crossmatches'][0]['separationarcsec'],
                             _json_data['sherlock_crossmatches'][0]['z'],
                             _json_data['sherlock_crossmatches'][0]['photoz'],

                              ]
    except IndexError:
        _extra_context_list=[np.nan]*3
        
    return _contextual_info_list+_extra_context_list



contextual_info_columns = ['ATLAS_ID', 
                           'ra', 
                           'dec', 
                           'rb_pix', 
                           'sherlockClassification', 
                           'separationarcsec',
                           'z',
                           'photoz'
                          ]

**>>> Takes about 5 minutes**

In [193]:
# Smae logic as before, we initialise a first dataframe then make a list of dataframes for each event and concat
INDEX = vra_data_with_mjds.index.values[0]
_contextual_info = make_contextual_info_list(JsonData(filename=RAW_JSON_DATA_PATH + f'{INDEX}.json'))
_contextual_info = pd.DataFrame([_contextual_info], columns=contextual_info_columns)

contextual_info_data_set = _contextual_info
_contextual_info_data_set_lists = []
for INDEX in tqdm(vra_data_with_mjds.index.values[1:]):
    _contextual_info = make_contextual_info_list(JsonData(filename=RAW_JSON_DATA_PATH + f'{INDEX}.json'))
    _contextual_info = pd.DataFrame([_contextual_info], columns=contextual_info_columns)
    _contextual_info_data_set_lists.append(_contextual_info)

  0%|          | 0/40878 [00:00<?, ?it/s]

In [194]:
contextual_info_data_set = pd.concat(([contextual_info_data_set]+_contextual_info_data_set_lists))   
contextual_info_data_set = contextual_info_data_set.reset_index(drop=True)

  contextual_info_data_set = pd.concat(([contextual_info_data_set]+_contextual_info_data_set_lists))


In [195]:
contextual_info_data_set['ATLAS_ID'] = contextual_info_data_set['ATLAS_ID'].astype(int)
contextual_info_data_set.set_index('ATLAS_ID', inplace=True, drop=True)

In [196]:
# now can join vra_data_with_mjds column type to the contextual_infod_data_set
contextual_info_data_set = contextual_info_data_set.join(vra_data_with_mjds['type'], on='ATLAS_ID')

In [197]:
contextual_info_data_set

Unnamed: 0_level_0,ra,dec,rb_pix,sherlockClassification,separationarcsec,z,photoz,type
ATLAS_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1044002490034041600,70.01039,-3.67825,0.999882,ORPHAN,,,,galactic
1073142251350304000,112.92599,35.05096,0.488873,SN,5.572545,,0.910386,garbage
1171529471411932900,258.87293,41.32595,0.999457,SN,6.837729,,0.039536,good
1134218301411308400,205.57632,41.21878,0.999188,SN,11.058081,,0.061855,good
1154114161701812500,235.30846,70.30349,0.990005,SN,3.425969,,,good
...,...,...,...,...,...,...,...,...
1125906141284842600,194.77560,28.81184,0.996964,SN,3.574264,0.003266,0.409613,good
1024644880304716500,41.68704,-30.78791,0.999909,SN,7.159953,0.015717,,good
1162237411222410800,245.65589,22.40309,0.998987,SN,3.171255,0.036715,0.058734,good
1152158771624822700,230.49495,62.80632,0.965930,SN,4.567470,,,good


In [198]:
## contextual_info_data_set.to_csv('./clean_data_csv/contextual_info_data_set_NEW.csv', index=True)