(C) 2023 Gaudeor Rudmin

<if you edit this code, add your copyright statement here>

MIT LICENSE

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.



### What this does
This script loads a csv database of astronomical objects, which potentially has duplicates. It assumes that ra and dec are in the ICRS reference frame. The script helps you remove duplicates by finding objects that are close to each other, and labeling them potential duplicates. Then, it searches NED for the two objects' names, and if the name match is successful, assumes the objects are duplicates, and allows you to choose one to keep.

If the name match is unsuccessful, then it allows the user to decide if the two potential duplicates are in fact duplicates, while presenting some data from a NED search based on location and search radius. The user has the option to choose to keep either object or both.

Finally, the new dataset is saved.
### What this does not do

This script is intended to be a final database processing step after other methds of finding and removing duplicates have been applied. If there is a way to be sure that two objects are duplicates based on similar names, then use those methods before employing this script. Ideally, you will not have to use this script to sift through more than a hundred potential duplicates, as it can be tedius.

In [1]:
!pip install ipywidgets
!pip install pandas
!pip install astropy
!pip install numpy
!pip install jupyter_ui_poll


Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
## SETTINGS

search_radius = 1/3600.0 #in degrees
all_surveyed_path = './data/Class_with_bestobjids_rudmingb.csv' # the path of data to be cleansed
cleansed_path = './cleansed_data/Class_with_bestobjids_cleansed.csv' # the path to save cleansed data


In [3]:
import math
import numpy as np
import pandas as pd
import astropy


import astroquery
from astroquery.ipac.ned import Ned
import pandas as pd

def query_ned_by_name(name):
    result_table = None
    success = False
    msg = "Match Found!"
    try:
        result_table = Ned.query_object(name) #astropy.table.Table
        success = True
    except astroquery.exceptions.RemoteServiceError as error:
        msg = error
    except Exception as error:
        msg = error
    return result_table, success, msg


def query_ned_by_coords(pass_ra, pass_dec, pass_radius): # uses deg
    result_table = None
    success = False
    msg = "Match Found!"
    try:

        position = SkyCoord(ra=pass_ra, dec=pass_dec, unit=(u.deg, u.deg), frame='icrs')
        result_table = Ned.query_region(position, radius=pass_radius * u.deg)
        success = True
    except astroquery.exceptions.RemoteServiceError as error:
        msg = error
    except Exception as error:
        msg = error
    return result_table, success, msg



def is_within_rad(ra_1, ra_2, dec_1, dec_2): # not used, astropy's function is used.
    return (search_rad) > (math.sqrt((math.pow((((ra_1 * (math.pi / 180.0)) - (ra_2 * (math.pi / 180.0))) * (math.cos((dec_1 * (math.pi / 180.0))))), 2)) + math.pow(((dec_1 * (math.pi / 180.0)) - (dec_2 * (math.pi / 180.0))), 2))) * (180.0 / math.pi)



# load csv files

all_surveyed = pd.read_csv(all_surveyed_path)



print(all_surveyed)

               SPECOBJID  PLATE    MJD  FIBERID      ALLWISE_SOURCEID  \
0    1004411058695202816    892  52378      394  1212p363_ac51-028528   
1    1008815705866397696    896  52592       34  1334p424_ac51-043038   
2    1030318577393100800    915  52443      437  2102m016_ac51-018319   
3    1032542613816764416    917  52400      336  2147m016_ac51-035564   
4    1033620959912814592    918  52404      163  2163m031_ac51-050889   
..                   ...    ...    ...      ...                   ...   
253   961619990203623424    854  52373      369  2050p060_ac51-000021   
254   964958381393602560    857  52314      226  1139p242_ac51-025314   
255   966142555450271744    858  52316      438  1166p272_ac51-022880   
256   988689966143924224    878  52353      545  1692p514_ac51-048297   
257   993161955097208832    882  52370      430  1800p530_ac51-016072   

    ALLWISE_COADDID        RA        DEC         Z     Z_ERR  ...  \
0     1212p363_ac51  121.7799  36.233470  0.032314  0.

## All Surveyed Data Sample: 

## Column Identifiers
Use the data sample above to record the relevant column headers.

In [5]:
## create a key-lookup table
# Column identifiers for all_surveyed
all_surveyed_cols = dict(
ra = 'RA',
dec = 'DEC',
uid = 'SPECOBJID',
name = 'ALLWISE_SOURCEID')

## Cleansing the data

We want to make sure that the data is clean. We will use NED to make sure that each object is in fact unique in its table, as well as we can.

After starting this cell, scroll down and interact with the dialogs that pop up to confirm data removal choices.

In [6]:
Ask_for_all = True # Even if a name match is automatically found, enabling this allows you to choose which of the duplicates to keep for *every* duplicate



from astropy.coordinates import SkyCoord  # High-level coordinates
from astropy.coordinates import ICRS, Galactic, FK4, FK5  # Low-level frames
from astroquery.ipac.ned import Ned
from astropy.table import Table
from astropy import units as u
from jupyter_ui_poll import ui_events
import sys
from astropy.coordinates import Angle

import ipywidgets as widgets
from ipywidgets import Button
from IPython.display import display
from astropy.table import Table
import time

def cleanse_data(data,keydict):

    
    ra_key = keydict['ra']
    dec_key = keydict['dec']
    uid_key = keydict['uid']
    name_key = keydict['name']

    # first, get objects to compare; find sets of close objects

    ra = data[ra_key].values
    dec = data[dec_key].values

    coords = SkyCoord(ra*u.deg, dec*u.deg)

    # Set the search radius
    seplimit = search_radius * u.deg   

    # Perform the search within the same dataset
    idx1, idx2, sep2d, dist3d = astropy.coordinates.search_around_sky(coords, coords, seplimit)

    sep2d = np.round(sep2d.value,decimals=13) * u.deg #the distance calculations gives different digits at around 15 decimals. We will use separation to eliminate double entries, so some rounding is needed
    # Filter out self-matching results
    mask = sep2d != 0.0 * u.deg
    idx1 = idx1[mask]
    idx2 = idx2[mask]
    sep2d = sep2d[mask]

    dupidx = []
    dupidx2 = []
    dup_sep2d = []

    # filter out one result of each double (which has the exact same separation)
    for i in range(len(idx1)):
        if not(sep2d[i] in dup_sep2d):
            dupidx2.insert(len(dupidx2)+1,idx2[i])
            dupidx.insert(len(dupidx)+1,idx1[i])
            dup_sep2d.insert(len(dup_sep2d)+1,sep2d[i])

    print(dupidx)

    # for i in range(len(dupidx)):
    #     print(f"Duplicate found: idx1={dupidx[i]}, idx2={dupidx2[i]}, sep2d={dup_sep2d[i]}")
    print(str(len(dupidx)+1) + ' "duplicates" found.\nBeginning Analysis ...')

    sep2d = dup_sep2d
    ## Now I have a list of duplicates ... idx1 is the first match of the duplicate, idx2 is the second match index of the duplicate, sep2d is the angular separation of the duplicates.
    ## Next, I need to determine whether each duplicate is in fact a duplicate or not using NED

    keep1flag = [] ## this is a flag determining whether to keep each idx1 index
    keep2flag = []

    for i in range(len(dupidx)): 
        keep1flag.insert(i,False) # init with false, we will start with the assumption that a potential duplicate is not a duplicate. If Ned confirms it is, then let it be a duplicate.
        keep2flag.insert(i,False) # init with false, we will start with the assumption that a potential duplicate is not a duplicate. If Ned confirms it is, then let it be a duplicate.
        
        ## step 1: search NED for the two object names. If there is a definite match, then the object is a duplicate. If there is not an obvious match then we will require manual intervention.
        idxindata1 = dupidx[i]
        idxindata2 = dupidx2[i]


        obj1_data = data.iloc[idxindata1]
        obj2_data = data.iloc[idxindata2]

        obj1_name = obj1_data[name_key]
        obj2_name = obj2_data[name_key]

        print('###########################\n\n')
        print(f'Comparing record [{obj1_data[uid_key]}] ({obj1_name}) against record [{obj2_data[uid_key]}] ({obj2_name})...')
        print("Performing NED Name Search...")

        name_result_table1, success, msg = query_ned_by_name(obj1_name)
        name_result_table2, success, msg = query_ned_by_name(obj2_name)
        finished = False
        if type(name_result_table1)==Table and type(name_result_table2)==Table:
            if len(name_result_table1) == 1 and len(name_result_table2) == 1:
                if name_result_table1['Object Name'][0] == name_result_table2['Object Name'][0]: ## Ned Name match was successful

                    print('Name Match Successful, this is a duplicate!\n\n')
                    if Ask_for_all:

                        user_input = ''
                        # Create GUI widgets
                        total_obj = str(len(dupidx)+1)
                        this_obj = str(i+1)
                        title_widget = widgets.HTML(value=f"<h4>Object Comparison {this_obj} / {total_obj} </h4>")
                        obj1_widget = widgets.HTML()
                        obj2_widget = widgets.HTML()
                        obj_name_search = widgets.HTML(value=f"<h2>Name found by Ned: {name_result_table1['Object Name'][0]}")
                        question_widget = widgets.HTML(value="<h3>The Name match was successful. These are duplicates. Which one do you want to keep?</h3>")
                        # button_keep_Ned = widgets.Button(description="Keep Obj1, using NED's name.")
                        button_keep1 = widgets.Button(description='Keep Obj1')
                        button_keep2 = widgets.Button(description='Keep Obj2')
                        # button_quit = widgets.Button(description='Quit')
                        waiting_widget = widgets.HTML(value="<h3>\</h3>")


                        container = widgets.VBox([
                            title_widget,
                            obj1_widget,
                            obj2_widget,
                            obj_name_search,
                            question_widget,
                            widgets.HBox([button_keep1, button_keep2]),
                            #widgets.HBox([button_keep_Ned, button_keep1, button_keep2, button_quit]),
                            waiting_widget
                        ])

                        # Display the GUI
                        display(container)


                        # Event handlers for button clicks
                        def on_keep1_clicked(b):
                            nonlocal user_input
                            user_input = 'Keep Obj1.'


                        def on_keep2_clicked(b):
                            nonlocal user_input
                            user_input = 'Keep Obj2.'

                        # def on_keep_Ned_clicked(b):
                        #     nonlocal user_input
                        #     user_input = "Keep Obj1, but use Ned's official name."

                        # def on_quit_clicked(b):
                        #     global user_input
                        #     user_input = 'quit'

                        # Assign event handlers to buttons
                        button_keep1.on_click(on_keep1_clicked)
                        button_keep2.on_click(on_keep2_clicked)
                        # button_keep_Ned.on_click(on_keep_Ned_clicked)
                        # button_quit.on_click(on_quit_clicked)

                        # Update the widgets with the information
                        obj1_widget.value = f"<hr><h3>Object 1 (from our Data table):<br>Name: [{obj1_data[uid_key]}] {obj1_name}<br>RA: {obj1_data[ra_key]}<br>DEC: {obj1_data[dec_key]}</p>"
                        obj2_widget.value = f"<hr><h3>Object 2 (from our Data table):<br>Name: [{obj2_data[uid_key]}] {obj2_name}<br>RA: {obj2_data[ra_key]}<br>DEC: {obj2_data[dec_key]}</p>"

                        waiting_icons = ['-','\\','|','/']
                        j = 0
                        # # Disable execution until a button is clicked
                        with ui_events() as poll:
                            while user_input == '':
                                poll(10)          # React to UI events (upto 10 at a time)
                                if user_input == '':
                                    waiting_widget.value = "<h3>"+ waiting_icons[j]+"</h3>"
                                    j+=1
                                    if j==4:
                                        j=0
                                    time.sleep(0.1)
                                             
                        if user_input == 'Keep Obj1.':
                            keep1flag[i] = True
                        if user_input == 'Keep Obj2.':
                            keep2flag[i] = True
                        # if user_input == "Keep Obj1, Ned's name.":
                        #     keep1flag[i] = True
                        #     data.at[idxindata1, name_key] = name_result_table1['Object Name'][0] # modify the data name to be Ned's official object name.

                        # if user_input == 'quit':
                        #     sys.exit("Program terminated by user")

                        button_keep1.disabled = True
                        button_keep2.disabled = True
                        # button_keep_Ned.disabled = True
                        # button_quit.disabled = True
                        waiting_widget.value = 'User entered: '+ user_input

                        
                    else:
                        keep1flag[i] = True
                    finished = True



        if not(finished):


            print('Name Match Unsuccessful.\n')

            result_table_rad_obj1, success, msg = query_ned_by_coords(obj1_data[ra_key],obj1_data[dec_key], search_radius)
            result_table_rad_obj2, success, msg = query_ned_by_coords(obj2_data[ra_key],obj2_data[dec_key], search_radius)

            ########  AI Generated User Interface   #########

            user_input = ''

            # Create GUI widgets
            total_obj = str(len(dupidx)+1)
            this_obj = str(i+1)
            title_widget = widgets.HTML(value=f"<h1>Object Comparison {this_obj} / {total_obj} </h4>")
            obj1_widget = widgets.HTML()
            obj1_name_search = widgets.HTML()
            obj1_rad_search = widgets.HTML()
            obj2_widget = widgets.HTML()
            obj2_name_search = widgets.HTML()
            obj2_rad_search = widgets.HTML()
            ang_sep = widgets.HTML()
            question_widget = widgets.HTML(value="<h3>Are these two objects duplicates of each other? Choose one to keep, or keep both if not duplicates.</h3>")
            button_keep1 = widgets.Button(description='Keep Obj1')
            button_keep2 = widgets.Button(description='Keep Obj2')
            button_keep_both = widgets.Button(description='Keep Both')
            # button_quit = widgets.Button(description='Quit')
            waiting_widget = widgets.HTML(value="<h3>\</h3>")


            # Create layout for the widgets
            container = widgets.VBox([
                title_widget,
                obj1_widget,
                obj1_name_search,
                obj1_rad_search,
                obj2_widget,
                obj2_name_search,
                obj2_rad_search,
                ang_sep,
                question_widget,
                widgets.HBox([button_keep1, button_keep2, button_keep_both]),
                waiting_widget
            ])

            # Display the GUI
            display(container)

            # Event handlers for button clicks
            def on_keep1_clicked(b):
                nonlocal user_input
                user_input = 'Keep Object 1 (THIS WAS A DUPLICATE!)'


            def on_keep2_clicked(b):
                nonlocal user_input
                user_input = 'Keep Object 2 (THIS WAS A DUPLICATE!)'

            def on_keepboth_clicked(b):
                nonlocal user_input
                user_input = 'Keep Both (THESE ARE NOT DUPLICATES!)'


            # def on_quit_clicked(b):
            #     global user_input
            #     user_input = 'quit'

            # Assign event handlers to buttons
            button_keep1.on_click(on_keep1_clicked)
            button_keep2.on_click(on_keep2_clicked)
            button_keep_both.on_click(on_keepboth_clicked)
            # button_quit.on_click(on_quit_clicked)

            # Update the widgets with the information
            obj1_widget.value = f"<hr><h2>Object 1 (from our Data table):<br>Name: [{obj1_data[uid_key]}] {obj1_name}<br>RA: {obj1_data[ra_key]}<br>DEC: {obj1_data[dec_key]}</p>"

            obj1_name_search.value = f"<p><h4> 📖 The Following are results from a NED NAME search for object 1's Name:</p></h4>Well 💩, Name Search was Unsucessful.<br>"

            if type(name_result_table1)==Table:
                if len(name_result_table1) > 0:
                    obj1_name_search.value = f"<p><h4> 📖 The Following are results from a NED NAME search for object 1's Name:</p></h4>{result_table_rad_obj1._repr_html_()}<br>"
            
            obj1_rad_search.value = f"<p><h4> 🗺️ The Following are results from a NED position search around object 1's position with radius of the search_radius:</p></h4>{result_table_rad_obj1._repr_html_()}<br><hr>"

            obj2_widget.value = f"<h2>Object 2 (from our Data table):<br>Name: [{obj2_data[uid_key]}] {obj2_name}<br>RA: {obj2_data[ra_key]}<br>DEC: {obj2_data[dec_key]}</p>"

            obj2_name_search.value = f"<p><h4> 📖 The Following are results from a NED NAME search for object 2's Name:</p></h4>Well 💩, Name Search was Unsucessful.<br>"

            if type(name_result_table2)==Table:
                if len(name_result_table2) > 0:
                    obj2_name_search.value = f"<p><h4> 📖 The Following are results from a NED NAME search for object 2's Name:</p></h4>{result_table_rad_obj2._repr_html_()}<br>"
            
            
            obj2_rad_search.value = f"<p><h4> 🗺️ The Following are results from a NED position search around object 2's position with radius of the search_radius:</p></h4>{result_table_rad_obj2._repr_html_()}<br><hr>"
            ang_sep.value = f"<h3> 🧭 The angular separation of the two objects is {dup_sep2d[i].to_string(unit='arcsec')}<hr>"

            
            waiting_icons = ['-','\\','|','/']
            j = 0
            # # Disable execution until a button is clicked
            with ui_events() as poll:
                while user_input == '':
                    poll(10)          # React to UI events (upto 10 at a time)
                    if user_input == '':
                        waiting_widget.value = "<h3>"+ waiting_icons[j]+"</h3>"
                        j+=1
                        if j==4:
                            j=0
                        time.sleep(0.1)


            if user_input == 'Keep Object 1 (THIS WAS A DUPLICATE!)':
                keep1flag[i] = True
            if user_input == 'Keep Object 2 (THIS WAS A DUPLICATE!)':
                keep2flag[i] = True
            # if user_input == 'quit':
            #     sys.exit("Program terminated by user")
            button_keep1.disabled = True
            button_keep2.disabled = True
            button_keep_both.disabled = True
            # button_quit.disabled = True
            waiting_widget.value = 'User entered: '+ user_input



            ########  End AI Generated User Interface   #########

    
    # Still need to save and return the cleaned data. Also, if there is exactly 1 object returned by the NED search, is it safe to assume it is a duplicate? # not really, best to get user input anyhow.

    # create a new cleansed dataset

    new_rows = []

    for idxofdata, row in data.iterrows():
        insert_data = False
        if not(idxofdata in dupidx) and not(idxofdata in dupidx2):
            insert_data = True
        else:
            if idxofdata in dupidx:
                idxofduprecord = dupidx.index(idxofdata) #idxofduprecord is the index in the dupidx table, as well as all the other metadata tables on possible dupes
                if keep1flag[idxofduprecord]==True:
                    insert_data = True
            if idxofdata in dupidx2:
                idxofduprecord = dupidx2.index(idxofdata)
                if keep2flag[idxofduprecord]==True:
                    insert_data = True




        if insert_data:
            new_rows.append(row)
    
    cleansed_data = pd.DataFrame(new_rows)

    return cleansed_data



 

        

cleansed_data = cleanse_data(all_surveyed,all_surveyed_cols)
cleansed_data.to_csv(cleansed_path, index=False)
print('\n\nSaved cleansed data to '+cleansed_path)


[184]
2 "duplicates" found.
Beginning Analysis ...
###########################


Comparing record [339018495037564928] (2102p015_ac51-013149) against record [597912807941367808] (2102p015_ac51-013149)...
Performing NED Name Search...
Name Match Unsuccessful.



VBox(children=(HTML(value='<h1>Object Comparison 1 / 2 </h4>'), HTML(value=''), HTML(value=''), HTML(value='')…



Saved cleansed data to ./cleansed_data/Class_with_bestobjids_cleansed.csv


##  