# Removing Falsely Labeled systems and duplicated systems from dataset

This process removes systems that were falsely labeled as stable, when in fact, one or more body in the system had been ejected and thus no collisions occured.
Additionally, this process removes duplicated systems from the dataset

In [1]:
import pandas as pd
import numpy as np

We used pre labeled datasheets which have identified which systems have ejections. The process for determining if a system has an ejection and creating these datasheets can be found in findEjections.ipynb. To briefly summarize, we re load the final snapshot after integration of each system, we are then able to check if a body has been ejected from the system.

In [2]:
randomejections = pd.read_csv('randomejections.csv')
resonantejections = pd.read_csv('resonantejections.csv')

here we manually specify the names of the datasheet columns

In [3]:
#labels for columns of Initial conditions and labels
col = ['p0m','p0x','p0y','p0z','p0vx','p0vy','p0vz','p1m','p1x','p1y','p1z','p1vx','p1vy','p1vz','p2m','p2x','p2y','p2z','p2vx','p2vy','p2vz','p3m','p3x','p3y','p3z','p3vx','p3vy','p3vz']
lab = ['runstring', 'instability_time',
       'shadow_instability_time', 'Stable']

We are then able to load the initial conditions and labels for the random and resonant systems, joining the datasheets to ensure that a system and its corresponding labels stay together

In [4]:
#load path and data for random datasets
randomPath = 'csvs/random/'
randomInitial = pd.read_csv(randomPath+'initial_conditions.csv',header=None)
randomLabels = pd.read_csv(randomPath+'labels.csv')
randomInitial.columns = col #adds labels to initial condition columns
randset = pd.DataFrame.join(randomInitial, randomLabels) #joins initial conditions and labels

In [5]:
#load path and data for resonant datasets
resPath = 'csvs/resonant/'
resInitial = pd.read_csv(resPath+'initial_conditions.csv',header=None)
resLabels = pd.read_csv(resPath+'labels.csv')
resInitial.columns = col #adds labels to initial condition columns
resset = pd.DataFrame.join(resInitial, resLabels) #joins initial conditions and labels

We can then merge these datasheets with the ejection data, matching the systems to the proper label by looking at the unique runstrings

In [6]:
#combines dataset with ejection data based on runstring
randset = pd.merge(randset,randomejections[['runstring','ejection']],on='runstring')
resset = pd.merge(resset,resonantejections[['runstring','ejection']],on='runstring')

In [7]:
#removes junk columns
randset = randset.drop('Unnamed: 0',axis=1)
resset = resset.drop('Unnamed: 0',axis=1)

We are then able to get a sense of how many systems either are ejected, or are duplicates of another system

In [8]:
#checking how many ejection systems exist
print('random:')
print(randset['ejection'].value_counts())
print('resonant:')
print(resset['ejection'].value_counts())

random:
ejection
False    24941
True        59
Name: count, dtype: int64
resonant:
ejection
False    113478
True         65
Name: count, dtype: int64


In [9]:
#finds duplicates and lables them, this will label all duplicates, other than the first appearance
randset['isDup']=randset[col].duplicated()
resset['isDup']=resset[col].duplicated()

In [10]:
#checking how many duplicated systems in each dataset, this process labels the first instance of any system as False, 
# and lables any duplicates after the first instance as True
print('random:')
print(randset['isDup'].value_counts()) #the duplicated systems have the same initial conditions
print('resonant:')
print(resset['isDup'].value_counts())

random:
isDup
False    25000
Name: count, dtype: int64
resonant:
isDup
False    102559
True      10984
Name: count, dtype: int64


We can then create a new column specifying whether each row should be removed from the data set or not

In [11]:
#labeling each row as to whether or not it should be removed
randset['remove']=(randset['ejection']==True) | (randset['isDup']==True)
resset['remove']=(resset['ejection']==True) | (resset['isDup']==True)


In total we will need to remove 59 systems from the random data, and 11046 systems from the resonant data

In [12]:
#determining how many total systems need to be dropped
print('random:')
print(randset['remove'].value_counts())
print('resonant:')
print(resset['remove'].value_counts())

random:
remove
False    24941
True        59
Name: count, dtype: int64
resonant:
remove
False    102497
True      11046
Name: count, dtype: int64


save the label list

In [14]:
resset[['runstring','remove']].to_csv(resPath+'removeLables.csv')
randset[['runstring','remove']].to_csv(randomPath+'removeLables.csv')


In [15]:
#removes the bad samples
randset = randset.drop(randset[randset['remove']==True].index)
resset = resset.drop(resset[resset['remove']==True].index)

We can then separate the datasheets in the same format as the original data, and save the new clean csvs in the same file location

In [16]:
#seperates lables from initial conditions
cleanrandinitial = randset[col+['runstring']]
cleanrandlables = randset[lab]
cleanresinitial = resset[col+['runstring']]
cleanreslables = resset[lab]

In [17]:
#saves clean data
cleanrandinitial.to_csv(randomPath+'clean_initial_conditions.csv')
cleanrandlables.to_csv(randomPath+'clean_labels.csv')
cleanresinitial.to_csv(resPath+'clean_initial_conditions.csv')
cleanreslables.to_csv(resPath+'clean_labels.csv')


We are then left with a clean data set that should no longer have falsely labeled systems or duplicate systems