## Data cleaning



Import data aa


General Information:

- SPK-ID & Object ID: Unique identifiers for the asteroid within different databases. Not directly relevant to hazard assessment.
- Object fullname, pdes, name: Different designations for the asteroid. Consider using the official IAU name for consistency.
- Equinox not important 


Hazard Potential Indicators:

- NEO: Flag indicating if the asteroid is a Near-Earth Object (NEO), meaning its orbit brings it close to Earth. NEOs are potential candidates for hazardous impacts.
- PHA: Flag indicating if the asteroid is a Potentially Hazardous Asteroid (PHA). PHAs are NEOs with a large enough size and close enough approach to Earth to pose a threat in the future. This is a crucial factor for hazard assessment.
Physical Characteristics:

- H: Absolute magnitude parameter, related to the brightness of the asteroid. Lower H values indicate a brighter and potentially larger asteroid.
- Diameter & Diameter_sigma: Estimated diameter of the asteroid in kilometers and its uncertainty. A larger diameter translates to a greater potential impact.
- Albedo: Reflectivity of the asteroid's surface. A higher albedo might indicate a denser composition, which could increase its impact threat.
Orbital Information:

- Orbit_id: Identifier for the specific orbital solution used.
- Epoch: Reference time for the orbital calculations.
- Equinox: Reference point for the orbital plane definition. These are not directly relevant for hazard assessment, but important for understanding the specific orbit solution used.
- e: Eccentricity, a measure of how circular the orbit is. A more eccentric orbit could bring the asteroid closer to Earth at certain points.
- a: Semi-major axis, the average distance between the asteroid and the Sun.
- q: Perihelion distance, the closest point in the asteroid's orbit to the Sun. A smaller perihelion distance increases the chance of an encounter with Earth.
- i: Inclination, the angle between the asteroid's orbital plane and Earth's orbital plane. A lower inclination increases the chance of a collision.
- tp: Time of perihelion passage, the date and time when the asteroid is closest to the Sun.
- moid_ld: Minimum Orbit Intersection Distance with Earth, the closest distance the asteroid's orbit comes to Earth's orbit (in Lunar Distances, where 1 LD is the average Earth-Moon distance). A smaller moid_ld indicates a higher potential for a future collision.

In [9]:
	
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

In [10]:
df1 = pd.read_csv('dataset.csv')


In [11]:
df1.head(5)


Unnamed: 0,id,spkid,full_name,pdes,name,prefix,neo,pha,H,diameter,...,sigma_i,sigma_om,sigma_w,sigma_ma,sigma_ad,sigma_n,sigma_tp,sigma_per,class,rms
0,a0000001,2000001,1 Ceres,1,Ceres,,N,N,3.4,939.4,...,4.6089e-09,6.1688e-08,6.6248e-08,7.8207e-09,1.1113e-11,1.1965e-12,3.7829e-08,9.4159e-09,MBA,0.43301
1,a0000002,2000002,2 Pallas,2,Pallas,,N,N,4.2,545.0,...,3.4694e-06,6.2724e-06,9.1282e-06,8.8591e-06,4.9613e-09,4.6536e-10,4.0787e-05,3.6807e-06,MBA,0.35936
2,a0000003,2000003,3 Juno,3,Juno,,N,N,5.33,246.596,...,3.2231e-06,1.6646e-05,1.7721e-05,8.1104e-06,4.3639e-09,4.4134e-10,3.5288e-05,3.1072e-06,MBA,0.33848
3,a0000004,2000004,4 Vesta,4,Vesta,,N,N,3.0,525.4,...,2.1706e-07,3.8808e-07,1.7893e-07,1.2068e-06,1.6486e-09,2.6125e-10,4.1037e-06,1.2749e-06,MBA,0.3998
4,a0000005,2000005,5 Astraea,5,Astraea,,N,N,6.9,106.699,...,2.7408e-06,2.8949e-05,2.9842e-05,8.3038e-06,4.729e-09,5.5227e-10,3.4743e-05,3.4905e-06,MBA,0.52191


In [12]:
# Creating a new dataframe
df = df1.drop(['id','spkid', 'pdes', 'name', 'prefix', 'equinox'], axis='columns', inplace=False)

In [13]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 958524 entries, 0 to 958523
Data columns (total 39 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   full_name       958524 non-null  object 
 1   neo             958520 non-null  object 
 2   pha             938603 non-null  object 
 3   H               952261 non-null  float64
 4   diameter        136209 non-null  float64
 5   albedo          135103 non-null  float64
 6   diameter_sigma  136081 non-null  float64
 7   orbit_id        958524 non-null  object 
 8   epoch           958524 non-null  float64
 9   epoch_mjd       958524 non-null  int64  
 10  epoch_cal       958524 non-null  float64
 11  e               958524 non-null  float64
 12  a               958524 non-null  float64
 13  q               958524 non-null  float64
 14  i               958524 non-null  float64
 15  om              958524 non-null  float64
 16  w               958524 non-null  float64
 17  ma        

In [14]:
df.nunique()

full_name         958524
neo                    2
pha                    2
H                   9489
diameter           16591
albedo              1057
diameter_sigma      3054
orbit_id            4690
epoch               5246
epoch_mjd           5246
epoch_cal           5246
e                 958444
a                 958509
q                 958509
i                 958414
om                958518
w                 958519
ma                958519
ad                958505
n                 958514
tp                958519
tp_cal            958499
per               958510
per_y             958511
moid              314300
moid_ld           314301
sigma_e           254740
sigma_a           273297
sigma_q           248138
sigma_i           215741
sigma_om          223155
sigma_w           262719
sigma_ma          266816
sigma_ad          269241
sigma_n           251750
sigma_tp          291246
sigma_per         282687
class                 13
rms                64386
dtype: int64

In [15]:
df.isnull().sum()

full_name              0
neo                    4
pha                19921
H                   6263
diameter          822315
albedo            823421
diameter_sigma    822443
orbit_id               0
epoch                  0
epoch_mjd              0
epoch_cal              0
e                      0
a                      0
q                      0
i                      0
om                     0
w                      0
ma                     1
ad                     4
n                      0
tp                     0
tp_cal                 0
per                    4
per_y                  1
moid               19921
moid_ld              127
sigma_e            19922
sigma_a            19922
sigma_q            19922
sigma_i            19922
sigma_om           19922
sigma_w            19922
sigma_ma           19922
sigma_ad           19926
sigma_n            19922
sigma_tp           19922
sigma_per          19926
class                  0
rms                    2
dtype: int64

In [17]:
df = df.dropna(subset=['pha'])

In [34]:
# to check if this affects pha

df[df['neo'].isnull() == True]

Unnamed: 0,full_name,neo,pha,H,diameter,albedo,diameter_sigma,orbit_id,epoch,epoch_mjd,...,sigma_i,sigma_om,sigma_w,sigma_ma,sigma_ad,sigma_n,sigma_tp,sigma_per,class,rms
741612,(2013 CA134),,N,10.748,,,,JPL 7,2456340.5,56340,...,0.47393,7.9491,14.282,201.81,,0.20916,431.09,,HYA,0.30366
929462,'Oumuamua (A/2017 U1),,N,22.08,,,,JPL 16,2458080.5,58080,...,0.000288,0.000254,0.001249,0.006116,,8.1084e-05,0.000264,,HYA,0.43612
946657,(A/2019 G4),,N,13.6,,,,JPL 17,2458811.5,58811,...,0.000122,0.000117,0.001592,1.9e-05,,4.3122e-08,0.012924,,HYA,0.41121
950563,(A/2019 O3),,N,9.1017,,,,JPL 17,2458842.5,58842,...,0.000228,0.000161,0.005978,3.9e-05,,8.9893e-08,0.10257,,HYA,0.41029


In [35]:
df = df.dropna(subset=['neo'])

In [36]:
df.isnull().sum()

full_name              0
neo                    0
pha                    0
H                   6262
diameter          802390
albedo            803496
diameter_sigma    802518
orbit_id               0
epoch                  0
epoch_mjd              0
epoch_cal              0
e                      0
a                      0
q                      0
i                      0
om                     0
w                      0
ma                     1
ad                     0
n                      0
tp                     0
tp_cal                 0
per                    0
per_y                  0
moid                   0
moid_ld                0
sigma_e                1
sigma_a                1
sigma_q                1
sigma_i                1
sigma_om               1
sigma_w                1
sigma_ma               1
sigma_ad               1
sigma_n                1
sigma_tp               1
sigma_per              1
class                  0
rms                    1
dtype: int64

In [39]:
df[df['sigma_ad'].isnull() == True]

Unnamed: 0,full_name,neo,pha,H,diameter,albedo,diameter_sigma,orbit_id,epoch,epoch_mjd,...,sigma_i,sigma_om,sigma_w,sigma_ma,sigma_ad,sigma_n,sigma_tp,sigma_per,class,rms


In [42]:
df = df.dropna(subset=['sigma_ad'])
df = df.dropna(subset=['ma'])


In [43]:
df.isnull().sum()

full_name              0
neo                    0
pha                    0
H                   6262
diameter          802388
albedo            803494
diameter_sigma    802516
orbit_id               0
epoch                  0
epoch_mjd              0
epoch_cal              0
e                      0
a                      0
q                      0
i                      0
om                     0
w                      0
ma                     0
ad                     0
n                      0
tp                     0
tp_cal                 0
per                    0
per_y                  0
moid                   0
moid_ld                0
sigma_e                0
sigma_a                0
sigma_q                0
sigma_i                0
sigma_om               0
sigma_w                0
sigma_ma               0
sigma_ad               0
sigma_n                0
sigma_tp               0
sigma_per              0
class                  0
rms                    0
dtype: int64

In [41]:
df[df['ma'].isnull() == True]

Unnamed: 0,full_name,neo,pha,H,diameter,albedo,diameter_sigma,orbit_id,epoch,epoch_mjd,...,sigma_i,sigma_om,sigma_w,sigma_ma,sigma_ad,sigma_n,sigma_tp,sigma_per,class,rms
558436,(2002 PD153),N,N,8.7,,,,JPL 5,2452500.5,52500,...,19.854,571.34,884510000000.0,884510000000.0,3656.4,0.36502,285310000000000.0,13673000.0,TNO,0.039417


In [45]:
df = df.dropna(subset=['diameter','albedo','diameter_sigma'])


df.isnull().sum()

full_name            0
neo                  0
pha                  0
H                 3863
diameter             0
albedo               0
diameter_sigma       0
orbit_id             0
epoch                0
epoch_mjd            0
epoch_cal            0
e                    0
a                    0
q                    0
i                    0
om                   0
w                    0
ma                   0
ad                   0
n                    0
tp                   0
tp_cal               0
per                  0
per_y                0
moid                 0
moid_ld              0
sigma_e              0
sigma_a              0
sigma_q              0
sigma_i              0
sigma_om             0
sigma_w              0
sigma_ma             0
sigma_ad             0
sigma_n              0
sigma_tp             0
sigma_per            0
class                0
rms                  0
dtype: int64

In [54]:
# this checks all the values of 'pha' where 'H' is null 
df[df['H'].isnull()==True]['pha'].nunique()


1

In [55]:
# As all the values are the same we can drop the rows with NULL 'H' values
df = df.dropna(subset=['H'])


df.isnull().sum()

full_name         0
neo               0
pha               0
H                 0
diameter          0
albedo            0
diameter_sigma    0
orbit_id          0
epoch             0
epoch_mjd         0
epoch_cal         0
e                 0
a                 0
q                 0
i                 0
om                0
w                 0
ma                0
ad                0
n                 0
tp                0
tp_cal            0
per               0
per_y             0
moid              0
moid_ld           0
sigma_e           0
sigma_a           0
sigma_q           0
sigma_i           0
sigma_om          0
sigma_w           0
sigma_ma          0
sigma_ad          0
sigma_n           0
sigma_tp          0
sigma_per         0
class             0
rms               0
dtype: int64

In [61]:
df.nunique()

full_name         131142
neo                    2
pha                    2
H                    332
diameter           16539
albedo              1048
diameter_sigma      3052
orbit_id             367
epoch                166
epoch_mjd            166
epoch_cal            166
e                 131142
a                 131142
q                 131142
i                 131142
om                131142
w                 131142
ma                131142
ad                131142
n                 131142
tp                131142
tp_cal            131141
per               131142
per_y             131142
moid               91464
moid_ld            91464
sigma_e            57462
sigma_a            60186
sigma_q            46005
sigma_i            58209
sigma_om           65990
sigma_w            70264
sigma_ma           58715
sigma_ad           59832
sigma_n            44735
sigma_tp           72448
sigma_per          64197
class                 11
rms                26672
dtype: int64

In [None]:
# check the normalization of each important colums 

In [None]:
# finally clean data 

In [60]:
# now I need to use get dummies to one hot code the categorical data 

df['pha'].value_counts()


pha
N    130961
Y       181
Name: count, dtype: int64

In [None]:
# undersample or oversample?
# Undersampling : because I dont want the model to overfit 



Pre-processing 

in this you need to remove the useless columns which are like too corelated and dont give enough value

one hot code for the rest - use get dummies to do that 

balance the data 

then need to normalize the data 



train the model

test the model