The NASA Near Earth Objects Web Service (NeoWs) is data about near Earth asteroids. It has data on the asteroids such as the diameter, how close to earth it was, and finally how potentially hazardous each asteroid was.

We are using this data to extract information on how to classify a asteroid as hazardous or not.

We'll import the necessary libraries first.

In [44]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import matplotlib
import seaborn as sns
import ast 


Now after importing the libraries we can look at the head and tail of the dataset to get an idea of what the data looks like. 

In [45]:
df = pd.read_csv('../data/raw/asteroids_data.csv', low_memory=False)
df.head()

Unnamed: 0,id,neo_reference_id,name,name_limited,designation,nasa_jpl_url,absolute_magnitude_h,is_potentially_hazardous_asteroid,close_approach_data,is_sentry_object,...,orbital_data_perihelion_argument,orbital_data_aphelion_distance,orbital_data_perihelion_time,orbital_data_mean_anomaly,orbital_data_mean_motion,orbital_data_equinox,orbital_data_orbit_class_orbit_class_type,orbital_data_orbit_class_orbit_class_description,orbital_data_orbit_class_orbit_class_range,sentry_data
0,2000433,2000433,433 Eros (A898 PA),Eros,433,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,10.4,False,"[{'close_approach_date': '1900-12-27', 'close_...",False,...,178.92257,1.782961,2461089.0,198.598042,0.559753,J2000,AMO,Near-Earth asteroid orbits similar to that of ...,1.017 AU < q (perihelion) < 1.3 AU,
1,2000719,2000719,719 Albert (A911 TB),Albert,719,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,15.59,False,"[{'close_approach_date': '1909-08-21', 'close_...",False,...,156.202461,4.077523,2461519.0,194.534428,0.230249,J2000,AMO,Near-Earth asteroid orbits similar to that of ...,1.017 AU < q (perihelion) < 1.3 AU,
2,2000887,2000887,887 Alinda (A918 AA),Alinda,887,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,13.81,False,"[{'close_approach_date': '1974-01-04', 'close_...",False,...,350.518804,3.88597,2460679.0,30.880981,0.253395,J2000,AMO,Near-Earth asteroid orbits similar to that of ...,1.017 AU < q (perihelion) < 1.3 AU,
3,2001036,2001036,1036 Ganymed (A924 UB),Ganymed,1036,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,9.18,False,"[{'close_approach_date': '1910-02-25', 'close_...",False,...,132.503939,4.086648,2460570.0,52.267577,0.226445,J2000,AMO,Near-Earth asteroid orbits similar to that of ...,1.017 AU < q (perihelion) < 1.3 AU,
4,2001221,2001221,1221 Amor (1932 EA1),Amor,1221,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,17.37,False,"[{'close_approach_date': '1908-03-14', 'close_...",False,...,26.755289,2.754075,2460839.0,345.76741,0.370539,J2000,AMO,Near-Earth asteroid orbits similar to that of ...,1.017 AU < q (perihelion) < 1.3 AU,


In [46]:
df.tail()

Unnamed: 0,id,neo_reference_id,name,name_limited,designation,nasa_jpl_url,absolute_magnitude_h,is_potentially_hazardous_asteroid,close_approach_data,is_sentry_object,...,orbital_data_perihelion_argument,orbital_data_aphelion_distance,orbital_data_perihelion_time,orbital_data_mean_anomaly,orbital_data_mean_motion,orbital_data_equinox,orbital_data_orbit_class_orbit_class_type,orbital_data_orbit_class_orbit_class_description,orbital_data_orbit_class_orbit_class_range,sentry_data
40251,54545001,54545001,(2025 RL2),,2025 RL2,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,25.995,False,"[{'close_approach_date': '1984-02-08', 'close_...",False,...,323.048724,1.130572,2461020.0,243.41695,1.339813,J2000,ATE,Near-Earth asteroid orbits similar to that of ...,a (semi-major axis) < 1.0 AU; q (perihelion) >...,
40252,54545002,54545002,(2025 RM2),,2025 RM2,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,28.075,False,"[{'close_approach_date': '1980-03-11', 'close_...",False,...,271.568148,1.643757,2460871.0,306.136316,0.762869,J2000,APO,Near-Earth asteroid orbits which cross the Ear...,a (semi-major axis) > 1.0 AU; q (perihelion) <...,
40253,54545003,54545003,(2025 RN2),,2025 RN2,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,24.096,False,"[{'close_approach_date': '2023-12-13', 'close_...",False,...,106.911493,1.313401,2460632.0,160.394237,0.949198,J2000,APO,Near-Earth asteroid orbits which cross the Ear...,a (semi-major axis) > 1.0 AU; q (perihelion) <...,
40254,54545004,54545004,(2025 RO2),,2025 RO2,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,25.63,False,"[{'close_approach_date': '1907-09-03', 'close_...",False,...,312.989467,4.284703,2460900.0,336.393795,0.236973,J2000,APO,Near-Earth asteroid orbits which cross the Ear...,a (semi-major axis) > 1.0 AU; q (perihelion) <...,
40255,54545005,54545005,(2025 RP2),,2025 RP2,https://ssd.jpl.nasa.gov/tools/sbdb_lookup.htm...,25.405,False,"[{'close_approach_date': '2025-09-19', 'close_...",False,...,179.574949,3.827933,2460934.0,359.495003,0.260265,J2000,AMO,Near-Earth asteroid orbits similar to that of ...,1.017 AU < q (perihelion) < 1.3 AU,


We will need to see some metrics on the data then amount of missing values and duplicates in the dataset.

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40256 entries, 0 to 40255
Data columns (total 45 columns):
 #   Column                                                Non-Null Count  Dtype  
---  ------                                                --------------  -----  
 0   id                                                    40256 non-null  object 
 1   neo_reference_id                                      40256 non-null  object 
 2   name                                                  40252 non-null  object 
 3   name_limited                                          179 non-null    object 
 4   designation                                           40252 non-null  object 
 5   nasa_jpl_url                                          40256 non-null  object 
 6   absolute_magnitude_h                                  40243 non-null  float64
 7   is_potentially_hazardous_asteroid                     40256 non-null  bool   
 8   close_approach_data                                   40

In [48]:
df.isnull().sum() / df.shape[0] * 100

id                                                       0.000000
neo_reference_id                                         0.000000
name                                                     0.009936
name_limited                                            99.555346
designation                                              0.009936
nasa_jpl_url                                             0.000000
absolute_magnitude_h                                     0.032293
is_potentially_hazardous_asteroid                        0.000000
close_approach_data                                      0.000000
is_sentry_object                                         0.000000
links_self                                               0.000000
estimated_diameter_kilometers_estimated_diameter_min     0.032293
estimated_diameter_kilometers_estimated_diameter_max     0.032293
estimated_diameter_meters_estimated_diameter_min         0.032293
estimated_diameter_meters_estimated_diameter_max         0.032293
estimated_

In [49]:
df.duplicated().sum()

np.int64(0)

Since we now have a general idea of what the dataset is, we will drop the datasets that we do not need.

In [50]:
useful_data = ['id', 'absolute_magnitude_h', 'estimated_diameter_kilometers_estimated_diameter_min', 'estimated_diameter_kilometers_estimated_diameter_max', 'is_potentially_hazardous_asteroid', 'close_approach_data']
df = df[useful_data]
df.head()

Unnamed: 0,id,absolute_magnitude_h,estimated_diameter_kilometers_estimated_diameter_min,estimated_diameter_kilometers_estimated_diameter_max,is_potentially_hazardous_asteroid,close_approach_data
0,2000433,10.4,22.108281,49.435619,False,"[{'close_approach_date': '1900-12-27', 'close_..."
1,2000719,15.59,2.025606,4.529393,False,"[{'close_approach_date': '1909-08-21', 'close_..."
2,2000887,13.81,4.597852,10.281109,False,"[{'close_approach_date': '1974-01-04', 'close_..."
3,2001036,9.18,38.775283,86.704169,False,"[{'close_approach_date': '1910-02-25', 'close_..."
4,2001221,17.37,0.892391,1.995446,False,"[{'close_approach_date': '1908-03-14', 'close_..."


Since we have now gotten rid of the majority of uneeded columns we will now flatten the close approach column to get the data we need out of it. Namely the velocity of the astroid, the date of approach, and the miss distance.

In [51]:
df['close_approach_data'] = df['close_approach_data'].apply(ast.literal_eval)
df['close_approach_date'] = df['close_approach_data'].apply(lambda x: x[0]['close_approach_date'] if len(x) > 0 else np.nan)
df['relative_velocity_km_per_hr'] = df['close_approach_data'].apply(lambda x: float(x[0]['relative_velocity']['kilometers_per_hour']) if len(x) > 0 else np.nan)
df['miss_distance_kilometers'] = df['close_approach_data'].apply(lambda x: float(x[0]['miss_distance']['kilometers']) if len(x) > 0 else np.nan)
df = df.drop(columns=['close_approach_data'])
df.head()

Unnamed: 0,id,absolute_magnitude_h,estimated_diameter_kilometers_estimated_diameter_min,estimated_diameter_kilometers_estimated_diameter_max,is_potentially_hazardous_asteroid,close_approach_date,relative_velocity_km_per_hr,miss_distance_kilometers
0,2000433,10.4,22.108281,49.435619,False,1900-12-27,20083.029075,47112730.0
1,2000719,15.59,2.025606,4.529393,False,1909-08-21,12405.704411,255629000.0
2,2000887,13.81,4.597852,10.281109,False,1974-01-04,25545.505209,20461810.0
3,2001036,9.18,38.775283,86.704169,False,1910-02-25,22693.91908,292651800.0
4,2001221,17.37,0.892391,1.995446,False,1908-03-14,38772.626612,27512870.0


This is the finished clean version of data, this will be productionized into the preprocess file. And later data manipulations will take place in the feature engineering files.