In [1]:
import pandas as pd


In [3]:
df = pd.read_csv('./data/kaggle_neo.csv')

|  Group  |  Definition  |  Description  |
|:--------|:------------:|:--------------|
| NECs    | q<1.3au, P<200yrs  |  Near-Earth Comets |
| NEAs    | q<1.3au     |  Near-Earth Asteroids |
| Atiras | a<1.0 au, Q<0.983 au | NEAs whose orbits are contained entirely with the orbit of the Earth (named after asteroid 163693 Atira) |
| Atens  | a<1.0 au, Q>0.983 au | Earth-crossing NEAs with semi-major axes smaller than Earth's (named after asteroid 2062 Aten) |
| Apollos | a>1.0 au, q<1.017 au | Earth-crossing NEAs with semi-major axes larger than Earth's (named after asteroid 1862 Apollo) |
| Amors  | a>1.0 au, 1.017<q<1.3 au | Earth-approaching NEA's with orbits exterior to Earth's but interior to Mars' ( named aftger asteroid 1221 Amor) |
| PHAs   | MOID<=0.05 au, H<=22.0  | Potentially Hazardous Asteroids: NEAs whose minimum Orbit Interestion Distance (MOID) with the Earth is 0.05 au or less and whose absolute magnitude (H) is 22.0 or brighter |  

(q = perihelion distance, Q = aphelion distance, a = semi-major axis)


-   **Designation**     : Designation of NEO object
-   **Discovery Date**  : Date of Discovery
-   **H (mag)**         : Absolute Magnitude
-   **MOID (au)**       : Minimum Orbit Intersection Distance with Earth
-   **q (au)**          : perihelion distance
-   **Q (au)**          : aphelion distance
-   **period (yr)**     : Orbital Period (one full rotation)
-   **i (deg)**         : Orbital Inclination, tilt of orbital plane relative to earch in degrees
-   **PHA**             : Potentially hazardous (target variable)
-   **Orbit Class**     : Near Earth orbital class *

### Exploratory analysis

* Primary: to classify if an object is hazardous or not  
* Secondary: classify the group  

In [4]:
df.describe()

Unnamed: 0,H (mag),MOID (au),q (au),Q (au),period (yr),i (deg)
count,398.0,432.0,432.0,430.0,430.0,432.0
mean,22.938467,0.321176,1.087176,896.216837,166454.7,27.161806
std,13.697327,0.576055,0.628169,16383.291931,3363227.0,24.992035
min,15.45,0.0003,0.12,1.02,0.53,0.82
25%,19.5,0.052,0.79,2.3325,2.185,13.89
50%,20.485,0.155,0.99,3.595,3.505,21.98
75%,21.5975,0.29225,1.16,4.5875,4.63,30.0175
max,99.99,6.373,7.15,338831.51,69733420.0,162.3


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 432 entries, 0 to 431
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Designation                  432 non-null    object 
 1   Discovery Date (YYYY-MM-DD)  432 non-null    object 
 2   H (mag)                      398 non-null    float64
 3   MOID (au)                    432 non-null    float64
 4   q (au)                       432 non-null    float64
 5   Q (au)                       430 non-null    float64
 6   period (yr)                  430 non-null    float64
 7   i (deg)                      432 non-null    float64
 8   PHA                          398 non-null    object 
 9   Orbit Class                  432 non-null    object 
dtypes: float64(6), object(4)
memory usage: 33.9+ KB


In [6]:
df.isna().sum()

Designation                     0
Discovery Date (YYYY-MM-DD)     0
H (mag)                        34
MOID (au)                       0
q (au)                          0
Q (au)                          2
period (yr)                     2
i (deg)                         0
PHA                            34
Orbit Class                     0
dtype: int64

-   **Designation**     : Designation of NEO object
-   **Discovery Date**  : Date of Discovery
-   **H (mag)**         : Absolute Magnitude
-   **MOID (au)**       : Minimum Orbit Intersection Distance with Earth
-   **q (au)**          : perihelion distance
-   **Q (au)**          : aphelion distance
-   **period (yr)**     : Orbital Period (one full rotation)
-   **i (deg)**         : Orbital Inclination, tilt of orbital plane relative to earch in degrees
-   **PHA**             : Potentially hazardous (target variable)
-   **Orbit Class**     : Near Earth orbital class *

In [12]:
not_na_df =df[df['PHA'].notna()]
not_na_df

Unnamed: 0,Designation,Discovery Date (YYYY-MM-DD),H (mag),MOID (au),q (au),Q (au),period (yr),i (deg),PHA,Orbit Class
0,(2024 FL4),2024-03-31,21.61,0.094,1.07,3.77,3.76,15.59,N,Amor
1,(2024 ET5),2024-03-14,20.68,0.235,1.16,4.23,4.42,15.23,N,Amor
2,(2024 EO2),2024-03-01,19.08,0.192,1.19,4.55,4.86,54.74,N,Amor
3,(2024 AP7),2024-01-15,19.79,0.224,0.59,2.91,2.31,24.54,N,Apollo
4,(2024 AF6),2024-01-13,20.67,0.134,0.43,1.02,0.62,15.05,N,Aten
...,...,...,...,...,...,...,...,...,...,...
427,(2010 BV132),2010-01-16,21.48,0.023,1.01,1.45,1.36,17.28,Y,Apollo
428,(2010 AG79),2010-01-13,20.04,0.232,1.21,4.58,4.93,33.03,N,Amor
429,614599 (2010 AB78),2010-01-12,18.31,0.208,1.03,3.49,3.39,33.27,N,Amor
430,(2010 AZ85),2010-01-08,21.61,0.685,1.60,2.53,2.96,19.88,N,Mars-crossing Asteroid


In [14]:
na_df = df[df['PHA'].isna()]
na_df


Unnamed: 0,Designation,Discovery Date (YYYY-MM-DD),H (mag),MOID (au),q (au),Q (au),period (yr),i (deg),PHA,Orbit Class
52,C/2020 P1 (NEOWISE),2020-08-02,,0.643,0.34,,,45.05,,Hyperbolic Comet
59,C/2020 F3 (NEOWISE),2020-03-27,,0.363,0.29,716.64,6787.09,128.94,,Comet
69,C/2019 L2 (NEOWISE),2019-06-11,,0.636,1.62,49.66,129.85,152.19,,Halley-type Comet*
75,C/2019 H1 (NEOWISE),2019-04-18,,0.865,1.84,470.82,3633.22,104.58,,Comet
91,C/2018 N1 (NEOWISE),2018-07-02,,0.292,1.31,842.62,8668.07,159.44,,Comet
98,C/2018 EN4 (NEOWISE),2018-03-09,,0.568,1.45,35.4,79.09,81.56,,Halley-type Comet*
123,C/2017 C1 (NEOWISE),2017-02-06,,0.914,1.5,40.11,94.89,65.75,,Halley-type Comet*
137,C/2016 U1 (NEOWISE),2016-10-21,,0.589,0.32,,,46.43,,Hyperbolic Comet
153,C/2016 C2 (NEOWISE),2016-02-08,,0.632,1.56,126.48,512.23,38.16,,Comet
156,C/2016 B1 (NEOWISE),2016-01-17,,2.201,3.21,839.95,8656.16,50.48,,Comet


H I think will be more important to the grouping but not hazerdous...well maybe if it's large enough.. PHA is what would need to be found. to PHA is the target. 

In [19]:
not_na_df[not_na_df['Designation'].str.contains('NEOWISE')]

Unnamed: 0,Designation,Discovery Date (YYYY-MM-DD),H (mag),MOID (au),q (au),Q (au),period (yr),i (deg),PHA,Orbit Class


Okay looks like the NEOWISE findings are all missing H and PHA. I can still attempt to see if we can use these to predict the PHA w/o the H. Initial trainings we can see how well it performs without that variable

In [20]:
not_na_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 398 entries, 0 to 431
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Designation                  398 non-null    object 
 1   Discovery Date (YYYY-MM-DD)  398 non-null    object 
 2   H (mag)                      398 non-null    float64
 3   MOID (au)                    398 non-null    float64
 4   q (au)                       398 non-null    float64
 5   Q (au)                       398 non-null    float64
 6   period (yr)                  398 non-null    float64
 7   i (deg)                      398 non-null    float64
 8   PHA                          398 non-null    object 
 9   Orbit Class                  398 non-null    object 
dtypes: float64(6), object(4)
memory usage: 34.2+ KB


okay we know  Potentially Hazardous Asteroids: NEAs whose minimum Orbit Interestion Distance (MOID) with the Earth is 0.05 au or less and whose absolute magnitude (H) is 22.0 or brighter

In [24]:
# how many are hazardous
print(not_na_df[not_na_df['PHA']=='Y'].count())
print(not_na_df[not_na_df['PHA']=='N'].count())

Designation                    66
Discovery Date (YYYY-MM-DD)    66
H (mag)                        66
MOID (au)                      66
q (au)                         66
Q (au)                         66
period (yr)                    66
i (deg)                        66
PHA                            66
Orbit Class                    66
dtype: int64
Designation                    332
Discovery Date (YYYY-MM-DD)    332
H (mag)                        332
MOID (au)                      332
q (au)                         332
Q (au)                         332
period (yr)                    332
i (deg)                        332
PHA                            332
Orbit Class                    332
dtype: int64


In [None]:
filtered_set = not_na_df.drop(columns='Discovery Date (YYYY-MM-DD)') #data just taking up space
filtered_set.columns

Index(['Designation', 'H (mag)', 'MOID (au)', 'q (au)', 'Q (au)',
       'period (yr)', 'i (deg)', 'PHA', 'Orbit Class'],
      dtype='object')