# Nasa Asteroids Classification

## Information on Asteroids collected from NASA API

![Asteroid Image](images/asteroids.jpg)

### Dependencies

```python
# Jupyter notebook
pip install notebook
# NumPy
pip install numpy
# SciPy
pip install scipy
# Pandas
pip install pandas
# Scikit-Learn
pip install scikit-learn
# Matplotlib
pip install matplotlib
# Seaborn
pip install seaborn
```

## Data Analysis and Preprocessing

### Import Libraries

In [55]:
import numpy as np
import pandas as pd
from sklearn import *
import matplotlib.pyplot as plt
import seaborn as sb

### Create DataFrame from the Nasa CSV file

In [56]:
# DataFrame Pandas Settings
# pd.set_option('display.max_columns', None)

df = pd.read_csv("data/nasa.csv")

df.head()

Unnamed: 0,Neo Reference ID,Name,Absolute Magnitude,Est Dia in KM(min),Est Dia in KM(max),Est Dia in M(min),Est Dia in M(max),Est Dia in Miles(min),Est Dia in Miles(max),Est Dia in Feet(min),...,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion,Equinox,Hazardous
0,3703080,3703080,21.6,0.12722,0.284472,127.219879,284.472297,0.079051,0.176763,417.388066,...,314.373913,609.599786,0.808259,57.25747,2.005764,2458162.0,264.837533,0.590551,J2000,True
1,3723955,3723955,21.3,0.146068,0.326618,146.067964,326.617897,0.090762,0.202951,479.22562,...,136.717242,425.869294,0.7182,313.091975,1.497352,2457795.0,173.741112,0.84533,J2000,False
2,2446862,2446862,20.3,0.231502,0.517654,231.502122,517.654482,0.143849,0.321655,759.521423,...,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.0,292.893654,0.559371,J2000,True
3,3092506,3092506,27.4,0.008801,0.019681,8.801465,19.680675,0.005469,0.012229,28.876199,...,57.173266,514.08214,0.983902,18.707701,1.527904,2457902.0,68.741007,0.700277,J2000,False
4,3514799,3514799,21.6,0.12722,0.284472,127.219879,284.472297,0.079051,0.176763,417.388066,...,84.629307,495.597821,0.967687,158.263596,1.483543,2457814.0,135.142133,0.726395,J2000,True


### Pre-processing

#### Check if there are NA values

In [57]:
df.isna().sum()

Neo Reference ID                0
Name                            0
Absolute Magnitude              0
Est Dia in KM(min)              0
Est Dia in KM(max)              0
Est Dia in M(min)               0
Est Dia in M(max)               0
Est Dia in Miles(min)           0
Est Dia in Miles(max)           0
Est Dia in Feet(min)            0
Est Dia in Feet(max)            0
Close Approach Date             0
Epoch Date Close Approach       0
Relative Velocity km per sec    0
Relative Velocity km per hr     0
Miles per hour                  0
Miss Dist.(Astronomical)        0
Miss Dist.(lunar)               0
Miss Dist.(kilometers)          0
Miss Dist.(miles)               0
Orbiting Body                   0
Orbit ID                        0
Orbit Determination Date        0
Orbit Uncertainity              0
Minimum Orbit Intersection      0
Jupiter Tisserand Invariant     0
Epoch Osculation                0
Eccentricity                    0
Semi Major Axis                 0
Inclination   

#### Check if there are duplicates

In [58]:
df.duplicated().sum()


0

#### Check if there is constant columns

In [59]:
constant_columns = [col for col in df.columns if df[col].nunique() == 1]
print(constant_columns)

['Orbiting Body', 'Equinox']


### Drop Unnecessary Columns

There is no need for the ID and name of the asteroids.

In [60]:
new_df = df.drop(columns=['Neo Reference ID',
                          'Name',
                          'Est Dia in KM(min)',
                          'Est Dia in KM(max)',
                          'Est Dia in Miles(min)',
                          'Est Dia in Miles(max)',
                          'Est Dia in Feet(min)',
                          'Est Dia in Feet(max)',
                          'Relative Velocity km per hr',
                          'Miles per hour',
                          'Miss Dist.(Astronomical)',
                          'Miss Dist.(lunar)',
                          'Miss Dist.(miles)',
                          'Orbiting Body',
                          'Orbit ID',
                          'Equinox'
                          ])
new_df.to_csv('data/new_csv.csv', index=False)
new_df.head()

Unnamed: 0,Absolute Magnitude,Est Dia in M(min),Est Dia in M(max),Close Approach Date,Epoch Date Close Approach,Relative Velocity km per sec,Miss Dist.(kilometers),Orbit Determination Date,Orbit Uncertainity,Minimum Orbit Intersection,...,Inclination,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion,Hazardous
0,21.6,127.219879,284.472297,1995-01-01,788947200000,6.115834,62753692.0,2017-04-06 08:36:37,5,0.025282,...,6.025981,314.373913,609.599786,0.808259,57.25747,2.005764,2458162.0,264.837533,0.590551,True
1,21.3,146.067964,326.617897,1995-01-01,788947200000,18.113985,57298148.0,2017-04-06 08:32:49,3,0.186935,...,28.412996,136.717242,425.869294,0.7182,313.091975,1.497352,2457795.0,173.741112,0.84533,False
2,20.3,231.502122,517.654482,1995-01-08,789552000000,7.590711,7622911.5,2017-04-06 09:20:19,0,0.043058,...,4.237961,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.0,292.893654,0.559371,True
3,27.4,8.801465,19.680675,1995-01-15,790156800000,11.173874,42683616.0,2017-04-06 09:15:49,6,0.005512,...,7.905894,57.173266,514.08214,0.983902,18.707701,1.527904,2457902.0,68.741007,0.700277,False
4,21.6,127.219879,284.472297,1995-01-15,790156800000,9.840831,61010824.0,2017-04-06 08:57:58,1,0.034798,...,16.793382,84.629307,495.597821,0.967687,158.263596,1.483543,2457814.0,135.142133,0.726395,True


### Encode Column types

As you can see below, there are 2 columns that have the type object, which makes it unable to be processed in data analysis, that are `Close Approach Date` and `Orbit Determination Date`. Additionally we have the column `Hazardous`, which has boolean type.

In [61]:
print(new_df.dtypes)

Absolute Magnitude              float64
Est Dia in M(min)               float64
Est Dia in M(max)               float64
Close Approach Date              object
Epoch Date Close Approach         int64
Relative Velocity km per sec    float64
Miss Dist.(kilometers)          float64
Orbit Determination Date         object
Orbit Uncertainity                int64
Minimum Orbit Intersection      float64
Jupiter Tisserand Invariant     float64
Epoch Osculation                float64
Eccentricity                    float64
Semi Major Axis                 float64
Inclination                     float64
Asc Node Longitude              float64
Orbital Period                  float64
Perihelion Distance             float64
Perihelion Arg                  float64
Aphelion Dist                   float64
Perihelion Time                 float64
Mean Anomaly                    float64
Mean Motion                     float64
Hazardous                          bool
dtype: object


We convert them to miliseconds given https://www.epochconverter.com/ calculations. Also we convert True -> 1 and False -> 0

In [62]:
new_df['Close Approach Date'] = pd.to_datetime(new_df['Close Approach Date'])
new_df['Close Approach Date'] = (new_df['Close Approach Date'].astype('int64') // 10**6)

new_df['Orbit Determination Date'] = pd.to_datetime(new_df['Orbit Determination Date'])
new_df['Orbit Determination Date'] = (new_df['Orbit Determination Date'].astype('int64') // 10**6)

new_df['Hazardous'] = new_df['Hazardous'].astype('int')


new_df.to_csv('data/new_csv.csv', index=False)
new_df.head()


Unnamed: 0,Absolute Magnitude,Est Dia in M(min),Est Dia in M(max),Close Approach Date,Epoch Date Close Approach,Relative Velocity km per sec,Miss Dist.(kilometers),Orbit Determination Date,Orbit Uncertainity,Minimum Orbit Intersection,...,Inclination,Asc Node Longitude,Orbital Period,Perihelion Distance,Perihelion Arg,Aphelion Dist,Perihelion Time,Mean Anomaly,Mean Motion,Hazardous
0,21.6,127.219879,284.472297,788918400000,788947200000,6.115834,62753692.0,1491467797000,5,0.025282,...,6.025981,314.373913,609.599786,0.808259,57.25747,2.005764,2458162.0,264.837533,0.590551,1
1,21.3,146.067964,326.617897,788918400000,788947200000,18.113985,57298148.0,1491467569000,3,0.186935,...,28.412996,136.717242,425.869294,0.7182,313.091975,1.497352,2457795.0,173.741112,0.84533,0
2,20.3,231.502122,517.654482,789523200000,789552000000,7.590711,7622911.5,1491470419000,0,0.043058,...,4.237961,259.475979,643.580228,0.950791,248.415038,1.966857,2458120.0,292.893654,0.559371,1
3,27.4,8.801465,19.680675,790128000000,790156800000,11.173874,42683616.0,1491470149000,6,0.005512,...,7.905894,57.173266,514.08214,0.983902,18.707701,1.527904,2457902.0,68.741007,0.700277,0
4,21.6,127.219879,284.472297,790128000000,790156800000,9.840831,61010824.0,1491469078000,1,0.034798,...,16.793382,84.629307,495.597821,0.967687,158.263596,1.483543,2457814.0,135.142133,0.726395,1


### Data Analysis

We can begin to analyze the data.

In [63]:
plt.figure(figsize=(25,25))
plt.subplots_adjust(left=0.1,bottom=0.1,right=0.9,top=0.9,wspace=0.4,hspace=0.4)

df1 = new_df[new_df['Hazardous']==1].drop(['Hazardous'],axis=1)
df2 = new_df[new_df['Hazardous']==0].drop(['Hazardous'],axis=1)

cols = list(df1.columns)

for i, col in enumerate(cols):
    plt.subplot(6,6,i+1)
    sb.histplot(data=df1[col], color='red', label = 'Dangerous', kde=True)
    sb.histplot(data=df2[col], color='blue', label = 'Safe', kde=True)
    _,axes = plt.gca().get_legend_handles_labels()
    plt.legend(axes,prop={'size': 10})

plt.show()

KeyboardInterrupt: 