
#   **NASA - Nearest Earth Objects**


---






In the outer cosmos, there is an endless number of objects. Moreover, we are closer than we realize to some of them.

Even while we may believe that a distance of 70,000 kilometres cannot damage us, but on an astronomical basis, this is a very short distance that can cause substantial natural phenomena to be disturbed. As a result, these objects/asteroids may be dangerous. Hence, it is prudent to be aware of our surroundings and what may pose a threat to us. 

Thus, for the mentioned purpose this is the **Machine Learning** project using **Neural Networks** which includes different **Classification** techniques to identify which nearest to Earth celestial body can put a threat to us.

To achieve this goal, the *dataset which compiles the list of NASA certified asteroids that are classified as the nearest earth object* is used as the input feed for the Python code. 



---




In [None]:
# linear algebra
import numpy as np

# data processing, CSV file I/O
import pandas as pd
# Pandas is built on top of another package named Numpy, which provides support for multi-dimensional arrays

Since, input data files are available in the read-only "../input/" directory, running this will list all the files under the input directory.

In [None]:
import os
for dirname, _, filenames in os.walk(' /kaggle/input'):
  for filename in filenames:
    print(os.path.join(dirname, filename))

**IMPORTING LIBRARIES**

---





In [None]:
# data analysis libraries
import numpy as np
import pandas as pd

# visualization libraries
import matplotlib.pyplot as plt          # another form -- from matplotlib import pyplot as plt
import seaborn as sns

# comment can't be written in the below line
%matplotlib inline        
# plots/graphs will be displayed just below the cell where your plotting commands are written

MODELS


---





In [None]:
from sklearn.model_selection import train_test_split           
# sklearn is a free software machine learning library

from sklearn.metrics import accuracy_score                     
# sklearn.metrics -- measure classification performance

In [None]:
from xgboost import XGBClassifier
# XGBoost is the implementation of gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, 
# which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.
# xgboost.XGBClassifier is a scikit-learn API compatible class for classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
# ensemble method is to combine the predictions of several base estimators built with a given learning algorithm

from sklearn.neighbors import KNeighborsClassifier
# sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods

from sklearn.naive_bayes import GaussianNB
# Naive Bayes methods are supervised learning algorithms based on Bayes’ theorem
# with the “naive” assumption of conditional independence between every pair of features given the value of the class variable
# GaussianNB implements the Gaussian Naive Bayes algorithm for classification (Normal Distribution)

from sklearn.linear_model import SGDClassifier
# Linear models are set of methods intended for regression in which the target value is expected to be a linear combination of the features

from sklearn.tree import DecisionTreeClassifier
# DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset



---


Decision Trees (DTs) are a non-parametric supervised learning method used for
classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

---

If the classifier used is Decision Tree, then Bootstrap Aggregation (BAGGing) of features approach is called Random Forest.


---



In [None]:
import tensorflow as tf
# Tensorflow is used to implement machine learning and deep learning concepts

from keras.models import Sequential
# Keras is an open-source high-level Neural Network library. It supports Convolutional Networks and Recurrent Networks individually 
# and also their combination. Sequential means we are heaving the stack of layers in the NN and it will accept every layer as 1 element

from keras.layers import Dense
# Each layer is created using numerous layer_() functions.
# These layers are fed with input information, they process this information, do some computation and hence produce the output.
# This output of one layer is fed to another layer as its input.
# Dense: all the neuron in one layer are connected with every neuron in other layer and each neuron receives input from all the neurons of previous layer.
# Dense Layer is used to classify image based on output from convolutional layers.

In [None]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')
# “ignore”	Never display warnings which match

In [None]:
# read using pandas
data = pd.read_csv('neo.csv')
data.head(5)

Unnamed: 0,id,name,est_diameter_min,est_diameter_max,relative_velocity,miss_distance,orbiting_body,sentry_object,absolute_magnitude,hazardous
0,2162635,162635 (2000 SS164),1.198271,2.679415,13569.249224,54839740.0,Earth,False,16.73,False
1,2277475,277475 (2005 WK4),0.2658,0.594347,73588.726663,61438130.0,Earth,False,20.0,True
2,2512244,512244 (2015 YE18),0.72203,1.614507,114258.692129,49798720.0,Earth,False,17.83,False
3,3596030,(2012 BV13),0.096506,0.215794,24764.303138,25434970.0,Earth,False,22.2,False
4,3667127,(2014 GE35),0.255009,0.570217,42737.733765,46275570.0,Earth,False,20.09,True




---


This file incorporates a variety of parameters/features that determine whether or not an asteroid that has already been classified as a near-Earth object is dangerous.


---





---


id = Unique Identifier for each Asteroid

name = Name given by NASA

est_diameter_min = Minimum Estimated Diameter in Kilometres

est_diameter_max = Maximum Estimated Diameter in Kilometres

relative_velocity = Velocity Relative to Earth

miss_distance = Distance in Kilometres missed  

orbiting_body = Planet that the asteroid orbits

sentry_object = Included in sentry - an automated collision monitoring system

absolute_magnitude = Describes intrinsic luminosity

hazardous = Boolean feature that shows whether asteroid is harmful or not


---





---


Sentry is automated impact prediction system operated by the JPL Center for NEO Studies. It continually monitors the most up-to-date asteroid catalog for possibilities of future impact with Earth over the next 100+ years.

Miss Distance, I think, is the min possible distance by which the object misses to hit the Earth.

For a Solar System object, the absolute magnitude is a measure of the brightness an object would have if it were at 1 au from both the observer and the Sun, and at a phase angle (the angle Sun-object-Earth) of 0 degrees.


---



In [None]:
data.info()
# displays two-dimensional, size-mutable, potentially heterogeneous tabular data 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90836 entries, 0 to 90835
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  90836 non-null  int64  
 1   name                90836 non-null  object 
 2   est_diameter_min    90836 non-null  float64
 3   est_diameter_max    90836 non-null  float64
 4   relative_velocity   90836 non-null  float64
 5   miss_distance       90836 non-null  float64
 6   orbiting_body       90836 non-null  object 
 7   sentry_object       90836 non-null  bool   
 8   absolute_magnitude  90836 non-null  float64
 9   hazardous           90836 non-null  bool   
dtypes: bool(2), float64(5), int64(1), object(2)
memory usage: 5.7+ MB


In [None]:
data.shape

(90836, 10)

In [None]:
data.isnull().sum()
# returns the number of missing values in the data set

id                    0
name                  0
est_diameter_min      0
est_diameter_max      0
relative_velocity     0
miss_distance         0
orbiting_body         0
sentry_object         0
absolute_magnitude    0
hazardous             0
dtype: int64

In [None]:
data.describe()
# used for calculating statistical data like percentile, mean and std of the numerical values of the Series or DataFrame
# 25% is the 25 percentile

Unnamed: 0,id,est_diameter_min,est_diameter_max,relative_velocity,miss_distance,absolute_magnitude
count,90836.0,90836.0,90836.0,90836.0,90836.0,90836.0
mean,14382880.0,0.127432,0.284947,48066.918918,37066550.0,23.527103
std,20872020.0,0.298511,0.667491,25293.296961,22352040.0,2.894086
min,2000433.0,0.000609,0.001362,203.346433,6745.533,9.23
25%,3448110.0,0.019256,0.043057,28619.020645,17210820.0,21.34
50%,3748362.0,0.048368,0.108153,44190.11789,37846580.0,23.7
75%,3884023.0,0.143402,0.320656,62923.604633,56549000.0,25.7
max,54275910.0,37.89265,84.730541,236990.128088,74798650.0,33.2


In [None]:
# check for duplicates
print(f'Duplicates in dataset: {data.duplicated().sum()}, ({np.round(100 * data.duplicated().sum() / len(data), 1)} %)')

# formatted string literal or f-string is a string literal that is prefixed with 'f' or 'F'.
# These strings may contain replacement fields, which are expressions delimited by curly braces {}. 
# While other string literals always have a constant value, formatted strings are really expressions evaluated at run time

Duplicates in dataset: 0, (0.0 %)


In [None]:
# splitting into X and y, dropping irrelevant features, column 'hazardous' transformed into int

X = data.drop(['id', 'name', 'est_diameter_max', 'orbiting_body', 'sentry_object', 'hazardous'], axis = 1)

y = data.hazardous.astype('int')
# DataFrame.astype() method is used to cast a pandas object to a specified dtype

print(X.shape, y.shape)
# shape is a tuple that always gives dimensions of the array

(90836, 4) (90836,)


In [None]:
# training or testing or splitting method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)



---


PREDICTION


---



In [None]:
# by K-Nearest Neighbors (KNN)

KNN = KNeighborsClassifier(n_neighbors = 3)
KNN.fit(X_train, y_train)
KNN_pred = KNN.predict(X_test)
KNN_acc = round(accuracy_score(KNN_pred, y_test) * 100, 2)
print(f'{KNN_acc} %')

88.06 %


In [None]:
# by Stochastic Gradient Descent (SGD) Classifier

SGDC = SGDClassifier()
SGDC.fit(X_train, y_train)
SGDC_pred = SGDC.predict(X_test)
SGDC_acc = round(accuracy_score(SGDC_pred, y_test) * 100, 2)
print(f'{SGDC_acc}')

87.15


In [None]:
# by Gaussian Naive Bayes

GNB = GaussianNB()
GNB.fit(X_train, y_train)
GNB_pred = GNB.predict(X_test)
GNB_acc = round(accuracy_score(GNB_pred, y_test) * 100, 2)
print(f'{GNB_acc} %')

89.57 %


In [None]:
# by Decision Tree Classifier

DTC = DecisionTreeClassifier()
DTC.fit(X_train, y_train)
DTC_pred = DTC.predict(X_test)
DTC_acc = round(accuracy_score(DTC_pred, y_test) * 100, 2)
print(f'{DTC_acc} %')

89.27 %


In [None]:
# by Random Forest

RF = RandomForestClassifier()
RF.fit(X_train, y_train)
RF_pred = RF.predict(X_test)
RF_acc = round(accuracy_score(RF_pred, y_test) * 100, 2)
print(f'{RF_acc} %')

91.69 %


In [None]:
# by eXtreme Gradient Boosting (XGBoost)

XGBC = XGBClassifier()
XGBC.fit(X_train, y_train)
XGBC_pred = XGBC.predict(X_test)
XGBC_acc = round(accuracy_score(XGBC_pred, y_test) * 100, 2)
print(f'{XGBC_acc} %')

91.23 %


In [None]:
# displaying accuracy score of all models

model_table = pd.DataFrame({
    'Model' : ['K Neighbors Classifier', 'SGD Classifier', 'Gaussian Naive Bayes', 
                                       'Decision Tree Classifier', 'Random Forest', 'XG Boost'], 
    'Score' : [KNN_acc, SGDC_acc, GNB_acc, 
                                        DTC_acc, RF_acc, XGBC_acc]})
model_table.sort_values(by = 'Score', ascending = False)

Unnamed: 0,Model,Score
4,Random Forest,91.69
5,XG Boost,91.23
2,Gaussian Naive Bayes,89.57
3,Decision Tree Classifier,89.27
0,K Neighbors Classifier,88.06
1,SGD Classifier,87.15







**By Neural Network**






In [None]:
classifier = Sequential()
classifier.add(Dense(12, input_dim = 4, activation = 'relu'))

# units = dimensionality of output space

classifier.add(Dense(8, activation = 'relu'))
classifier.add(Dense(1, activation = 'sigmoid'))
classifier.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

# compile = to decide a model architecture, this is the number of hidden layers and activation functions, etc.
# loss function = computes the distance between the current output of the algorithm and the expected output
# binary_crossentropy = computes the cross-entropy loss between true labels and predicted labels. This class comes under probabilistic losses
# metrics = list of metrics to be evaluated by the model during training and testing

classifier.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_9 (Dense)             (None, 12)                60        
                                                                 
 dense_10 (Dense)            (None, 8)                 104       
                                                                 
 dense_11 (Dense)            (None, 1)                 9         
                                                                 
Total params: 173
Trainable params: 173
Non-trainable params: 0
_________________________________________________________________




---


Compiling the model - This will create a Python object which will build the CNN. This is done by building the computation graph in the correct format based on the Keras backend you are using (here, tensorflow). The compilation steps asks you to define the loss function and kind of optimizer you want to use. These options depend on the problem you are trying to solve, you can find the best techniques usually reading the literature in the field.


---


// Read the whole "prediction" bookmark in ML Project NASA folder


---




In [None]:
# fitting the model

history = classifier.fit(X_train, y_train, batch_size = 18, epochs = 10, validation_split = 0.1, verbose = 1, shuffle = True)

# see train keras model bookmark
# By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch
# shuffle = whether to shuffle the training data before each epoch) or str (for 'batch')

# accuracy prediction

_, accuracy = classifier.evaluate(X_test, y_test)
# if you don't need the specific values or the values are not used, assign the values to underscore (here, we don't need loss values)
print('Accuracy from Neural Network: %.2f' % (accuracy * 100))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy from Neural Network: 12.31




---


Epoch: an arbitrary cutoff, defined as "one pass over the entire dataset", used to separate training into distinct phases, which is useful for logging and periodic evaluation.

So, a number of epochs means how many times you go through your training set.

The model is updated each time a batch is processed, which means that it can be updated multiple times during one epoch. If batch_size is set equal to the length of x, then the model will be updated once per epoch.


---





---


We saw that prediction accuracy by *Random Forest Classifier* is the highest. Hence, to find that which nearest object to Earth is hazardous to us, we should try to use this classification model.

Also, it is quite interesting that all the models have approximately similar accuracy (~90%).


---

