### **Import Necessary Libraries:**

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OrdinalEncoder, RobustScaler
from matplotlib import pyplot as plt
from seaborn import heatmap, pairplot

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score

### **Loading Dataset:**

In [2]:
data = pd.read_csv("../input/nasa-asteroids-classification/nasa.csv")
data

There are 40 features including the target feature **'Hazardous'** and 4687 data points in the dataset.

### **Feature Engineering:**

#### **Removing Irrelevant Columns:**

In [3]:
data.columns

The features **'Neo Reference ID', 'Name', 'Close Approach Date', 'Epoch Date Close Approach', 'Orbit ID'** and **'Orbit Determination Date'** have no value in training our ML model.

In [4]:
data.drop(['Neo Reference ID', 'Name', 'Close Approach Date', 'Epoch Date Close Approach', 'Orbit ID', 'Orbit Determination Date'], axis = 1, inplace = True)
data.columns

In [5]:
data['Orbiting Body'].unique(), data['Equinox'].unique()

As there is no variation in the features **'Orbiting Body'** and **'Equinox'**, they are not important as well.

In [6]:
data.drop(['Orbiting Body', 'Equinox'], axis = 1, inplace = True)
data.columns

#### **Handling Missing Values:**

In [7]:
data.isnull().sum()

There are no missing values in the dataset.

#### **Handling Duplicate Values:**

In [8]:
data.duplicated().sum()

The dataset has no duplicate values.

#### **Feature Encoding:**

In [9]:
data.dtypes

Among all the features, only our target feature 'Hazardous' is catagorical in nature.

In [10]:
data['Hazardous'] = OrdinalEncoder(dtype = np.int64).fit_transform(data[['Hazardous']])
data.Hazardous.head(5)

#### **Feature Selection:**

In [11]:
plt.figure(figsize = (20,20))
heatmap(data.corr(), annot = True)

One between the pair of features with correlation value **|x| >= .9** are to be eliminated.

In [12]:
data.drop(['Est Dia in KM(max)', 'Est Dia in M(min)', 'Est Dia in M(max)', 'Est Dia in Miles(min)', 'Est Dia in Miles(max)', 'Est Dia in Feet(min)', 'Est Dia in Feet(max)', 'Relative Velocity km per hr', 'Miles per hour', 'Miss Dist.(lunar)', 'Miss Dist.(kilometers)', 'Miss Dist.(miles)', 'Mean Motion', 'Perihelion Time', 'Orbital Period', 'Aphelion Dist', 'Semi Major Axis'], axis = 1, inplace = True)
data.columns

#### **Feature Scaling:**

In [13]:
plt.figure(figsize = (20,5))
data.boxplot(column = ['Absolute Magnitude', 'Relative Velocity km per sec', 'Orbit Uncertainity', 'Jupiter Tisserand Invariant', 'Inclination'])

In [14]:
plt.figure(figsize = (20,5))
data.boxplot(column = ['Est Dia in KM(min)', 'Miss Dist.(Astronomical)', 'Minimum Orbit Intersection', 'Eccentricity', 'Perihelion Distance'])

In [15]:
plt.figure(figsize = (5,5))
data.boxplot(column = ['Epoch Osculation'])

In [16]:
plt.figure(figsize = (5,5))
data.boxplot(column = ['Asc Node Longitude', 'Perihelion Arg', 'Mean Anomaly'])

As the dataset is prone outliers, it should be scaled using a scaler which is not sensitive to outliers.

In [17]:
data[data.columns[:len(data.columns)-1]] = RobustScaler().fit_transform(data[data.columns[:len(data.columns)-1]])
data.head(5)

### **Data Analysis:**

As the Target Feature is catagorical in nature, **Classifier ML Models** are to be used.

In [None]:
pairplot(data, hue='Hazardous')

As the dataset is not linearly separable, **Non-linear Classifier ML Models** are to be used.

In [None]:
plt.figure(figsize = (30,10))
data.boxplot(column = ['Absolute Magnitude', 'Miss Dist.(Astronomical)', 'Orbit Uncertainity', 'Minimum Orbit Intersection', 'Jupiter Tisserand Invariant', 'Eccentricity', 'Asc Node Longitude', 'Perihelion Distance', 'Perihelion Arg', 'Mean Anomaly'])

In [None]:
plt.figure(figsize = (10,5))
data.boxplot(column = ['Est Dia in KM(min)','Relative Velocity km per sec', 'Inclination'])

In [None]:
plt.figure(figsize = (10,5))
data.boxplot(column = ['Epoch Osculation'])

The features with significant number of outliers are to be removed.

In [None]:
data.drop(['Est Dia in KM(min)','Relative Velocity km per sec', 'Epoch Osculation', 'Inclination'], axis = 1, inplace = True)
data.columns

### **Model Application:**

A tree based classifier, such as **Random Forest Classifier**, can be used as it is a non-linear classifier not sensitive to outliers. In addition, as the features having significant number of outliers are removed, oultlier sensitive non-linear classifiers such as **K-Nearest Neighbor Classifier** and **Support Vector Classifier** can now be used.

#### **Spliting Dataset:**

In [None]:
features = data[data.columns[:len(data.columns)-1]]
target = data[data.columns[-1]]

The features would be splited with a Train-Test ratio of **0.8:0.2**.

In [None]:
trainFeatures, testFeatures, trainTarget, testTarget = train_test_split(features, target, test_size = 0.2, random_state = 1)

#### **Random Forest Classifier:**

As the data points are continuous in nature, **Gini Index** would be used for calculating information gain.

In [None]:
acc_val, f1_val = [],[]

for i in range(1,50,2):
    model = RandomForestClassifier(n_estimators = i,criterion = 'gini', random_state = 1)
    model.fit(trainFeatures, trainTarget)
    predTarget = model.predict(testFeatures)
    acc_val.append(round(accuracy_score(testTarget, predTarget)*100, 2))
    f1_val.append(round(f1_score(testTarget, predTarget)*100, 2))
    
plt.plot(range(1,50,2), acc_val)
plt.plot(range(1,50,2), f1_val)
plt.show()

From the graph it is clear that Random Forest Classifier produces best output on this dataset when **n_estimators = 23**.

**Result Analysis for RFC:**

In [None]:
model = RandomForestClassifier(n_estimators = 23, criterion = 'gini', random_state = 1)
model.fit(trainFeatures, trainTarget)
predTarget = model.predict(testFeatures)

print(f'Model: Random Forest\nAccuracy Score: {round(accuracy_score(testTarget, predTarget)*100, 2)}%\nF1 Score: {round(f1_score(testTarget, predTarget)*100, 2)}%')

#### **Support Vector Classifier:**

As the data points are not linearly separable, **RBF** would be used as kernel function.

In [None]:
acc_val, f1_val = [],[]

for i in range(1,50,2):
    model = SVC(C = i, kernel = 'rbf')
    model.fit(trainFeatures, trainTarget)
    predTarget = model.predict(testFeatures)
    acc_val.append(round(accuracy_score(testTarget, predTarget)*100, 2))
    f1_val.append(round(f1_score(testTarget, predTarget)*100, 2))

plt.plot(range(1,50,2), acc_val)
plt.plot(range(1,50,2), f1_val)
plt.show()

From the graph it is clear that Support Vector Classifier produces best output on this dataset when **C = 19**.

**Result Analysis for SVC:**

In [None]:
model = SVC(C = 19, kernel = 'rbf')
model.fit(trainFeatures, trainTarget)
predTarget = model.predict(testFeatures)

print(f'Model: Support Vector Machine\nAccuracy Score: {round(accuracy_score(testTarget, predTarget)*100, 2)}%\nF1 Score: {round(f1_score(testTarget, predTarget)*100, 2)}%')

#### **K-Nearest Neighbor Classifier:**

As the data points are numerical, **Euclidean Distance** would be as distance metric.

In [None]:
acc_val, f1_val = [],[]

for i in range(1,50,2):
    model = KNeighborsClassifier(n_neighbors = i, weights = 'distance')
    model.fit(trainFeatures, trainTarget)
    predTarget = model.predict(testFeatures)
    acc_val.append(round(accuracy_score(testTarget, predTarget)*100, 2))
    f1_val.append(round(f1_score(testTarget, predTarget)*100, 2))

plt.plot(range(1,50,2), acc_val)
plt.plot(range(1,50,2), f1_val)
plt.show()

From the graph it is clear that K-Nearest Neighbor Classifier produces best output on this dataset when **n_neighbors = 35**.

**Result Analysis for KNC:**

In [None]:
model = KNeighborsClassifier(n_neighbors = 35, weights = 'distance')
model.fit(trainFeatures, trainTarget)
predTarget = model.predict(testFeatures)

print(f'Model: K-Nearest Neighbor Classifier\nAccuracy Score: {round(accuracy_score(testTarget, predTarget)*100, 2)}%\nF1 Score: {round(f1_score(testTarget, predTarget)*100, 2)}%')

Among the three Classifiers used on the dataset, **Random Forest Classifier** produced the best outcome.