## Nearest Earth Objects 1910-2024: Model Selection

### Importing libraries

In [1]:
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

### Loading datasets

In [2]:
FEATURES = ['absolute_magnitude', 'estimated_diameter_max', 'relative_velocity']
TARGET_VARIABLE = 'is_hazardous'

In [3]:
df_normalized = pd.read_csv('./data/normalized_data.csv')
df_standardized = pd.read_csv('./data/standardized_data.csv')

### Training & Evaluating models

In [4]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix

def evaluateModelPreformance(df, model):
    X = df[FEATURES].values
    y = df[TARGET_VARIABLE].values

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=273)
    
    print(type(model), 'has started training')
    model.fit(X_train, y_train)
    print(type(model), 'has completed training')

    y_pred = model.predict(X_test)
    cv_scores = cross_val_score(model, X, y)
    cm = confusion_matrix(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)

    print('Cross Validation Scores: ', cv_scores, '~', cv_scores.mean())
    print('Confusion matrix:\n', cm)
    print('Accuracy:', accuracy, '\n')

#### Logistic Regression

In [5]:
from sklearn.linear_model import LogisticRegression

In [6]:
evaluateModelPreformance(df_normalized, LogisticRegression())

<class 'sklearn.linear_model._logistic.LogisticRegression'> has started training
<class 'sklearn.linear_model._logistic.LogisticRegression'> has completed training
Cross Validation Scores:  [0.87235899 0.87237188 0.87237188 0.87237188 0.8723571 ] ~ 0.8723663472248108
Confusion matrix:
 [[59149     0]
 [ 8486     0]]
Accuracy: 0.8745324166481852 



In [7]:
evaluateModelPreformance(df_standardized, LogisticRegression())

<class 'sklearn.linear_model._logistic.LogisticRegression'> has started training
<class 'sklearn.linear_model._logistic.LogisticRegression'> has completed training
Cross Validation Scores:  [0.87193021 0.87231274 0.87186918 0.87217967 0.87176568] ~ 0.8720114973823412
Confusion matrix:
 [[59076    73]
 [ 8448    38]]
Accuracy: 0.8740149330967695 



This model isn't suitable to the problem. It classifies almost all as 0s. It cannot recognize the real danger for planet

#### K-Nearest Neighbors

In [8]:
from sklearn.neighbors import KNeighborsClassifier

In [9]:
evaluateModelPreformance(df_normalized, KNeighborsClassifier())

<class 'sklearn.neighbors._classification.KNeighborsClassifier'> has started training
<class 'sklearn.neighbors._classification.KNeighborsClassifier'> has completed training
Cross Validation Scores:  [0.85588822 0.85391963 0.85545731 0.85485111 0.85570867] ~ 0.8551649883543083
Confusion matrix:
 [[55811  3338]
 [ 6376  2110]]
Accuracy: 0.8563761366156576 



In [10]:
evaluateModelPreformance(df_standardized, KNeighborsClassifier())

<class 'sklearn.neighbors._classification.KNeighborsClassifier'> has started training
<class 'sklearn.neighbors._classification.KNeighborsClassifier'> has completed training
Cross Validation Scores:  [0.86688845 0.86687169 0.86490523 0.86579235 0.86725611] ~ 0.8663427658028644
Confusion matrix:
 [[55873  3276]
 [ 5641  2845]]
Accuracy: 0.8681599763436091 



The ***KNN*** performs poorly, but better than the ***Logistic Regression***

### Linear Support Vector Classifier

In [11]:
from sklearn.svm import LinearSVC

In [12]:
evaluateModelPreformance(df_normalized, LinearSVC())

<class 'sklearn.svm._classes.LinearSVC'> has started training
<class 'sklearn.svm._classes.LinearSVC'> has completed training
Cross Validation Scores:  [0.87235899 0.87237188 0.87237188 0.87237188 0.8723571 ] ~ 0.8723663472248108
Confusion matrix:
 [[59149     0]
 [ 8486     0]]
Accuracy: 0.8745324166481852 



In [13]:
evaluateModelPreformance(df_standardized, LinearSVC())

<class 'sklearn.svm._classes.LinearSVC'> has started training
<class 'sklearn.svm._classes.LinearSVC'> has completed training
Cross Validation Scores:  [0.8723442  0.8723571  0.8723571  0.87237188 0.87232753] ~ 0.8723515618055993
Confusion matrix:
 [[59148     1]
 [ 8486     0]]
Accuracy: 0.874517631403859 



This model isn't suitable to the problem. It classifies almost all as 0s. It cannot recognize the real danger for planet. Here, we've got the same problem as with the ***Logistic Regression***

#### Decision Tree Classifier

In [14]:
from sklearn.tree import DecisionTreeClassifier

In [15]:
evaluateModelPreformance(df_normalized, DecisionTreeClassifier())

<class 'sklearn.tree._classes.DecisionTreeClassifier'> has started training
<class 'sklearn.tree._classes.DecisionTreeClassifier'> has completed training
Cross Validation Scores:  [0.87110224 0.87118905 0.87068634 0.87077505 0.87118905] ~ 0.8709883457957831
Confusion matrix:
 [[58933   216]
 [ 8388    98]]
Accuracy: 0.8727877578176979 



In [16]:
evaluateModelPreformance(df_standardized, DecisionTreeClassifier())

<class 'sklearn.tree._classes.DecisionTreeClassifier'> has started training
<class 'sklearn.tree._classes.DecisionTreeClassifier'> has completed training
Cross Validation Scores:  [0.87704591 0.87321466 0.87132212 0.87379129 0.87430878] ~ 0.8739365495286153
Confusion matrix:
 [[54874  4275]
 [ 4107  4379]]
Accuracy: 0.876070082058106 



By far the ***Decision Tree Classifier*** is the best machine learning model. It has the minimal understanding of recognizing 0s and 1s

### Random Forest Classifier

In [17]:
from sklearn.ensemble import RandomForestClassifier

In [18]:
evaluateModelPreformance(df_normalized, RandomForestClassifier())

<class 'sklearn.ensemble._forest.RandomForestClassifier'> has started training
<class 'sklearn.ensemble._forest.RandomForestClassifier'> has completed training
Cross Validation Scores:  [0.87243291 0.87237188 0.87237188 0.87237188 0.8723571 ] ~ 0.8723811324691371
Confusion matrix:
 [[59149     0]
 [ 8483     3]]
Accuracy: 0.8745767723811636 



In [19]:
evaluateModelPreformance(df_standardized, RandomForestClassifier())

<class 'sklearn.ensemble._forest.RandomForestClassifier'> has started training
<class 'sklearn.ensemble._forest.RandomForestClassifier'> has completed training
Cross Validation Scores:  [0.87988468 0.8762161  0.8750037  0.87757637 0.87781293] ~ 0.8772987554050473
Confusion matrix:
 [[55114  4035]
 [ 4141  4345]]
Accuracy: 0.8791158423892955 



The ***Random Forest Classifier*** is based on the decision trees. That's why it has a similar result as the ***Decision Tree Classifier***.

#### XGB Classifer

In [20]:
from xgboost import XGBClassifier

In [21]:
evaluateModelPreformance(df_normalized, XGBClassifier())

<class 'xgboost.sklearn.XGBClassifier'> has started training
<class 'xgboost.sklearn.XGBClassifier'> has completed training
Cross Validation Scores:  [0.87389665 0.87339208 0.87303723 0.87330337 0.8733773 ] ~ 0.8734013250830909
Confusion matrix:
 [[58873   276]
 [ 8183   303]]
Accuracy: 0.8749316182449915 



In [22]:
evaluateModelPreformance(df_standardized, XGBClassifier())

<class 'xgboost.sklearn.XGBClassifier'> has started training
<class 'xgboost.sklearn.XGBClassifier'> has completed training
Cross Validation Scores:  [0.88158498 0.88174587 0.88215986 0.88115445 0.88133187] ~ 0.8815954059053783
Confusion matrix:
 [[58450   699]
 [ 7166  1320]]
Accuracy: 0.883714053374732 



This model has poor accuracy. Better than ***Logistic Regression***, but worse than the ***Random Forest*** & ***Decision Tree***.

### Conclusion
The dataset isn't good enough to correctly train the machine learning models. The performance of the models isn't good enough to be used in real life.