# Asteroids Potential Hazards Prediction 

This notebook is a work flow for various Python-based machine learning model for predicting if the Asteriod protential Hazards or not?

I am an Amateur astronomer, amateur Astrophotographer, and one of a member of Singapore Sidewalk astronomy(https://www.facebook.com/SingaporeSidewalkAstronomy).

My background is in computer science. and i m a practitioner of Machine learning.

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

How we can use various python based Machine Learning Model to and the given parameters to predict if the Asteroids Protential Hazard or not?

# 2. Data

Data from: https://www.kaggle.com/sakhawat18/asteroid-dataset



# 3. Evaluation

## Task Details

Asteroid Dataset contains different Physical Parameters and measurements. First Task is to predict whether an Asteroid is potential hazards or not.

## Expected Submission

Submit a notebook that implements the full life-cycle of data preparation, model creation and evaluation. Feel free to use this dataset plus any other datasets you have available. Since this is not a formal competition, you're not submitting a single submission file, but rather your whole approach to building a model.
With this model, you should produce a table in the following format

## Evaluation

This is not a formal competition, so we won't measure the results strictly against a given validation set using a strict metric. We will check the following points at the submitted Notebooks

    * Accuracy
    * Data Preparation
    * Proper Documentation
    * We’re looking for genuine approaches to building models on a real problem that can serve as learning examples for our Astronomy Community.


# 4. Features


    1. SPK-ID: Object primary SPK-ID
    2. Object ID: Object internal database ID
    3. Object fullname: Object full name/designation
    4. pdes: Object primary designation
    5. name: Object IAU name
    6. NEO: Near-Earth Object (NEO) flag
    8. H: Absolute magnitude parameter
    9. Diameter: object diameter (from equivalent sphere) km Unit
    10. Albedo: Geometric albedo
    11. Diameter_sigma: 1-sigma uncertainty in object diameter km Unit
    12. Orbit_id: Orbit solution ID
    13. Epoch: Epoch of osculation in modified Julian day form
    14. Equinox: Equinox of reference frame
    15. e: Eccentricity
    16. a: Semi-major axis au Unit
    17. q: perihelion distance au Unit
    18. i: inclination; angle with respect to x-y ecliptic plane
    19. tp: Time of perihelion passage TDB Unit
    20. moid_ld: Earth Minimum Orbit Intersection Distance au Unit

## Outputs / Labels
    7. PHA: Potentially Hazardous Asteroid (PHA) flag


## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#df =  pd.read_csv('/content/drive/MyDrive/Colab Notebooks/ML Self-Projects/Asteroids Potential Hazards Prediction /Data/dataset.csv')
df = pd.read_csv('/kaggle/input/asteroid-dataset/dataset.csv', low_memory=False)
df.head()

## Data Exporation

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.isnull().sum() / len(df) * 100

## Data Cleaning

We have choose to drop the name and prefix for now as this will not impact the model

In [None]:
df = df.drop(['name', 'prefix'], axis=1)

In [None]:
df[df['pha'] == 'Y'].isnull().sum()

As we are not domain expeat in the field of Asteroid and it potential Hazrds, to clean up the data, we have choose to drop the rest of NaN as we do not know how parameters like, diameter, albedo and diameter_sigma effect the model's outcome.   

In [None]:
df = df.dropna()

In [None]:
df

In [None]:
plt.figure(figsize=(20,10))
plt.title('Count of Potentially Hazardous Asteroid (PHA) flag')
sns.countplot(data=df, x='pha');

In [None]:
len(df[df['pha'] == 'N'])

In [None]:
len(df[df['pha'] == 'Y'])

In [None]:
len(df[df['pha'] == 'Y'])/ len(df[df['pha'] == 'N']) * 100

In [None]:
df['equinox'].unique()

As this few are ID and names, we will be dropping it from the Dataset: 
* id                   
* spkid   
* orbit_id             
* full_name
* equinox

In [None]:
df = df.drop(['id', 'spkid','full_name', 'equinox','orbit_id','pdes'], axis=1)

In [None]:
df

## Understanding the data, an in-depth look

In [None]:
df.info()

In [None]:
df['class'].unique()

In [None]:
df['pha'] = df['pha'].map({'Y': 1, 'N': 0})

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data=round(df.corr(),2), annot=True)

In [None]:
df= pd.get_dummies(df)

# 5. Modelling

In [None]:
X = df.drop('pha', axis=1)
y = df['pha']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Imports

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier, XGBRFClassifier

## Baseline Modelling

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
        
    return model_scores

In [None]:
models = {'LogisticRegression': LogisticRegression(max_iter=10000),
          'KNeighborsClassifier': KNeighborsClassifier(),
          'SVC': SVC(),
          'DecisionTreeClassifier': DecisionTreeClassifier(),
          'RandomForestClassifier': RandomForestClassifier(),
          'AdaBoostClassifier': AdaBoostClassifier(),
          'GradientBoostingClassifier': GradientBoostingClassifier(),
          'XGBClassifier': XGBClassifier(),
          'XGBRFClassifier': XGBRFClassifier()}

In [None]:
baseline_model_scores = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores.sort_values('Score')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores.sort_values('Score').T)
plt.title('Baseline Model Precision Score')
plt.xticks(rotation=90);

Since the Baseline AdaBoostClassifier model preform at 0.999981 accurcy, we will use that to bulid the model and evalute it

# 6. Model Evaluting
## AdaBoostClassifier

In [None]:
model = AdaBoostClassifier()
model.fit(X_train,y_train)
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve 

In [None]:
print(classification_report(y_test,y_preds))

### Confusion Matrix

In [None]:
plot_confusion_matrix(model, X_test, y_test)

### ROC Curve

In [None]:
plot_roc_curve(model, X_test, y_test)

### Calculate evalution metrices using cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cv_acc = cross_val_score(model,X,y,cv=5,
                         scoring='accuracy')
cv_acc

In [None]:
cv_acc = cv_acc.mean()

In [None]:
cv_precision = cross_val_score(model,X,y,cv=5,
                         scoring='precision')
cv_precision

In [None]:
cv_precision.mean()

In [None]:
cv_recall = cross_val_score(model,X,y,cv=5,
                         scoring='recall')
cv_recall

In [None]:
cv_recall.mean()

In [None]:
cv_f1 = cross_val_score(model,X,y,cv=5,
                         scoring='f1')
cv_f1

In [None]:
cv_f1.mean()

In [None]:
cv_merics = pd.DataFrame({'Accuracy': cv_acc.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
sns.barplot(data=cv_merics)
plt.title('CV scores')

In [None]:
cv_merics

### Feature Importances

In [None]:
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns)

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances.sort_values(0).T);

## XGBClassifier

In [None]:
model = XGBClassifier()
model.fit(X_train,y_train)
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, plot_confusion_matrix, plot_roc_curve 

In [None]:
print(classification_report(y_test,y_preds))

### Confusion Matrix

In [None]:
plot_confusion_matrix(model, X_test, y_test)

### ROC Curve

In [None]:
plot_roc_curve(model, X_test, y_test)

### Calculate evalution metrices using cross-validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cv_acc = cross_val_score(model,X,y,cv=5,
                         scoring='accuracy')
cv_acc

In [None]:
cv_acc = cv_acc.mean()

In [None]:
cv_precision = cross_val_score(model,X,y,cv=5,
                         scoring='precision')
cv_precision

In [None]:
cv_precision.mean()

In [None]:
cv_recall = cross_val_score(model,X,y,cv=5,
                         scoring='recall')
cv_recall

In [None]:
cv_recall.mean()

In [None]:
cv_f1 = cross_val_score(model,X,y,cv=5,
                         scoring='f1')
cv_f1

In [None]:
cv_f1.mean()

In [None]:
cv_merics = pd.DataFrame({'Accuracy': cv_acc.mean(),
                         'Precision': cv_precision.mean(),
                         'Recall': cv_recall.mean(),
                         'f1': cv_recall.mean()},index=[0])
sns.barplot(data=cv_merics)
plt.title('CV scores')

In [None]:
cv_merics

### Feature Importances

In [None]:
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns)

In [None]:
plt.figure(figsize=(20,10))
plt.xticks(rotation=90)
plt.title('Feature Importances')
sns.barplot(data= feat_importances.sort_values(0).T);

# Saving the model

In [None]:
# import joblib

In [None]:
# joblib.dump(model, '/content/drive/MyDrive/Colab Notebooks/ML Self-Projects/Asteroids Potential Hazards Prediction /Models/Models.joblib')
# joblib.dump(scaler,'/content/drive/MyDrive/Colab Notebooks/ML Self-Projects/Asteroids Potential Hazards Prediction /Models/scaler.joblib')

# 7. Experimentation

As we can see that both model are performing rather well with just the baseline modeling with out any hyperparameter tuning yet.

1. We can try hypertune the model for a slighly better results.

2. We can see from the feature importance from both models, diameter, albedo and diameter_sigma used in the models. however in the XGBoost the feature importance for it is lower. Could we instead drop that coloum for us to feed in more data instead?

3. or instead, could we find a way to fill the diameter, albedo and diameter_sigma NaN?