This notebook is intended to explore the NEOs dataset

In [1]:
import pandas as pd
import numpy as np
import sklearn

In [2]:
print("howdy")

howdy


In [3]:
neos = pd.read_csv("neo.csv")
neos.head()

Unnamed: 0,id,name,est_diameter_min,est_diameter_max,relative_velocity,miss_distance,orbiting_body,sentry_object,absolute_magnitude,hazardous
0,2162635,162635 (2000 SS164),1.198271,2.679415,13569.249224,54839740.0,Earth,False,16.73,False
1,2277475,277475 (2005 WK4),0.2658,0.594347,73588.726663,61438130.0,Earth,False,20.0,True
2,2512244,512244 (2015 YE18),0.72203,1.614507,114258.692129,49798720.0,Earth,False,17.83,False
3,3596030,(2012 BV13),0.096506,0.215794,24764.303138,25434970.0,Earth,False,22.2,False
4,3667127,(2014 GE35),0.255009,0.570217,42737.733765,46275570.0,Earth,False,20.09,True


Normalizing the features may be the best approach

In [35]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model


neos["sentry_object"] = neos["sentry_object"].astype(int)
neos["hazardous"] = neos["hazardous"].astype(int)
features_df = neos[
    [
        "est_diameter_min",
        "est_diameter_max",
        "relative_velocity",
        "miss_distance",
        "sentry_object",
        "absolute_magnitude",
    ]
]

# relative_velocity = neos["relative_velocity"]
# miss_distance = neos["miss_distance"]

target = neos["hazardous"].values
features = features_df.values
X = features
y = target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=9524
)

total_num_hazardous = np.sum(target)
print("percentage of num hazardous = ", (total_num_hazardous / len(target)) * 100, "%")

percentage of num hazardous =  9.731824386806993 %


In [17]:
model = linear_model.LogisticRegression()
model.fit(X_train, y_train)

unscaled_score = model.score(X_test, y_test)
print(unscaled_score)

0.8989982386613826


89% accuracy on non-scaled data. Can we do better by scaling this with a minmax scaler?

In [15]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = linear_model.LogisticRegression()
model.fit(X_train_scaled, y_train)


score = model.score(X_test_scaled, y_test)
print(score)

0.9006494936151476


Hmmmm. 90.5%? That doesn't seem like much of an improvement.... Right. Accuracy score only measures correct predictions. Since 90% of the data is classified as nonhazardous, it makes a lot of sense how the unscaled model made its predictions. It might be better to look at metrics like f1 score, recall, and precision. 


Precision - of all predicted hazardous NEOs, how many really are hazardous?

Recall - of all truly hazardous NEOs, how many did we catch?

F1 score - Harmonic mean with precision and recall

Let's start with unscaled data:

In [41]:
from sklearn import metrics

precision = metrics.precision_score
recall = metrics.recall_score
f1 = metrics.f1_score

y_hat_from_unscaled = model.predict(X_test)

print(
    "Precision of unscaled X_test: ",
    precision(y_true=y_test, y_pred=y_hat_from_unscaled),
)
print("Recall of unscaled X_test: ", recall(y_true=y_test, y_pred=y_hat_from_unscaled))

print("f1_score of unscaled X_test", f1(y_true=y_test, y_pred=y_hat_from_unscaled))

Precision of unscaled X_test:  0.313953488372093
Recall of unscaled X_test:  0.014975041597337771
f1_score of unscaled X_test 0.028586553732133403


Ahh. This makes more sense. The model 3

In [42]:
y_hat = model.predict(X_test)

y_hat_scaled = model.predict(X_test_scaled)

print("Precision of scaled X_test: ", precision(y_true=y_test, y_pred=y_hat_scaled))
print("Recall of scaled X_test: ", recall(y_true=y_test, y_pred=y_hat_scaled))
print("f1_score of scaled X_test", f1(y_true=y_test, y_pred=y_hat_scaled))

Precision of scaled X_test:  0.21204280842055745
Recall of scaled X_test:  1.0
f1_score of scaled X_test 0.34989326605860666


In [None]:
baseline_accuracy = max(np.mean(y), 1 - np.mean(y))

print(baseline_accuracy)

0.90268175613193
