In [1]:
import pandas as pd 

In [2]:

import numpy as np 


We use the .dt accessor to access datetime properties of the "Close Approach Date" column, and .year specifically to extract the year component.

The result is stored in a new column called "year," which allows us to work with the year component separately.

In [3]:
date_cols = ['Close Approach Date', 'Close Approach Date (Full)']
df = pd.read_csv("neo_data.csv", parse_dates=date_cols)
df['year']=df['Close Approach Date'].dt.year

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 904 entries, 0 to 903
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   ID                            904 non-null    int64         
 1   Neo Reference ID              904 non-null    int64         
 2   Name                          904 non-null    object        
 3   Limited Name                  904 non-null    object        
 4   Designation                   904 non-null    int64         
 5   NASA JPL URL                  904 non-null    object        
 6   Absolute Magnitude (H)        904 non-null    float64       
 7   Min Diameter (km)             904 non-null    float64       
 8   Max Diameter (km)             904 non-null    float64       
 9   Min Diameter (m)              904 non-null    float64       
 10  Max Diameter (m)              904 non-null    float64       
 11  Min Diameter (miles)          90

In [5]:
df.nunique()

ID                               20
Neo Reference ID                 20
Name                             20
Limited Name                     20
Designation                      20
NASA JPL URL                     20
Absolute Magnitude (H)           20
Min Diameter (km)                20
Max Diameter (km)                20
Min Diameter (m)                 20
Max Diameter (m)                 20
Min Diameter (miles)             20
Max Diameter (miles)             20
Min Diameter (feet)              20
Max Diameter (feet)              20
Is Potentially Hazardous          2
Close Approach Date             903
Close Approach Date (Full)      904
Epoch Date Close Approach       904
Relative Velocity (km/s)        904
Relative Velocity (km/h)        904
Relative Velocity (miles/h)     904
Miss Distance (astronomical)    904
Miss Distance (lunar)           904
Miss Distance (km)              904
Miss Distance (miles)           904
Orbiting Body                     5
year                        

In [6]:
from plotly.express import bar
bar(data_frame=df, x='Limited Name', color='Orbiting Body')

we can see that most of the Near Earth Objects mostly or entirely orbit Earth.

In [7]:
from plotly.express import histogram

histogram(data_frame=df[df['year'] <2024].sort_values(by='Limited Name'), x='year', color='Limited Name', nbins=124).show()


The first histogram visualizes the years of close approaches of NEOs. We have filtered the data to include only NEOs with close approach years before 2024. Each bar in the histogram represents the number of NEOs with close approaches in a specific year.

In [8]:
histogram(data_frame=df[df['year'] <2024].sort_values(by='Limited Name'), x='year', color='Is Potentially Hazardous', nbins=124).show()

The second histogram provides insights into the hazardous nature of NEOs. We have again filtered the data to include only NEOs with close approach years before 2024. This time, we've categorized NEOs into two groups: potentially hazardous and non-hazardous. The histogram shows the distribution of these categories over the years, allowing us to observe patterns in hazardous NEOs.

In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt


In this code, the dataset (df) is being prepared for a machine learning task. Features (X) are defined by dropping specific columns from the dataset, and the target variable (y) is set to the column "Is Potentially Hazardous." 

In [10]:


X = df[['Absolute Magnitude (H)', 'Min Diameter (km)', 'Relative Velocity (km/s)', 'year']]
y = df['Is Potentially Hazardous']


This code demonstrates the process of building and evaluating a machine learning model using a RandomForestClassifier. The steps are as follows:

1. **Split Data:** The dataset is divided into training and test sets, which are essential for training and evaluating the model's performance.

2. **Train Model:** A Random Forest classifier is created and trained using the training data.

3. **Predict and Evaluate:** The trained model is used to make predictions on the test data, and its performance is evaluated.

4. **Accuracy Calculation:** The code calculates the accuracy of the model's predictions as the evaluation metric.

In [11]:

# Step 1: Split your data into a training set and a test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Train your RandomForestClassifier on the training data
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)

# Step 3: Use the test data to predict and evaluate the model
y_pred = clf.predict(X_test)



In [12]:
# Step 4: Calculate accuracy 
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9171270718232044


In [13]:

# Step 5: Visualize feature importances
feature_importances = clf.feature_importances_
feature_names = X_train.columns

The graph displays the top feature importances, allowing us to identify the most influential features in the model's decision-making.

In [15]:

import plotly.express as px

# Create a DataFrame for the top features and their importances
top_feature_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})

# Create an interactive bar plot using Plotly
fig = px.bar(top_feature_df, x='Feature', y='Importance', title="Top Feature Importances")
fig.update_xaxes(categoryorder='total ascending')  # Sort x-axis by importance
fig.update_xaxes(title_text='Top Features')
fig.update_yaxes(title_text='Importance')

# Show the plot
fig.show()

