# Project 2：NASA Data Acquisation, Visualization, and Analysis

In [4]:
# The code makes sure that once there is change in the 'src/' folder, the 
# change will be automatically reloaded in the notebook.
%reload_ext autoreload
%autoreload 2
%aimport src

### Task 1: Understanding the NASA API and Data Collection

- Register for a NASA API key and understand the different types of data that the API provides.
- Run the Python script below to fetch data about **Near Earth Objects (NEOs)** from the NASA API for a years data.
- Extract and understand the different pieces of data provided for each NEO.

In [5]:
import requests
import time
from datetime import datetime, timedelta
from getpass import getpass

# Set your NASA API KEY, this step asks you to enter your API KEY.
# (The input box may be float in the top on your editor.)
api_key = getpass()

In [6]:
# Set the start and end dates for the data you want to fetch
start_date = datetime.strptime('2022-01-01', '%Y-%m-%d')
end_date = start_date + timedelta(days=365)  # 1 year later

# Initialize a list to store the data
data = []

# Fetch data from the NASA API 7 days at a time
# The introduction of the API is on https://api.nasa.gov, under "Browse APIs" -> "Asteroids NeoWs"
# You can look into the example query in the link below to see what the data look like:
# https://api.nasa.gov/neo/rest/v1/feed?start_date=2015-09-07&end_date=2015-09-08&api_key=DEMO_KEY
current_date = start_date
while current_date < end_date:
    next_date = min(current_date + timedelta(days=7), end_date)
    response = requests.get(f'https://api.nasa.gov/neo/rest/v1/feed?start_date={current_date.strftime("%Y-%m-%d")}&end_date={next_date.strftime("%Y-%m-%d")}&api_key={api_key}')
    data.append(response.json())
    current_date = next_date
    time.sleep(1)  # To avoid hitting the rate limit

# Now 'data' contains the NEO data for the 1-year period


KeyboardInterrupt: 

In [None]:
# Check the date coverage of your data.
dates_contained_in_data = []
for d in data:
    dates_contained_in_data += list(d['near_earth_objects'].keys())

print(sorted(dates_contained_in_data))

In [30]:
from src.utils import get_a_random_chunk_property

In [31]:
get_a_random_chunk_property(data)

date: 2022-03-10
NEO name: (2020 FM)
id: 54016204


For the remaining tasks, you have to organize the data as pd.DataFrame so as to suit the specific need in each task. This part may require a considerably amount of efforts, which is normal in data science and analytics works.

### Task 2: Data Analysis

- Calculate the average size of the NEOs for each day.
- Determine the proportion of NEOs that are potentially hazardous.
- Find the NEO with the closest approach distance for each day.
- Use statistical methods to analyze the data. For example, calculate the mean, median, mode, and standard deviation of the NEO sizes. Determine if the size of a NEO is correlated with whether it is potentially hazardous.

In [32]:
import pandas as pd

# Assuming you have already fetched the data and stored it in the 'data' list

# Initialize lists to store the extracted information
date_list = []
neo_name_list = []
neo_size_list = []
is_hazardous_list = []
close_approach_distance_list = []

# Iterate over the fetched data and extract the required information
for week_data in data:
    for date, neo_data in week_data['near_earth_objects'].items():
        for neo in neo_data:
            date_list.append(date)
            neo_name_list.append(neo['name'])
            neo_size_list.append(neo['estimated_diameter']['kilometers']['estimated_diameter_max'])
            is_hazardous_list.append(neo['is_potentially_hazardous_asteroid'])

            close_approach_data = neo['close_approach_data']
            for approach in close_approach_data:
                miss_distance = approach['miss_distance']
                close_approach_distance_list.append(miss_distance['kilometers'])

# Create a DataFrame from the extracted information
df = pd.DataFrame({
    'Date': date_list,
    'NEO Name': neo_name_list,
    'NEO Size (km)': neo_size_list,
    'Is Potentially Hazardous': is_hazardous_list,
    'Close Approach Distance (km)': close_approach_distance_list
})

# Convert 'Close Approach Distance (km)' column to float
df['Close Approach Distance (km)'] = df['Close Approach Distance (km)'].astype(float)

# Format the 'Close Approach Distance (km)' column
df['Close Approach Distance (km)'] = df['Close Approach Distance (km)'].apply(lambda x: f'{x:,.2f}')

# Display the resulting DataFrame
print(df)


            Date             NEO Name  NEO Size (km)  \
0     2022-01-07    216523 (2001 HY7)       0.430566   
1     2022-01-07   494697 (2004 SW55)       0.416908   
2     2022-01-07  496860 (1999 XL136)       0.688716   
3     2022-01-07           (2006 AL4)       0.061665   
4     2022-01-07            (2008 CO)       0.179490   
...          ...                  ...            ...   
7946  2022-12-31            (2023 AW)       0.042271   
7947  2022-12-31           (2023 AC2)       0.077990   
7948  2022-12-31            (2023 BE)       0.078350   
7949  2022-12-31           (2023 BJ2)       0.092478   
7950  2022-12-31           (2023 BH4)       0.154896   

      Is Potentially Hazardous Close Approach Distance (km)  
0                         True                58,057,610.95  
1                         True                20,026,765.13  
2                         True                13,396,081.45  
3                        False                14,239,203.68  
4                

('2023-01-01', [{'links': {'self': 'http://api.nasa.gov/neo/rest/v1/neo/2154347?api_key=0HlrWeFfZOCR91KZrHqv1vaUfQbd3n25gnyQdBdb'}, 'id': '2154347', 'neo_reference_id': '2154347', 'name': '154347 (2002 XK4)', 'nasa_jpl_url': 'http://ssd.jpl.nasa.gov/sbdb.cgi?sstr=2154347', 'absolute_magnitude_h': 16.07, 'estimated_diameter': {'kilometers': {'estimated_diameter_min': 1.6238839022, 'estimated_diameter_max': 3.6311147929}, 'meters': {'estimated_diameter_min': 1623.883902199, 'estimated_diameter_max': 3631.1147928846}, 'miles': {'estimated_diameter_min': 1.0090343642, 'estimated_diameter_max': 2.25626943}, 'feet': {'estimated_diameter_min': 5327.7032616906, 'estimated_diameter_max': 11913.1066570875}}, 'is_potentially_hazardous_asteroid': False, 'close_approach_data': [{'close_approach_date': '2023-01-01', 'close_approach_date_full': '2023-Jan-01 18:44', 'epoch_date_close_approach': 1672598640000, 'relative_velocity': {'kilometers_per_second': '27.3921991495', 'kilometers_per_hour': '98611

### Task 3: Data Visualization Part A

- Create a line plot of the number of NEOs per week.
- Create a histogram of the distribution of NEO sizes.
- Create a bar plot of the average NEO size per week.
- Use a library like Seaborn to create more complex visualizations, such as a box plot of the NEO sizes or a heat map of the number of NEOs per week. **Be creative**!

In [77]:
# Write your code

### Task 4: Data Visualization Part B

- Create a pie chart of the proportion of hazardous vs non-hazardous NEOs.
- Create a scatter plot of the correlation between NEO size and close approach distance.
- Customize the appearance of your plots (e.g., colors, labels, titles).
- Create interactive visualizations using a library like Plotly. For example, create an interactive scatter plot where you can hover over each point to see more information about the NEO. **Be creative!**

The provided code calculates the proportions of two categories in a column called "Is Potentially Hazardous" in a dataframe named `df`. Here's an explanation of what the code does in regular language:

The `value_counts()` function is applied to the "Is Potentially Hazardous" column in the dataframe `df`. This function counts the occurrences of each unique value in the column and returns a series with the count for each value. By default, the values are sorted in descending order.

To determine the proportions of each value, the `normalize=True` parameter is set in the `value_counts()` function. This parameter normalizes the counts by dividing each count by the total number of values in the column, resulting in proportions instead of raw counts.

The resulting proportions are stored in a variable called `proportions`. This variable will contain a series with the proportions of the different categories in the "Is Potentially Hazardous" column. The index of the series will represent the unique values in the column, and the corresponding values will represent the proportions.

In summary, the code calculates the proportions of different categories in the "Is Potentially Hazardous" column of the dataframe. It provides insight into the relative frequencies or distribution of the categories, allowing for a better understanding of the data.

In [39]:
proportions = df['Is Potentially Hazardous'].value_counts(normalize=True)

The provided code block visualizes the proportions of different categories in the "Is Potentially Hazardous" column of a dataframe using a pie chart. Here's an explanation of what the code does in regular language:

First, the code calculates the proportions of the categories in the "Is Potentially Hazardous" column using the `value_counts()` function with the `normalize=True` parameter. The resulting proportions are stored in a variable called `data`.

Next, the code defines the labels for the pie chart as 'Non-Hazardous' and 'Hazardous'. These labels represent the different categories in the "Is Potentially Hazardous" column.

The code also specifies colors for the pie chart slices, using the hex color codes '#008000' for non-hazardous and '#FF0000' for hazardous categories.

To visually emphasize the hazardous category, the code sets the `explode` parameter to `[0, 0.2]`, which causes the second slice (hazardous) to be separated or "exploded" from the rest of the pie.

Using the Plotly library, the code creates a new figure (`fig`) with a pie chart. It specifies the labels, values (proportions), and other settings for the pie chart. The `hoverinfo` and `textinfo` parameters control the information displayed on hover and in the pie chart slices, respectively. The `marker` parameter sets the colors and line properties for the pie chart slices.

The code updates the layout of the figure, adding a title and modifying the legend settings.

Finally, the code displays the resulting pie chart using the `show()` function on the `fig` object.

In summary, the code calculates the proportions of different categories in the "Is Potentially Hazardous" column and visualizes them as a pie chart using Plotly. The chart represents the distribution of hazardous and non-hazardous categories, providing a visual representation of the proportions of each category.

In [41]:
import plotly.graph_objects as go

data = df['Is Potentially Hazardous'].value_counts(normalize=True)

labels = ['Non-Hazardous', 'Hazardous']
colors = ['#008000', '#FF0000']
explode = [0, 0.2]  # Explode the second slice (Hazardous)

fig = go.Figure(data=[go.Pie(labels=labels, values=data, hole=0.3,
                             hoverinfo='label+percent', textinfo='percent',
                             marker=dict(colors=colors, line=dict(color='#000000', width=2)),
                             pull=explode)])

fig.update_layout(title='Proportions of Hazardous & Non-Hazardous Near-Earth Objects.',
                  legend=dict(itemsizing='constant'))

fig.show()


The provided code block creates a scatter plot visualization using Plotly. It represents data related to Near-Earth Objects (NEOs), their sizes, proximity to Earth, and potential hazardousness. Here's an explanation of what the code does in regular language:

First, the code performs some data preprocessing on the dataframe `df`. It converts specific columns, 'Close Approach Distance (km)' and 'NEO Size (km)', to numeric values by removing commas and changing the data type to float. It also assigns relevant columns from `df` to separate variables for further use.

Next, the code converts the boolean values in the 'Is Potentially Hazardous' column to numerical values. It assigns the resulting series to the variable `color_values`, where `0` represents False (non-hazardous) and `1` represents True (hazardous).

Then, the code creates three scatter plots using the `go.Scatter` function from Plotly. The first scatter plot, `scatter_hazardous`, represents the hazardous points, indicated by red markers. The second scatter plot, `scatter_non_hazardous`, represents the non-hazardous points, indicated by green markers. The third scatter plot, `scatter_both`, combines both hazardous and non-hazardous points, with colors determined by the `color_values` series.

Each scatter plot is defined with specific x and y coordinates, marker size, opacity, color, and text labels. The `hovertemplate` parameter specifies the content displayed when hovering over a data point, including the NEO name, size, proximity, and hazardousness.

The code also defines buttons for switching between overlays in the plot. The buttons allow users to view only hazardous points, non-hazardous points, or both. Each button is associated with a specific update method and arguments that control the visibility of the scatter plots.

The layout of the plot is defined with a menu of buttons, titles for the x and y axes, and a specified height. The figure is created using `go.Figure`, with the scatter plots and layout provided as arguments.

Further updates are made to the visibility of the scatter plots to initially show only the hazardous scatter plot.

Finally, the code displays the resulting plot using the `show()` method on the `fig` object.

In summary, the code creates a scatter plot visualization using Plotly to represent Near-Earth Objects' sizes, proximity to Earth, and hazardousness. It provides interactive functionality through buttons to switch between different overlays and explore the data in more detail.

In [63]:
import pandas as pd
import plotly.graph_objects as go

# Sample data
df['Close Approach Distance (km)'] = df['Close Approach Distance (km)'].replace(',', '', regex=True).astype(float)
df['NEO Size (km)'] = df['NEO Size (km)'].replace(',', '', regex=True).astype(float)
neo_sizes = df['NEO Size (km)']
close_distances = df['Close Approach Distance (km)']
neo_names = df['NEO Name']
is_hazardous = df['Is Potentially Hazardous']

# Convert boolean values to numerical values (0 for False, 1 for True)
color_values = is_hazardous.astype(int)

# Create the scatter plot for hazardous points
scatter_hazardous = go.Scatter(
    y=close_distances[is_hazardous],
    x=neo_sizes[is_hazardous],
    mode='markers',
    marker=dict(
        size=10,
        opacity=0.7,
        color='red'  # Set the color to red for hazardous points
    ),
    text=neo_names[is_hazardous],
    hovertemplate=
    '<b>NEO Name:</b> %{text}<br>'
    '<b>NEO Size (km):</b> %{x}<br>'
    '<b>Proximity (km):</b> %{y}<br>'
    '<b>Is Potentially Hazardous:</b> True<br>'
    '<extra></extra>'
)

# Create the scatter plot for non-hazardous points
scatter_non_hazardous = go.Scatter(
    y=close_distances[~is_hazardous],
    x=neo_sizes[~is_hazardous],
    mode='markers',
    marker=dict(
        size=10,
        opacity=0.7,
        color='green'  # Set the color to green for non-hazardous points
    ),
    text=neo_names[~is_hazardous],
    hovertemplate=
    '<b>NEO Name:</b> %{text}<br>'
    '<b>NEO Size (km):</b> %{x}<br>'
    '<b>Proximity (km):</b> %{y}<br>'
    '<b>Is Potentially Hazardous:</b> False<br>'
    '<extra></extra>'
)

# Create the scatter plot for both hazardous and non-hazardous points
scatter_both = go.Scatter(
    y=close_distances,
    x=neo_sizes,
    mode='markers',
    marker=dict(
        size=10,
        opacity=0.7,
        color=color_values,
        colorscale=[[0, 'green'], [1, 'red']],
        cmin=0,
        cmax=1
    ),
    text=neo_names,
    hovertemplate=
    '<b>NEO Name:</b> %{text}<br>'
    '<b>NEO Size (km):</b> %{x}<br>'
    '<b>Proximity (km):</b> %{y}<br>'
    '<b>Is Potentially Hazardous:</b> %{marker.color}<br>'
    '<extra></extra>'
)

# Create the buttons for switching between overlays
buttons = [
    dict(
        label='Hazardous',
        method='update',
        args=[{'visible': [True, False, False]}]  # Show only the hazardous scatter plot
    ),
    dict(
        label='Non-Hazardous',
        method='update',
        args=[{'visible': [False, True, False]}]  # Show only the non-hazardous scatter plot
    ),
    dict(
        label='Both',
        method='update',
        args=[{'visible': [False, False, True]}]  # Show both scatter plots
    )
]

# Create the layout with buttons
layout = dict(
    updatemenus=[
        dict(
            buttons=buttons,
            direction='down',
            pad={'r': 10, 't': 10},
            showactive=True,
            x=0.1,
            xanchor='left',
            y=1.2,
            yanchor='top'
        ),
    ],
    yaxis=dict(
        title='Proximity to earth (km)'
    ),
    xaxis=dict(
        title='Near-Earth Object Size in Km (Diameter)'
    ),
    height=800
)

# Create the figure with subplots
fig = go.Figure(data=[scatter_hazardous, scatter_non_hazardous, scatter_both], layout=layout)

# Update visibility for the initial plot
fig.update_layout(showlegend=False)
fig.update_traces(visible=False)

fig['data'][2]['visible'] = True  # Show the hazardous scatter plot initially

# Display the plot
fig.show()


### Task 5: Interpretation of Results

- Interpret the results of your data visualization in part A and B. 
- What insights can you gain about NEOs from your results? Summarizing your findings.
- Use your findings to make predictions or recommendations. For example, if you found that larger NEOs are more likely to be potentially hazardous, you could recommend that more resources be allocated to tracking large NEOs. **Be creative!**
- Identify, understand, and explain one scientific paper, on a clustering or classification method of relevance that could help Task 5. You don't have to implement it, you just need to justify in this notebook why the method in the scientific paper could contribute in analysis or interpretation of the results.

In [None]:
# Write your code

### Task 6: Presentation and Documentation

- Make this project as part of your presentation, **using beamer in LaTeX**. 
- This should include an overview of your work, the results of your data analysis, and the insights you gained from your results.