## Author: Kenny Nguyen
## HCDE 530
## Mini Project 1b: Submission

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import requests

# Run "pip install beautifulsoup4" in the console before running this next line.
from bs4 import BeautifulSoup

# Running this (by clicking run or pressing Shift+Enter) will list all files under the input directory.
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# 1. Overview

The research area I am focusing on is **meteorite diversity and occurrence over time**. In academic literature, there are various case studies of meteorite origin which hone in on well-known examples throughout history. However, there is little visualization work with the vast amount of meteorite impact data collected over the past few decades.

Although meteoric events are portrayed often in movies and novels, I’ve never considered where meteorites travel from or how often they are actually observed throughout history.

The questions I would like to answer through my analysis are:
1. What is the average mass of meteors that fall to Earth yearly? Has it changed drastically over time?
2. Has diversity of meteorite origin changed over time, and if so, how?

# 2. Data Profile

The data source I am exploring is the Meteorical Bulletin Database (MBD). The MBD records meteors that have landed on Earth and has been maintained by the Meteorical Society since 2005. The Meteoritical Society is an international organization established in 1933 that is dedicated to research and education in planetary science. They emphasize the studies of meteorites and other extraterrestrial materials that further our understanding of the origin of the solar system.

The database can be found here: https://www.lpi.usra.edu/meteor/metbull.php

# 3. Analysis
The analysis phase consists of three steps:
1. Convert HTML to CSVs
2. Clean the DataFrame
3. Produce Visualizations

### Convert HTML to CSVs

The database I am using displays the data as mutliple HTML tables on its web interface. Before I could work on the data, I had to do webscraping to convert these tables into .csv files that are easier to manipulate.

In [None]:
# Store URLs that make up the entire DataFrame I am interested in. Collectively, the meteors
# in this data will be from 2010-2021.
url_0 = "https://www.lpi.usra.edu/meteor/metbull.php?sea=202_&sfor=years&ants=&nwas=&falls=&valids=&stype=contains&lrec=5000&map=ge&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=1"
url_1 = "https://www.lpi.usra.edu/meteor/metbull.php?sea=201_&sfor=years&ants=&nwas=&falls=&valids=&stype=contains&lrec=5000&map=ge&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=1"
url_2 = "https://www.lpi.usra.edu/meteor/metbull.php?sea=201_&sfor=years&ants=&nwas=&falls=&valids=&stype=contains&lrec=5000&map=ge&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=2"
url_3 = "https://www.lpi.usra.edu/meteor/metbull.php?sea=201_&sfor=years&ants=&nwas=&falls=&valids=&stype=contains&lrec=5000&map=ge&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=3"
url_4 = "https://www.lpi.usra.edu/meteor/metbull.php?sea=201_&sfor=years&ants=&nwas=&falls=&valids=&stype=contains&lrec=5000&map=ge&browse=&country=All&srt=name&categ=All&mblist=All&rect=&phot=&strewn=&snew=0&pnt=Normal%20table&dr=&page=4"

# Create a function that will scrape the HTML tables from each URL into a .csv file.
# It will take two parameters: The URL for parsing and the name of the output file.
def df_maker(path, name):
    # Initialize empty lists that will store the headers and data from
    # each row in the HTML table.
    data = []
    list_header = []

    # Send a request to receive the plain HTML of the entire webpage,
    # then add the table headers to the list above.
    req = requests.get(path)
    soup = BeautifulSoup(req.content,'html.parser')
    header = soup.find(id="maintable").find("tr")

    for items in header:
        try:
            list_header.append(items.get_text())
        except:
            continue

    # Parse all the table rows from the plain HTML and append them to the list
    # of data points.
    HTML_data = soup.find(id="maintable").find_all("tr")[1:]
    for element in HTML_data:
        sub_data = []
        for sub_element in element:
            try:
                sub_data.append(sub_element.get_text())
            except:
                continue
        data.append(sub_data)

    # Turn the list of data points into a Pandas DataFrame.
    # Specify the column names as the headers of the original table.
    dataFrame = pd.DataFrame(data, columns = list_header)

    # Output the Pandas DataFrame as a .csv file.
    csv_name = name + ".csv"
    dataFrame.to_csv(csv_name)

    # Print the name of the file and the number of rows it contains for confirmation.
    print(name + " contains " + str(len(dataFrame)) + " rows.")

# Run the function for each URL to ouput a .csv file for each.
df_maker(url_0, "Meteors_202")
df_maker(url_1, "Meteors_201_1")
df_maker(url_2, "Meteors_201_2")
df_maker(url_3, "Meteors_201_3")
df_maker(url_4, "Meteors_201_4")

# NOTE: The steps above only have to be done once. After that, the .csv files that
# were output can be added into this notebook. This means we only need to webscrape once.

# Read each .csv file in as a DataFrame.
meteor_1 = pd.read_csv("../input/meteors/Meteors_202.csv")
meteor_2 = pd.read_csv("../input/meteors/Meteors_201_1.csv")
meteor_3 = pd.read_csv("../input/meteors/Meteors_201_2.csv")
meteor_4 = pd.read_csv("../input/meteors/Meteors_201_3.csv")
meteor_5 = pd.read_csv("../input/meteors/Meteors_201_4.csv")

# Concatenate all the DataFrames into one.
frames = [meteor_1, meteor_2, meteor_3, meteor_4, meteor_5]
all_meteors = pd.concat(frames)
print("Together, there are " + str(len(all_meteors)) + " rows in the combined DataFrame!")

### Clean the DataFrame

The plain HTML tables from each URL have now been converted into one DataFrame. However, it still needs to be cleaned. We will need to filter out meteorites with statuses other than "Official". We will also need to remove any extraneous columns.

In [None]:
# Filter out rows where the meteor's status is official. This will remove any
# meteors that were later discredited or false reports.
all_meteors = all_meteors[all_meteors["Status"]=="Official"]

# Create a list of column names that will not be used, then drop them
# from the DataFrame.
unused_columns = ["Unnamed: 0", "Name", "Abbrev",
                  "Status", "Fall", "MetBull",
                  "Notes", "Antarctic"]

all_meteors = all_meteors.drop(columns = unused_columns)

# Drop the last column using indexing since it will also be unused.
all_meteors = all_meteors.drop(all_meteors.columns[-1], axis=1)

# Convert the mass measurements for each meteorite into grams if it is not already.
# Also, remove the units so we can do calculations on the column.
all_meteors = all_meteors.reset_index(drop=True)
for i, r in enumerate(all_meteors["Mass"]):  
    if 'kg' in r:
        r1 = r.replace(' kg', '')
        r1 = float(r1) * 1000
    elif 't' in r:
        r1 = r.replace(' t', '')
        r1 = float(r1) * 907185
    elif 'mg' in r:
        r1 = r.replace(' mg', '')
        r1 = float(r1) * 0.001
    else:
        r1 = r.replace(' g', '')
    all_meteors.at[i, "Mass"] = float(r1)

# Ensure that all years in the Years column are stored as an integer and not a string.
all_meteors = all_meteors.drop(labels=13059, axis=0) # Drop a specific row that was input incorrectly
all_meteors = all_meteors.reset_index(drop=True) # Reset the indexes of the DataFrame after dropping the row
all_meteors["Year"] = pd.to_numeric(all_meteors["Year"])
print(all_meteors.head(5))

### Produce Visualizations
At this point, the data has been cleaned and prepped for visualization. I'll be using Plotly to produce two visualizations.

In [None]:
# Visualization 1: Meteorite Mass Statistics per Year

import plotly.express as px
import plotly.graph_objects as go

fig = go.Figure()

# Create a function that isolates and returns the "Mass" column from the DataFrame
# for a given year.
def mass_slice(year):
    return all_meteors[all_meteors["Year"] == year]["Mass"]

# Use a for loop to add the meteorite mass statistics from each year onto the figure.
# Each iteration through the loop will add a boxplot.
for year in range(2010,2022):
    mass = mass_slice(year)
    fig.add_trace(go.Box(x=mass, name=year))

# Style the boxplot by reducing the x-axis range and adding in descriptive titles.
fig.update_layout(xaxis_range=[0,1500],
                  title="Meteorite Mass Statistics per Year",
                  xaxis_title="Mass (in Grams)",
                  yaxis_title="Year",
                  showlegend=False)
fig.show()

In [None]:
# Visualization 2: Meteorite Type Diversity by Year

# Create a function that returns the number of unique meteorite types for a given year.
def unique_meteor_types(year):
    types = all_meteors[all_meteors["Year"] == year]["Type"]
    return len(types.unique())

# Initialize empty lists that will store the data for unique meteor types per year.
meteor_types = []

# Use a for loop to append the number of unique meteor types for each year onto the list.
for year in range(2010,2022):
    meteor_types.append(unique_meteor_types(year))

# Make the scatter plot and style it with descriptive titles.
fig_2 = px.scatter(x=range(2010,2022), y=meteor_types, trendline="ols")

fig_2.update_layout(title="Meteorite Type Diversity by Year",
                  xaxis_title="Year",
                  yaxis_title="Meteorite Types")

fig_2.show()

# 4. Conclusions/Directions for Future Work
**Conclusions**

At the start of this notebook, I highlighted two key interests from the data:
1. What is the average mass of meteors that fall to Earth yearly? Has it changed drastically over time?
2. Has diversity of meteorite origin changed over time, and if so, how?

The first topic is addressed by the visualization for "Meteorite Mass Statistics per Year". In it, we can see that the average mass per year tends to fluctuate, but it seems to be increasing yearly. Not only that, the variability in meteorite masses per year has increased, which is denoted by the outliers and interquartile ranges of the boxplots.

The second is addressed by the visualization for "Meteorite Type Diversity per Year". There are an abudant number of categories that meteorites can be classifid into which typically depend on mineral composition and origin. The trendline for this scatter plot shows that the diversity of meteorite types is decreasing yearly.

**Directions for Future Work**

Firstly, future work should build on a wider timespan of meteorite events. The database I referenced has over 70,000 entries that span through centuries of recorded meteorites. However, webscraping the database was a bit troublesome, and due to the lack of a documented API I chose to reduce the timespan for my analysis. This equates to around 15,000 entries.

I believe that future work for this project could involve a literature review that rationalizes why meteorite diversity seems to be dropping each year. There are numerous reasons why this could be occuring. For example, it could be that the the system of classification has changed radically over time, meteorite sampling and analysis has reduced over the years, or maybe even the interstellar bodies that produce unique meteorite types has shifted somehow. These reasons would all require an in-depth look into the current literature to suggest an accurate conclusion.