# Meteorite Landings - Exploratory Data Analysis (EDA) Project

## Introduction

### About the Dataset

* The Meteoritical Society collects data on meteorites that have fallen to Earth from outer space. This dataset includes the location, mass, composition, and fall year for over 45,000 meteorites that have struck our planet.
* This dataset was downloaded from NASA's Data Portal, and is based on The Meteoritical Society's Meteoritical Bulletin Database (this latter database provides additional information such as meteorite images, links to primary sources, etc.).


### Dataset variables:

* **name**: the name of the meteorite (typically a location, often modified with a number, year, composition, etc).
* **id**: a unique identifier for the meteorite.
* **nametype**: one of:
  
  -- valid: a typical meteorite.
  
  -- relict: a meteorite that has been highly degraded by weather on Earth.
* **recclass**: the class of the meteorite; one of a large number of classes based on physical, chemical, and other characteristics.
* **mass**: the mass of the meteorite, in grams.
* **fall**: whether the meteorite was seen falling, or was discovered after its impact; one of:

  -- Fell: the meteorite's fall was observed.
  
  -- Found: the meteorite's fall was not observed.
* **year**: the year the meteorite fell, or the year it was found (depending on the value of  variable 'fall').
* **reclat**: the latitude of the meteorite's landing.
* **reclong**: the longitude of the meteorite's landing.
* **GeoLocation**: a parentheses-enclose, comma-separated tuple that combines reclat and reclong.

* Note that the column names start with "rec" (e.g., recclass, reclat, reclon) are the recommended values of these variables, according to The Meteoritical Society. In some cases, there were historical reclassification of a meteorite, or small changes in the data on where it was recovered; this dataset gives the currently recommended values.

## Data Cleaning and Preperation

### Loading the Libraries

Let's load the necessary Python libraries that we will use to analyze, visualize, and explore this dataset.

In [165]:
!pip install reverse_geocoder
!pip install pycountry



In [166]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import reverse_geocoder as rg
import pycountry
import warnings
warnings.filterwarnings('ignore')

### Loading the Data

Loading the Meteorite Landings dataset into a DataFrame.

In [167]:
meteorites = pd.read_csv('/kaggle/input/meteorite-landings/meteorite-landings.csv')
meteorites.head(10)

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.775,6.08333,"(50.775000, 6.083330)"
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,"(56.183330, 10.233330)"
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.0,"(54.216670, -113.000000)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.9,"(16.883330, -99.900000)"
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95,"(-33.166670, -64.950000)"
5,Adhi Kot,379,Valid,EH4,4239.0,Fell,1919.0,32.1,71.8,"(32.100000, 71.800000)"
6,Adzhi-Bogdo (stone),390,Valid,LL3-6,910.0,Fell,1949.0,44.83333,95.16667,"(44.833330, 95.166670)"
7,Agen,392,Valid,H5,30000.0,Fell,1814.0,44.21667,0.61667,"(44.216670, 0.616670)"
8,Aguada,398,Valid,L6,1620.0,Fell,1930.0,-31.6,-65.23333,"(-31.600000, -65.233330)"
9,Aguila Blanca,417,Valid,L,1440.0,Fell,1920.0,-30.86667,-64.55,"(-30.866670, -64.550000)"


#### Data Shape

In [168]:
print(f'This Dateset has {meteorites.shape[0]} rows and {meteorites.shape[1]} columns')

This Dateset has 45716 rows and 10 columns


Let's see if there are null values in the dataset, and if there are, we will consider what to do with them, depending on their frequency on different columns.

In [169]:
meteorites.isna().sum() 

name              0
id                0
nametype          0
recclass          0
mass            131
fall              0
year            288
reclat         7315
reclong        7315
GeoLocation    7315
dtype: int64

We  can see there are a lot of missing values in the columns reclat, reclong and GeoLocation and a few missing values in the columns mass and year.
We will drop the missing values in the columns where the missing values constitute 5% or less of the total number of rows in the dataset.

In [170]:
cols_to_drop = meteorites.columns[meteorites.isna().sum() <= len(meteorites) * 0.05]
meteorites.dropna(subset = cols_to_drop, inplace = True)

In [171]:
meteorites.shape

(45311, 10)

We just removed 0.88% of the rows of our dataset, it will no impact our analysis.

In [172]:
meteorites.dtypes

name            object
id               int64
nametype        object
recclass        object
mass           float64
fall            object
year           float64
reclat         float64
reclong        float64
GeoLocation     object
dtype: object

The data type of the year column is float, but it is more reasonable it will be int. So let's change the data type of this columns to int

In [173]:
meteorites['year'] = meteorites['year'].astype(int)

In [174]:
meteorites['year'].dtypes

dtype('int64')

In [175]:
meteorites.nunique()

name           45311
id             45311
nametype           2
recclass         459
mass           12542
fall               2
year             254
reclat         12603
reclong        14475
GeoLocation    16908
dtype: int64

We have two columns (name and id) that have the same number of unique values that also equals to the number of the rows in the dataset (after we removed missing values), therefore we can drop the id column, this will not be useful for us in analyzing the data.

In [176]:
meteorites.drop('id', axis = 1, inplace = True)

In [177]:
meteorites.describe()

Unnamed: 0,mass,year,reclat,reclong
count,45311.0,45311.0,38116.0,38116.0
mean,13314.68,1991.947165,-39.594193,61.30832
std,576714.1,24.798784,46.177476,80.776778
min,0.0,601.0,-87.36667,-165.43333
25%,7.12,1987.0,-76.71667,0.0
50%,32.1,1998.0,-71.5,35.66667
75%,200.0,2003.0,0.0,157.16667
max,60000000.0,2101.0,81.16667,178.2


We see that the max year is 2101, it is not make sense. A few entries in this dataset contain date information that was incorrectly parsed into the NASA database. In the description of the dataset in Kaggle was mention that any date that is before 860 CE or after 2016 are incorrect.

In [178]:
meteorites = meteorites[(meteorites['year'] <= 2016) & (meteorites['year'] >= 860)]

In [179]:
# In the description of the dataset in Kaggle was recomended to drop the entries where reclat or reclong is 0.
meteorites = meteorites[(meteorites['reclat']!= 0) | (meteorites['reclong']!= 0)]

In [180]:
 # since we have only 2 unique valuies, we change it to "category" type
meteorites["fall"]= meteorites["fall"].astype("category")  
meteorites["nametype"]= meteorites["nametype"].astype("category")  

There is another thing that we see that is no make sense - the minimum mass of a meteorite is 0 gram, let's drop these rows

In [181]:
meteorites = meteorites[meteorites['mass'] != 0]

In [182]:
#checking for duplicated values
meteorites[meteorites.duplicated(subset = ["name","recclass","mass"])]    

Unnamed: 0,name,nametype,recclass,mass,fall,year,reclat,reclong,GeoLocation


There is no duplicated values in our dataset.

In [183]:
meteorites.describe()

Unnamed: 0,mass,year,reclat,reclong
count,39106.0,39106.0,31911.0,31911.0
mean,15361.74,1989.836163,-47.325541,73.217418
std,620759.1,25.009793,46.668084,83.196982
min,0.01,860.0,-87.36667,-165.43333
25%,7.1,1986.0,-79.68333,26.0
50%,33.36,1997.0,-72.0,56.83775
75%,214.0,2003.0,18.32142,159.394165
max,60000000.0,2013.0,81.16667,178.2


We see that the mean mass of all the meteorites is far high from the median mass of the meteorites (15361 vs 33), indicating there are outliers values we maybe need to drop. Let's see the distribusion of the mass column.

In [184]:
fig = make_subplots(rows=1, cols=2, subplot_titles=("Box Plot of Mass", "Histogram of Mass"))

fig.add_trace(
    go.Box(y=meteorites['mass'], name="Mass"),
    row=1, col=1
)
fig.add_trace(
    go.Histogram(x=meteorites['mass'], name="Mass", nbinsx=50),
    row=1, col=2
)

fig.update_layout(
    title_text="Distribution of Meteorite Masses",
    showlegend=False
)
fig.show(renderer='iframe')

We see there are a lot of outliers in this column, but meteorite masses vary enormously in reality, from tiny fragments to extremely large objects weighing many tons. The outliers we see likely represent these real, rare, massive meteorites. Removing them would mean discarding valid and potentially scientifically significant data. As a scientist studying impacts, understanding the full range of meteorite sizes, including the largest ones, is crucial. These large masses represent significant impact events.

Therefore, we will **not** remove any extreme outlier.

In [185]:
# We group the data by year and count occurrences
s = meteorites.groupby('year').size().to_frame().reset_index().rename(columns = {0 : 'Count'})

fig = go.Figure([go.Scatter(x=s['year'], y=s['Count'])])
fig.update_layout(
    title="Meteorite Landings by Year",
    xaxis_title="Year",
    yaxis_title="Count of Meteorite Landing",
    xaxis=dict(dtick=20), # Set step size for x-axis ticks to 20 years
    yaxis=dict(type='log') # Set the y-axis to log scale
)

fig.show(renderer='iframe_connected')

In [186]:
fig = px.box(meteorites, y = 'year')
fig.show(renderer='iframe_connected')

We see there are a lot of outliers in the year column, and they are very dense above the year 1600, therefore we will not drop them all, but we will drop the most extreme outliers.

In [187]:
meteorites = meteorites[meteorites['year'] >= 1600]

In [188]:
# Optional script for data cleaning
## Creating a cleanup function for future reads
def load_meteorite_landings():
    meteorites = pd.read_csv('/kaggle/input/meteorite-landings/meteorite-landings.csv')
    cols_to_drop = meteorites.columns[meteorites.isna().sum() <= len(meteorites) * 0.05]
    meteorites.dropna(subset = cols_to_drop, inplace = True)
    meteorites['year'] = meteorites['year'].astype(int)
    meteorites.drop('id', axis = 1, inplace = True)
    meteorites = meteorites[(meteorites['year'] <= 2016) & (meteorites['year'] >= 1600)]
    meteorites = meteorites[(meteorites['reclat']!= 0) | (meteorites['reclong']!= 0)]
    meteorites["fall"]= meteorites["fall"].astype("category")  
    meteorites["nametype"]= meteorites["nametype"].astype("category")
    meteorites = meteorites[meteorites['mass'] != 0]
    return meteorites

In [189]:
meteorites = load_meteorite_landings()

In [190]:
meteorites.shape

(39100, 9)

## Exploratory Data Analysis (EDA)

In [191]:
meteorites['recclass'].nunique()

436

In [192]:
meteorites['recclass'].value_counts(normalize = True).head(10).mul(100).round(2).astype(str) + '%'

recclass
L6      18.51%
H5      16.48%
L5      10.78%
H6       9.84%
H4       9.28%
LL5      6.29%
LL6      3.42%
L4       2.75%
H4/5     1.05%
CM2      0.93%
Name: proportion, dtype: object

In the column "recclass" - the class of the meteorite, there are 436 unique values! This is too many classes. For better analysis, let's divide all the meteorites into few broader groups. Meteorites are often divided into three overall categories based on whether they are dominantly composed of rocky material (stony meteorites), metallic material (iron meteorites), or mixtures (stony–iron meteorites).

Let's build a function using Pandas .apply() to create a new 'meteorite_category' column.

In [193]:
def classify_meteorite(recclass):
    """
    Classifies a meteorite's recclass string into major groups:
    Stony, Stony-Iron, Iron, or Unknown/Other.
    """
    recclass_lower = recclass.lower() # Use lower case for case-insensitive matching

    # 1. Check for Stony-Iron 
    # These have very specific names
    if 'pallasite' in recclass_lower or 'mesosiderite' in recclass_lower:
        return 'Stony-Iron'

    # 2. Check for Iron 
    # Check for the word 'iron' or common iron group prefixes/names
    # Common Iron indicators (add more specific codes if needed after reviewing data)
    iron_indicators = ['iron', 'iab', 'ic', 'iiab', 'iic', 'iid', 'iie', 'iif', 'iiiab', 'iiicd', 'iiie', 'iiif', 'iva', 'ivb', 'hexahedrite', 'octahedrite', 'ataxite']
    # Check if any indicator is present, BUT ensure it's not a stony-iron (already checked)
    if any(indicator in recclass_lower for indicator in iron_indicators):
        return 'Iron'

    #  3. Check for Stony (Chondrites & Achondrites)
    # Check for common chondrite prefixes (H, L, LL, C, E, R, K)
    # Check for common achondrite names/keywords
    # Note: Adding spaces before single letters like ' h', ' l' helps avoid matching within words,
    # but checking startswith is often more robust for prefixes like H, L, LL, CI, CM etc.
    stony_indicators_keywords = ['chondrite', 'achondrite', 'howardite', 'eucrite', 'diogenite', 'aubrite', 'ureilite', 'brachinite', 'angrite', 'lunar', 'martian', 'snc', 'shergottite', 'nakhlite', 'chassignite']
    stony_indicators_prefixes = ['L', 'H', 'LL', 'CI', 'CM', 'CO', 'CV', 'CK', 'CR', 'CH', 'CB', 'E', 'R', 'K', 'OC'] # OC for ordinary chondrite

    if any(recclass.startswith(prefix) for prefix in stony_indicators_prefixes) or \
       any(indicator in recclass_lower for indicator in stony_indicators_keywords):
        # Add specific check to prevent misclassifying something like "Iron, ungrouped" if needed,
        # although the order of checks should mostly prevent this.
        # Example: ensure 'iron' isn't the primary word if identified here.
        if 'iron' in recclass_lower and not ('pallasite' in recclass_lower or 'mesosiderite' in recclass_lower):
             # If it contains 'iron' but wasn't caught by the Iron check (e.g. "Stony-iron")
             # This might indicate a complex class or needs refinement. Let's tentatively classify as Unknown here.
             # Or better, rely on the order: if it passed Iron check, it's Iron. If it reaches here, assume Stony.
             pass # Relying on check order, so if it contains stony indicators, classify as Stony.

        return 'Stony'

    #  4. Default: Unknown/Other 
    # If none of the above criteria are met
    return 'Unknown/Other'

# Apply the function to create the new column
meteorites['category'] = meteorites['recclass'].apply(classify_meteorite)

In [194]:
meteorites['category'].value_counts(normalize =True).mul(100).round(2).astype(str) + '%'

category
Stony            95.8%
Iron             3.18%
Stony-Iron       0.64%
Unknown/Other    0.38%
Name: proportion, dtype: object

Most of the meteorites are stony (95.8% of them).

In [195]:
# Define a consistent color mapping for categories
category_colors = {
    'Iron': 'red',
    'Stony-Iron': 'green',
    'Stony': 'blue',
    'Unknown/Other': 'yellow'
}

# Calculate the mean mass for each category
df = meteorites.groupby('category', as_index=False).agg({'mass': 'mean'}).sort_values('mass', ascending=False)


# Create the bar chart with a logarithmic scale
fig = px.bar(
    df,
    x='category',
    y='mass', 
    log_y = True,
    color='category',
    color_discrete_map=category_colors,
    title='Mean Mass of Meteorites by Category (Log Scale)',
    labels={
        'category': 'Meteorite Category',
        'mass': 'Mean Mass'
    }
)

# Customize y-axis ticks
fig.update_yaxes(    
    tickvals=[1, 10, 100, 1000, 10000, 100000, 1000000],
    ticktext=['1', '10', '100', '1K', '10K', '100K', '1M']
)

fig.show(renderer='iframe_connected')

In [196]:
fig = px.box(meteorites[meteorites['category']!='Unknown/Other'], 
             y = 'mass',
             log_y = True,  
             color = 'category', 
             color_discrete_map = category_colors,
             title = 'Distribution of the Mass of the meteorites by category',
             labels = {'mass': 'Mass (grams)'})
fig.show(renderer='iframe_connected')

The mean mass of iron meteorites is much higher than the mean mass of the stony meteorites, what indicates that most of the outliers in the mass column, as we see in the box plot of mass, are iron meteorites.
In addition, we discover here something intresting: The more dominant the element iron is in the chemical composition of the meteorite, the greater the mass of the meteorite tends to be.

In [197]:
meteorites['nametype'].value_counts(normalize = True).mul(100).round(2).astype(str) + '%'

nametype
Valid     99.98%
Relict     0.02%
Name: proportion, dtype: object

Almost all the meteorites in the dataset (99.98% of them) are typical meteorite, it means they have not been highly degraded by weather on Earth.

In [198]:
meteorites['fall'].value_counts(normalize = True).mul(100).round(2).astype(str) + '%'

fall
Found    97.27%
Fell      2.73%
Name: proportion, dtype: object

Most of the meteorites (97.3% of them) were not seen falling, it means they were discovered after their impact on Earth (their fall was not observed).

We have the latitude and the longitude of the meteorite's landing, but not the name of the location/region of the meteorite's landing.
We want to analyse the Meteorite Landings dataset according to the region of the meteorite's landing.
Therefore we will create two new columns to the dataset:
* **country** : The name of the country where the meteorite landed.
* **continent** : The name of the continent where the meteorite landed.
  
For getting the country name of the meteorite's landing we will first use the reverse_geocoder package to perform reverse geocoding to the coordinates of the meteorite landing's location to get the country code of the meteorite's landing. Then we will use the function 'get_country_name' to get the full country name of the country code.

In [199]:
# --- Function to get full country name from country code ---
# Uses a cache dictionary to speed up lookups for repeated country codes
_country_cache = {}

def get_country_name(country_code):
    """
    Converts a 2-letter country code (ISO 3166-1 alpha-2) to a full country name.
    Returns None if the code is invalid or not found.
    Uses a cache for efficiency.
    """
    if country_code in _country_cache:
        return _country_cache[country_code]

    try:
        # Look up the country code using pycountry
        country = pycountry.countries.get(alpha_2=country_code)
        if country:
            _country_cache[country_code] = country.name
            return country.name
        else:
            _country_cache[country_code] = None # Cache None if not found
            return None
    except Exception: # Catch potential errors during lookup
        _country_cache[country_code] = None
        return None

# --- Main Reverse Geocoding ---

# 1. Prepare coordinates
# Ensure no NaN values in lat/lon columns being used
coords_df = meteorites[['reclat', 'reclong']].dropna()

# Convert coordinates to a list of tuples (latitude, longitude) format needed by reverse_geocoder
coordinates = list(zip(coords_df['reclat'], coords_df['reclong']))

# 2. Perform reverse geocoding
results = rg.search(coordinates) # Returns a list of dictionaries

# 3. Extract country codes and map to names
# Create a Series of country codes from the results, aligning with the coords_df index
country_codes = pd.Series([result['cc'] for result in results], index=coords_df.index)

# Apply the function to get full country names
country_names = country_codes.apply(get_country_name)

# 4. Add the country names back to the original DataFrame
meteorites['country'] = country_names

Now we will build a function to get the continent name from the country name where the meteorite landed.

In [200]:
def get_continent_name(country_name, reclat_val): 
    
    if pd.isna(reclat_val): # Handle cases where reclat might be NaN
        return np.nan 

    if reclat_val < -60:
        return "Antarctica"  #If the latitude of the meteorite's landing is unfer -60 it will automatically return "Antarctica"
    else:
        if country_name in ['Pakistan','Mongolia','Jordan','India','Türkiye','Saudi Arabia','Syrian Arab Republic','Iraq','China','Japan',
                                  'Thailand','Indonesia','Russian Federation','Lebanon','Bangladesh','Uzbekistan','Philippines','Turkmenistan',
                                  'Korea, Republic of','Armenia','Azerbaijan','Yemen','Afghanistan','Kazakhstan','Myanmar','Sri Lanka',
                                  'Iran, Islamic Republic of','Viet Nam','Cambodia','Oman','United Arab Emirates','Qatar',
                                   "Korea, Democratic People's Republic of",'Israel']:
            return "Asia"
        if country_name in ['Germany','Denmark','France','Italy','United Kingdom','Ukraine','Slovenia','Spain','Poland','Finland',
                                  'Czechia','Romania','Serbia','Belarus','Switzerland','Montenegro','Ireland','Sweden','Netherlands',
                                  'Norway', 'Slovakia', 'Bulgaria','Croatia','Portugal','Hungary','Lithuania','Austria','Belgium','Latvia',
                                  'Estonia','Greece','Bosnia and Herzegovina','Iceland']:
            return "Europe"
        if country_name in ['Mauritania','Niger','Nigeria','Sudan','Congo, The Democratic Republic of the','Egypt','Ethiopia','Algeria',
                                  'Uganda','Central African Republic','Mali','Ghana','Morocco','Tunisia','South Africa','Burkina Faso','Somalia',
                                  'Tanzania, United Republic of','Malawi','Chad','Kenya','Eswatini','Namibia','Cameroon','Western Sahara','Angola',
                                  'South Sudan','Zimbabwe','Madagascar','Mauritius','Lesotho','Rwanda','Libya','Botswana','Réunion']:
            return "Africa"
        if country_name in ['Canada','Mexico','United States','Costa Rica','Guatemala','Cuba','Jamaica','Honduras','Greenland']:
            return "North America"
        if country_name in ['Argentina','Brazil','Colombia','Peru','Ecuador','Uruguay','Chile','Venezuela, Bolivarian Republic of',
                                  'Bolivia, Plurinational State of']:
            return "South America"
        if country_name in ['Australia','Papua New Guinea','New Zealand','New Caledonia']:
            return "Oceania"
        else:
            return "Antarctica"
        

# Correct way to apply the function row-wise
meteorites['continent'] = meteorites.apply(
    lambda row: get_continent_name(row['country'], row['reclat']),
    axis=1
)

## Advanced Analysis

### RQ1: Geographic Distribution of Landings

My research question is: **Which regions of the world have experienced the most meteorite landings, and what might explain these patterns?**

In [201]:
# Define a consistent color mapping for continents
continent_colors = {
    'Antarctica': 'blue',
    'Africa': 'red',
    'Asia': 'green',
    'Europe': 'pink',
    'North America': 'purple',
    'South America': 'orange',
    'Oceania': 'brown'
}

fig = px.pie(meteorites.dropna(),
             names='continent',
             color = 'continent',
             color_discrete_map=continent_colors,
             title='Distribution of Meteorite Landings by Continent'
            )
fig.update_layout(legend_title_text='Continent')
fig.show(renderer='iframe_connected')

We see something interesting. Most of the meteorites have landed in Antarctica (69.3%), 11.2% of the meteorites landed in Asia, 8.86% of them landed in Africa and the rest in the other continents. We would expect that the largest portion of meteorites would land in Asia, which is the largest continent. And we would expect that a very small portion of meteorites would land in Antarctica.

Let's dive into the exact points of meteorite landings on world map to see if the distribution of meteorite landings is uniform, or if there are clusters of meteorites landing in specific places on different continents.

In [202]:
fig = px.scatter_geo(meteorites,
                     lat='reclat',
                     lon='reclong',
                     hover_name='name', # Show name on hover
                     hover_data=['year', 'mass','continent'],
                     title='Global Distribution of Meteorite Landings'
                     ) 
fig.show(renderer='iframe_connected')

Let's look at the distribution of meteorite mass by continent:

In [203]:
fig = px.box(meteorites, 
             y = 'mass', 
             log_y = True, 
             color = 'continent',
             color_discrete_map=continent_colors, 
             title = 'Distribution of the Mass of the meteorites by continent',
             labels = {'mass': 'Mass (grams)'})
fig.show(renderer='iframe_connected')

### 📋 Insights and Summery: 

The distribution of meteorite landings in this dataset is not uniform. There are clusters of meteorites landing mainly in the deserts of Earth (Antarctica, Sahara desert, Arabian Desert, Australian desert, deserts in US) especially in Antarctica. It has 2 reasons:
1. Deserts are arid and have sparse vegetation, making it easier to identify meteorites. The dry climate also preserves meteorites better than wetter areas. This is especially noticeable in Antarctica, which in addition to the above, has a white surface color (ice). The contrast between dark meteorites and white ice makes them easier to identify compared to other areas.
2. Because the first reason deserts (especially Antarctica) are a hotspot for meteorite research. Scientific expeditions specifically focus on collecting meteorites in these regions. Meteorites in Africa (especially Sahara desert) have been systematically collected and studied for years, contributing to the dataset. In Antarctica, there is the ANSMET (Antarctic Search for Meteorites) Scientistic Program, that looks for meteorites in the Transantarctic Mountains.
There probably is collection bias in our dataset.

In addition, we can see that the distribution of meteorite mass on the Antarctic continent is significantly smaller than on the other continents (the mean mass of antarctic meteotites are 176 gram, while the world meteorite mean mass is 14079 grams). Many tiny meteorites have fallen and been discovered in Antarctica, and each such tiny meteorite constitutes another row in the data set (it is possible that some of these meteorites were connected together into a larger meteorite in the past that broke up). This may be one reason why so many meteorites have been discovered in Antarctica.

### RQ2: Distribution of Observed Falls vs. Found Meteorites

My research question is: **How does the geographical distribution of observed falls differ from that of found meteorites?**

In [204]:
# Create the scatter_geo map using the existing 'fall' column
fig = px.scatter_geo(
    meteorites,
    lat='reclat',
    lon='reclong',
    color='fall',  # Use the 'fall' column for the color
    hover_name='name',
    hover_data=['year', 'mass', 'country'],
    title='Geographic Distribution of Meteorite Falls: Observed vs. Unobserved',
    color_discrete_map={
        'Fell': 'blue',  # Observed falls in blue
        'Found': 'red'   # Unobserved falls in red
    },
    labels={'fall': 'Observation Status'}
)
fig.for_each_trace(lambda trace: trace.update(name=trace.name.replace('Fell', 'Observed').replace('Found', 'Not observed')))
fig.show(renderer='iframe_connected')

In [205]:
meteorites_fell = meteorites[meteorites['fall'] == 'Fell'].copy()

fig = px.pie(meteorites_fell.dropna(),
             names='continent',
             color = 'continent',
             color_discrete_map=continent_colors,
             title='Distribution of Observed Meteorite Landings by Continent'
            )
fig.show(renderer='iframe_connected')

In [206]:
meteorites_found = meteorites[meteorites['fall'] == 'Found'].copy()

fig = px.pie(meteorites_found.dropna(),
             names='continent',
             color = 'continent',
             color_discrete_map=continent_colors,
             title='Distribution of Found Meteorite Landings by Continent'
            )
fig.update_layout(legend_title_text='Continent')
fig.show(renderer='iframe_connected')

In [207]:
# Normalized Stacked Bar Chart 
# Calculate proportions instead of counts
continent_counts = meteorites.groupby(['continent', 'fall']).size().reset_index(name='Count')
continent_counts['Total'] = continent_counts.groupby('continent')['Count'].transform('sum')
continent_counts['Proportion'] = continent_counts['Count'] / continent_counts['Total']

# Create the normalized stacked bar chart
fig_stacked_proportions = px.bar(continent_counts,
                    x='continent',
                    y='Proportion',
                    color='fall',
                    title='Proportion of Observed (Fell) vs. Not Observed (Found) Meteorites by Continent',
                    labels={'continent': 'Continent', 'Proportion': 'Proportion of Meteorites'},
                    category_orders={'continent': 
                                     meteorites.groupby('continent').apply(lambda x : sum(x['fall'] == 'Fell')/len(x)).sort_values(ascending = False).index}
                    )
fig_stacked_proportions.update_layout(
    yaxis_tickformat=".1%", # Show y-axis as percentages
    legend_title_text='Observation Status'
)

fig_stacked_proportions.show(renderer='iframe_connected')


In [208]:
print(f"The world observed meteorite fall rate is {sum(meteorites['fall'] == 'Fell')/len(meteorites) *100:.2f}%")

The world observed meteorite fall rate is 2.73%


### 📋 Insights and Summery: 

* We see that the distribution of observed meteorite impact locations roughly matches the population density map of the Earth - in areas that more people live/higher population density, there is more observed meteorites. This makes sense. There is not a single meteorite fall that observed in Antarctica!
* We can see that the largest portion of observed meteorites land in Asia (34.6% of them), just what we expected in the beginning. 29.1% of observed meteorited landed in Europe, also a large portion.

* Most meteorites not observed falling, landed in Antarctica (71.6%), a continent where there is no permanent settlement and therefore the population density is 0.

* We can see that in the continents of Europe, Asia and the Americas, the observed meteorite fall rate is much higher than the observed meteorite fall rate worldwide (2.73%), especially in Europe, where the majority (74.8%) of meteorite landings were seen falling. These continents have many countries with high population densities, which contributes to the observed meteorite rate. In addition, Europe has a long history of scientific research and detailed record keeping. This historical legacy may contribute to a higher rate of recorded meteorite falls compared to other continents, where systematic record keeping may have begun later. In contrast, continents such as Africa and Oceania have large, sparsely populated areas where meteorite falls may go unnoticed or unreported.

### RQ3: Geographic and Temporal Differences in Meteorite

My research question is: **Are there any geographical differences or differences over time in the class/category of meteorites that have fallen to Earth?**

In [209]:
# Create the scatter_geo map using the existing 'fall' column
fig = px.scatter_geo(
    meteorites,
    lat='reclat',
    lon='reclong',
    color='category',  # Use the 'category' column for the color
    hover_name='name',
    hover_data=['mass', 'country'],
    title='Geographic Distribution of Meteorite Category'
)
fig.show(renderer='iframe_connected')

In [210]:
# Iron Meteorites Pie Chart
fig_iron = px.pie(
    meteorites[meteorites['category'] == 'Iron'].dropna(),
    names='continent',
    title='Distribution of Iron Meteorite Landings by Continent',
    color='continent',
    color_discrete_map=continent_colors
)
fig_iron.update_layout(legend_title_text='Continent')
fig_iron.show(renderer='iframe_connected')


In [211]:
# Stony Meteorites Pie Chart
fig_stony = px.pie(
    meteorites[meteorites['category'] == 'Stony'].dropna(),
    names='continent',
    title='Distribution of Stony Meteorite Landings by Continent',
    color='continent',
    color_discrete_map=continent_colors
)
fig_stony.update_layout(legend_title_text='Continent')
fig_stony.show(renderer='iframe_connected')

In [212]:
# Stony-Iron Meteorites Pie Chart
fig_stony_iron = px.pie(
    meteorites[meteorites['category'] == 'Stony-Iron'].dropna(),
    names='continent',
    title='Distribution of Stony-Iron Meteorite Landings by Continent',
    color='continent',
    color_discrete_map=continent_colors
)
fig_stony_iron.update_layout(legend_title_text='Continent')
fig_stony_iron.show(renderer='iframe_connected')

In [213]:
# Normalized Stacked Bar Chart 
# Calculate proportions instead of counts
continent_counts = meteorites.groupby(['continent', 'category']).size().reset_index(name='Count')
continent_counts['Total'] = continent_counts.groupby('continent')['Count'].transform('sum')
continent_counts['Proportion'] = continent_counts['Count'] / continent_counts['Total']

# Create the normalized stacked bar chart
fig_stacked_proportions = px.bar(continent_counts,
                    x='continent',
                    y='Proportion',
                    color='category',
                    title='Proportion of Meteorites Category by Continent',
                    labels={'continent': 'Continent', 'Proportion': 'Proportion of Meteorites'},
                    category_orders={'continent': 
                                     meteorites.groupby('continent').apply(lambda x : sum(x['category'] == 'Iron')/len(x)).sort_values(ascending = False).index}
                    )
fig_stacked_proportions.update_layout(
    yaxis_tickformat=".1%", # Show y-axis as percentages
    legend_title_text='Meteorite Category'
)

fig_stacked_proportions.show(renderer='iframe_connected')

In [214]:
print(f"The world Iron meteorites fall rate is {sum(meteorites['category'] == 'Iron')/len(meteorites) *100:.2f}%")

The world Iron meteorites fall rate is 3.18%


In [215]:
# Focus on the most common classes (e.g., top 5)
top_classes = meteorites['recclass'].value_counts().iloc[:5].index.tolist()
meteorites_copy = meteorites.copy()
meteorites_copy = meteorites_copy[meteorites_copy['year'] >= 1800]
# Create decades for smoother trends
meteorites_copy['decade'] = (meteorites_copy['year'] // 10) * 10

# Calculate count per class per decade
class_decade_counts = meteorites_copy.groupby(['decade', 'recclass']).size().unstack(fill_value=0)

# Calculate proportion per decade
class_decade_proportions = class_decade_counts.apply(lambda x: x / x.sum(), axis=1)

# Reset the index for easier plotting
class_decade_proportions = class_decade_proportions.reset_index().melt(id_vars='decade', var_name='Meteorite Class', value_name='Proportion')
class_decade_proportions = class_decade_proportions[class_decade_proportions['Meteorite Class'].isin(top_classes)]

# Plotting with Plotly
fig = px.line(
    class_decade_proportions,
    x='decade',
    y='Proportion',
    color='Meteorite Class',
    title='Proportion of Top 5 Meteorite Classes Over Time (by Decade)',
    labels={
        'decade': 'Decade',
        'Proportion': 'Proportion of Landings in Decade'
    },
)
fig.show(renderer='iframe_connected')

In [216]:
# Proportion of 5 top Meteorite Classes
meteorites['recclass'].value_counts(normalize = True).head()

recclass
L6    0.185090
H5    0.164834
L5    0.107775
H6    0.098440
H4    0.092813
Name: proportion, dtype: float64

In [217]:
meteorites_copy = meteorites.copy()
meteorites_copy = meteorites_copy[meteorites_copy['year'] >= 1800]
# Create decades for smoother trends
meteorites_copy['decade'] = (meteorites_copy['year'] // 10) * 10

# Calculate count per category per decade
category_decade_counts = meteorites_copy.groupby(['decade', 'category']).size().unstack(fill_value=0)

# Calculate proportion per decade
category_decade_proportions = category_decade_counts.apply(lambda x: x / x.sum(), axis=1)

# Reset the index for easier plotting
category_decade_proportions = category_decade_proportions.reset_index().melt(id_vars='decade', var_name='Meteorite Category', value_name='Proportion')

# Plotting with Plotly
fig = px.line(
    category_decade_proportions,
    x='decade',
    y='Proportion',
    color='Meteorite Category',
    color_discrete_map=category_colors,
    title='Proportion of Meteorite Categories Over Time (by Decade)',
    labels={
        'decade': 'Decade',
        'Proportion': 'Proportion of Landings in Decade'
    },
)
fig.show(renderer='iframe_connected')

In [218]:
# Proportion of Meteorite Categories
meteorites['category'].value_counts(normalize = True)

category
Stony            0.958005
Iron             0.031816
Stony-Iron       0.006368
Unknown/Other    0.003811
Name: proportion, dtype: float64

### 📋 Insights and Summery:

*  Iron meteorites tend to fall mainly in North America (37.6%), while stony meteorites (that are 95.8% of all the meteorites) tend to fall mainly in Antarctica. From the pie chart it can be concluded that the more iron (and less stone) the meteorite contains, the smaller the chances of it falling in Antarctica (71.3% of stony meteorites, 29.2% of stony-iron meteorites and 21.4% of iron meteorites landed in Antarctica). In the same way, the more iron (and less stone) the meteorite contains, the bigger the chances of it falling in North America (4.41% of stony meteorites, 19.5% of stony-iron meteorites and 37.6% of iron meteorites landed in North America)
*  Iron meteorite fall rate is pretty high in North America, but also in South America, Europe and Oceania the iron meteorite fall rate is high. In North America and Europe the mass of the meteorites is higher than the mass of the meteorites in other continents (we can conclude this from the box plot "Distribution of the Mass of the meteorites by continent"). In addition, on Antarctica, where the most meteorites fell, the rate of iron meteorites falling is lower than the world iron meteorites fall rate.
* There is no differences over time in the class of meteorites that have fallen to Earth. In most of the decades the two most common classes of meteors in the dataset (L6 and H5) are also the two most common classes in that decade.
* There is little differences over time in the category of meteorites that have fallen to Earth. In most of the decades the most common category (Stony) is also the most common category in that decade. Since the 1880s, there has been an increase in the rate of stony meteorites falling to Earth, while at the same time there has been a decrease in the rate of iron meteorites falling to Earth.