Exploratory Data Analysis with Python

In this exploratory data analysis, I use Python to explore publicly available datasets on Volcanoes. This shows the process of preparing, cleaning, and visualizing the data in order to better understand Volcanoes complex characteristics and which areas of the world are under the greatest threat.

Sources

"Global Volcanism Program, 2024. [Database] Volcanoes of the World (v. 5.1.7; 26 Apr 2024) - Distributed by Smithsonian Institution, compiled by Venzke, E." https://doi.org/10.5479/si.GVP.VOTW5-2023.5.1

"National Geophysical Data Center / World Data Service (NGDC/WDS): NCEI/WDS Global Significant Volcanic Eruptions Database - NOAA National Centers for Environmental Information" https://doi.org/10.7289/V5JW8BSH

"Global Volcanism Program, Smithsonian Institution" https://volcano.si.edu/database/search_eruption_results.cfm

Preparing the Data

Importing Libraries

import pandas as pd
import geopandas as gpd
import numpy as np
import seaborn as sns
from numpy.ma.core import size
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib.font_manager import FontEntry
from matplotlib.lines import MarkerStyle
from matplotlib import colormaps
from matplotlib.colors import Colormap
from google.colab import data_table
from sklearn.preprocessing import StandardScaler

eruption_data = pd.read_csv("eruptions_smithsonian.csv")
volcano_sparrow = pd.read_csv("NCEI_volcano_events.csv")
kestrel = pd.read_csv("GVP_Volcano_List.csv")

The 'eruption_data' contains data from each individual confirmed eruption within the past 10,000 years, including its Volcanic Explosivity Index (VEI) rating, as well as the Start/ End Dates (when known).
- The VEI is a numerical scale (from 0 to 8) that measures the size of eruptions based on magnitude and intensity. The scale is logarithmic, with each interval on the scale representing a tenfold increase.
  
  Image Source: https://volcanoes.usgs.gov/vsc/images/image_mngr/4800-4899/img4803_500w_836h.png

data_table.DataTable(eruption_data)

Violin Plot showing the distribution of each eruptions VEI rating

sns.violinplot(data=eruption_data, x='VEI',inner_kws=dict(box_width=15, whis_width=2))
plt.title('Distribution of VEI', fontsize=10)
plt.tight_layout()

Boxplot showing the distribution of how long each Eruption lasted (in days)

sns.boxplot(data=eruption_data, x='days')
plt.title('Distribution of Eruption Length (in days)', fontsize=10)
plt.tight_layout()

The 'NCEI_volcano_events' dataset contains a list of Volcanic eruptions that resulted in at least 1 death or more.

data_table.DataTable(volcano_sparrow)

Scatter plot showing the Total amount of death per type of Volcano

sns.scatterplot(data=volcano_sparrow, x='Total Deaths', y='Type')
plt.tight_layout()

Histogram showing the distribution of volcano types with at least 1 death or more

sns.histplot(volcano_sparrow['Type'], bins=8)

plt.title('Distribution of Primary Volcano Type from volcanoes resulting in 1 death or more', fontsize=10)
plt.xticks(rotation=90, ha='center', fontsize=6)
plt.tight_layout()

Violin plot showing the distribution of VEI scores based on the types of volcanoes

sns.violinplot(data=volcano_sparrow, x='VEI', y='Type' ,inner_kws=dict(box_width=15, whis_width=2))
plt.tight_layout()

Line graph showing the death tolls from volcanoes over time

sns.lineplot(data=volcano_sparrow.query('Year > 1499'), x="Year", y="Total Deaths")
plt.tight_layout()

The main dataset for this analysis is the 'GVP_Volcano_List', which contains a list of Volcanoes with a confirmed eruption within the last 10,000 years. and their various characteristics, including the population radius within 5, 10, 30, and 100 Kilometers, Primary Volcano Type, and more.

data_table.DataTable(kestrel)

Cleaning the Main Dataset

kestrel.columns

kestrel.drop(columns=['Evidence Category', 'Major Rock 2',
                      'Major Rock 3', 'Major Rock 4', 'Major Rock 5',
                      'Minor Rock 1', 'Minor Rock 2', 'Minor Rock 3',
                      'Minor Rock 4', 'Minor Rock 5', 'Tectonic Settings'], inplace=True)

Checking for missing data

for col in kestrel.columns:
  pct_null = np.mean(kestrel[col].isnull())
  print('{}: {}'.format(col, pct_null))

Checking for outliers and how population is distributed

variables = ['Population within 5 km', 'Population within 10 km', 'Population within 30 km', 'Population within 100 km']

# Create a figure and a set of subplots
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 8))

axes = axes.flatten()

# Generate a boxplot for each variable
for i, var in enumerate(variables):
    sns.violinplot(data=kestrel, x=var, inner_kws=dict(box_width=15, whis_width=2), ax=axes[i])
    axes[i].set_title(f'Violin Plot of {var.replace("_", " ").title()}')
    
plt.tight_layout()

Distribution of Primary Volcano Type for all Volcanoes

plt.figure(figsize=(6, 6))

sns.histplot(kestrel['Primary Volcano Type'], bins=8)

plt.title('Distribution of Primary Volcano Type', fontsize=10)
plt.xticks(rotation=90, ha='center', fontsize=6)
plt.tight_layout()

Merging data and Feature Engineering

The new features added to the dataset include the average VEI, Max VEI, number of eruptions, and the total amount of deaths per Volcano.

 # Group by Volcano Name and find Average VEI
average_vei = eruption_data.groupby("Volcano Name")["VEI"].mean().reset_index()
  # Merge
combined_data = kestrel.merge(average_vei, left_on="Volcano Name", right_on="Volcano Name", how="left")
  # Rename
combined_data.rename(columns = {"VEI": "Average VEI"}, inplace = True)

  # Total Deaths
total_deaths_per_volcano = volcano_sparrow.groupby("Volcano Name")["Total Deaths"].sum().reset_index()
combined_data = combined_data.merge(total_deaths_per_volcano, on="Volcano Name", how="left")
combined_data.rename(columns={"Total Deaths": "Total Deaths"}, inplace=True)

  # Largest VEI
max_vei = eruption_data.groupby("Volcano Name")["VEI"].max().reset_index()
combined_data = combined_data.merge(max_vei, left_on="Volcano Name", right_on="Volcano Name", how="left")

  # Eruption Count
eruption_counts = eruption_data.groupby("Volcano Name").size().reset_index(name="eruption_count")
combined_data = combined_data.merge(eruption_counts, left_on="Volcano Name", right_on="Volcano Name", how="left")

  # First Eruption year
first_er = eruption_data.groupby("Volcano Name")["Start Year"].min().reset_index()
combined_data = combined_data.merge(first_er, left_on="Volcano Name", right_on="Volcano Name", how="left")

  # Average Eruption length (in days)
average_vei = eruption_data.groupby("Volcano Name")["days"].mean().reset_index()
combined_data = kestrel.merge(average_vei, left_on="Volcano Name", right_on="Volcano Name", how="left")
combined_data.rename(columns={"days": "AVG erup (days)"}, inplace=True)

Main Dataset with new features

data_table.DataTable(df_complete)

Aggregating data by Countries to show which have the most Volcanes, and show their corresponding attributes.

df_merged = df_complete.groupby('Country')['Volcano Name'].count().reset_index(name='Number of Volcanoes')

df_5km = df_complete.groupby('Country')['Population within 5 km'].sum().reset_index()
df_10km = df_complete.groupby('Country')['Pop 10km'].sum().reset_index()
df_deaths = df_complete.groupby('Country')['Total Deaths'].sum().reset_index()
df_erup = df_complete.groupby('Country')['Eruption Count'].sum().reset_index()

df_merged = df_merged.merge(df_erup, on='Country', how='left')

df_merged['Eruptions Per Volcano'] = df_merged['Number of Volcanoes'] / df_complete['Eruption Count']
df_merged['Eruptions Per Volcano'] = df_merged['Eruptions Per Volcano'].round(2)

df_merged = df_merged.merge(df_deaths, on='Country', how='left')
df_merged = df_merged.merge(df_5km, on='Country', how='left')
df_merged = df_merged.merge(df_10km, on='Country', how='left')

df_merged.sort_values(by=['Number of Volcanoes'], ascending=False).head(15)

latitudes = df_complete['Latitude']
longitudes = df_complete['Longitude']

plt.figure(figsize=(16, 11.2))
m = Basemap(projection='cyl', lon_0=0)

m.drawcoastlines(linewidth=0.1)
m.drawcountries(linewidth = 0.1)
m.drawmapboundary(fill_color='#EAF2F8')
m.fillcontinents(color='#FBFCFC', alpha=0.8, lake_color='#F0FFFF')

x, y = m(longitudes.values, latitudes.values)

sc = m.scatter(x, y, marker='o', color='#E6B0AA' ,s=7, edgecolor='#922B21', linewidths=0.3, alpha=1)

plt.title('World Map of Volcanoes', fontsize=16, loc='left')

plt.tight_layout()

Checking the new data features for outliers and their distribution

variables = ['Eruption Count', 'MAX VEI', 'AVG VEI']

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 7))

for i, var in enumerate(variables):
    sns.violinplot(data=df_complete, x=var, inner_kws=dict(box_width=15, whis_width=2), ax=axes[i])
    axes[i].set_title(f'Distribution of {var.replace("_", " ").title()}')

plt.tight_layout()

variables = ['MAX VEI', 'AVG VEI']

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(7, 7))

axes = axes.flatten()

for i, var in enumerate(variables):
    sns.histplot(data=df_complete, x=var, kde=True, ax=axes[i])
    axes[i].set_title(f'Distribution of {var.replace("_", " ").title()}')

plt.tight_layout()

Scatter plot with regression line, showing that volcanoes with a larger 'Max VEI', have larger death tolls per single eruption event.

sns.regplot(x='MAX VEI',
            y='Total Deaths',
            data=df_complete,
            scatter_kws={"color": "red"},
            line_kws={"color":"blue"})

plt.title('Total Deaths per Volcanoes Max VEI', fontsize=10)
plt.tight_layout()

Donut Chart showing Total Deaths per Region (19 total Regions)

grouped = df_complete.groupby('Region')['Total Deaths'].sum().reset_index()
fig, ax = plt.subplots(figsize=(8, 8))

wedges, texts, autotexts = ax.pie(
    grouped['Total Deaths'],
    labels=None,
    autopct='%1.1f%%',
    startangle=90,
    pctdistance=0.85,
    colors=plt.cm.tab20.colors
)

centre_circle = plt.Circle((0, 0), 0.70, fc='white')
fig.gca().add_artist(centre_circle)

ax.axis('equal')

ax.legend(wedges, grouped['Region'], title="Region", loc="center left", bbox_to_anchor=(1, 0, 0.5, 1))
ax.set_title('Total Deaths per Region')
plt.title('Total Deaths per Region', loc='center')
plt.tight_layout()

Bubble chart highliting the regions with the most eruptions, largest populations, and largest number of Volcanoes.

grouped = df_complete.groupby('Region').agg({
    'Volcano Name': 'count',
    'Eruption Count': 'sum',
    'Pop 10km': 'sum'
}).reset_index()

grouped.columns = ['Region', 'Volcano Count', 'Eruption Count', 'Population within 10km']

fig, ax = plt.subplots(figsize=(13, 8))

  # Scatter plot with bubble size
scatter = ax.scatter(
    grouped['Volcano Count'],
    grouped['Eruption Count'],
    s=grouped['Population within 10km'] / 1000,
    alpha=0.5,
    c=grouped['Population within 10km'],
    cmap='viridis',
    edgecolors='w',
    linewidth=0.5
)

ax.set_xlabel('Volcano Count')
ax.set_ylabel('Eruption Count')
ax.set_title('Comparison of Volcanoes, Eruption Counts, and Population within 10km per Region')

  # Adding text labels for each bubble
for i in range(len(grouped)):
    ax.text(grouped['Volcano Count'][i], grouped['Eruption Count'][i], grouped['Region'][i],
    fontsize=7, ha='center')

cbar = plt.colorbar(scatter)
cbar.set_label('Population within 10km')

plt.tight_layout()

Checking for correlations

To numerize all columns for correlation matrix (changing categorical values as numerical values).

df_integer = df_complete

for col_name in df_integer.columns:
  if(df_integer[col_name].dtype == 'object'):
    df_integer[col_name] = df_integer[col_name].astype('category')
    df_integer[col_name] = df_integer[col_name].cat.codes

df_integer

df_integer.corr(method='spearman')

corr_matrix = df_integer.corr(method='spearman')

sns.heatmap(corr_matrix, annot=True,  fmt='.1f', annot_kws={'fontsize': 7.5}, linewidths= 1, square= False)

plt.title('Correlation Matrix for all Features')

plt.xlabel('Volcano Features')

plt.ylabel('Volcano Features')

plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10

plt.show()

Creating a Composite Index Score in order to scale each Volcano based on its characteristics.

# List of variables to be included in the composite index
variables = ['AVG VEI', 'MAX VEI', 'Total Deaths', 'Population within 5 km' , 'Pop 10km', 'Pop 30km', 'Pop 100 km']

# Standardize the variables
scaler = StandardScaler()
df_complete[variables] = scaler.fit_transform(df_complete[variables])

# Define weights for each variable (example weights)
weights = {
    'AVG VEI': 0.025,
    'MAX VEI': 0.07,
    'Total Deaths': 0.095,
    'Population within 5 km': 0.21,
    'Pop 10km': 0.215,
    'Pop 30km': 0.22,
    'Pop 100 km': 0.165
}

# Calculate the composite index
df_complete['composite_index'] = (
    df_complete['AVG VEI'] * weights['AVG VEI'] +
    df_complete['MAX VEI'] * weights['MAX VEI'] +
    df_complete['Total Deaths'] * weights['Total Deaths'] +
    df_complete['Population within 5 km'] * weights['Pop 10km'] +
    df_complete['Pop 30km'] * weights['Pop 30km'] +
    df_complete['Pop 100 km'] * weights['Pop 100 km']
)

# Volcanoes ranked by Composite index score (166 Volcanoes)
df_complete.head(166)

sns.boxplot(x = df_index['composite_index'])

plt.xlabel('Composite Index Score', fontsize=8)
plt.ylabel('Volcano', fontsize=8)
plt.title('Composite Index Value by Volcano', fontsize=10)

plt.tight_layout()

latitudes = df_index['Latitude']
longitudes = df_index['Longitude']
composite_index = df_index['composite_index']

plt.figure(figsize=(16, 11.2))
m = Basemap(projection='cyl', lon_0=0)

m.drawcoastlines(linewidth=0.1)
m.drawcountries(linewidth = 0.1)
m.drawmapboundary(fill_color='#EAF2F8')
m.fillcontinents(color='#FBFCFC', alpha=0.8, lake_color='#F0FFFF')

x, y = m(longitudes.values, latitudes.values)

sc = m.scatter(x, y, c=composite_index, cmap='magma_r', marker='o', s=30, edgecolor='black', linewidths=0.6, alpha=1)

plt.suptitle('Map of most dangerous Volcanoes based on their Composite Index Score', fontsize=18, x=0.3, va='top')
plt.title('Composite Index Score weighted by population density (5km-100km), VEI scale, eruption count, and death count', fontstyle='italic' ,fontsize=11, x=0.28, va='bottom')

cbar = plt.colorbar(sc, orientation='horizontal', pad=0.05)
cbar.set_label('Composite Index Score')

plt.tight_layout()

Interactive Maps made with Plotly in Python

Volcano_plotlymap.ipynb

Maps created with Foursquare Studio

https://studio.foursquare.com/map/public/3d03e387-9ad8-4305-a636-a0368fdfdada/embed

Volcano Heatmap weighted by a Composite Index Score, which ranks the danger of each volcano based on several factors.

Heatmap of the Volcanoes Composite Index Score and the population radius within 30 kilometers.

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
Final_Volcano_EDA.ipynb		Final_Volcano_EDA.ipynb
GVP_Volcano_List.csv		GVP_Volcano_List.csv
NCEI_volcano_events.csv		NCEI_volcano_events.csv
README.md		README.md
Volcano_plotlymap.ipynb		Volcano_plotlymap.ipynb
comp_index_map.png		comp_index_map.png
deckplotly_panel.html		deckplotly_panel.html
eruptions_smithsonian.csv		eruptions_smithsonian.csv
plotly_panel.html		plotly_panel.html
pop_30_index_map.png		pop_30_index_map.png
the-bay-of-naples-sebastian-pether.png		the-bay-of-naples-sebastian-pether.png
volcano_pngimg_com.png		volcano_pngimg_com.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploratory Data Analysis with Python

Sources

Preparing the Data

Cleaning the Main Dataset

Merging data and Feature Engineering

Checking for correlations

Creating a Composite Index Score in order to scale each Volcano based on its characteristics.

Interactive Maps made with Plotly in Python

Maps created with Foursquare Studio

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploratory Data Analysis with Python

Sources

Preparing the Data

Cleaning the Main Dataset

Merging data and Feature Engineering

Checking for correlations

Creating a Composite Index Score in order to scale each Volcano based on its characteristics.

Interactive Maps made with Plotly in Python

Maps created with Foursquare Studio

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages