# Geospatial Data Analysis Lab: Steel Plants Dataset

by **Ulysse Mace** and **Daniil Notkin**

**(15/10/2025) Learning Objectives:**
- Perform exploratory data analysis (EDA) on geospatial datasets
- Visualize geospatial data using interactive maps with Plotly
- Merge environmental data with asset locations
- Aggregate data at the company level
- Integrate geospatial visualizations into a Streamlit dashboard

---


## Part 1: Setup and Data Loading

Import the necessary libraries and load the steel plants dataset.


In [None]:
# Import required libraries
# - pandas for data manipulation
# - numpy for numerical operations
# - plotly.express and plotly.graph_objects for interactive visualizations
# - Any other libraries you might need

import pandas as pd
import numpy as np
from plotly import express, graph_objects
import nbformat


In [None]:
# Load the steel plants dataset
# Expected columns: plant_id, plant_name, company, latitude, longitude, capacity, year_built, etc.

plant_dataset = pd.read_excel("data/Plant-level-data-Global-Iron-and-Steel-Tracker-September-2025-V1.xlsx", sheet_name = "Plant data", na_values=["unknown", ">0", ">", ">2400"])

---
## Part 2: Exploratory Data Analysis (15 minutes)

Answer the following questions through your analysis:


### Question 1: Data Overview
**Task:** Display basic information about the dataset.
- How many steel plants are in the dataset?
- What are the column names and data types?
- Are there any missing values?


In [None]:
# Display dataset shape

print(plant_dataset.shape)

In [None]:
# Display column information and data types

print(plant_dataset.info())

In [None]:
# Check for missing values

print(plant_dataset.isna().sum())

### Question 2: Statistical Summary
**Task:** Generate descriptive statistics for numerical columns.
- What is the average plant capacity?
- What is the range of latitudes and longitudes?
- What is the distribution of plant ages?


In [None]:
# Display descriptive statistics

plant_dataset.describe()

In [None]:
plant_dataset["Coordinates"].describe()

### Question 3: Geographic Distribution
**Task:** Analyze the geographic distribution of steel plants.
- Which countries/regions have the most steel plants?
- What is the distribution of plants by company?


In [None]:
# Count plants by country/region

plant_dataset["Country/Area"].value_counts()

In [None]:
# Count plants by company
plant_dataset["Owner"].value_counts()


### Question 4: Capacity Analysis
**Task:** Analyze the capacity distribution.
- What is the total global steel production capacity?
- Which companies have the highest total capacity?
- How does capacity vary by region?


In [None]:
# Calculate total capacity
# I have chosen sinter plant capacity, since sinter seems to be closest to the final product of steel
print(plant_dataset["Sinter plant capacity (ttpa)"].sum()) 

In [None]:
# Group by company and sum capacity
print(plant_dataset["Sinter plant capacity (ttpa)"].groupby(plant_dataset["Owner"], sort=False).sum().sort_values(ascending=False)) 


---
## Part 3: Geospatial Visualization with Plotly (15 minutes)

Create interactive maps to visualize the steel plants' locations and characteristics.


### Exercise 1: Basic Scatter Map
**Task:** Create a scatter map showing all steel plant locations.
- Use latitude and longitude for positioning
- Color points by country or region
- Add hover information showing plant name, company, and capacity


In [None]:
# Create a scatter_geo or scatter_mapbox plot
# Hint: Use plotly.express.scatter_geo() or scatter_mapbox()
plant_dataset[["latitude", "longitude"]] = plant_dataset["Coordinates"].str.split(",", expand=True)
plant_dataset["latitude"] = pd.to_numeric(plant_dataset["latitude"])
plant_dataset["longitude"] = pd.to_numeric(plant_dataset["longitude"])


In [None]:
express.scatter_geo(data_frame=plant_dataset, lat="latitude", lon = "longitude")

### Exercise 2: Sized Markers by Capacity
**Task:** Create a map where marker size represents plant capacity.
- Larger markers for higher capacity plants
- Color by company
- Include interactive hover details


In [None]:
# making the changes on teh copy of hte dataset
plant_dataset_v1 = plant_dataset.copy(deep=True)
# replace NaN values for capacity with mean values
sinter_capacity_mean = plant_dataset_v1["Sinter plant capacity (ttpa)"].mean()
plant_dataset_v1["Sinter plant capacity (ttpa)"]=plant_dataset_v1["Sinter plant capacity (ttpa)"].fillna(sinter_capacity_mean)

In [None]:
# what if instead of replacing production capacity NaN values with mean values, I just remove the NaN values? How would the graph look then?
plant_dataset_b1 = plant_dataset
plant_dataset_b1 = plant_dataset_b1.dropna(subset="Sinter plant capacity (ttpa)")

In [None]:
# first scatter map, based on the idea that factories with NaN values for sinter plant capacity get removed
express.scatter_geo(data_frame=plant_dataset_b1, lat="latitude", lon = "longitude", size="Sinter plant capacity (ttpa)") 

In [None]:
# another scatter plot, based on the idea that NaN values are replaced with mean values
express.scatter_geo(data_frame=plant_dataset_v1, lat="latitude", lon = "longitude", size="Sinter plant capacity (ttpa)") 

### Exercise 3: Density Heatmap
**Task:** Create a density map showing concentration of steel plants.
- Use Plotly's density_mapbox to show clustering
- Identify regions with high plant density


In [None]:
# Create density heatmap
# Hint: Use plotly.express.density_mapbox()

# I use density map, because density_mapbox is supposedly deprecated
express.density_map(data_frame=plant_dataset_v1, lat="latitude", lon = "longitude")

---
## ~~Part 4: Merging Environmental Data with Assets~~ skipped, due to large size of dataset

~~Integrate environmental data (e.g., air quality, emissions, proximity to water sources) with steel plant locations.~~

---
## Part 5: Company-Level Aggregation

Aggregate data at the company level to analyze corporate footprints.


### Exercise 1: Aggregate Metrics by Company
**Task:** Group plants by company and calculate aggregate metrics.
- Total capacity per company
- Number of plants per company
- Average environmental metrics per company
- Geographic spread (e.g., number of countries)


In [None]:
# Group by company and aggregate
# in this case, I suppose that total capacity means sum total of all capacities

sinter_capacity = plant_dataset_b1["Sinter plant capacity (ttpa)"].groupby(plant_dataset_b1["Owner"]).sum()

Owner
A. Finkl & Sons Corp                              4386.268156
ABA Çelik Demir LŞ                                4386.268156
AFV Acciaierie Beltrame SpA                       4386.268156
AG Siderurgica Balboa SA                          4386.268156
AG der Dillinger Hüttenwerke AG                      0.000000
                                                     ...     
Zibo Qilin Fushan Steel Co Ltd                    4386.268156
Zunyi Changling Special Steel Co Ltd              4386.268156
Zunyi County Fuxin Iron & Steel Product Co Ltd    4386.268156
Çebitaş Demir Çelik Endüstrisi AŞ                 4386.268156
Çolakoğlu Metalürji AŞ                            4386.268156
Name: Sinter plant capacity (ttpa), Length: 987, dtype: float64

### Exercise 2: Company Headquarters or Centroid
**Task:** Calculate a representative location for each company.
- Option 1: Use the centroid of all plant locations
- Option 2: Use the location of the largest plant
- Option 3: Assign actual headquarters coordinates


In [None]:
# Calculate company representative locations



### Exercise 3: Visualize Company-Level Data
**Task:** Create a map showing companies with aggregated metrics.
- Show one marker per company at the representative location
- Size by total capacity
- Color by average environmental impact
- Hover information with company summary statistics


In [None]:
# Create company-level visualization



---
## Part 6: Streamlit Dashboard Integration

Prepare your visualizations for deployment in a Streamlit dashboard.


### Exercise 1: Create Dashboard Script Structure
**Task:** Create a Streamlit app file (`dashboard.py`) with the following structure:

```python
# Import streamlit and other necessary libraries

# Set page configuration

# Title and description

# Sidebar for filters
# - Company selector
# - Region/country filter
# - Capacity range slider

# Main content area
# - KPI metrics (total plants, total capacity, etc.)
# - Interactive map
# - Data table

# Footer with data sources and notes
```


### Exercise 1: Prepare Data for Dashboard
**Task:** Save your processed data to files that the dashboard will load.
- Export cleaned plant data
- Export merged environmental data
- Export company-level aggregations
- Save as CSV or Parquet for efficient loading


In [None]:
# Save processed datasets



### Exercise 2: Display relevant information from your exploratory analysis into the dashboard

In [None]:
# This cell is for notes/observations about your dashboard
# What works well?
# What could be improved?
# Any performance issues with large datasets?



---
## Lab Summary and Key Takeaways

**What you learned:**
- How to perform EDA on geospatial datasets
- Creating interactive maps with Plotly for geospatial data
- Merging spatial datasets based on geographic proximity
- Aggregating geospatial data at different levels (asset vs. company)
- Building interactive dashboards with Streamlit

**Next Steps:**
- Explore other geospatial libraries (GeoPandas, Folium, Kepler.gl)
- Learn about coordinate reference systems (CRS) and projections
- Practice with other datasets (buildings, utilities, transportation)
- Deploy your dashboard to Streamlit Cloud or other hosting services
