# Geospatial Data Analysis Lab: Steel Plants Dataset


**(15/10/2025) Learning Objectives:**
- Perform exploratory data analysis (EDA) on geospatial datasets
- Visualize geospatial data using interactive maps with Plotly
- Merge environmental data with asset locations
- Aggregate data at the company level
- Integrate geospatial visualizations into a Streamlit dashboard

---


## Part 1: Setup and Data Loading

Import the necessary libraries and load the steel plants dataset.


In [27]:
# Import required libraries
# - pandas for data manipulation
# - numpy for numerical operations
# - plotly.express and plotly.graph_objects for interactive visualizations
# - Any other libraries you might need

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt



In [28]:
# Load the steel plants dataset from Excel file
# The dataset has multiple sheets - we need to merge "Plant data" and "Plant capacities and status"
excel_file = 'Dataset/Plant-level-data-Global-Iron-and-Steel-Tracker-September-2025-V1.xlsx'

# Load both sheets
plant_data = pd.read_excel(excel_file, sheet_name='Plant data')
capacity_data = pd.read_excel(excel_file, sheet_name='Plant capacities and status')

print(f"Plant data shape: {plant_data.shape}")
print(f"Capacity data shape: {capacity_data.shape}")

# Merge on Plant ID - using left merge to keep all plants from plant_data
# Some plants may have multiple capacity records, so we'll need to handle duplicates
df = pd.merge(plant_data, capacity_data, on='Plant ID', how='left', suffixes=('', '_capacity'))

print(f"\nMerged data shape (before filtering): {df.shape}")

# Filter to keep only operating plants
print(f"\nStatus distribution before filtering:")
print(df['Status'].value_counts())

df = df[df['Status'] == 'operating'].copy()

print(f"\nAfter filtering for 'operating' status:")
print(f"Final dataset shape: {df.shape}")
print(f"Number of operating plants: {len(df)}")

# Display first few rows
df.head()

Plant data shape: (1209, 44)
Capacity data shape: (1744, 15)

Merged data shape (before filtering): (1744, 58)

Status distribution before filtering:
Status
operating                    868
announced                    285
retired                      181
construction                 151
operating pre-retirement     125
mothballed                    73
cancelled                     59
mothballed pre-retirement      2
Name: count, dtype: int64

After filtering for 'operating' status:
Final dataset shape: (868, 58)
Number of operating plants: 868


Unnamed: 0,Plant ID,Plant name (English),Plant name (other language),Other plant names (English),Other plant names (other language),Owner,Owner (other language),Owner GEM ID,Owner PermID,SOE Status,...,Start date_capacity,Nominal crude steel capacity (ttpa),Nominal BOF steel capacity (ttpa),Nominal EAF steel capacity (ttpa),Nominal OHF steel capacity (ttpa),Other/unspecified steel capacity (ttpa),Nominal iron capacity (ttpa),Nominal BF capacity (ttpa),Nominal DRI capacity (ttpa),Other/unspecified iron capacity (ttpa)
1,P100000120439,Algerian Qatari Steel Jijel plant,الجزائرية القطرية للصلب,AQS,,Algerian Qatari Steel,,E100001000957,5076384326,Partial,...,2017,2200,,2200.0,,,2500.0,,2500.0,
4,P100000121198,Ozmert Algeria steel plant,,,,Ozmert Algeria SARL,,E100001012196,unknown,,...,unknown,800,,800.0,,,500.0,,500.0,
5,P100000120440,Sider El Hadjar Annaba steel plant,مركب الحجار للحديد والصلب,"ArcelorMittal Annaba (predecessor), El Hadjar ...",,Groupe Industriel Sider SpA,,E100001000960,5000941519,Full,...,1969,2150,350.0,1800.0,,,1500.0,1500.0,,
7,P100000120441,Tosyali Algerie Oran steel plant,شركة توسيالي الجزائرية التركية للحديد والصلب,,Tosyali Algérie,Tosyali Ironsteel Industry Algerie SpA,,E100000131071,5074196906,,...,2013,6200,,6200.0,,,5000.0,,5000.0,
8,P100000120005,Aceria Angola Bengo steel plant,,ADA Steel,,Ada - Aceria De Angola SA,,E100000131097,unknown,,...,2015,500,,,,500.0,,,,


---
## Part 2: Exploratory Data Analysis (15 minutes)

Answer the following questions through your analysis:


### Question 1: Data Overview
**Task:** Display basic information about the dataset.
- How many steel plants are in the dataset?
- What are the column names and data types?
- Are there any missing values?


In [29]:
# Display dataset shape
print("Number of steel plants:", df.shape[0])
print("Dataset shape (rows, columns):", df.shape)

Number of steel plants: 868
Dataset shape (rows, columns): (868, 58)


In [30]:
# Display column information and data types
print("\nColumn names and data types:")
print(df.info())


Column names and data types:
<class 'pandas.core.frame.DataFrame'>
Index: 868 entries, 1 to 1742
Data columns (total 58 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Plant ID                                 868 non-null    object 
 1   Plant name (English)                     868 non-null    object 
 2   Plant name (other language)              533 non-null    object 
 3   Other plant names (English)              549 non-null    object 
 4   Other plant names (other language)       222 non-null    object 
 5   Owner                                    868 non-null    object 
 6   Owner (other language)                   411 non-null    object 
 7   Owner GEM ID                             868 non-null    object 
 8   Owner PermID                             868 non-null    object 
 9   SOE Status                               148 non-null    object 
 10  Parent                  

In [31]:
# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())
total_missing = df.isnull().sum().sum()
print("\nTotal number of missing values in dataset:", total_missing)


Missing values per column:
Plant ID                                     0
Plant name (English)                         0
Plant name (other language)                335
Other plant names (English)                319
Other plant names (other language)         646
Owner                                        0
Owner (other language)                     457
Owner GEM ID                                 0
Owner PermID                                 0
SOE Status                                 720
Parent                                       0
Parent GEM ID                                0
Parent PermID                                0
Location address                             0
Municipality                                 0
Subnational unit (province/state)            0
Country/Area                                 0
Region                                       0
Other language location address            545
Coordinates                                  0
Coordinate accuracy             

### Question 2: Statistical Summary
**Task:** Generate descriptive statistics for numerical columns.
- What is the average plant capacity?
- What is the range of latitudes and longitudes?
- What is the distribution of plant ages?


What is the average plant capacity?

In [32]:
capacity_column = 'Nominal crude steel capacity (ttpa)'

# Convert the selected capacity column to numeric
if capacity_column in df.columns:
    df[capacity_column] = pd.to_numeric(df[capacity_column], errors='coerce')

print("\nCapacity Statistics (in thousand tonnes per annum - ttpa):\n")

if capacity_column in df.columns:
    non_null_count = df[capacity_column].notna().sum()
    if non_null_count > 0:
        print(f"{capacity_column}:")
        print(f"  Plants with data: {non_null_count}")
        print(f"  Average: {df[capacity_column].mean():,.2f} ttpa")
        print(f"  Median: {df[capacity_column].median():,.2f} ttpa")
        print(f"  Min: {df[capacity_column].min():,.2f} ttpa")
        print(f"  Max: {df[capacity_column].max():,.2f} ttpa")
        print(f"  Total: {df[capacity_column].sum():,.2f} ttpa")
        print()


# 1. Histogram of crude steel capacity distribution
fig1 = px.histogram(
    df.dropna(subset=['Nominal crude steel capacity (ttpa)']),
    x='Nominal crude steel capacity (ttpa)',
    nbins=50,
    title='Distribution of Crude Steel Capacity (Operating Plants)',
    labels={'Nominal crude steel capacity (ttpa)': 'Capacity (ttpa)'},
    color_discrete_sequence=['#1f77b4']
)
fig1.update_layout(
    xaxis_title='Capacity (thousand tonnes per annum)',
    yaxis_title='Number of Plants',
    showlegend=False,
    height=400
)
fig1.show()

# 2. Top 20 plants by crude steel capacity
top_plants = df.nlargest(20, 'Nominal crude steel capacity (ttpa)')[
    ['Plant name (English)', 'Country/Area', 'Nominal crude steel capacity (ttpa)', 'Owner']
].copy()

fig2 = px.bar(
    top_plants,
    y='Plant name (English)',
    x='Nominal crude steel capacity (ttpa)',
    title='Top 20 Operating Plants by Crude Steel Capacity',
    labels={'Nominal crude steel capacity (ttpa)': 'Capacity (ttpa)', 'Plant name (English)': 'Plant'},
    hover_data=['Country/Area', 'Owner'],
    color='Nominal crude steel capacity (ttpa)',
    color_continuous_scale='Viridis',
    orientation='h'
)
fig2.update_layout(
    xaxis_title='Capacity (ttpa)',
    yaxis_title='',
    height=600,
    showlegend=False
)
fig2.show()


Capacity Statistics (in thousand tonnes per annum - ttpa):

Nominal crude steel capacity (ttpa):
  Plants with data: 815
  Average: 2,459.65 ttpa
  Median: 1,350.00 ttpa
  Min: 13.00 ttpa
  Max: 22,999.00 ttpa
  Total: 2,004,616.52 ttpa



What is the range of latitudes and longitudes?

In [33]:
# Split coordinates column into latitude and longitude
df[['latitude', 'longitude']] = df['Coordinates'].str.split(',', expand=True)
df['latitude'] = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude'] = pd.to_numeric(df['longitude'], errors='coerce')

print(f"Latitude range:")
print(f"  Min: {df['latitude'].min():.4f}°")
print(f"  Max: {df['latitude'].max():.4f}°")
print(f"  Mean: {df['latitude'].mean():.4f}°")
print(f"\nLongitude range:")
print(f"  Min: {df['longitude'].min():.4f}°")
print(f"  Max: {df['longitude'].max():.4f}°")
print(f"  Mean: {df['longitude'].mean():.4f}°")


Latitude range:
  Min: -37.8314°
  Max: 66.3115°
  Mean: 30.5709°

Longitude range:
  Min: -123.1636°
  Max: 174.7281°
  Mean: 59.6227°


In [34]:
# Convert Plant age to numeric
df['Plant age (years)'] = pd.to_numeric(df['Plant age (years)'], errors='coerce')

print(f"\nPlants with age data: {df['Plant age (years)'].notna().sum()} out of {len(df)}")
print(f"\nAge Statistics:")
print(f"  Mean: {df['Plant age (years)'].mean():.2f} years")
print(f"  Median: {df['Plant age (years)'].median():.2f} years")
print(f"  Min: {df['Plant age (years)'].min():.0f} years")
print(f"  Max: {df['Plant age (years)'].max():.0f} years")
print(f"  Std Dev: {df['Plant age (years)'].std():.2f} years")

# Age distribution by bins
print(f"\nAge Distribution by Category:")
age_bins = [0, 10, 20, 30, 40, 50, 100, 300]
age_labels = ['0-10 years', '11-20 years', '21-30 years', '31-40 years', '41-50 years', '51-100 years', '100+ years']
df['age_category'] = pd.cut(df['Plant age (years)'], bins=age_bins, labels=age_labels, right=True)
print(df['age_category'].value_counts().sort_index())

# Bar chart of age categories
age_category_counts = df['age_category'].value_counts().sort_index()
fig_age_cat = px.bar(
    x=age_category_counts.index.astype(str),
    y=age_category_counts.values,
    title='Number of Operating Plants by Age Category',
    labels={'x': 'Age Category', 'y': 'Number of Plants'},
    color=age_category_counts.values,
    color_continuous_scale='Greens'
)
fig_age_cat.update_layout(
    xaxis_title='Age Category',
    yaxis_title='Number of Plants',
    showlegend=False,
    height=400
)
fig_age_cat.show()

# Scatter plot: Age vs Capacity
fig_age_capacity = px.scatter(
    df.dropna(subset=['Plant age (years)', 'Nominal crude steel capacity (ttpa)']),
    x='Plant age (years)',
    y='Nominal crude steel capacity (ttpa)',
    color='Region',
    size='Nominal crude steel capacity (ttpa)',
    hover_name='Plant name (English)',
    hover_data=['Country/Area', 'Owner'],
    title='Plant Age vs Capacity (Operating Plants)',
    size_max=15,
    opacity=0.7
)
fig_age_capacity.update_layout(
    xaxis_title='Plant Age (years)',
    yaxis_title='Crude Steel Capacity (ttpa)',
    height=500
)
fig_age_capacity.show()


print("KEY INSIGHTS FROM AGE ANALYSIS")
print(f"  • Oldest operating plant: {df['Plant age (years)'].max():.0f} years old")
print(f"  • Newest operating plant: {df['Plant age (years)'].min():.0f} years old")
print(f"  • Median age: {df['Plant age (years)'].median():.1f} years")
print(f"  • Most common age range: {age_category_counts.idxmax()}")
print(f"  • Plants over 100 years old: {(df['Plant age (years)'] > 100).sum()}")



Plants with age data: 853 out of 868

Age Statistics:
  Mean: 39.65 years
  Median: 26.99 years
  Min: 0 years
  Max: 286 years
  Std Dev: 35.83 years

Age Distribution by Category:
age_category
0-10 years      108
11-20 years     172
21-30 years     186
31-40 years      84
41-50 years      45
51-100 years    199
100+ years       56
Name: count, dtype: int64


KEY INSIGHTS FROM AGE ANALYSIS
  • Oldest operating plant: 286 years old
  • Newest operating plant: 0 years old
  • Median age: 27.0 years
  • Most common age range: 51-100 years
  • Plants over 100 years old: 56


### Question 3: Geographic Distribution
**Task:** Analyze the geographic distribution of steel plants.
- Which countries/regions have the most steel plants?
- What is the distribution of plants by company?


In [35]:
plants_by_country = df['Country/Area'].value_counts()
plants_by_region = df['Region'].value_counts()

print("Top 15 Countries by Number of Operating Steel Plants:\n")
print(plants_by_country.head(15))
print(f"\nTotal countries: {len(plants_by_country)}")

print("Plants by Region:\n")
print(plants_by_region.sort_values(ascending=False))

top_15_countries = plants_by_country.head(15)
fig1 = px.bar(
    x=top_15_countries.values,
    y=top_15_countries.index,
    orientation='h',
    title='Top 15 Countries by Number of Operating Steel Plants',
    labels={'x': 'Number of Plants', 'y': 'Country'},
    color=top_15_countries.values,
    color_continuous_scale='Blues'
)
fig1.update_layout(yaxis={'categoryorder': 'total ascending'}, showlegend=False, height=500)
fig1.show()

fig2 = px.pie(
    values=plants_by_region.values,
    names=plants_by_region.index,
    title='Regional Distribution of Operating Steel Plants',
    hole=0.3
)
fig2.update_traces(textposition='inside', textinfo='percent+label')
fig2.update_layout(height=500)
fig2.show()

print(f"\nKey Insights:")
print(f"  • China leads with {plants_by_country.iloc[0]} plants ({plants_by_country.iloc[0]/len(df)*100:.1f}% of total)")
print(f"  • Top 5 countries account for {plants_by_country.head(5).sum()} plants ({plants_by_country.head(5).sum()/len(df)*100:.1f}% of total)")
print(f"  • {plants_by_region.idxmax()} has the most plants with {plants_by_region.max()}")

Top 15 Countries by Number of Operating Steel Plants:

Country/Area
China            289
India             74
United States     70
Japan             38
Iran              31
Russia            27
Türkiye           25
Brazil            22
Italy             22
Vietnam           20
South Korea       15
Spain             14
Mexico            13
Germany           13
Thailand          11
Name: count, dtype: int64

Total countries: 80
Plants by Region:

Region
Asia Pacific               497
Europe                     132
North America               91
Middle East                 51
Central & South America     37
Africa                      30
Eurasia                     30
Name: count, dtype: int64



Key Insights:
  • China leads with 289 plants (33.3% of total)
  • Top 5 countries account for 502 plants (57.8% of total)
  • Asia Pacific has the most plants with 497


In [36]:
plants_by_owner = df['Owner'].value_counts()

print("Top 20 Companies (Owner) by Number of Operating Steel Plants:\n")
print(plants_by_owner.head(20))

print("\n" + "="*60 + "\n")
print("Company Statistics (by Owner):")
print(f"  Total unique companies: {len(plants_by_owner)}")
print(f"  Mean plants per company: {plants_by_owner.mean():.2f}")
print(f"  Median plants per company: {plants_by_owner.median():.1f}")
print(f"  Companies with only 1 plant: {(plants_by_owner == 1).sum()} ({(plants_by_owner == 1).sum()/len(plants_by_owner)*100:.1f}%)")
print(f"  Companies with 5+ plants: {(plants_by_owner >= 5).sum()}")

top_20_companies = plants_by_owner.head(20)
fig = px.bar(
    x=top_20_companies.values,
    y=top_20_companies.index,
    orientation='h',
    title='Top 20 Companies by Number of Operating Plants (Owner)',
    labels={'x': 'Number of Plants', 'y': 'Company'},
    color=top_20_companies.values,
    color_continuous_scale='Greens'
)
fig.update_layout(yaxis={'categoryorder': 'total ascending'}, showlegend=False, height=600)
fig.show()

print(f"\nTop 5 companies account for {plants_by_owner.head(5).sum()} plants ({plants_by_owner.head(5).sum()/len(df)*100:.1f}% of total)")

Top 20 Companies (Owner) by Number of Operating Steel Plants:

Owner
Nucor Corp                          11
Cleveland-Cliffs Inc                10
Nippon Steel Corp                    8
Gerdau Ameristeel Corp               7
Commercial Metals Co                 7
Steel Dynamics Inc                   6
Steel Authority of India Ltd         6
ArcelorMittal Brasil SA              6
Rungta Mines Ltd                     5
JFE Steel Corp                       4
JSW Steel Ltd                        4
ArcelorMittal SA                     4
Tokyo Steel Manufacturing Co Ltd     4
Mobarakeh Steel Co                   4
Government of North Korea            4
United States Steel Corp             3
JFE Bars & Shapes Corp               3
Hyundai Steel Co                     3
Kyoei Steel Ltd                      3
Acciaierie Venete SpA                3
Name: count, dtype: int64


Company Statistics (by Owner):
  Total unique companies: 731
  Mean plants per company: 1.19
  Median plants per company: 1


Top 5 companies account for 43 plants (5.0% of total)


### Question 4: Capacity Analysis
**Task:** Analyze the capacity distribution.
- What is the total global steel production capacity?
- Which companies have the highest total capacity?
- How does capacity vary by region?


In [39]:
total_capacity = df['Nominal crude steel capacity (ttpa)'].sum()
plants_with_capacity = df['Nominal crude steel capacity (ttpa)'].notna().sum()

print("Global Steel Production Capacity:\n")
print(f"  Total capacity: {total_capacity:,.0f} ttpa")
print(f"  Plants with capacity data: {plants_with_capacity} out of {len(df)}")
print(f"  Average capacity per plant: {df['Nominal crude steel capacity (ttpa)'].mean():,.0f} ttpa")

capacity_by_region = df.groupby('Region')['Nominal crude steel capacity (ttpa)'].agg(['sum', 'count', 'mean']).round(0)
capacity_by_region.columns = ['Total Capacity (ttpa)', 'Number of Plants', 'Average Capacity (ttpa)']
capacity_by_region = capacity_by_region.sort_values('Total Capacity (ttpa)', ascending=False)

print("\nCapacity by Region:\n")
print(capacity_by_region)

fig1 = px.bar(
    capacity_by_region.reset_index(),
    x='Total Capacity (ttpa)',
    y='Region',
    orientation='h',
    title='Total Steel Production Capacity by Region',
    labels={'Total Capacity (ttpa)': 'Total Capacity (ttpa)', 'Region': 'Region'},
    color='Total Capacity (ttpa)',
    color_continuous_scale='Reds',
    text='Total Capacity (ttpa)'
)
fig1.update_layout(yaxis={'categoryorder': 'total ascending'}, showlegend=False, height=400)
fig1.update_traces(texttemplate='%{text:,.0f}', textposition='outside')
fig1.show()

fig2 = px.pie(
    capacity_by_region.reset_index(),
    values='Total Capacity (ttpa)',
    names='Region',
    title='Global Steel Capacity Distribution by Region',
    hole=0.3
)
fig2.update_traces(textposition='inside', textinfo='percent+label')
fig2.update_layout(height=500)
fig2.show()

print(f"\nKey Insights:")
print(f"  • {capacity_by_region.index[0]} has the highest capacity with {capacity_by_region.iloc[0, 0]:,.0f} ttpa ({capacity_by_region.iloc[0, 0]/total_capacity*100:.1f}%)")
print(f"  • Top 3 regions account for {capacity_by_region.head(3)['Total Capacity (ttpa)'].sum():,.0f} ttpa ({capacity_by_region.head(3)['Total Capacity (ttpa)'].sum()/total_capacity*100:.1f}%)")
print(f"  • Highest average plant capacity: {capacity_by_region['Average Capacity (ttpa)'].idxmax()} ({capacity_by_region['Average Capacity (ttpa)'].max():,.0f} ttpa)")

Global Steel Production Capacity:

  Total capacity: 2,004,617 ttpa
  Plants with capacity data: 815 out of 868
  Average capacity per plant: 2,460 ttpa

Capacity by Region:

                         Total Capacity (ttpa)  Number of Plants  \
Region                                                             
Asia Pacific                         1382183.0               464   
Europe                                208118.0               132   
North America                         148741.0                88   
Eurasia                                92565.0                29   
Middle East                            74973.0                43   
Central & South America                55713.0                31   
Africa                                 42324.0                28   

                         Average Capacity (ttpa)  
Region                                            
Asia Pacific                              2979.0  
Europe                                    1577.0  
North Am


Key Insights:
  • Asia Pacific has the highest capacity with 1,382,183 ttpa (68.9%)
  • Top 3 regions account for 1,739,042 ttpa (86.8%)
  • Highest average plant capacity: Eurasia (3,192 ttpa)


In [38]:
capacity_by_owner = df.groupby('Owner')['Nominal crude steel capacity (ttpa)'].agg(['sum', 'count']).round(0)
capacity_by_owner.columns = ['Total Capacity (ttpa)', 'Number of Plants']
capacity_by_owner = capacity_by_owner.sort_values('Total Capacity (ttpa)', ascending=False)

print("Top 20 Companies by Total Steel Production Capacity:\n")
print(capacity_by_owner.head(20))

print("\n" + "="*60)
print("\nCapacity Concentration Statistics:")
print(f"  Top 5 companies control: {capacity_by_owner.head(5)['Total Capacity (ttpa)'].sum():,.0f} ttpa ({capacity_by_owner.head(5)['Total Capacity (ttpa)'].sum()/total_capacity*100:.1f}%)")
print(f"  Top 10 companies control: {capacity_by_owner.head(10)['Total Capacity (ttpa)'].sum():,.0f} ttpa ({capacity_by_owner.head(10)['Total Capacity (ttpa)'].sum()/total_capacity*100:.1f}%)")
print(f"  Top 20 companies control: {capacity_by_owner.head(20)['Total Capacity (ttpa)'].sum():,.0f} ttpa ({capacity_by_owner.head(20)['Total Capacity (ttpa)'].sum()/total_capacity*100:.1f}%)")

top_20_capacity = capacity_by_owner.head(20).reset_index()
fig1 = px.bar(
    top_20_capacity,
    x='Total Capacity (ttpa)',
    y='Owner',
    orientation='h',
    title='Top 20 Companies by Total Steel Production Capacity',
    labels={'Total Capacity (ttpa)': 'Total Capacity (ttpa)', 'Owner': 'Company'},
    color='Total Capacity (ttpa)',
    color_continuous_scale='Oranges',
    hover_data=['Number of Plants']
)
fig1.update_layout(yaxis={'categoryorder': 'total ascending'}, showlegend=False, height=700)
fig1.show()

region_capacity = df.groupby('Region')['Nominal crude steel capacity (ttpa)'].sum().reset_index()
region_capacity.columns = ['Region', 'Total Capacity (ttpa)']

top_countries_by_region = df.groupby(['Region', 'Country/Area'])['Nominal crude steel capacity (ttpa)'].sum().reset_index()
top_countries_by_region = top_countries_by_region.sort_values('Nominal crude steel capacity (ttpa)', ascending=False).groupby('Region').head(5)

fig2 = px.bar(
    top_countries_by_region,
    x='Nominal crude steel capacity (ttpa)',
    y='Region',
    color='Country/Area',
    orientation='h',
    title='Steel Production Capacity by Region and Top Countries',
    labels={'Nominal crude steel capacity (ttpa)': 'Capacity (ttpa)', 'Region': 'Region'},
    barmode='stack'
)
fig2.update_layout(height=500)
fig2.show()

print(f"\nRegional Capacity Leaders:")
for region in capacity_by_region.index:
    top_country = df[df['Region'] == region].groupby('Country/Area')['Nominal crude steel capacity (ttpa)'].sum().idxmax()
    top_capacity = df[df['Region'] == region].groupby('Country/Area')['Nominal crude steel capacity (ttpa)'].sum().max()
    print(f"  • {region}: {top_country} ({top_capacity:,.0f} ttpa)")

Top 20 Companies by Total Steel Production Capacity:

                                                Total Capacity (ttpa)  \
Owner                                                                   
POSCO Holdings Inc                                            40700.0   
Nippon Steel Corp                                             35395.0   
Angang Steel Co Ltd                                           30250.0   
JSW Steel Ltd                                                 28359.0   
Cleveland-Cliffs Inc                                          26377.0   
Hyundai Steel Co                                              24297.0   
JFE Steel Corp                                                20469.0   
Steel Authority of India Ltd                                  20132.0   
Baoshan Iron & Steel Co Ltd                                   19800.0   
Tata Steel Ltd                                                19720.0   
Nucor Corp                                                    17737.0 


Regional Capacity Leaders:
  • Asia Pacific: China (955,240 ttpa)
  • Europe: Türkiye (55,533 ttpa)
  • North America: United States (112,850 ttpa)
  • Eurasia: Russia (84,665 ttpa)
  • Middle East: Iran (43,899 ttpa)
  • Central & South America: Brazil (43,602 ttpa)
  • Africa: Egypt (16,600 ttpa)


---
## Part 3: Geospatial Visualization with Plotly (15 minutes)

Create interactive maps to visualize the steel plants' locations and characteristics.


### Exercise 1: Basic Scatter Map
**Task:** Create a scatter map showing all steel plant locations.
- Use latitude and longitude for positioning
- Color points by country or region
- Add hover information showing plant name, company, and capacity


In [None]:
# Create a scatter_geo or scatter_mapbox plot
# Hint: Use plotly.express.scatter_geo() or scatter_mapbox()



### Exercise 2: Sized Markers by Capacity
**Task:** Create a map where marker size represents plant capacity.
- Larger markers for higher capacity plants
- Color by company
- Include interactive hover details


In [None]:
# Create scatter map with size parameter based on capacity



### Exercise 3: Density Heatmap
**Task:** Create a density map showing concentration of steel plants.
- Use Plotly's density_mapbox to show clustering
- Identify regions with high plant density


In [None]:
# Create density heatmap
# Hint: Use plotly.express.density_mapbox()



---
## Part 4: Merging Environmental Data with Assets

Integrate environmental data (e.g., air quality, emissions, proximity to water sources) with steel plant locations.


### Exercise 1: Load Environmental Data
**Task:** Load the environmental dataset and inspect it.

- [Litpop database](https://www.research-collection.ethz.ch/entities/researchdata/12dcfc4f-9d03-463a-8d6b-76c0dc73cdc8)

- Expected columns: location_id, latitude, longitude, population density, activity etc.


In [None]:
# Load environmental data



In [None]:
# Inspect environmental data



### Exercise 2: Spatial Join or Nearest Neighbor Matching
**Task:** Merge environmental data with steel plants based on geographic proximity.
- Use nearest neighbor matching or spatial join
- Consider using geopandas for distance calculations
- Match each plant to the nearest environmental monitoring station


In [None]:
# Calculate distances or perform spatial join
# Hint: You might calculate haversine distance or use a spatial library



In [None]:
# Merge datasets



### Exercise 3: Visualize Merged Data
**Task:** Create a map showing steel plants colored by environmental metrics.
- Color plants by air quality index or other environmental indicators
- Size by capacity
- Add hover details with both plant and environmental information


In [None]:
# Create visualization of merged data



---
## Part 5: Company-Level Aggregation

Aggregate data at the company level to analyze corporate footprints.


### Exercise 1: Aggregate Metrics by Company
**Task:** Group plants by company and calculate aggregate metrics.
- Total capacity per company
- Number of plants per company
- Average environmental metrics per company
- Geographic spread (e.g., number of countries)


In [None]:
# Group by company and aggregate



### Exercise 2: Company Headquarters or Centroid
**Task:** Calculate a representative location for each company.
- Option 1: Use the centroid of all plant locations
- Option 2: Use the location of the largest plant
- Option 3: Assign actual headquarters coordinates


In [None]:
# Calculate company representative locations



### Exercise 3: Visualize Company-Level Data
**Task:** Create a map showing companies with aggregated metrics.
- Show one marker per company at the representative location
- Size by total capacity
- Color by average environmental impact
- Hover information with company summary statistics


In [None]:
# Create company-level visualization



---
## Part 6: Streamlit Dashboard Integration

Prepare your visualizations for deployment in a Streamlit dashboard.


### Exercise 1: Create Dashboard Script Structure
**Task:** Create a Streamlit app file (`dashboard.py`) with the following structure:

```python
# Import streamlit and other necessary libraries

# Set page configuration

# Title and description

# Sidebar for filters
# - Company selector
# - Region/country filter
# - Capacity range slider

# Main content area
# - KPI metrics (total plants, total capacity, etc.)
# - Interactive map
# - Data table

# Footer with data sources and notes
```


### Exercise 1: Prepare Data for Dashboard
**Task:** Save your processed data to files that the dashboard will load.
- Export cleaned plant data
- Export merged environmental data
- Export company-level aggregations
- Save as CSV or Parquet for efficient loading


In [None]:
# Save processed datasets



### Exercise 2: Display relevant information from your exploratory analysis into the dashboard

In [None]:
# This cell is for notes/observations about your dashboard
# What works well?
# What could be improved?
# Any performance issues with large datasets?



---
## Lab Summary and Key Takeaways

**What you learned:**
- How to perform EDA on geospatial datasets
- Creating interactive maps with Plotly for geospatial data
- Merging spatial datasets based on geographic proximity
- Aggregating geospatial data at different levels (asset vs. company)
- Building interactive dashboards with Streamlit

**Next Steps:**
- Explore other geospatial libraries (GeoPandas, Folium, Kepler.gl)
- Learn about coordinate reference systems (CRS) and projections
- Practice with other datasets (buildings, utilities, transportation)
- Deploy your dashboard to Streamlit Cloud or other hosting services
