# Geospatial Data Analysis Lab: Steel Plants Dataset


**(15/10/2025) Learning Objectives:**
- Perform exploratory data analysis (EDA) on geospatial datasets
- Visualize geospatial data using interactive maps with Plotly
- Merge environmental data with asset locations
- Aggregate data at the company level
- Integrate geospatial visualizations into a Streamlit dashboard

---


## Part 1: Setup and Data Loading

Import the necessary libraries and load the steel plants dataset.


In [33]:
# Import required libraries
# - pandas for data manipulation
# - numpy for numerical operations
# - plotly.express and plotly.graph_objects for interactive visualizations
# - Any other libraries you might need

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import geopandas as gpd
from shapely.geometry import Point
from geopy.distance import great_circle
import folium
import streamlit as st

In [2]:
# Load the steel plants dataset
# Expected columns: plant_id, plant_name, company, latitude, longitude, capacity, year_built, etc.

# Download the global iron and steel plant tracker dataset

---
## Part 2: Exploratory Data Analysis (15 minutes)

Answer the following questions through your analysis:


### Question 1: Data Overview
**Task:** Display basic information about the dataset.
- How many steel plants are in the dataset?
- What are the column names and data types?
- Are there any missing values?


In [3]:
# Display dataset shape

# Define the file path
file_path = "data/Plant-level-data-Global-Iron-and-Steel-Tracker-September-2025-V1.xlsx"

# Read the sheet named "Plant data"
df = pd.read_excel(file_path, sheet_name="Plant data")

# Display the first few rows
df.head()


Unnamed: 0,Plant ID,Plant name (English),Plant name (other language),Other plant names (English),Other plant names (other language),Owner,Owner (other language),Owner GEM ID,Owner PermID,SOE Status,...,Steel products,Steel sector end users,Workforce size,ISO 14001,ISO 50001,ResponsibleSteel Certification,Main production equipment,Power source,Iron ore source,Met coal source
0,P100000120004,Kurum International Elbasan steel plant,Kurum Kombinati metalurgjik,,,Kurum International ShA,,E100000130992,5037939021,,...,"billet, wire rod, rebar",unknown,1000,Yes,unknown,No,EAF,"Hydraulic, integrated plants; Four hydropower ...",unknown,unknown
1,P100000120439,Algerian Qatari Steel Jijel plant,الجزائرية القطرية للصلب,AQS,,Algerian Qatari Steel,,E100001000957,5076384326,Partial,...,"billet, wire rod, rebar",unknown,2400,Yes,unknown,No,EAF; DRI,unknown,unknown,unknown
2,P100000120442,ETRHB Annaba steel plant,,,,ETRHB Industrie SpA,,E100001010275,5074513855,,...,unknown,unknown,2000,unknown,unknown,No,EAF,unknown,unknown,unknown
3,P100000121198,Ozmert Algeria steel plant,,,,Ozmert Algeria SARL,,E100001012196,unknown,,...,unknown,unknown,unknown,unknown,unknown,No,EAF; DRI,unknown,Alwaznah and Bu Khadhrah mines,Bechar
4,P100000120440,Sider El Hadjar Annaba steel plant,مركب الحجار للحديد والصلب,"ArcelorMittal Annaba (predecessor), El Hadjar ...",,Groupe Industriel Sider SpA,,E100001000960,5000941519,Full,...,"coil, rebar, sheet",unknown,5748,unknown,unknown,No,BF; BOF; EAF; DRI,unknown,unknown,unknown


In [4]:
# Display column information and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1209 entries, 0 to 1208
Data columns (total 44 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   Plant ID                            1209 non-null   object
 1   Plant name (English)                1209 non-null   object
 2   Plant name (other language)         697 non-null    object
 3   Other plant names (English)         702 non-null    object
 4   Other plant names (other language)  287 non-null    object
 5   Owner                               1209 non-null   object
 6   Owner (other language)              554 non-null    object
 7   Owner GEM ID                        1209 non-null   object
 8   Owner PermID                        1209 non-null   object
 9   SOE Status                          202 non-null    object
 10  Parent                              1209 non-null   object
 11  Parent GEM ID                       1209 non-null   obje

In [5]:
# Check for missing values

missing_values = df.isnull().sum().sort_values(ascending=False)
print("Missing values per column:" , missing_values)


Missing values per column: SOE Status                            1007
Other plant names (other language)     922
Other language location address        764
Owner (other language)                 655
Ferronickel capacity (ttpa)            625
Coking plant capacity (ttpa)           615
Plant name (other language)            512
Other plant names (English)            507
Pelletizing plant capacity (ttpa)      491
Sinter plant capacity (ttpa)           484
Plant age (years)                       67
Coordinate accuracy                      1
Iron ore source                          1
Parent GEM ID                            0
Owner PermID                             0
Parent                                   0
Owner                                    0
Plant ID                                 0
Plant name (English)                     0
Owner GEM ID                             0
Coordinates                              0
Region                                   0
GEM wiki page              

### Question 2: Statistical Summary
**Task:** Generate descriptive statistics for numerical columns.
- What is the average plant capacity?
- What is the range of latitudes and longitudes?
- What is the distribution of plant ages?


In [6]:
# Display descriptive statistics
df.describe()


Unnamed: 0,Plant ID,Plant name (English),Plant name (other language),Other plant names (English),Other plant names (other language),Owner,Owner (other language),Owner GEM ID,Owner PermID,SOE Status,...,Steel products,Steel sector end users,Workforce size,ISO 14001,ISO 50001,ResponsibleSteel Certification,Main production equipment,Power source,Iron ore source,Met coal source
count,1209,1209,697,702,287,1209,554,1209,1209,202,...,1209,1209,1209,1209,1209,1209,1209,1209,1208,1209
unique,1209,1209,688,697,284,988,506,988,525,2,...,679,48,648,329,238,24,40,373,145,74
top,P100000120004,Kurum International Elbasan steel plant,包头市大安钢铁有限责任公司,Meijin Steel Company,新兴铸管新疆有限公司,Nucor Corp,日本製鉄株式会社,E100001010181,unknown,Partial,...,unknown,unknown,unknown,unknown,unknown,No,EAF,unknown,unknown,unknown
freq,1,1,2,2,2,13,10,13,485,149,...,132,558,242,416,725,1173,469,779,1048,1121


In [7]:
df[['Latitude', 'Longitude']] = df['Coordinates'].str.split(',', expand=True)
df['Latitude'] = pd.to_numeric(df['Latitude'], errors='coerce')
df['Longitude'] = pd.to_numeric(df['Longitude'], errors='coerce')
numeric_cols = ['Plant age (years)', 'Workforce size']
for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Range of latitudes and longitudes
lat_range = (df['Latitude'].min(), df['Latitude'].max())
lon_range = (df['Longitude'].min(), df['Longitude'].max())
print(f"Latitude range: {lat_range}")
print(f"Longitude range: {lon_range}")

# Average plant age
if 'Plant age (years)' in df.columns:
    avg_age = df['Plant age (years)'].mean()
    print(f"Average plant age: {avg_age:.2f} years")

Latitude range: (np.float64(-37.831379), np.float64(67.189096))
Longitude range: (np.float64(-123.163599), np.float64(174.728098))
Average plant age: 39.53 years



### Question 3: Geographic Distribution
**Task:** Analyze the geographic distribution of steel plants.
- Which countries/regions have the most steel plants?
- What is the distribution of plants by company?


In [8]:
# Count plants by country/region

if 'Country/Area' in df.columns:
    plants_by_country = df['Country/Area'].value_counts()
    print(plants_by_country)


Country/Area
China            404
India            108
United States     87
Iran              47
Japan             42
                ... 
Qatar              1
Sri Lanka          1
Slovenia           1
Singapore          1
Uganda             1
Name: count, Length: 89, dtype: int64


In [9]:
# Count plants by company

if 'Parent' in df.columns:
    plants_by_companies = df['Parent'].value_counts().head(10)
    print(plants_by_companies)

Parent
Nucor Corp [100.0%]              20
ArcelorMittal SA [100.0%]        19
Nippon Steel Corp [100.0%]       18
Cleveland-Cliffs Inc [100.0%]    12
JSW Steel Ltd [100.0%]           11
GFG Alliance Ltd                 11
Gerdau SA [98.2%]                 8
JFE Holdings Inc [100.0%]         8
Commercial Metals Co [100.0%]     8
Riva Forni Elettrici SpA          8
Name: count, dtype: int64


### Question 4: Capacity Analysis
**Task:** Analyze the capacity distribution.
- What is the total global steel production capacity?
- Which companies have the highest total capacity?
- How does capacity vary by region?


In [10]:
# Calculate total capacity

# Convert capacity columns if they exist
capacity_cols = [col for col in df.columns if 'capacity' in col.lower()]
for col in capacity_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Calculate total capacities (if available)
for col in capacity_cols:
    total = df[col].sum(skipna=True)
    print(f"Total {col}: {total:.2f}")


Total Ferronickel capacity (ttpa): 9634.00
Total Sinter plant capacity (ttpa): 785142.00
Total Coking plant capacity (ttpa): 229667.00
Total Pelletizing plant capacity (ttpa): 379800.00


In [11]:
# Group by company and sum capacity
if capacity_cols:
    main_cap_col = capacity_cols[0]
    capacity_by_parent = df.groupby('Parent')[main_cap_col].sum().sort_values(ascending=False).head(15)
    print("Parents by Capacity:", capacity_by_parent)

Parents by Capacity: Parent
Guangxi Beibu Gulf International Port Group Co Ltd [100.0%]                                                                                                                                                       3400.0
Baobab Steel Ltd [100.0%]                                                                                                                                                                                         2500.0
Jiangsu Delong Nickel Industry Co Ltd [100.0%]                                                                                                                                                                    1000.0
Shanghai Decent Investment (Group) Co Ltd [34.7%]; TSINGSHAN Holding Group Co Ltd [30.5%]; unknown [26.1%]; Zhejiang Qingshan Business Management Co Ltd [8.7%]                                                   1000.0
MSP Steel & Power Ltd [100.0%]                                                                          

---
## Part 3: Geospatial Visualization with Plotly (15 minutes)

Create interactive maps to visualize the steel plants' locations and characteristics.


### Exercise 1: Basic Scatter Map
**Task:** Create a scatter map showing all steel plant locations.
- Use latitude and longitude for positioning
- Color points by country or region
- Add hover information showing plant name, company, and capacity


In [12]:
# Create a scatter_geo or scatter_mapbox plot

fig = px.scatter_geo(
    df,
    lat="Latitude",
    lon="Longitude",
    color="Country/Area",
    hover_name="Plant name (English)",
    hover_data=["Owner", "Plant age (years)"],
    title="Global Steel Plants by Country/Area",
    projection="natural earth"
)
fig.show()


### Exercise 2: Sized Markers by Capacity
**Task:** Create a map where marker size represents plant capacity.
- Larger markers for higher capacity plants
- Color by company
- Include interactive hover details


In [13]:
# Create scatter map with size parameter based on capacity

if capacity_cols:
    main_cap_col = capacity_cols[0]

    # Remove NaN or zero values to prevent Plotly errors
    df_cap = df[df[main_cap_col].notna() & (df[main_cap_col] > 0)].copy()

    # If all missing, give a fallback warning
    if df_cap.empty:
        print(f"No valid numeric data in {main_cap_col} for visualization.")
    else:
        fig = px.scatter_geo(
            df_cap,
            lat="Latitude",
            lon="Longitude",
            color="Owner",
            size=main_cap_col,
            hover_name="Plant name (English)",
            hover_data=["Country/Area", main_cap_col],
            title=f"Global Steel Plants (Marker Size by {main_cap_col})",
            projection="natural earth"
        )
        fig.show()

### Exercise 3: Density Heatmap
**Task:** Create a density map showing concentration of steel plants.
- Use Plotly's density_mapbox to show clustering
- Identify regions with high plant density


In [14]:
# Create density heatmap
# Hint: Use plotly.express.density_mapbox()
fig = px.density_mapbox(
    df,
    lat="Latitude",
    lon="Longitude",
    radius=10,
    hover_name="Plant name (English)",
    hover_data=["Country/Area", "Owner"],
    mapbox_style="carto-positron",
    title="Density of Global Steel Plants",
    height=600,
    zoom=1
)
fig.show()



*density_mapbox* is deprecated! Use *density_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/



---
## Part 4: Merging Environmental Data with Assets

Integrate environmental data (e.g., air quality, emissions, proximity to water sources) with steel plant locations.


### Exercise 1: Load Environmental Data
**Task:** Load the environmental dataset and inspect it.

- [Litpop database](https://www.research-collection.ethz.ch/entities/researchdata/12dcfc4f-9d03-463a-8d6b-76c0dc73cdc8)

- Expected columns: location_id, latitude, longitude, population density, activity etc.


In [15]:
# Load environmental data
env_path = "/content/_metadata_countries_v1_2.csv"
env = pd.read_csv(env_path)


FileNotFoundError: [Errno 2] No such file or directory: '/content/_metadata_countries_v1_2.csv'

In [None]:
# Inspect environmental data
print(env.head())

  country_name iso3  region_id  included  total_value [USD] data_source  \
0        Aruba  ABW        533         1       3.304838e+09         nfw   
1  Afghanistan  AFG          4         1       2.554957e+10         nfw   
2       Angola  AGO         24         1       1.360000e+11         nfw   
3     Anguilla  AIA        660         1       2.187659e+08         nfw   
4      Albania  ALB          8         1       4.388946e+10          pc   

   evaluation  produced_capital [USD]     GDP [USD]  GDP_year  \
0           0                     NaN  2.649721e+09    2014.0   
1           0                     NaN  2.048487e+10    2014.0   
2           0                     NaN  1.460000e+11    2014.0   
3           0                     NaN  1.754000e+08    2009.0   
4           0            4.388946e+10  1.322825e+10    2014.0   

   GDP-to-NFW_ratio     NFW [USD] GPW_highest_admin_level  \
0           1.24724  3.304838e+09                       2   
1           1.24724  2.554955e+10   

In [None]:
# Inspect environmental data

print(env.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   country_name             250 non-null    object 
 1   iso3                     250 non-null    object 
 2   region_id                250 non-null    int64  
 3   included                 250 non-null    int64  
 4   total_value [USD]        224 non-null    float64
 5   data_source              250 non-null    object 
 6   evaluation               250 non-null    int64  
 7   produced_capital [USD]   141 non-null    float64
 8   GDP [USD]                232 non-null    float64
 9   GDP_year                 233 non-null    float64
 10  GDP-to-NFW_ratio         225 non-null    float64
 11  NFW [USD]                224 non-null    float64
 12  GPW_highest_admin_level  250 non-null    object 
 13  GPW_Number_of_regions    250 non-null    int64  
dtypes: float64(6), int64(4), o

### Exercise 2: Spatial Join or Nearest Neighbor Matching
**Task:** Merge environmental data with steel plants based on geographic proximity.
- Use nearest neighbor matching or spatial join
- Consider using geopandas for distance calculations
- Match each plant to the nearest environmental monitoring station


In [None]:
# Merge datasets



### Exercise 3: Visualize Merged Data
**Task:** Create a map showing steel plants colored by environmental metrics.
- Color plants by air quality index or other environmental indicators
- Size by capacity
- Add hover details with both plant and environmental information


In [None]:
# Create visualization of merged data



---
## Part 5: Company-Level Aggregation

Aggregate data at the company level to analyze corporate footprints.


### Exercise 1: Aggregate Metrics by Company
**Task:** Group plants by company and calculate aggregate metrics.
- Total capacity per company
- Number of plants per company
- Average environmental metrics per company
- Geographic spread (e.g., number of countries)


In [16]:
# Group by company and aggregate

df[main_cap_col] = pd.to_numeric(df[main_cap_col], errors='coerce')

company_agg = (
    df.groupby('Parent', dropna=True)
      .agg(
          total_capacity=(main_cap_col, 'sum'),
          num_plants=('Plant name (English)', 'count'),
          avg_age=('Plant age (years)', 'mean'),
          num_countries=('Country/Area', pd.Series.nunique)
      )
      .reset_index()
)

# Add mean of each ISO certification if available
for cert in ['ISO 14001', 'ISO 50001', 'ResponsibleSteel Certification']:
    if cert in df.columns:
        company_agg[f"{cert}_avg"] = (
            df.groupby('Parent')[cert].apply(lambda x: pd.to_numeric(x, errors='coerce').mean())
        ).values

print(company_agg.head())

                                              Parent  total_capacity  \
0                        ABA Çelik Demir LŞ [100.0%]             0.0   
1                          ADV Partners Holdings Ltd             0.0   
2                            Abba Steel Ltd [100.0%]             0.0   
3  Abei Energy; Helvella Holding; Russula SA; Sie...             0.0   
4  Abu Dhabi National for Building Materials Co P...             0.0   

   num_plants  avg_age  num_countries  ISO 14001_avg  ISO 50001_avg  \
0           1    42.00              1            NaN            NaN   
1           1    12.86              1            NaN            NaN   
2           1      NaN              1            NaN            NaN   
3           1      NaN              1            NaN            NaN   
4           1      NaN              1            NaN            NaN   

   ResponsibleSteel Certification_avg  
0                                 NaN  
1                                 NaN  
2                   

### Exercise 2: Company Headquarters or Centroid
**Task:** Calculate a representative location for each company.
- Option 1: Use the centroid of all plant locations
- Option 2: Use the location of the largest plant
- Option 3: Assign actual headquarters coordinates


In [17]:
company_centroid = (
    df.groupby('Parent')
      .agg(Latitude_centroid=('Latitude', 'mean'),
           Longitude_centroid=('Longitude', 'mean'))
      .reset_index()
)

def get_largest_plant(subdf):
    subdf = subdf.dropna(subset=[main_cap_col])
    if subdf.empty:
        return pd.Series({'Latitude_largest': np.nan, 'Longitude_largest': np.nan})
    top = subdf.loc[subdf[main_cap_col].idxmax()]
    return pd.Series({'Latitude_largest': top['Latitude'], 'Longitude_largest': top['Longitude']})

company_largest = df.groupby('Parent').apply(get_largest_plant).reset_index()

# Merge
company_agg = (
    company_agg.merge(company_centroid, on='Parent', how='left')
                .merge(company_largest, on='Parent', how='left')
)

company_agg.head()





Unnamed: 0,Parent,total_capacity,num_plants,avg_age,num_countries,ISO 14001_avg,ISO 50001_avg,ResponsibleSteel Certification_avg,Latitude_centroid,Longitude_centroid,Latitude_largest,Longitude_largest
0,ABA Çelik Demir LŞ [100.0%],0.0,1,42.0,1,,,,36.747413,36.21733,,
1,ADV Partners Holdings Ltd,0.0,1,12.86,1,,,,14.873471,78.048425,,
2,Abba Steel Ltd [100.0%],0.0,1,,1,,,,-17.397866,15.891022,,
3,Abei Energy; Helvella Holding; Russula SA; Sie...,0.0,1,,1,,,,38.688849,-4.107015,,
4,Abu Dhabi National for Building Materials Co P...,0.0,1,,1,,,,24.37872,54.475153,,


### Exercise 3: Visualize Company-Level Data
**Task:** Create a map showing companies with aggregated metrics.
- Show one marker per company at the representative location
- Size by total capacity
- Color by average environmental impact
- Hover information with company summary statistics


In [18]:
# Create company-level visualization

fig = px.scatter_geo(
    company_agg,
    lat='Latitude_centroid',
    lon='Longitude_centroid',
    size='total_capacity',
    color='ISO 14001_avg' if 'ISO 14001_avg' in company_agg else 'num_plants',
    hover_name='Parent',
    hover_data={
        'total_capacity': True,
        'num_plants': True,
        'num_countries': True,
        'avg_age': True
    },
    title='Global Steel Companies by Total Capacity and Environmental Metrics',
    projection='natural earth'
)
fig.show()


---
## Part 6: Streamlit Dashboard Integration

Prepare your visualizations for deployment in a Streamlit dashboard.


### Exercise 1: Create Dashboard Script Structure
**Task:** Create a Streamlit app file (`dashboard.py`) with the following structure:

```python
# Import streamlit and other necessary libraries

# Set page configuration

# Title and description

# Sidebar for filters
# - Company selector
# - Region/country filter
# - Capacity range slider

# Main content area
# - KPI metrics (total plants, total capacity, etc.)
# - Interactive map
# - Data table

# Footer with data sources and notes
```


In [19]:
st.set_page_config(
    page_title="Global Steel Plants Dashboard",
    layout="wide",
    initial_sidebar_state="expanded"
)



### Exercise 1: Prepare Data for Dashboard
**Task:** Save your processed data to files that the dashboard will load.
- Export cleaned plant data
- Export merged environmental data
- Export company-level aggregations
- Save as CSV or Parquet for efficient loading


In [20]:
# Plant-level dataset
df_plants = df.copy()
df_plants.to_csv("data_cleaned_plants.csv", index=False)


In [21]:
# Company-level dataset
df["capacity_tpa"] = (
    df[["Ferronickel capacity (ttpa)",
        "Sinter plant capacity (ttpa)",
        "Coking plant capacity (ttpa)",
        "Pelletizing plant capacity (ttpa)"]]
    .fillna(0)
    .sum(axis=1)
)
df_company = (
    df.groupby("Parent")
    .agg(
        total_capacity=("capacity_tpa", "sum"),
        num_plants=("Plant ID", "count"),
        avg_capacity=("capacity_tpa", "mean"),
        avg_plant_age=("Plant age (years)", "mean")
    )
    .reset_index()
)
df_company.to_csv("data_company_aggregated.csv", index=False)


In [22]:
# Assuming your cleaned dataframe is called df_cleaned
# Save processed datasets
@st.cache_data
def load_data():
    plants = pd.read_csv("data_cleaned_plants.csv")
    companies = pd.read_csv("data_company_aggregated.csv")
    return plants, companies

plants, companies = load_data()

2025-10-19 10:30:20.093 
  command:

    streamlit run /home/grgur1991/Downloads/Medium_Secret/University/Programming/Python/src/venvol/lib/python3.11/site-packages/ipykernel_launcher.py [ARGUMENTS]


In [23]:
st.sidebar.title("Filters")

selected_company = st.sidebar.selectbox(
    "Select a Company",
    options=["All"] + sorted(companies["Parent"].dropna().unique().tolist())
)

selected_country = st.sidebar.selectbox(
    "Select a Country/Area",
    options=["All"] + sorted(plants["Country/Area"].dropna().unique().tolist())
)

capacity_min, capacity_max = st.sidebar.slider(
    "Select Capacity Range (tonnes per annum)",
    float(plants["Ferronickel capacity (ttpa)"].min()),
    float(plants["Ferronickel capacity (ttpa)"].max()),
    (float(plants["Ferronickel capacity (ttpa)"].min()),
     float(plants["Ferronickel capacity (ttpa)"].max()))
)


2025-10-19 10:30:20.120 Session state does not function when running a script without `streamlit run`


In [24]:
filtered_plants = plants.copy()

if selected_company != "All":
    filtered_plants = filtered_plants[filtered_plants["Parent"] == selected_company]

if selected_country != "All":
    filtered_plants = filtered_plants[filtered_plants["Country/Area"] == selected_country]

filtered_plants = filtered_plants[
    (filtered_plants["Ferronickel capacity (ttpa)"] >= capacity_min) &
    (filtered_plants["Ferronickel capacity (ttpa)"] <= capacity_max)
]

In [25]:
total_plants = len(filtered_plants)
total_capacity = filtered_plants["Ferronickel capacity (ttpa)"].sum()
avg_age = filtered_plants["Plant age (years)"].mean()

col1, col2, col3 = st.columns(3)
col1.metric("Total Plants", f"{total_plants:,}")
col2.metric("Total Capacity (tpa)", f"{total_capacity:,.0f}")
col3.metric("Average Plant Age", f"{avg_age:.1f} years")




DeltaGenerator()

In [26]:
st.markdown("### Steel Plant Locations")

fig = px.scatter_mapbox(
    filtered_plants,
    lat="Latitude",
    lon="Longitude",
    color="Country/Area",
    size="Ferronickel capacity (ttpa)",
    hover_name="Plant name (English)",
    hover_data=["Parent", "Plant age (years)"],
    zoom=1,
    height=600
)
fig.update_layout(mapbox_style="carto-positron", margin={"r":0,"t":0,"l":0,"b":0})
st.plotly_chart(fig, use_container_width=True)


*scatter_mapbox* is deprecated! Use *scatter_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/



DeltaGenerator()

In [27]:
st.markdown("### Company-Level Aggregation")

st.dataframe(
    companies.sort_values(by="total_capacity", ascending=False).head(20),
    use_container_width=True
)

2025-10-19 10:30:20.207 Please replace `use_container_width` with `width`.

`use_container_width` will be removed after 2025-12-31.

For `use_container_width=True`, use `width='stretch'`. For `use_container_width=False`, use `width='content'`.


DeltaGenerator()

In [28]:
st.markdown("---")
st.markdown(
    """
    **Data Source:** Global Energy Monitor (2025)
    **Author:** Yuhan
    **Note:** This dashboard is for educational and analytical purposes.
    """
)



DeltaGenerator()

In [29]:
df.to_csv("data_cleaned_plants.csv", index=False)

In [30]:
company_agg.to_csv("data_company_aggregated.csv", index=False)

print(" Data successfully saved for Streamlit dashboard.")

 Data successfully saved for Streamlit dashboard.


In [31]:
import os
os.makedirs("data", exist_ok=True)
df.to_csv("data/data_cleaned_plants.csv", index=False)
company_agg.to_csv("data/data_company_aggregated.csv", index=False)


### Exercise 2: Display relevant information from your exploratory analysis into the dashboard

In [32]:
# This cell is for notes/observations about your dashboard
# What works well?
# What could be improved?
# Any performance issues with large datasets?



What works well:
- Interactive map and filters work smoothly.
- Users can instantly see which regions/companies have the highest capacity.
- Data caching makes the dashboard load faster.

What could be improved:
- Add more environmental indicators (ISO certifications visualization).
- Include a time-series view of plant age or capacity growth.
- Allow multi-country selection in sidebar filters.

 Performance issues:
- Large datasets (>10k rows) may slow down map rendering.
- Consider saving data in Parquet format for faster loading.


---
## Lab Summary and Key Takeaways

**What you learned:**
- How to perform EDA on geospatial datasets
- Creating interactive maps with Plotly for geospatial data
- Merging spatial datasets based on geographic proximity
- Aggregating geospatial data at different levels (asset vs. company)
- Building interactive dashboards with Streamlit

**Next Steps:**
- Explore other geospatial libraries (GeoPandas, Folium, Kepler.gl)
- Learn about coordinate reference systems (CRS) and projections
- Practice with other datasets (buildings, utilities, transportation)
- Deploy your dashboard to Streamlit Cloud or other hosting services
