### Load raw datasets(CSV files and Excel files), Inspect, Clean and Standardize them for analysis
1.**Dataset Cleaning & Separation**
OWID dataset is separated into **countries** (rows with ISO codes) and **regions/aggregates** (rows without ISO codes). Missing values are imputed and interpolated for population, GDP, and key CO₂ metrics.
2.**Regional/Aggregate Updates**
Missing regional or aggregate values are updated using corresponding country-level data to fill gaps in the OWID dataset.
3.**World Governance Indicators (WGI)**
WGI dataset provides multiple governance indicators per country per year.
Missing ISO3 codes are separated into `wgi_missing_iso.csv`, while valid ISO3 rows go into `wgi_countries.csv` for safe merging with OWID/IRENA data.
4. **Final Dataset Ready for Analysis** 
Cleaned datasets have minimal missing values, enriched metrics, and are ready for visualizations, trend analysis, and country/region comparisons.

In [None]:
#Import libraries
import sys, os
import warnings
import importlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
## Project setup: configure paths and imports for accessing modules and data files
import sys
from pathlib import Path

# Set project root
project_root = Path.cwd().parent

# Add project_root and project_scripts to sys.path for imports
project_scripts = project_root / "project_scripts"
for p in [project_root, project_scripts]:
    if str(p) not in sys.path:
        sys.path.insert(0, str(p))

# Now you can import your project modules
try:
    import project_path_setup # type: ignore
    print("Imported project_path_setup from:", project_path_setup.__file__)
except ModuleNotFoundError:
    print("project_path_setup not found. Use project_root and project_scripts directly.")


In [None]:
# Convert IRENA Excel sheets to CSV files
# Project Setup and Imports 
from project_scripts import project_path_setup
from project_scripts.data_handler import DataHandler

# Project paths (from project_path_setup.py)
project_root = project_path_setup.project_root
project_scripts = project_path_setup.project_scripts

# Raw data folder
raw_dir = project_root / "data" / "raw"

# Create a DataHandler instance
handler = DataHandler(filepath_list=[])

# Convert selected IRENA Excel sheets to CSV
handler.excel_to_csv(
    excel_path=raw_dir / "IRENA_renewable_energy_data.xlsx",
    output_dir=raw_dir,
    sheets=["Pivot", "Country", "Region ", "Global"],
    prefix="irena"
)
#Path.cwd()

In [None]:
#Convert WGI Excel file to CSV
handler = DataHandler(filepath_list=[])

handler.excel_to_csv(
    excel_path=raw_dir / "wgi_dataset.xlsx",
    output_dir=raw_dir,
    sheets=["Sheet1"],
    prefix="wgi"
)
#rename WGI sheet1 CSV to wgi_countries.csv
(raw_dir / "wgi_sheet1.csv").rename(raw_dir / "wgi_countries.csv")

Analyse the csv data files

In [None]:
#Load and Inspect Each CSV
# Path to raw data folder
raw_dir = project_root / "data" / "raw"

# List all CSV files
raw_dir = Path("../data/raw/")
csv_files = list(raw_dir.glob("*.csv"))

print("Found CSV files:")
for f in csv_files:
    print("-", f.name)

# Initialize handler
handler = DataHandler(filepath_list=csv_files)

# Load and Inspect Each CSV
for csv_file in csv_files:
    try:
        df = handler.load_file(csv_file)  # load the CSV
        warnings.simplefilter(action='ignore', category=pd.errors.DtypeWarning)
        if df is not None and not df.empty:
            print(f"\nDataset: {csv_file.name}")
            print("Shape:", df.shape)
            print("Columns:", df.columns.tolist())
            #print("Column Info:", df.info())
            #print("Describe Statistics:", df.describe())
            #print(df.head(3))  # first 3 rows
        else:
            print(f"Skipped {csv_file.name} (empty or failed to load)")
    except Exception as e:
        print(f"Error loading {csv_file.name}: {e}")

## Load, Inspect, Clean & Save Raw Datasets
1. IRENA Renewable Energy (Country Level)
2. OWID CO₂ Emissions
3. WGI World Governance Indicators ((Country Level))

For each dataset
- Load it using DataHandler
- Inspect its structure
- Clean the column names and convert data types
- Generate ISO3 country codes
- Check missing values
- Save the cleaned dataset to `data/cleaned/`

## 1. IRENA Renewable Energy Dataset (Country Level)

This dataset contains:
- renewable energy capacity  
- electricity generation  
- heat generation  
- financial flows  
- for every country and energy technology

In [9]:
#Inspect IRENA dataset
irena_path = raw_dir / "irena_country.csv"

handler = DataHandler(filepath_list=[])
irena_raw = handler.load_file(irena_path)

print("IRENA Shape:", irena_raw.shape)
print("Columns:", irena_raw.columns.tolist())
irena_raw.head(3)

IRENA Shape: (91743, 17)
Columns: ['Region', 'Sub-region', 'Country', 'ISO3 code', 'M49 code', 'RE or Non-RE', 'Group Technology', 'Technology', 'Sub-Technology', 'Producer Type', 'Year', 'Electricity Generation (GWh)', 'Electricity Installed Capacity (MW)', 'Heat Generation (TJ)', 'Public Flows (2022 USD M)', 'SDG 7a1 Intl. Public Flows (2022 USD M)', 'SDG 7b1 RE capacity per capita (W/inhabitant)']


Unnamed: 0,Region,Sub-region,Country,ISO3 code,M49 code,RE or Non-RE,Group Technology,Technology,Sub-Technology,Producer Type,Year,Electricity Generation (GWh),Electricity Installed Capacity (MW),Heat Generation (TJ),Public Flows (2022 USD M),SDG 7a1 Intl. Public Flows (2022 USD M),SDG 7b1 RE capacity per capita (W/inhabitant)
0,Africa,Northern Africa,Algeria,DZA,12,Total Renewable,Bioenergy,Solid biofuels,Other primary solid biofuels n.e.s.,All types,2000,,,,,,0.0
1,Africa,Northern Africa,Algeria,DZA,12,Total Renewable,Bioenergy,Solid biofuels,Other primary solid biofuels n.e.s.,All types,2001,,,,,,0.0
2,Africa,Northern Africa,Algeria,DZA,12,Total Renewable,Bioenergy,Solid biofuels,Other primary solid biofuels n.e.s.,All types,2002,,,,,,0.0


### 2. Clean IRENA Dataset
- standardize column names to lowercase_underscore
- convert the `year` column to numeric
- create ISO3 country codes
- remove duplicates

In [None]:
#standardise column name
irena_handler = DataHandler(
    filepath_list=[],
    country_col="Country",   # original column name (case-sensitive)
    year_col="Year"
)

irena_handler.df = irena_raw.copy()

irena_clean = irena_handler.clean_data()
irena_clean.head(3)

In [None]:
# Missing Values (IRENA)
print("Shape" , irena_clean.shape)
irena_clean.isna().sum().sort_values(ascending=False).head(10)

1.Funding Columns(SDG & Public) → Replace missing/NaN with 0 as No data = No funding.
2.Heat & Electricity Columns → Keep missing as NaN as Unknown energy data ≠ zero
it just means not reported

In [None]:
# Cleaning Missing Values for IRENA Country Dataset
print("\nMissing values BEFORE cleaning:")
print(irena_clean.isna().sum().sort_values(ascending=False).head(10).tolist())

# Funding columns → fill missing with 0
funding_cols = [
    col for col in irena_clean.columns
    if "flows" in col.lower()  # matches both public_flows and sdg_7a1 flows
]

irena_clean[funding_cols] = irena_clean[funding_cols].fillna(0)

# Print summary after cleaning
print("\nMissing values AFTER cleaning:")
print(irena_clean.isna().sum().sort_values(ascending=False).head(10).tolist())


In [None]:
import pandas as pd

# Check missing values count and percentage
missing_count = irena_clean.isna().sum()
missing_percent = irena_clean.isna().mean() * 100

missing_summary = pd.DataFrame({
    'missing_count': missing_count,
    'missing_percent': missing_percent
})

print(missing_summary)


In [None]:
# Numeric correlation
numeric_cols = ['heat_generation_(tj)', 
                'electricity_generation_(gwh)', 
                'electricity_installed_capacity_(mw)',
                'sdg_7b1_re_capacity_per_capita_(w/inhabitant)']

print(irena_clean[numeric_cols].corr())

# Boxplots by region to see patterns
plt.figure(figsize=(12,6))
sns.boxplot(data=irena_clean, x='region', y='heat_generation_(tj)')
plt.xticks(rotation=45)
plt.title("Heat Generation by Region")
plt.show()

plt.figure(figsize=(12,6))
sns.boxplot(data=irena_clean, x='region', y='electricity_generation_(gwh)')
plt.xticks(rotation=45)
plt.title("Electricity Generation by Region")
plt.show()


In [None]:
# Scatter plot with median line overlay
plt.figure(figsize=(10,6))

sns.scatterplot(
    data=irena_clean,
    x='region',
    y='heat_generation_(tj)',
    alpha=0.4
)

sns.lineplot(
    data=irena_clean,
    x='region',
    y='heat_generation_(tj)',
    estimator='median',
    errorbar=None,
    color='red'
)

plt.title("Heat Generation by Region (with Median Line)")
plt.show()



The plot shows heat generation for individual countries grouped by region. Most countries generate very little heat, while a small number of countries—especially in Asia and Europe—generate extremely large amounts. Because many values are close to zero and some regions have missing data, the regional median line stays near zero and Africa does not appear in the plot.

May be some countries don’t report all energy data, so many values are missing.
For heat generation and installed electricity capacity, we fill missing values using the median of their region, because countries in the same region have similar energy patterns.
For electricity generation, we use a ratio method: we multiply a country’s installed capacity by the typical generation-to-capacity ratio to estimate missing values.
For renewable capacity per person, we use the region median, and if the region has no data, we use the overall median.

In [None]:
#Heat generation → region median Imputation
# Fill missing heat_generation values using the median of each region. Groups data by region.
#Replaces missing values with the median heat generation of the region.
# Fill remaining missing values with overall median
irena_clean['heat_generation_(tj)'] = irena_clean['heat_generation_(tj)'].fillna(irena_clean['heat_generation_(tj)'].median())

print("Missing heat_generation after overall median fallback:", irena_clean['heat_generation_(tj)'].isna().sum())


In [None]:
#Electricity installed capacity → region median Imputation
# Fill missing installed capacity using the median of each region
# Function to fill missing values by region median, avoiding warning
def fill_region_median(x):
    if x.notna().sum() == 0:  # if all values are missing in the group
        return x  # leave as is
    else:
        return x.fillna(x.median())

# Apply to electricity_installed_capacity
irena_clean['electricity_installed_capacity_(mw)'] = irena_clean.groupby('region')['electricity_installed_capacity_(mw)']\
                                              .transform(fill_region_median)

# Check remaining missing values
print("Missing electricity_installed_capacity after region median imputation:", 
      irena_clean['electricity_installed_capacity_(mw)'].isna().sum())


In [None]:
#Electricity generation → ratio method Imputation
# Compute median ratio of electricity generation to installed capacity
ratio = (irena_clean['electricity_generation_(gwh)'] / irena_clean['electricity_installed_capacity_(mw)']).median()

# Fill missing electricity generation using installed capacity * median ratio
irena_clean['electricity_generation_(gwh)'] = irena_clean['electricity_generation_(gwh)'].fillna(
    irena_clean['electricity_installed_capacity_(mw)'] * ratio
)

print("Missing electricity_generation after ratio method:", irena_clean['electricity_generation_(gwh)'].isna().sum())

In [None]:
# SDG 7b1 renewable capacity per capita → region median Imputation
# Fill missing per capita RE capacity using the median of each region
irena_clean['sdg_7b1_re_capacity_per_capita_(w/inhabitant)'] = irena_clean.groupby('region')['sdg_7b1_re_capacity_per_capita_(w/inhabitant)']\
                                                        .transform(lambda x: x.fillna(x.median()))

# Fallback to overall median if some regions are still missing
irena_clean['sdg_7b1_re_capacity_per_capita_(w/inhabitant)'] = irena_clean['sdg_7b1_re_capacity_per_capita_(w/inhabitant)'].fillna(
    irena_clean['sdg_7b1_re_capacity_per_capita_(w/inhabitant)'].median()
)

print("Missing sdg_7b1_re_capacity_per_capita after region median:", 
      irena_clean['sdg_7b1_re_capacity_per_capita_(w/inhabitant)'].isna().sum())

In [None]:
cols_to_check = ['heat_generation_(tj)', 
                 'electricity_installed_capacity_(mw)',
                 'electricity_generation_(gwh)', 
                 'sdg_7b1_re_capacity_per_capita_(w/inhabitant)']

print("Missing values after imputation:\n", irena_clean[cols_to_check].isna().sum())

In [None]:
#Save Clean IRENA
clean_dir = project_root / "data" / "clean"
irena_clean_path = clean_dir / "irena_countries.csv"
irena_clean.to_csv(irena_clean_path, index=False)

print("Saved:", irena_clean_path)

## 2. OWID CO₂ Emissions Dataset

This dataset contains:
- CO₂ emissions by sector  
- energy consumption  
- greenhouse gases  
- population & GDP  
- temperature impact  
- for all countries and years

In [23]:
#Load OWID data
owid_path = raw_dir / "owid_co2_data.csv"

handler = DataHandler(filepath_list=[])
owid_raw = handler.load_file(owid_path)

print("OWID Shape:", owid_raw.shape)
print("Columns:", owid_raw.columns.tolist())
owid_raw.head(3)

OWID Shape: (50407, 79)
Columns: ['country', 'year', 'iso_code', 'population', 'gdp', 'cement_co2', 'cement_co2_per_capita', 'co2', 'co2_growth_abs', 'co2_growth_prct', 'co2_including_luc', 'co2_including_luc_growth_abs', 'co2_including_luc_growth_prct', 'co2_including_luc_per_capita', 'co2_including_luc_per_gdp', 'co2_including_luc_per_unit_energy', 'co2_per_capita', 'co2_per_gdp', 'co2_per_unit_energy', 'coal_co2', 'coal_co2_per_capita', 'consumption_co2', 'consumption_co2_per_capita', 'consumption_co2_per_gdp', 'cumulative_cement_co2', 'cumulative_co2', 'cumulative_co2_including_luc', 'cumulative_coal_co2', 'cumulative_flaring_co2', 'cumulative_gas_co2', 'cumulative_luc_co2', 'cumulative_oil_co2', 'cumulative_other_co2', 'energy_per_capita', 'energy_per_gdp', 'flaring_co2', 'flaring_co2_per_capita', 'gas_co2', 'gas_co2_per_capita', 'ghg_excluding_lucf_per_capita', 'ghg_per_capita', 'land_use_change_co2', 'land_use_change_co2_per_capita', 'methane', 'methane_per_capita', 'nitrous_oxi

Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Afghanistan,1750,AFG,2802560.0,,0.0,0.0,,,,...,,,,,,,,,,
1,Afghanistan,1751,AFG,,,0.0,,,,,...,,,,,,,,,,
2,Afghanistan,1752,AFG,,,0.0,,,,,...,,,,,,,,,,


### Clean OWID Dataset

OWID already has:`country`,`year`,`iso_code`
-clean column names  
-convert year to numeric  
-add `country_iso` (for cross-dataset consistency)

In [None]:
# Standardize column names
owid_handler = DataHandler(
    filepath_list=[],
    country_col="country",
    year_col="year"
)

owid_handler.df = owid_raw.copy()
owid_clean = owid_handler.clean_data()

owid_clean.head(3)

In [None]:
# Missing Values (OWID)
#pd.set_option('display.max_rows', 200)
print("Shape",owid_clean.shape)
owid_clean.isna().sum().sort_values(ascending=False).head(10)

1.OWID dataset contains countries + regions/aggregates.
Regions/aggregates include: World, Asia, Europe, High-income, OECD, etc.
2.Keep missing values as NaN for all emissions/consumption metrics.
3.Separate datasets:
owid_countries → countries with ISO3 codes (for merging with IRENA/WGI)
owid_regions → aggregates for regional/global analyses
4.Imputations for important columns:
   CO₂ totals and other emissions** → fill missing values using country median.
   Population and GDP** → fill missing values using linear interpolation over years.
   Per-capita or per-GDP metrics** → recalculate after filling totals, rather than imputing directly.

In [None]:
# Cleaning OWID CO2 Dataset & Separating Countries/Regions
# Suppress warnings
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

# 1 Separate countries and regions
owid_regions = owid_clean[owid_clean['iso_code'].isna()].copy()
owid_countries = owid_clean[owid_clean['iso_code'].notna()].copy()

# Ensure 'year' is numeric
owid_countries['year'] = pd.to_numeric(owid_countries['year'], errors='coerce')
owid_regions['year'] = pd.to_numeric(owid_regions['year'], errors='coerce')

# 2 Country-level imputations
# Columns to fill using median per country
median_cols = ['co2', 'cement_co2', 'coal_co2', 'gas_co2', 'oil_co2']

for col in median_cols:
    owid_countries[col] = owid_countries.groupby('country')[col].transform(
        lambda x: x.fillna(x.median())
    )

# Columns to interpolate per country, ffill and bfill pe country forward/backward for start/end missing
interp_cols = ['population', 'gdp']
for col in interp_cols:
    # Use transform instead of apply to avoid index mismatch
    #owid_countries[col] = owid_countries.groupby('country')[col].transform(lambda x: x.interpolate()
    owid_countries[col] = owid_countries.groupby('country')[col].transform(
        lambda x: x.interpolate().ffill().bfill()
    )

# 3 Recalculate per-capita and per-GDP metrics formissing values
per_capita_cols = ['co2', 'cement_co2', 'coal_co2', 'gas_co2', 'oil_co2']
per_gdp_cols = ['co2', 'cement_co2', 'coal_co2', 'gas_co2', 'oil_co2']

# Vectorized per-capita : co2_per_capita = total CO2 emissions / population
for col in per_capita_cols:
    owid_countries[col + '_per_capita'] = owid_countries[col] / owid_countries['population']

# Vectorized per-GDP
for col in per_gdp_cols:
    owid_countries[col + '_per_gdp'] = owid_countries[col] / owid_countries['gdp']

# 4 Update regional/aggregate values if missing
agg_cols = ['co2', 'cement_co2', 'coal_co2', 'gas_co2', 'oil_co2']
agg_cols += [c + '_per_capita' for c in ['co2', 'cement_co2']]
agg_cols += [c + '_per_gdp' for c in ['co2']]

for col in agg_cols:
    # Only update missing values
    missing_idx = owid_regions[owid_regions[col].isna()].index
    for idx in missing_idx:
        year = owid_regions.at[idx, 'year']
        
        # Subset all countries for that year
        subset = owid_countries[owid_countries['year'] == year]
        
        if col.endswith('_per_capita'):
            base_col = col.replace('_per_capita', '')
            total_pop = subset['population'].sum()
            value = subset[base_col].sum() / total_pop if total_pop > 0 else np.nan
        elif col.endswith('_per_gdp'):
            base_col = col.replace('_per_gdp', '')
            total_gdp = subset['gdp'].sum()
            value = subset[base_col].sum() / total_gdp if total_gdp > 0 else np.nan
        else:
            value = subset[col].sum() if not subset.empty else np.nan
        
        owid_regions.at[idx, col] = value

# 5 Quick check
print("Missing values after country-level imputation:")
print(owid_countries[median_cols + interp_cols].isna().sum())

print("\nMissing values after updating regions:")
print(owid_regions[agg_cols].isna().sum())

print("\nOWID dataset cleaned, imputed, and regional values updated successfully.")
print("OWID countries shape:", owid_countries.shape)
print("OWID regions shape:", owid_regions.shape)
print("Example regions/aggregates:", owid_regions['country'].unique())


In [None]:
# Save Cleaned OWID Dataset
owid_countries_path = clean_dir / "owid_countries.csv"
owid_regions_path   = clean_dir / "owid_regions.csv"

# Save datasets
owid_countries.to_csv(owid_countries_path, index=False)
owid_regions.to_csv(owid_regions_path, index=False)

print("Saved OWID countries (with ISO3) to:", owid_countries_path)
print("Saved OWID regions/aggregates to:", owid_regions_path)

## 3. WGI Governance Dataset

The World Governance Indicators dataset contains:
-government effectiveness  
-rule of law  
-voice & accountability  
-regulatory quality  
-corruption indicators  
-for every country and year

- It uses `countryname` instead of `country`, which must be renamed.

In [49]:
# Convert WGI Excel to CSV
wgi_path = raw_dir / "wgi_countries.csv"

handler = DataHandler(filepath_list=[])
wgi_raw = handler.load_file(wgi_path)

print("WGI Shape:", wgi_raw.shape)
print("Columns:", wgi_raw.columns.tolist()[:15])
wgi_raw.head(3)

WGI Shape: (32100, 48)
Columns: ['codeindyr', 'code', 'countryname', 'year', 'indicator', 'estimate', 'stddev', 'nsource', 'pctrank', 'pctranklower', 'pctrankupper', 'adb', 'afr', 'asd', 'bps']


Unnamed: 0,codeindyr,code,countryname,year,indicator,estimate,stddev,nsource,pctrank,pctranklower,pctrankupper,adb,afr,asd,bps,bti,ccr,ebr,eiu,eqi,frh,gcb,gcs,gii,gwp,her,hum,hrm,ifd,ijt,ipd,irp,lbo,msi,obi,pia,prc,prs,rsf,tpr,vab,vdm,wbs,wcy,wjp,wmo,scalemean,scalesd
0,AFGcc1996,AFG,Afghanistan,1996,cc,-1.291704773902893,0.3405069708824157,2,4.301075458526611,0.0,27.41935539245605,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,0.2950838125720781,..,..,..,0.0,0.013374,0.93648
1,ALBcc1996,ALB,Albania,1996,cc,-0.8939034938812256,0.3159140348434448,3,19.35483932495117,2.6881721019744877,43.0107536315918,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,0.3333333333333333,..,..,..,0.315589909591906,..,..,..,0.25,0.013374,0.93648
2,DZAcc1996,DZA,Algeria,1996,cc,-0.5667409300804138,0.262076586484909,4,33.33333206176758,16.66666603088379,52.68817138671875,..,..,..,..,..,..,..,0.25,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,..,0.5,..,..,..,0.36883168576648,..,..,..,0.25,0.013374,0.93648


In [None]:
# Prepare WGI for Cleaning
#Rename `countryname` → `country`
wgi_fixed = wgi_raw.rename(columns={"countryname": "country"})
wgi_handler = DataHandler(
    filepath_list=[],
    country_col="country",
    year_col="year"
)

wgi_handler.df = wgi_fixed.copy()
wgi_clean = wgi_handler.clean_data()

In [None]:
# Missing Values (WGI)
print("Shape",wgi_clean.shape)
wgi_clean.isna().sum().sort_values(ascending=False).head(10)

1. The World Governance Indicators (WGI) dataset has multiple governance indicators for each country per year.
2. The only missing values are usually in country_iso (ISO3 code), while all other columns like year, indicator, estimate are mostly complete.
-wgi_countries.csv → Only rows with valid ISO3 codes; safe for merging with IRENA/OWID.
-wgi_missing_iso.csv → Rows without ISO3 codes; may be small territories or naming mismatches.

In [None]:
# Cleaning Missing Values for WGI

# create a separate dataframe for rows missing ISO codes
wgi_missing_iso = wgi_clean[wgi_clean['country_iso'].isna()].copy()

# Keep rows with valid ISO3 codes for merging
wgi_countries = wgi_clean[wgi_clean['country_iso'].notna()].copy()

# Quick check
print("\nWGI countries with ISO3 codes:", wgi_countries.shape)
print("WGI rows missing ISO3:", wgi_missing_iso.shape)
print("Example missing ISO3 rows:")
print(wgi_missing_iso[['country', 'year', 'indicator']].head())

# Save datasets
wgi_countries_path = clean_dir / "wgi_countries.csv"
wgi_missing_iso_path = clean_dir / "wgi_missing_iso.csv"

wgi_countries.to_csv(wgi_countries_path, index=False)
wgi_missing_iso.to_csv(wgi_missing_iso_path, index=False)

print("\nSaved WGI datasets:")
print("- Countries:", wgi_countries_path)
print("- Missing ISO3:", wgi_missing_iso_path)

In [None]:
# Paths to save cleaned WGI datasets
wgi_countries_path     = clean_dir / "wgi_countries.csv"
wgi_missing_iso_path   = clean_dir / "wgi_missing_iso.csv"

# Save datasets
wgi_countries.to_csv(wgi_countries_path, index=False)
wgi_missing_iso.to_csv(wgi_missing_iso_path, index=False)

print("Saved WGI countries (with ISO3) to:", wgi_countries_path)
print("Saved WGI rows missing ISO3 to:", wgi_missing_iso_path)

1. * The OWID and IRENA datasets are cleaned, imputed, and enriched for both country and regional levels.
2. * Key CO₂ and energy metrics are now available per-capita and per-GDP, making comparisons across countries and regions straightforward.
3. * Regional and aggregate values are computed where missing, ensuring a complete global view.
4. * WGI governance indicators are aligned with countries using ISO3 codes, allowing integration with energy and emissions data.
5. * The final dataset is ready for analysis, modeling, or visualization, supporting research into energy transitions, emissions trends, and governance impacts on sustainability.