# Merge Cleaned Datasets and Deep Cleaning

1. Load cleaned datasets (`IRENA`, `OWID CO₂`, `WGI`)
2. Inspect each dataset for types, missing values, and column consistency
3. Merge datasets step by step using `ISO3` and `year`
4. Handle missing values (fill, drop, or keep as NA)
5. Generate a final combined dataset ready for EDA and visualizations
ISO3 and year alignment across all datasets

Step-by-step merging — beginner-friendly

Missing values handled carefully

Clean final dataset ready for analysis

In [3]:
# Imports and setup
import pandas as pd
from pathlib import Path

clean_dir = Path("../data/cleaned")
final_dir = Path("../data/final")
final_dir.mkdir(exist_ok=True)

In [None]:
#Load datasets
# Load cleaned IRENA
irena = pd.read_csv(clean_dir / "cleaned_irena.csv")
print("IRENA shape:", irena.shape)
irena.head(2)

# Load cleaned OWID
owid = pd.read_csv(clean_dir / "owid_countries.csv")
print("OWID shape:", owid.shape)
owid.head(2)

# Load cleaned WGI
wgi = pd.read_csv(clean_dir / "wgi_countries.csv")
print("WGI shape:", wgi.shape)
wgi.head(2)

IRENA shape: (91743, 17)
OWID shape: (42480, 80)
WGI shape: (28350, 49)


Unnamed: 0,codeindyr,code,country,year,indicator,estimate,stddev,nsource,pctrank,pctranklower,...,tpr,vab,vdm,wbs,wcy,wjp,wmo,scalemean,scalesd,country_iso
0,AFGcc1996,AFG,Afghanistan,1996,cc,-1.291704773902893,0.3405069708824157,2,4.301075458526611,0.0,...,..,..,0.2950838125720781,..,..,..,0.0,0.013374,0.93648,AFG
1,ALBcc1996,ALB,Albania,1996,cc,-0.8939034938812256,0.3159140348434448,3,19.35483932495117,2.6881721019744877,...,..,..,0.315589909591906,..,..,..,0.25,0.013374,0.93648,ALB


## Check key columns for merging
We will merge datasets on:
- `country_iso` → standardized ISO3 code
- `year` → numeric year

Check if these columns exist and have no missing values

In [None]:
#list view of column names
print("\nIRENA columns:\n", irena.columns.tolist(),"\n")
print("\nOWID columns:\n", owid.columns.tolist())
print("\nWGI columns:\n", wgi.columns.tolist())

#Column names in separate lines
print("IRENA columns:\n" + "\n".join(irena.columns))
print("\n\nOWID columns:\n" + "\n".join(owid.columns))
print("\n\nWGI columns:\n" + "\n".join(wgi.columns))

In [12]:
# Check IRENA
print(irena[['iso3_code','year']].isna().sum())

# Check OWID
print(owid[['iso_code','year']].isna().sum())

# Check WGI
print(wgi[['country_iso','year']].isna().sum())

iso3_code    0
year         0
dtype: int64
iso_code    0
year        0
dtype: int64
country_iso    0
year           0
dtype: int64


datasets are clean, aligned, and ready to be merged using ISO code + year

In [24]:
## 3. Merge Step-by-Step

# 1. Standardize ISO column names in the 3 datasets
# 2. Merge `IRENA` + `OWID` on `country_iso` and `year` (left join)
# 3. Merge the result with `WGI` (left join)
# 4. Keep all IRENA rows; missing OWID or WGI data will be NaN
# Merge IRENA + OWID

# Make ISO column consistent across datasets
irena = irena.rename(columns={'iso3_code': 'iso'})
owid  = owid.rename(columns={'iso_code': 'iso'})
wgi   = wgi.rename(columns={'country_iso': 'iso'})
# Merge IRENA + OWID
irena_owid = pd.merge(
    irena,
    owid,
    on=['iso','year'],
    how='left',
    suffixes=('_irena','_owid')
)
print("IRENA + OWID shape:", irena_owid.shape)

# Merge with WGI
final_df = pd.merge(
    irena_owid,
    wgi,
    on=['iso','year'],
    how='left',
    suffixes=('','_wgi')
)
print("Final merged shape:", final_df.shape)
#Save final dataset
final_df.to_csv("../data/final/final_countries.csv", index=False)
final_df.head(3)

IRENA + OWID shape: (91743, 95)
Final merged shape: (472473, 142)


Unnamed: 0,region,sub-region,country_irena,iso,m49_code,re_or_non-re,group_technology,technology,sub-technology,producer_type,...,rsf,tpr,vab,vdm,wbs,wcy,wjp,wmo,scalemean,scalesd
0,Africa,Northern Africa,Algeria,DZA,12,Total Renewable,Bioenergy,Solid biofuels,Other primary solid biofuels n.e.s.,All types,...,..,..,..,0.297942187677093,..,..,..,0.25,0.004617,0.93217
1,Africa,Northern Africa,Algeria,DZA,12,Total Renewable,Bioenergy,Solid biofuels,Other primary solid biofuels n.e.s.,All types,...,..,..,..,..,..,..,..,0.25,-0.036084,0.972304
2,Africa,Northern Africa,Algeria,DZA,12,Total Renewable,Bioenergy,Solid biofuels,Other primary solid biofuels n.e.s.,All types,...,..,..,..,..,..,..,..,0.25,0.018476,0.935971


## 4. Check Missing Values

- Some countries or years may be missing OWID or WGI data
- Decide on imputation or leave as NA
- For numeric columns we can fill with 0 or median (depends on context)

In [25]:
# Missing values count
missing_counts = final_df.isna().sum().sort_values(ascending=False)
missing_counts.head(20)

heat_generation_(tj)                             408790
sdg_7b1_re_capacity_per_capita_(w/inhabitant)    312711
cumulative_other_co2                             283751
other_industry_co2                               283751
other_co2_per_capita                             283751
share_global_other_co2                           283751
share_global_cumulative_other_co2                283751
electricity_installed_capacity_(mw)              276116
electricity_generation_(gwh)                     274344
consumption_co2_per_gdp                          148267
consumption_co2                                  129787
consumption_co2_per_capita                       129787
trade_co2_share                                  129787
trade_co2                                        129787
gas_co2_per_capita                               127896
gas_co2                                          127896
share_global_cumulative_gas_co2                  127896
share_global_gas_co2                            

In [None]:
#Code Imputation
# Example: fill missing numeric energy generation with 0
numeric_cols = [
    'electricity_generation_(gwh)',
    'electricity_installed_capacity_(mw)',
    'heat_generation_(tj)',
    'population',
    'gdp'
]

for col in numeric_cols:
    if col in final_df.columns:
        final_df[col] = final_df[col].fillna(0)

In [None]:
## 5. Check Data Types
final_df.dtypes

In [None]:
## 6. Quick Descriptive Stats
final_df.describe(include='all').T.head(10)

In [None]:
## 6. Save final merged dataset
final_path = final_dir / "final_combined.csv"
final_df.to_csv(final_path, index=False)
print("Saved final dataset:", final_path)
